Website: https://kylebarron.dev/all-transit
All transit, as reported by the Transitland database. Inspired by All Streets. I have a blog post here detailing more information about the project.
The code for the website is in site/. It uses React, Gatsby, Deck.gl, and
React Map GL/Mapbox GL JS.
The static_image folder contains code to generate an SVG and PNG of all the
routes in the U.S. It uses d3 and
geo2svg.
Most of the data-generating code for this project is done in Bash,
jq, GNU Parallel, SQLite, and a couple
Python scripts. Data is kept in newline-delimited JSON and newline-delimited
GeoJSON for all intermediate steps to facilitate streaming and keep memory use
low.
Clone this Git repository and install the Python package I wrote to easily access the Transitland API.
git clone https://github.com/kylebarron/all-transit
cd all-transit
pip install transitland-wrapper
mkdir -p dataEach of the API endpoints allows for a bounding box. At first, I tried to just
pass a bounding box of the entire United States to these APIs and page through
the results. Unsurprisingly, that method isn't successful for the endpoints that
have more data to return, like stops and schedules. I found that for the
schedules endpoint, the API was really slow and occasionally timed out when I
was trying to request something with offset=100000, because presumably it
takes a lot of time to find the 100,000th row of a given query.
Because of this, I found it best in general to split API queries into smaller pieces, by using e.g. operator ids or route ids.
Download all operators whose service area intersects the continental US, and then extract their identifiers.
# All operators
transitland operators --page-all > data/operators_new.geojson
# All operator `onestop_id`s
cat data/operators.geojson \
| jq '.properties.onestop_id' \
| uniq \
| \
tr -d \" \
> data/operator_onestop_ids.txtI downloaded routes by the geometry of the US, and then later found it best to split the response into separate files by operator. If I were to run this download again, I'd just download routes by operator to begin with.
# All routes
rm -rf data/routes
mkdir -p data/routes
cat data/operator_onestop_ids.txt | while read operator_id
do
transitland routes \
--page-all \
--operated-by $operator_id \
--per-page 1000 \
> data/routes/$operator_id.geojson
doneNow that the routes are downloaded, I extract the identifiers for all
RouteStopPatterns and Routes.
mkdir -p data/route_stop_patterns_by_onestop_id/
cat data/operator_onestop_ids.txt | while read operator_id
do
cat data/routes/$operator_id.geojson \
| jq '.properties.route_stop_patterns_by_onestop_id[]' \
| uniq \
| tr -d \" \
> data/route_stop_patterns_by_onestop_id/$operator_id.txt
done
mkdir -p data/routes_onestop_ids/
cat data/operator_onestop_ids.txt | while read operator_id
do
cat data/routes/$operator_id.geojson \
| jq '.properties.onestop_id' \
| uniq \
| tr -d \" \
> data/routes_onestop_ids/$operator_id.txt
doneIn order to split up how I later call the ScheduleStopPairs API endpoint, I
split the Route identifiers into sections. There are just shy of 15,000 route
identifiers, so I split into 5 files of roughly equal 3,000 route identifiers.
# Split into fifths so that I can call the ScheduleStopPairs API in sections
cat routes_onestop_ids.txt \
| sed -n '1,2999p;3000q' \
> routes_onestop_ids_1.txt
cat routes_onestop_ids.txt \
| sed -n '3000,5999p;6000q' \
> routes_onestop_ids_2.txt
cat routes_onestop_ids.txt \
| sed -n '6000,8999p;9000q' \
> routes_onestop_ids_3.txt
cat routes_onestop_ids.txt \
| sed -n '9000,11999p;12000q' \
> routes_onestop_ids_4.txt
cat routes_onestop_ids.txt \
| sed -n '12000,15000p;15000q' \
> routes_onestop_ids_5.txtStops are points along a Route or RouteStopPattern where passengers may
get on or off.
Downloading stops by operator was necessary to keep the server from paging
through too long of results. I was stupid and concatenated them all into a
single file, which I later saw that I needed to split with jq. If I were
downloading these again, I'd write each Stops response into a file named by
operator.
# All stops
rm -rf data/stops
mkdir -p data/stops
cat data/operator_onestop_ids_new.txt | while read operator_id
do
transitland stops \
--page-all \
--served-by $operator_id \
--per-page 1000 \
> data/stops/$operator_id.geojson
doneRouteStopPatterns are portions of a route. I think an easy way to think of the
difference is the a Route can be a MultiLineString, while a RouteStopPattern
is always a LineString.
So far I haven't actually needed to use RouteStopPatterns for anything. I
would've ideally matched ScheduleStopPairs to RouteStopPatterns instead of
to Routes, but I found that some ScheduleStopPair have missing
RouteStopPatterns, while Route is apparently never missing.
mkdir -p data/route_stop_patterns/
cat data/operator_onestop_ids.txt | while read operator_id
do
transitland onestop-id \
--page-all \
--file data/route_stop_patterns_by_onestop_id/$operator_id.txt \
> data/route_stop_patterns/$operator_id.json
doneScheduleStopPairs are edges along a Route or RouteStopPattern that define
a single instance of transit moving between a pair of stops along the route.
I at first tried to download this by operator_id, but even that stalled the
server because some operators in big cities have millions of different
ScheduleStopPairs. Instead I downloaded by route_id.
Apparently you can only download by Route and not by RouteStopPattern, or
else I probably would've chosen the latter, which might've made associating
ScheduleStopPairs to geometries easier.
I used each fifth of the Route identifiers from earlier so that I could make
sure each portion was correctly downloaded.
# All schedule-stop-pairs
# Best to loop over route_id, not operator_id
mkdir -p data/ssp/
cat data/operator_onestop_ids_new.txt | while read operator_id
do
cat data/routes_onestop_ids/$operator_id.txt | while read route_id
do
transitland schedule-stop-pairs \
--page-all \
--route-onestop-id $route_id \
--per-page 1000 \
--active \
| gzip >> data/ssp/$operator_id.json.gz
touch data/ssp/$operator_id.finished
done
done
for i in {1..5}; do
cat data/routes_onestop_ids_${i}.txt | while read route_id
do
transitland schedule-stop-pairs \
--page-all \
--route-onestop-id $route_id \
--per-page 1000 --active \
| gzip >> data/ssp/ssp${i}.json.gz
done
doneI generate vector tiles for the routes, operators, and stops. I have jq
filters in code/jq/ to reshape the GeoJSON into the format I want, so that the
correct properties are included in the vector tiles.
In order to keep the size of the vector tiles small:
- The
stopslayer is only included at zoom 11 - The
routeslayer only includes metadata about the identifiers of the stops that it passes at zoom 11
# Writes mbtiles to data/mbtiles/routes.mbtiles
# The -c is important so that each feature gets output onto a single line
find data/routes -type f -name '*.geojson' -exec cat {} \; \
`# Apply jq filter at code/jq/routes.jq` \
| jq -c -f code/jq/routes.jq \
| bash code/tippecanoe/routes.sh
# Writes mbtiles to data/mbtiles/operators.mbtiles
bash code/tippecanoe/operators.sh data/operators.geojson
# Writes mbtiles to data/mbtiles/stops.mbtiles
# The -c is important so that each feature gets output onto a single line
find data/stops -type f -name '*.geojson' -exec cat {} \; \
| jq -c -f code/jq/stops.jq \
| bash code/tippecanoe/stops.shCombine into single mbtiles
tile-join \
-o data/mbtiles/all.mbtiles \
`# Don't enforce size limits;` \
`# Size limits already enforced individually for each sublayer` \
--no-tile-size-limit \
`# Overwrite existing mbtiles` \
--force \
`# Input files` \
data/mbtiles/stops.mbtiles \
data/mbtiles/operators.mbtiles \
data/mbtiles/routes.mbtilesThen publish! Host on a small server with
mbtileserver or export the
mbtiles to a directory of individual tiles with
mb-util and upload the individual files to
S3.
I'll upload this to S3:
Export mbtiles to a directory
mb-util \
`# Existing mbtiles` \
data/all.mbtiles \
`# New directory` \
data/all \
`# Set file extension to pbf` \
--image_format=pbfThen upload to S3
# First the tile.json
aws s3 cp \
code/tile/op_rt_st.json s3://data.kylebarron.dev/all-transit/op_rt_st/tile.json \
--content-type application/json \
`# Set to public read access` \
--acl public-read
aws s3 cp \
data/all s3://data.kylebarron.dev/all-transit/op_rt_st/ \
--recursive \
--content-type application/x-protobuf \
--content-encoding gzip \
`# Set to public read access` \
--acl public-read \
`# 6 hour cache; one day swr` \
--cache-control "public, max-age=21600, stale-while-revalidate=86400"The schedule component is my favorite part of the project. You can see streaks moving around that correspond to transit vehicles: trains, buses, ferries. This data comes from actual schedule information from the Transitland API and matches it to route geometries. (Though it's not real-time info, so it doesn't reflect delays).
I use the deck.gl
TripsLayer
to render the schedule data as an animation. That means that I need to figure
out the best way to transport three-dimensional LineStrings (where the third
dimension refers to time) to the client. Unfortunately, at this time Tippecanoe
doesn't support three-dimensional
coordinates. The
recommendation in that thread was to reformat to have individual points with
properties. That would make it harder to associate the points to lines, however.
I eventually decided it was best to pack the data into tiled
gzipped-minified-GeoJSON. And since I know that all features are LineStrings,
and since I have no properties that I care about, I take only the coordinates,
so that the data the client receives is like:
[
[
[
0, 1, 2
],
[
1, 2, 3
]
],
[
[]
...
]
]I currently store the third coordinate as seconds of the day. So that 4pm is `16
- 60 * 60 = 57000`.
In order to make the data download manageable, I cut each GeoJSON into xyz map tiles, so that only data pertaining to the current viewport is loaded. For dense cities like Washington DC and New York City, some of the LineStrings are very dense, so I cut the schedule tiles into full resolution at zoom 13, and then generate overview tiles for lower zooms that contain a fraction of the features of their child tiles.
I generated tiles in this manner down to zoom 2, but discovered that performance was very poor on lower-powered devices like my phone. Because of that, I think it's best to have the schedule feature disabled by default.
I originally tried to do everything with jq, but the schedule data for all
routes in the US as uncompressed JSON is >100GB and things were too slow. I
tried SQLite and it's pretty amazing.
To import ScheduleStopPair data into SQLite, I first converted the JSON files
to CSV:
# Create CSV file with data
mkdir -p data/ssp_sqlite/
for i in {1..5}; do
# header line
gunzip -c data/ssp/ssp${i}.json.gz \
| head -n 1 \
| jq -rf code/ssp/ssp_keys.jq \
| gzip \
> data/ssp_sqlite/ssp${i}.csv.gz
# Data
gunzip -c data/ssp/ssp${i}.json.gz \
| jq -rf code/ssp/ssp_values.jq \
| gzip \
>> data/ssp_sqlite/ssp${i}.csv.gz
doneThen import the CSV files into SQLite:
for i in {1..5}; do
gunzip -c data/ssp_sqlite/ssp${i}.csv.gz \
| sqlite3 -csv data/ssp_sqlite/ssp.db '.import /dev/stdin ssp'
doneCreate SQLite index on route_id
sqlite3 data/ssp_sqlite/ssp.db \
'CREATE INDEX route_onestop_id_idx ON ssp(route_onestop_id);'I found it best to loop over route_ids when matching schedules to route
geometries. Here I create a crosswalk with the operator id for each route, so
that I can pass to my Python script 1) ScheduleStopPairs pertaining to a
route, 2) Stops by operator and 3) Routes by operator.
# Make xw with route_id: operator_id
cat data/routes/*.geojson \
| jq -c '{route_id: .properties.onestop_id, operator_id: .properties.operated_by_onestop_id}' \
> data/route_operator_xw.jsonHere's the meat of connecting schedules to route geometries. The bash script
calls code/schedules/ssp_geom.py, and the general process of that script is:
- Load stops, routes, and route stop patterns for the operator
- Load provided
ScheduleStopPairs from stdin - Iterate over every
ScheduleStopPair. For each pair, try to find the route stop pattern it's associated with. If it exists, use the linear stop distances contained in theScheduleStopPairand Shapely's linear referencing methods to take the substring of thatLineString. - If a route stop pattern isn't found directly, find the associated route, then find its associate route stop patterns, then try taking a substring of each of those, checking that the start/end points are very close to the start/end stops.
- As a fallback, skip route stop patterns entirely. Find the starting/ending
Points; find the nearest point on the route for each of those points, and take the line between them. - Get the time at which the vehicle leaves the start stop and at which it
arrives at the destination stop. Then linearly interpolate this along
every coordinate of the
LineString. This way, the finalizedLineStrings have the same geometry as the original routes, and every coordinate has a time.
# Loop over _routes_
num_cpu=12
for i in {1..5}; do
cat data/routes_onestop_ids_${i}.txt \
| parallel -P $num_cpu bash code/schedules/ssp_geom.sh {}
doneNow in data/ssp/geom I have a newline-delimited GeoJSON file for every route.
I take all these individual features and cut them into individual tiles for a
zoom that has all the original data with no simplification, which I currently
have as zoom 13.
rm -rf data/ssp/tiles
mkdir -p data/ssp/tiles
find data/ssp/geom/ -type f -name 'r-*.geojson' -exec cat {} \; \
| uniq \
| python code/tile/tile_geojson.py \
`# Set minimum and maximum tile zooms` \
-z 13 -Z 13 \
`# Only keep LineStrings` \
--allowed-geom-type 'LineString' \
`# Write tiles into the following root dir` \
-d data/ssp/tilesCreate overview tiles for lower zooms
python code/tile/create_overview_tiles.py \
--min-zoom 10 \
--existing-zoom 13 \
--tile-dir data/ssp/tiles \
--max-coords 150000Make gzipped protobuf files from these tiles:
rm -rf data_us/ssp/pbf
mkdir -p data_us/ssp/pbf
num_cpu=15
for zoom in {10..13}; do
find data_us/ssp_geom_tiles/${zoom} -type f -name '*.geojson' \
| parallel -P $num_cpu bash code/tile/compress_tiles_pbf.sh {}
doneUpload to AWS
aws s3 cp \
data/ssp/pbf/13 s3://data.kylebarron.dev/all-transit/pbfv2/schedule/4_16-20/13 \
--recursive \
--content-type application/x-protobuf \
--content-encoding gzip \
`# Set to public read access` \
--acl public-readSeveral data providers wish to be accredited when you use their data.
Download all feed information:
transitland feeds --page-all > data/feeds.geojson
python code/generate_attribution.py data/feeds.geojson \
| gzip \
> data/attribution.json.gz
aws s3 cp \
data/attribution.json.gz \
s3://data.kylebarron.dev/all-transit/attribution.json \
--content-type application/json \
--content-encoding gzip \
--acl public-read