1. Home
  2. Articles
  3. Introducing the Subway Origin-Destination Ridership dataset

Introducing the Subway Origin-Destination Ridership dataset

Updated July 25, 2024 3:00 p.m.

If you’ve ever used our subway ridership datasets, you know that the MTA has a very good sense of how many people are riding the subway and where they’re boarding. But this suggests another obvious question: where are those people going? This is one of the most requested datasets by policymakers and MTA Open Data users—and the answer is a little complicated. Our ridership data comes from MetroCard swipes and OMNY taps at entry turnstiles, but our subway system has no similar swipe or tap requirement on exit. However, even though we don’t have perfect exit data, what we do have are pretty good estimates of how our riders move throughout the system.

This week on Open Data, we are excited to share those “pretty good estimates” with the MTA’s new Subway Origin-Destination Ridership Estimate open datasets. These datasets provide an estimation of the number of riders that traveled between a given origin-destination pair for each hour of day and day of the week, averaged over a calendar month, like so:

Year

Month

Day of Week

Hour

Origin Station

Destination Station

Estimated Trips

2024

5

Monday

8

Times Sq-42 St

Court Sq

123.45

2024

5

Monday

8

Times Sq-42 St

Grand Central-42 St

456.78

This data shows us how the entire city travels, and how the approximately 4 million trips that take place on the subway on a typical weekday move the city. To get a high-level sense of how this works, take a look at the Kepler.gl animation below, which shows how riders moved across the system during an average week (Monday-Friday) in June 2024.

In this animation, each arc represents a journey for a number of users (identified by arc thickness), from an origin station (colored in light blue) to a destination station (colored in orange). Hundreds of thousands of riders travel to Manhattan’s central business district in the morning from across the city, followed by their return in the evening. Travel visibly varies across the city—with the many permutations that arise from millions of riders traveling freely across the subway’s 425 stations and station complexes.

About the data

Let’s talk a bit more about the data. This dataset is based off of the ‘Destination Inference’ step of our ridership model, which we detailed in a previous blog post. As that post outlines, the basis of this model is the assumption that a subway trip’s destination is the station the rider next swipes/taps at. If a MetroCard swipes into Bowling Green at 9:15 a.m., and then that same MetroCard swipes into the 103 St stop in East Harlem later that afternoon, we make the imperfect (but pretty good) inference that this 9:15 a.m. trip traveled from Bowling Green to 103 St. These “linked trips” are what form the basis of our understanding of how riders travel across the system (Note 1).

In this Subway Origin-Destination (OD) dataset, we’ve taken these assigned destinations generated by our destination inference process and aggregated them by origin-destination station complex pair and hour of day. These totals are then further aggregated by averaging over a calendar month. Removing personally identifying information, like MetroCard ID numbers, and aggregating ridership data over a calendar month is done to protect the privacy of MTA riders by preventing the association of a single MetroCard swipe or subway trip to a specific person or hour. The format of this aggregated dataset allows users to understand for “an average 9 a.m. hour during the month of May,” roughly how many people travelled between two subway complexes.

It’s important to keep a few things in mind when using this data: 

  • Because this data is the result of a modeling process, the ridership numbers for each origin-destination pair are estimates, not exact values. This modeling process, as well as the monthly aggregation, results in fractional ridership values—we’ve intentionally left ridership estimates as decimals to reflect the uncertainty inherent in this dataset. 
  • Because this data represents a monthly average, users should be mindful that holidays, construction, or other important events that take place during a given month might impact ridership estimates. 
  • Since the modeling process only looks at subway station entries, we can’t quantify how many of these trips truly started and ended at these subway station complexes and how many may have included a transfer from or to another mode of transit (e.g. a bus) at either or both ends. 
  • When using the data to look at arrivals to a subway station, users should note that the timestamp for each OD pair is rounded down to the nearest hour of the entry swipe (or tap) and does not account for the travel time between the entry swipe and arrival at the destination (Note 2).

How can we dig deeper with this data?

Where are people going from a station?

One of the clearest use cases for this data is better understanding the journeys that riders are taking from a given subway station. Take the Court Sq station on the  lines as an example. The following map and graph represent the various destinations for trips originating at this station for Saturdays in June 2024. We can see that riders entering at this station largely travel to destinations in Queens or Midtown Manhattan.

A bar chart showing the Top 15 Destinations for Trips from Court Square.
Figure 1: This bar chart shows the top 15 trip destinations from the Court Sq station for Saturdays in June 2024. Riders entering this station largely travel to destinations in Queens or Midtown Manhattan.
Data visualization of trips from Court Square on Saturdays in June 2024, showing the most trips are to Midtown Manhattan and Queens, with fewer trips to other areas of the city.
Figure 2: This map visualizes trip destinations from the Court Square station for Saturdays in June 2024, with arc thickness determined by the number of users. We can see the concentration of riders traveling to Midtown Manhattan and Queens.

Estimating destination foot traffic

We can also use this data to zoom in on ridership patterns at specific subway station complexes to understand foot traffic in a certain area. For example, someone opening a new business and deciding on optimal operating hours and labor needs could use this dataset to investigate the approximate volume of subway trips arriving to nearby subway station(s) by hour of day for weekdays compared to weekends, or summer months compared to winter months. This data could also support an analysis of where these riders are coming from to better tailor marketing towards these riders.

Let's continue with our example of Court Sq. We can see that weekend ridership to this station doesn’t really pick up until 10 a.m. but stays high until 10 p.m., so a business owner might want to shift their opening hours later on the weekends. We can also see that on weekdays the morning and evening peaks are very similar in scale, suggesting that this station is in an area where many full-time workers both live (evening arrivals) and work (morning arrivals).

A line graph of Estimated Ridership to Court Square in May 2024. The graph shows a bimodal distribution with peaks during the morning and evening rush hours for Monday through Friday. On Saturdays and Sundays, there is a single, lower peak in the evening.
Figure 3: This figure shows estimated hourly ridership to the Court Sq station (as a destination) for May 2024. We can see the distinct weekend and weekday patterns. Plots like this can help us estimate foot traffic in and around the station complex.

Taking a look at the origins for trips to Court Sq for Saturdays in May, we can see that many trips originate at stations along the ​, in Queens and Midtown Manhattan, including the city’s busiest subway complexes: Times Square-42 St, Grand Central-42 St, and 34 St-Herald Sq. We could break this down further, for example to compare weekday trip origins to weekend trips, or by time of day to see how the origins for weekday commutes to this station complex are spread across the city.

A bar chart showing the Top 15 Origins for Trips from Court Square.
Figure 4: This figure shows the top 15 origin station complexes for trips to the Court Sq station for the average Saturday in May 2024.

Investigating impacts of special events

Even though this dataset represents averages across an entire month, if an event causes a big enough bump in ridership for a given origin or destination, the effects will be noticeable in the dataset. For example, average ridership estimates for Saturdays in May show the effects of various graduation ceremonies, as well as early summer activities such as baseball games and the RBC Brooklyn Half.

One of the most notable anomalies we see is for Columbia University’s graduate school recognition ceremonies at the Baker Athletic Complex on Saturday, May 11. When their MBA graduation event ended around 1:30 p.m., we see large concentration of riders traveling on the  from the 215 St station to the 125 St station as graduation attendees traveled en masse from the athletic complex back to Columbia’s Manhattanville campus.

Data visualization showing unusually high ridership between 125 St and 215 St.
Figure 5: This map illustrates the unusually high estimated ridership between 215 St and 125 St stations for Saturdays in May 2024 between 1 p.m. and 2 p.m., driven by Columbia University’s graduate school recognition ceremonies on May 11, 2024.

And remember this previous blog post which explored, among other topics, the ridership spike at the 72 St, 66 St–Lincoln Center, and 59 St–Columbus Circle stations related to the NYC Marathon? That analysis focused on MetroCard swipes and OMNY taps entering the subway stations, which only gives us data on departures from these stations. With the OD data set, we can estimate arrivals to a particular subway station complex as well. We can get a sense of marathon spectator traffic by visualizing the flow of riders to multiple popular viewing spots for Sundays in November 2023. We could also visualize both sides of the travel patterns related to an event and see how interest in the event might be spread across different neighborhoods in the city.

The map below represents the difference in average hourly ridership to each subway station complex on Sundays in November 2023 compared to Sundays in October 2023. In addition to a spike in trips to the Upper West Side stations close to the race finish line in Central Park, we also see an increase in the number of trips to stations all along the race course as spectators come to cheer on runners: from the Nevins St and Atlantic Av-Barclays Ctr stations in Downtown Brooklyn to the Bedford Av station in Williamsburg, up to the Vernon Blvd-Jackson Av and Court Sq stations in Long Island City, and then across to the 72 St and 86 St stations on the Upper East Side.

A map showing the estimated average ridership to each station complex between October and November 2023 for the hour 10 a.m.-11 a.m. and the New York City Marathon route.
Figure 6: This map shows the impact of the NYC Marathon (11/5/2023) on ridership patterns for Sundays in November 2023, specifically the change in estimated average ridership to each station complex between October and November 2023 for 10 a.m.-11 a.m.

How can you use this data?

This dataset will be updated monthly on the NYS Open Data Portal, where users can also find additional documentation on how these origin-destination estimates are generated. We hope these examples have given you a small sense of some of the rich analyses that can be done with this dataset.

So please explore, ask us questions, and if you build something cool with this dataset we want to hear about it. Make your findings public and send us an email at opendata@mtahq.org!

About the authors

Julia Lynn is a post-graduate intern with the Data & Analytics team. As a resident of Williamsburg, she has a particular fondness for the Court Sq station.

Matt Yarri is a data science manager with the Data & Analytics team, covering MTA ridership data. He has cheered on many friends who have run the NYC Marathon, which is one of his favorite civic events of the year.

Notes

  1. While not all trips can be “linked” in this way, roughly 80% can be, which provides a fairly representative sample. For trips that can’t be linked in this way, those system entries are still accounted for in this dataset—with those entries allocated to destinations using the same distribution as the known, linked trips. 
  2. For example, if a rider swipes in at 8:50 a.m. at Wakefield-241 St and travels to Bowling Green, that trip will be recorded with an 8 a.m. timestamp for the entry swipe, even though the rider likely arrived at Bowling Green closer to 10 a.m.