In 1953, the New York State Legislature created the New York City Transit Authority as a public corporation to oversee management and operation of the public transportation system. Now known as MTA New York City Transit, it operates the New York City subway, and local and express New York City buses alongside its sister MTA agencies the Long Island Rail Road and the Metro-North Railroad commuter rail services. The New York City subway system has the highest ridership of any subway in the US and the 7th largest in the world. It provided an average of 5.7 million weekday rides in 2015.
In this lesson, we will examine the turnstile data provided by the MTA. As we will see, turnstile data tracks the number of entries and exits of passengers in the subway system. However, the number of times the turnstile records an entrance or exit does not capture the total number of passengers who use the subway. Passengers who use one-way exit turnstiles are not counted. Passengers often use emergency exits instead of the turnstiles during peak hours or if the turnstiles are in an inconvenient location. People may also “jump the turnstiles” and enter the system without paying a fare (and using the turnstile.)
This lesson will examine passengers entering through turnstiles in some of the busiest and largest stations in the MTA subway system.
Side note: The MTA subway system charges one fare regardless of the distance travel. Washington DC’s Metro subway and Seattle’s light rail system charges based on the numbers of zones travels. Passengers must “tap in” when they enter the system, and “tap out” when they exit the system to determine their fare. One feature of this payment structure is that those transit systems have better data on the length and destinations of trips.
– Understand the uses and limitation of turnstile data
– Locate New York MTA historic subway turnstile data
– Create graphs to visualize patterns of turnstile data
The MTA Developers website provides a collection of free data resources to allow developers to create applications on top of their data services. They provide real-time data services such as Service Status and BusTime, the MTA real-time bus location service. As well, they provide static data feeds, such as GTFS schedule data which does not update frequently. Finally, they host some historical transit data, such as fare card usage and Turnstile data. Turnstile data counts the number of passengers entering and exiting the turnstiles of the subway system. Passengers swipe MTA Metrocards at the turnstile to enter the subway platform. We will be using this dataset in this lesson.
Create a Working Directory, a folder on your computer that is easily accessible.
Go to the “Agreement for Access to Metropolitan Transportation Authority (“MTA”) Data Feeds” page the access the data.
Scroll down, and click the link to agree to the Terms of Service, which will reveal a list of data sources available for download.
Scroll down and select “Turnstile Usage Data.”
On the Turnstile Usage Data page, we want data for Saturday, November 11, 2017, which is the file: turnstile_171125.txt. [link].
On Windows, Right-Click on “November 11, 2017”, and save the file to your Working Directory.
On MacOS, Control-Click on “November 11, 2017”, and save the file to your Working Directory.
Finally, for the last file we need, download the transit-data-toolkit file, which contains the R files. Open the zip file, and copy the 07-subway-turnstiles.R file to your Working Directory.
When you open RStudio, you might see files and data frames open from the last time you used it. If that is the case, close all the files in the Source Pane, by selecting File > Close All. Then, remove any Environment variables, by selecting Sessions > Clear Workspace.
Open the file in RStudio, by selecting File > Open File > 07-subway-turnstiles.R.
Set the Working Directory. Select: Sessions > Set Working Directory > To Source File Location. This step sets our Working Directory folder as the default location for our current session in RStudio. Any files we create in RStudio get saved to this location.
As we have done in previous modules, place the cursor on the first line of code. When ready, you will click on the Run button to execute a line of code. Note, comments are not executed and skipped over until the first line of code is reached.
If you have not installed
ggplot in a previous module, you can install it now. Otherwise, you can skip this line.
# Install ggplot (this is only required once)
install.packages('ggplot2', dependencies = TRUE)
Next, we load
ggplot in RStudio. We will use this package later to graph our findings.
# Load ggplot into current session
Read through and then click Run to execute the next three lines of R. First, we load the turnstile data file into RStudio as the
rawturnstile data frame. Then we create a working copy of the data into a data frame named
turnstile. With the third line, we open the data frame in a new tab in the Source Pane.
# Read in MTA turnstile csv data file
rawturnstile <- read.csv(file="./turnstile_170909.txt", head=TRUE,sep=",")
# Create a working dataframe
turnstile <- rawturnstile
The MTA provides field descriptions of the files. However, documentation on datasets often require research to understand its contents. A good resource for asking questions and learning more about MTA dataset is their MTA Developer Resources Google Group. Information from the discussion forum provided additional details to the Turnstile Usage Data field descriptions.
View the MTA turnstile data. It contains the following columns:
C/A – The Control Area is the operator booth in a station. Some stations only have one operator booth. However, larger stations may have more than one.
UNIT – The Remote Unit, which is the collection of turnstiles. A station may have more than one Remote Unit.
SCP – The Subunit Channel Position represents the turnstile and the number used may repeat across stations. The UNIT and SCP together is a unique identifier of a turnstile.
DATE – The Date is the date of the recording with the format MM/DD/YYYY.
TIME – The Time is the time for a recording, with the format: HH:MM:SS.
DESC – The DESC is the type of event of the reading. The turnstiles submit “Regular” readings every four hours. They stagger the exact time of the readings across all the turnstiles and stations. Staggering the data submission times avoids having all the turnstiles update at simultaneously. “Recover Audit” designates scheduled readings taken after a communication outage. Our analysis uses “Regular” and “Recover Audit” readings. We discard other values such as “DoorClose” and “DoorOpen” which represent unscheduled maintenance readings.
ENTRIES = The ENTRIES are a cumulative count of turnstile entrances. Note, the ENTRIES do not reset each day or for each recording period. The turnstile entry count continues to increase until it reaches the device limit and then resets to zero.
EXITS = The EXITS are a cumulative count of the turnstile exits.
Next we will be cleaning and preparing the data for analysis. Read and Run the next block of R. First, we rename the first column to something more meaningful. In the next line, the
which command in R selects only the
RECOVR AUD readings.
# Rename CA column
colnames(turnstile) <- "BOOTH"
# Keep Only Regular and Recovery Audit readings
turnstile <- turnstile[which(turnstile$DESC == "REGULAR")| turnstile$DESC == "RECOVR AUD"),]
In the turnstile
dataframe, look at the Entries and Exits columns.
The turnstiles record each time a passenger enters and exits through it. Every four hours, they send a time-stamped running tally of the entries and exits. The data shows that the tally increases with every reading. The turnstiles reset when their counters reach their maximum limit. To determine how many people enter through a turnstile, we subtract the previous time stamped
Entries reading with the current number of
Read the next comment and line of code. R has a built-in function
diff with outputs the difference of between consecutive rows. The ave function groups each booth and turnstile as a subset. By combining the
diff and ave function, we can determine the entries and exits of each turnstile.
These links go in depth explanations about the
diff function and
# The first argument is the data to be operated on, the second argument is the group, and the last argument applied diff(x) on group.
turnstile$diff <- ave(turnstile$ENTRIES, turnstile$BOOTH, turnstile$SCP, FUN=function(x) c(0, diff(x)))
We now have the number of passengers entering and exiting a turnstile over each time interval. However, on occasion, there might be a negative number of passengers. Outages in the transmission, lapses in communication between the turnstile and the MTA backend servers, and maintenance on the turnstile cause these readings.
Run the next line of R to remove these readings from our data.
# Remove negative entries
turnstile <- turnstile[which(turnstile$diff > 0),]
With that, our data is prepared for analysis.
Four major stations of the subway system are 34th St Penn Station, 42nd St Port Authority, 42th St Grand Central, and Atlantic Avenue Barclay. These are some of the busiest stations in the system, with connections to other various modes of transit such as Amtrak, Long Island Railroad, MTA Metro North regional trains, and regional bus service. Let’s compare the ridership of these stations over a given week.
Run this line of R. It selects all the readings from the desired stations and discards the other stations.
# Select Terminals
terminals <- turnstile[which(turnstile$STATION == "34 ST-PENN STA" | turnstile$STATION == "42 ST-PORT AUTH" | turnstile$STATION == "GRD CNTRL-42 ST" | turnstile$STATION == "ATL AV-BARCLAY"),]
Then we calculate the Entries by our four stations and day. The
aggregate function groups all our data into subsets and produces summary statistics on each subset. The first parameter of
aggregate defines the variable we are summarizing. In our example,
diff is the variable to summarize and the variables
STATION, and DATE define our groupings. The second parameter is the data frame. The third parameter is the summary statistic we want to calculate, which is the sum of the subsets. The final parameter defines what to do with missing values. Here,
na.rm=TRUE removes any missing values.
Click Run and execute the next block of R to calculate the number of entries for our station by day and view our new data frame.
# Sum all the entries by day. The first parameter of aggregate defines the subset of data, in this case the diff, STATION, and DATE. The second parameter is the data frame. The third parameter is the summary statistic, in this case the sum of the subset. The final parameter defines what to do for missing values. na.rm=TRUE removes missing values.
daily_entries_by_station <- aggregate(cbind(terminals$diff)~terminals$STATION+terminals$DATE, data=terminals, sum, na.rm=TRUE)
Look at the new data frame of
daily_entries_by_station by MTA station. For each of our four stations, the number of entries for each day of the week of November 18, 2017, is displayed.
We can now visualize our data.
Visualize: Graphing our Data
What kind of chart should we use? A line chart is a useful when comparing numeric values across time. We will use it to compare each station’s daily entries.
Click Run to execute the next two lines of R. First, we set the
ggplot theme. Then, we initialize a
entriesgraph, which uses data from
daily_entries. The next parameters define the chart with the date on the x-axis, the number of entries on the y-axis, and instruct
ggplot to use
station to identity the group and assign the
color for each line in the chart. The second line of R draws the chart to the Plots tab.
# Make Subway Ridership Linegraph
# Initialize a ggplot using railrides dataframe, and define axes
entriesgraph <- ggplot(daily_entries, aes(x=DATE, y=ENTRIES, group=STATION, color=STATION))
The plot will appear in the Plots tab bottom-right hand pane.
To see the plot in a larger window, click on Zoom.
Next, Run the next lines to render the chart and assign colors to our stations.
entriesgraph <- entriesgraph + geom_point() + geom_line() + scale_color_manual(name="LINE",values=c("42 ST-PORT AUTH"="blue","34 ST-PENN STA"="green","GRD CNTRL-42 ST"="orange","ATL AV-BARCLAY"="red"))
Finally, Run this line to add a title and redraw the graph.
entriesgraph <- entriesgraph + ggtitle("MTA Subway Daily Turnstile Entries Week of 11-18-2017")
Our data starts with November 18th and the 19th, which is a Saturday and Sunday respectively. In general, MTA subway ridership is less on weekends, due to the lack of weekday commuters who do not come into New York. Monday and Tuesday show the expected jump in subway entries as commuters return to work. What happened during the rest of the week? On November 23rd, Thanksgiving was observed. The slight drop on Wednesday suggests people taking the day off before the holiday. The drop on Thursday to weekend levels suggests people stayed home, which is not surprising. By Friday, riders returned, but to level less then Wednesday and far less then Monday or Tuesday. The people using the subway on Friday might be a combination of people who did not take Friday as a holiday or people traveling for Black Friday shopping.
Looking at the graph, a narrative emerges to explain the fluctuation of subway entries. How can we test this narrative? One period to investigate is a “typical” week, which does not contain a holiday. The weeks before and after Thanksgiving are suitable candidates because they would include any seasonal trends, for instances the times of the year tourism peaks. Other useful weeks to explore are the Thanksgiving weeks of prior years. Fortunately, the MTA provides turnstile data going back to 2010. Running our same analysis on prior Thanksgiving weeks would reveal if similar patterns appeared to add more evidence to our explanation of the behavior of the riders. It might also raise new questions and narratives. We invite you to try it and find out.
You’re done! Help us improve this site with you valuable feedback, by taking our 5 minute survey.