3 – Exploring Reliability – Subway Line

1. Introduction

An important component of the performance of a transit route is reliability. When a bus is consistently late or early, or when a large gap forms between buses, passengers miss their connections and cannot depend on being able to arrive at their destination when they expect to.

Reflect:What are different ways to define a transit system’s reliability? Think about a bus, train, or subway that you consider reliable or unreliable. What happened to lead you to this evaluation?
Is it because it always or never arrives on the schedule time? Is it because you always feel like you are waiting longer than you are supposed to?
There are many ways to measure reliability. This introduction will discuss three methods: scheduled on-time performance, bunching, and wait time reliability.

Scheduled On-time Performance

One simple on-time performance measure is the percentage of vehicles that arrive within a defined period of time before and after the schedule time. A transit system will define what is an expected wait time and what is longer than expected.

On-time Performance = # of vehicles arriving on-time / total # of vehicles arrivals

For example, the bus could be deemed on-time if it arrives between one minute before and five minutes after the time it is supposed to reach its stop. On-time performance is a good measure of how well a transit agency operates and schedules its vehicles, and for surface routes, how well the city’s department of transportation manages traffic congestion along the route. While this measure is easy to understand, it does not fully describe the rider experience. For example, if a train arrives on time 90% of the time, the number of delayed passengers is unknown. That is, it does not take into account the number of riders impacted by a train being early or late or how long they had to wait.


Have you ever waited a long time for a bus, only to have two buses come one right after the other? That’s bunching. Have you ever arrived onto a subway platform, just missing a train, only to have another train (usually empty) arrive immediately after it? That’s bunching working in your favor. In general, bunching occurs when transit vehicles, which are planned to be evenly spaced, get off schedule. Poor operational discipline, traffic, passengers, and other external factors may cause bunching. The result is that a few riders have a surprisingly short wait, and a larger number of riders face a longer than expected wait time.

In general, bus or train bunching is itself not a problem for riders, rather the problem is the large gap that precedes or follows the bunch. So why not measure gaps instead? Measuring bunches as a proxy for large gaps is a more conservative estimate of the problem, and does not penalize the transit system for vehicles with malfunctioning location data.

A good resource on bus bunching is Bus Turnaround, a website and campaign focused on the
reliability of New York’s bus system.

Wait Time Reliability

The Massachusetts Bay Transportation Authority (MBTA) uses a custom measure, Wait Time Reliability, to determine the reliability of their transit system. On March 31, 2016, the MBTA Green Line had a Wait Time Reliability of 90%. This number means 90% of riders waited no longer than the scheduled time between trains. Therefore, this metric describes the rider experience. For the MBTA subway, peak rush hour trains usually arrive 5 minutes apart and 8-10 minutes during other times. Wait Time Reliability estimates the percentage of passengers who waited longer than the scheduled interval. The MBTA Data Blog has a detailed explanation of their Subway Reliability calculation.

This lesson introduces using transit data to measure the reliability of a transit system. In this lesson, we will be using R and RStudio. If you haven’t set up these tools, please refer to Getting Started: Tools[Link] to learn how to install R and RStudio.

We will use MBTA Wait Time Reliability data, which we will download from their site. In particular, we will examine the MBTA Green Line, which consists of four different branches: B, C, D, and E. The Green Line is the oldest line in Boston, and still uses a portion of the route which first opened in 1897.

2. Learning Outcomes

  • Understand how to measure the performance and reliability of a transit line
  • Locate MBTA reliability data
  • Wrangle the MBTA reliability data into a form suitable for analysis
  • Create a map of the reliability of the MBTA Green Line stations

3. Finding Data

We will be using MBTA Wait Time Reliability data to examine the peak hour reliability of each station served by Boston’s Green Line for this lesson. After that, we will map each station’s reliability.
We will only be looking at one month of the year to keep the data a manageable size. If you wish, you can repeat the exercise on your own using other months to get a sense of what a “typical” month should look like in terms of reliability, and what a not “typical” month looks like.
Go to the MBTA Back on Track Dashboard website which contains the system’s Reliability, Ridership, Financial, and Customer Satisfaction data. Explore the Dashboard and try to get an idea of the types of data and information that are available.

4. Set Up

Update: The MBTA recently updated their Reliability data format. While we work to update this lesson, please download this TDashboardData_reliability_20160301-20160331.csv file. You will have to unzip the file and move it to your Working Directory.
  • Create a Working Directory folder where you will save all files for this lesson.
  • Download the transit-data-toolkit file, which contains the R files. Open the zip file, and copy the 03-reliability.R file to your Working Directory.
  • Go to the More Data section of this resource
  • Make sure “Reliability” is selected.
  • Then select the date range March 1, 2016 to March 31, 2016.
  • Click on Download.
  • The website will export the requested data into a csv file onto your computer.
  • The files will automatically have the title: TDashboardData_reliability_20160301-20160331.csv.
  • Move the csv file to your Working Directory.
Originally, we looked into using data for March 2017 however, there was missing data for part of the month. We inquired with the MBTA via the MBTA Developers Google Group and were referred a blog post describing a problem capturing data for the Green Line from February 8th 2017 to March 2nd 2017. Missing or incomplete data is a frequent challenge in data wrangling.

Figure 1. MBTA Performance Dashboard – More Data

  • The last step is to download the Data Dictionary. Click on the link PDF Data Dictionary.
  • Open the Data Dictionary. This PDF describes all the fields in the dataset. MBTA did a good job documenting their data. You may find other datasets have poor documentation. In these cases, you will need to investigate what the fields mean by searching the internet or contacting the organization that produces the data.

Read the definitions of “OTP_NUMERATOR” and “OTP_DENOMINATOR” for “rail.” What data do these fields contain?

The OTP_NUMERATOR for rail is the estimated number of passengers on that day for that transit station, whose wait time was longer than scheduled.

The OTP_DENOMINATOR for rail is the estimated number of total passengers on that day for that transit station.

5. Data Wrangling

  • Launch RStudio.
  • If you are new to RStudio or haven’t used it recently, take a look our section Getting Started: Tools for a brief overview of R.
  • When RStudio opens, if there are files and a data frame open from a previous session, close them in the Source Pane by selecting: File > Close All or by manually closing each tab. Clear any data in the In the Environment tab by selecting: Session > Clear Workspace.

Now we are ready to start working with RStudio to prepare our data.

  • Open the file in RStudio Select: File -> Open File -> 03-reliability.R

Next set the Working Directory. The Working Directory is where RStudio will look for files. To Source Location means that RStudio will looks for other files in the same folder at the Source file, in this case, where we saved 03-reliability.R.

  • Select: Session > Set Working Directory > To Source Location

Read the first three lines of our R program in the Source Pane.

# Install leaflet (this is only required once)
install.packages('leaflet', dependencies = TRUE)

  • Click on the first line to place the cursor there. The first line is a comment which begins with the # symbol, and contains an explanation of the nearby code. While running the script, R ignores all comments.
  • Click the Run button twice to run the next two lines of the script.

This line installs the R package leaflet. Packages provide you extra features which are not included in the basic installation of R. They only need to be installed once. If you run this program again, you do not have to Run these two lines again. leaflet is a popular package for making for working with spatial data and generating maps.

  • Run the next two lines of R. (Remember R skips the comments.) Although we installed the package, we still need to load them into R in order to use it. Unlike installing packages, we load the packages very time we run this script.

# Load library
  • Run the next two lines of R.
# Read in MBTA performance csv file
rawdata <- read.csv(file="./TDashboardData_reliability_20160301-20160331.csv", head=TRUE,sep=",")

What do you see?

The first line of R code opens the contents of our csv file and saves to the dataframe “rawdata.”

The second line of R code, the View command, opens the rawdata into a new tab in the Source Pane.

Examine the rawdata tab. Read through each field. Can you identify which fields contain information about Peak/ Off-Peak hours, the Green Line, Station names, and the Reliability Metric?

Often understanding these field names is a challenge. If a field name is not immediately clear, a good first step is to go back to the data source and look to see if the data provider offers any documentation on how the data are named.

Going back to MBTA Back on Track, the website contains a Data Dictionary, which contains definitions of each column of data.

  • Click Run five times to run the next comments and lines of code below.
  • First, we select the reliability for peak data. Then, we remove the columns of data that we do not need for this analysis. Because we are only looking at the Green Line, the next line removes the other subway lines. Finally, we remove extra spaces from the STOP column, which contains the station stop names. This final step will be important later in the lesson when we match the station stop names with names from another dataset.
  • Look at the data frame to see if the data matches your expectations.

Note: Experimenting is important and helpful. If anything looks strange or wrong, you can always go back to the top of the script and Run through each line of R code again.

#Select Peak Service rows for the Green Line
reliablity <- rawdata[which (rawdata$PEAK_OFFPEAK_IND =="Peak Service (Weekdays 6:30-9:30AM, 3:30PM-6:30PM)"),]
#Remove columns of data we don't need
#Select only the Green Line data
reliablity <- reliablity[which (reliablity$ROUTE_TYPE=='Green Line'),]
#Remove and extra spaces
reliablity$STOP <- trimws(reliablity$STOP,which = c("right"))

Figure 2. reliability Data Frame

Question: Go to the rawdata dataframe. Why did we select the ROUTE_TYPE instead of the ROUTE_OR_LINE?

We are interested in the Green Line. Do both fields include the Green Line? If so, are there any differences?

Answer: Looking at the data frame, you can see that in the ROUTE_OR_LINE field, the B, C, D, and E branches of the Green Line are listed separately. Whereas in ROUTE_TYPE, the Green Line has the same name for each branch. We do not need the specific branch for this analysis. Therefore, we will use the ROUTE_TYPE field. Note: You might discover that we need the specific branch to answer a question in the future. In this case, you will have to go back and edit your R program. This iterative process occurs often in data wrangling, and is another reason to write good comments as you program.

6. Joining Data

We now have the Reliability for each rapid transit station on the Green Line. We want to present this data on a map.

Do we have enough data to make a map? If not, what additional data do we need and where might we find it?

As we saw in Lesson 1, in order to create a map, we need get the location coordinates of the stations. However, location data is not available in the Reliability data from the MBTA Back On Track site. We also need to know which stations make up a line. You will often find that there isn’t a single dataset with all the information you need. Therefore you will need to combine multiple datasets to be able to perform your analysis.

Download the file mbta_stations.csv. This file contains coordinates of the station names, the line, and the location coordinates. This file was created from a shapefile downloaded from MassGIS, the Office of Geographic Information of the state of Massachusetts. A shapefile is a common geographic vector file format used in GIS. For your convenience, we extracted the data for you to use in this lesson by using QGIS.

Question: What fields (columns) do mbta_stations.csv and our current dataframe have in common?

Answer: The field in common is the station name. Therefore, the next step is to match the coordinate data with reliability data based on the station name. This operation is called a “join,” and fortunately, RStudio provides an easy way to accomplish this step.

  • Read and step through the next block from the R script to prepare our data.
  • We first import the csv file and select the Green Line stations. Then, we change the column name from STATION to STOP. If you recall, in our reliability dataframe, the STOP column contains the station names. In order to join the two tables, the column both data frames have in common must have the exact name. That way, R knows which columns to match.

# Import the MBTA station location data
stations <- read.csv("./mbta_stations.csv")
# Select only the Green Line data
locs <- stations[which (stations$LINE=='GREEN'),]

Figure 3. MBTA station location data

# Change the name of Station to Stop to match the other data frame
colnames(locs)[2] <- "STOP"

Now we are ready to perform the join. After joining the two datasets, we will save only the columns that we need for our analysis.

  • Run through the next block of R

# Join the location data to the reliability file
joindata <- merge(x = data, y = locs, by ="STOP", all.x=TRUE)
# Select only the columns needed for analysis
joindata <- joindata[c(“STOP”,”SERVICE_DATE”,”ROUTE_TYPE”,”rely”,”LONGITUDE”,”LATITUDE”)]
# Review the data to ensure that we selected the proper columns

Review the joindata data frame. For each station, we now have the service date, route type, rely, longitude, and latitude.

Figure 4. joindata Data Frame

7. Visualize

  • Run the next lines of R to create our map.

First, we create a color palette colorpal which defines the colors for the markers on the map.

Next, we define the parameters for our map. The setView parameters define the location of the map, which is Boston and the initial zoom level.addCircles creates the circles which mark get Green Line stop. The placement of the markers come from the longitude andlatitudedata found in the joindata data frame. The color of the points is a gradient of green, based on the rely value. The green’s saturation increases as the reliability increases.

# Map the Green Line
# Map the stations
# Create a ColorBrewer Greens color palette.
colorpal <- colorNumeric(palette = "Greens", domain = joindata$rely)

# Lat Long coordinates from www.latlong.net
mbta_subway <- leaflet(joindata) %>%
addTiles() %>%
setView(-71.057083, 42.361145, zoom = 12) %>%
addCircles(~LONGITUDE, ~LATITUDE, weight = 3, radius=120,
color=~colorpal(rely), stroke = TRUE, fillOpacity = 0.8) %>%
addLegend("topright", colorpal, values=~rely, title="MBTA Green Line Reliability, Data Source: MBTA Developer Portal")

  • Click Run to execute the final line of R to display the map in the Plots panel.
  • Click on Zoom in the Plots Pane to see the larger version of the map.

# Show the map

Figure 5. MBTA Green Line Reliability

8. Analysis

Question: Look at your map. What do you see? Where are people waiting the longest?

Answer: We see longer wait times towards the eastern part of Boston near the Financial District. This is downtown Boston which is quite busy during working hours and might explain some of the delays. On the western end of the Green Line, the subway appears to be more dependable. However, this map by itself cannot explain if there is a pattern or any geographic influence on reliability.

The Green Line Extension will extend the route into Somerville and Medford and has a planned completion date of 2021.

8. Further Reading

Beyond “On-Time Performance” by Human Transit

Bus Turnaround
Explaining Dashboard Metrics: Subway Reliability by MBTA Data Blog

You’re done! Help us improve this site with you valuable feedback, by taking our 5 minute survey.