2 – Exploring Ridership

1. Introduction

Ridership, the number of people who utilize a transit route or line, is a fundamental metric of transportation systems. There are multiple ways to define and measure ridership. The Massachusetts Bay Transportation Authority (MBTA) defines ridership as the average number of boardings per weekday in a month. One trip may consist of multiple boardings. In other words, a trip which consists of a bus ride to a subway to another bus counts as three boardings. Therefore, ridership does measure the total number of trips by passengers. The New York Metropolitan Transportation Authority (MTA) defines subway ridership by the number of people who enter a subway station. They define bus ridership as the number of paying riders who board a bus. It is easier to track boarding and entries because a payment is generally required. Payment is a metric transit authorities can track. However, not knowing when a passenger exits the vehicle hinders the accuracy of ridership measurement. Northern California’s heavy rail system Caltrain performs a physical head count once a year.

Tracking and understanding ridership gives transit authorities insight on how their services are used. In this lesson, we explore the ridership of the MBTA subway system, locally referred to as the T. Boston’s subway is one of the oldest in the country. The Tremont Street Subway tunnel first opened in 1897 and is still in use. In March 2017, approximately 764,000 passengers rode the MBTA subway. This lesson will teach you how to create a graph the 2016 ridership levels for each T line, from finding and preparing the data to creating and reflecting on the visual display of a fundamental transit system measurement.

2. Learning Outcomes

Locate Ridership data from the MBTA Back On Track website
Prepare MBTA’s Ridership data for analysis
Create a graph of each line’s ridership levels
Reflect on limitations of the ridership data

3. Introduction to Finding Data

The MBTA Performance Dashboard, found at mbtabackontrack.com, houses a website which provides performance information with easy to understand charts and graphs. The site also provides the underlying data that drives the dashboard. We will be using these data to explore the ridership of the MBTA.

First, go to mbtabackontrack.com. The site has data on Reliability, Ridership, Financials, and Customer Satisfaction. Feel free to explore each section. When you are ready, go to the Ridership section. Then click on More Data, where we will download the 2016 ridership data.

Under Ridership, click on 2016 to download the data for that year. Move it to your Working Directory . Unzip the file, and then rename the weekday_ridership_2016.csv file to: mbta_ridership_2016.csv.

Back on the page where you downloaded the data, click on [PDF] Data Dictionary. This document describes the contents of the variables of the data. Fortunately, the MBTA Performance Dashboard provides detailed descriptions of all the data. Often, public datasets do not provide good documentation. In this case, you may find you need to do additional research such as contacting the maintainers of the data when questions pertaining to the data arise.

 Figure 1. MBTA Back on Track Performance Dashboard

4. Set up

Now that we have our data, we need to set up our files and folders.

  • Create a working folder somewhere on your computer that is easy to access. Store all your files here. We will refer to this folder as your Working Directory.
  • Download the transit-data-toolkit file, which contains the R files. Open the zip file, and copy the 02-ridership.R file to your Working Directory. This file contains the R script for preparing the data.
  • Now we are ready to use R to clean up our data. As before, refer to Getting Started: Tools – R and RStudio Overview [link] if you are new to R or need a refresher.
  • Launch RStudio.
  • When you open RStudio, you might see files and data frames open from the last time you used it. If that is the case, close all the files in the Source Pane, by selecting File > Close All. Then, remove any Environment variables, by selecting Sessions > Close Workspace.
  • Open the file in RStudio, by selecting File > Open File > 02-ridership.R.

You may have to navigate to where you saved the file.

  • Set the Working Directory. Select: Sessions > Set Working Directory > To Source File Location. This step sets our Working Directory folder as the default location for our current session in RStudio. Any files we create in RStudio will be saved here.
  • Also the Source Pane has the Run button, which we will be using frequently.

We will also need to install ggplot2, a popular graphics package. R packages provide additional functionality that is not included with the basic installation of R. It is a common programming practice to install and access packages at the start of a script. We will further explain how to use ggplot2 when we are ready to graph our results.

In RStudio, the ggplots will appear in the Plots Tab on the lower right pane. It also contains a Zoom feature, which will open the ggplot in a new and larger window. There is also an Export feature for saving your ggplot to the computer.

Figure 2. RStudio Source Pane, Run button, and Plots tab.

5. Data Wrangling

We’re now ready to start working in R to get our data ready to be analyzed.


  • Click on the Source Pane, and place the cursor on the first line of the script. The # symbol designates the line as an R comment. Comments are not executable code. Programmers use comments to explain to other programmers the intention of the code.


  • Click on the Run button to execute the two lines that follow the comment here to install ggplot2 and access the package. If you already have ggplot2 installed, you may skip this line.
# Install ggplot2 (this is only required once)

install.packages('ggplot2', dependencies = TRUE)


The next lines load our ridership data into a data frame and then display it into a Source Pane. A data frame is a standard way R stores a table of data. Review the columns of data to ensure the data loaded properly. Make sure the file name exactly matches the name of the file you saved and that it is saved in your Working Directory.

  • Click Run to execute the following line of code, which will import our ridership data.

# Read in MBTA ridership csv file
rawdata <- read.csv(file="./mbta_ridership_2016.csv", head=TRUE,sep=",")
  • Click Run again to display our data frame in a Source Pane

View the data to ensure it matches what we downloaded from the MBTA Performance Dashboard. Note that the R commands we execute also appear in the Console Pane below the Source Pane.

Figure 3. RStudio rawdata data frame and Console Pane.

Our goal is to graph the subway ridership of each subway line. The next step toward that goal is to select only the ridership data we want. First, in the MODE_TYPE column, we select the rows containing RAIL (that is, subway) and discard the other modes of travel. Then copy the selected data into a new data frame called railrides. Copying the raw data to a new data frame preserves the original rawdata data frame should we need to return to it. Then, we select only the columns we need: SERVICE_MONTH, MODE_TYPE, AVERAGE_WEEKDAY_RIDERSHIP_COUNT, and ROUTE_OR_LINE. Run through these lines.

  • Click back to the Source Pane.
  • Click Run and execute the following lines of code.
  • Remember, the lines with a # are comments describing what the code underneath it will do. When you click Run, the # lines will be ignored.

# Select Rail (subway) rows

railrides <- rawdata[which (rawdata$MODE_TYPE =="RAIL"),]

# Select columns for analysis


Now that we selected the rows and columns we need, we clean up their contents to be able to graph our data. We rename the columns to make them easier to understand. In R, a factor is a categorical variable, like colors, months, or the good, ok, bad scale. MONTH, MODE, and LINE were imported by R as a factor datatype (that is, categorical data.) In general, when R imports a csv file, it does a good job of converting the data into the correct data type. That is, it takes an 8 and saves it as an integer rather than as text or a character. However, sometimes R converts text into a factor, which is what is happening with our data.

The following lines of R code will convert the factors into characters (that is, text). Then, the final step is to convert the MONTH column into a date data type. Place the cursor at the beginning of this block of R code, and Run through each step. You may view the data frame to ensure the data is updating as intended.

  • Run through these lines to make the variable names easier to read.

# Rename column names
colnames(railrides)[1] <- "Month"
colnames(railrides)[2] <- "Mode"
colnames(railrides)[3] <- "Avg_Weekday_Ridership"
colnames(railrides)[4] <- "Line"
  • Run through these lines to convert the Month column to be able to process it as a date.

# Convert columns to characters and date data type
tempdata <- sapply(railrides, is.factor)
railrides[tempdata] <- lapply(railrides[tempdata], as.character)
railrides$Month <- as.Date(railrides$Month, "%Y-%m-%d")

Subway ridership data ready to be graphed Figure 4. Subway ridership data ready to be graphed

6. Plotting the data

Now we will make a graph of our data. Ggplot is a type of plot in R. Let’s break down each step in creating a ggplot to understand how it is made. Ggplot2 creates a ggplot and uses a + to add layers and themes to it. Each layer contains new graph elements, like defining color or adding a title. As we add layers and themes to our ggplot, the plot() function updates the ggplot in the Plots tab of RStudio.

The first line initializes the ggplot and defines the data frame we’ll be using (in this case, raildata), and the aes() function defines the data for the x axis and y axis, and also how the data will be grouped and colored. We give the ggplot the variable name subwayridergraph.

  • Ensure your cursor is at the #Make Subway Ridership Linegraph,
  • Run through each step.

# Make Subway Ridership Linegraph

# Initialize a ggplot using railrides dataframe, and define axes
subwayridesgraph <- ggplot(railrides, aes(x=Month, y=Avg_Weekday_Ridership, group=Line, color=Line))


Click on Zoom to open the graph in a larger window.

Figure 5. Empty ggplot
Figure 5. Empty ggplot

We created our ggplot, but it is still just a plot without any data. Next we will add layers to it.

  • Run through the next lines to add points, lines, and assign colors. R usese a + to add each of the attributes to our ggplot.

# Define the ggplot subwayridersgraph

subwayridesgraph <- subwayridesgraph + geom_point() + geom_line() + scale_color_manual(name="LINE",values=c("BLUE LINE"="blue","GREEN LINE"="green","ORANGE LINE"="orange","RED LINE"="red"))

# draw the plot


Every graph should have a title, an x-axis label, a y-axis label, and a legend. A viewer without any prior knowledge about the data can fully understand the graph if these items are always included.

Run the next line to add a title to our ggplot.

# Add a title to the ggplot

subwayridesgraph <- subwayridesgraph + ggtitle("Average Weekday MBTA Subway Ridership 2016")

# draw the plot


Our final graph is below. In the Plots tab, click on Export to save a copy of the ggplot to your working directory.

Graph of MBTA subway ridership in 2016. Figure 6. Graph of MBTA subway ridership in 2016.

7. Analysis

Ridership tells an important part of the public transportation story, illustrating the demand and usefulness of system in the community it serves. We use ridership to calculate other important metrics, like revenue and cost of operation per rider. That said, it is also just one part of the story on the impact of transit service on the surrounding community.

Now compare the Red and Blue Lines. Clearly, the Red Line has a much greater ridership than the Blue Line. What other information is useful in comparing these two lines? The number of stations and length of the line help add more context to the lines. The Red Line has 22 station over 22.5 miles of track. In comparison, the Blue Line only has 12 stations and 6 miles of track. Therefore, the difference in ridership makes sense. To further understand the differences between the two lines, it would also be useful to know the population and densities of the neighborhoods the two lines serve. In this case, the Red Line serves high density areas such as downtown Boston, MIT, and Harvard University.

Reflect: Is the Red Line more successful than the Blue Line? Not necessarily. Other factors can define the success of a transit system, for instance coverage.

Although it is instinctively an obvious measurement to define success, ridership is only one factor in determining if a transit system is successful. Unreliable and infrequent service may cause low ridership in high population density areas. Fast, reliable, frequent, and accessible service increases people’s use public transportation, because they can depend on it to arrive on time and in a reasonable amount of time. The elements of transit ridership are further discussed in the TransitCenter’s Transit Ridership Recipe.


Cities and transit authorities may also include coverage as part of the mission of their public transportation system. In this case, maximizing the number of neighborhoods with access to buses and subways is another goal and factor in the allocation of resources. This ensures that less populated areas also have service, even if those routes or lines never achieve ridership levels of more populated areas.

Going Further:

Looking at our graph, what do you see? Comparing the Green and Orange Lines, the Green ridership dips below the Orange line for January, February, and March. One factor may be that the Green Line is partially above ground. What are other potential factors for the ridership differences in these two lines?


 8. Further Reading

TransitCenter’s Ridership Recipe, TransitCenter

The Transit Ridership Recipe, Jarrett Walker

Blue Line (MBTA) Wikipedia entry

Green Line (MBTA) Wikipedia entry

Orange Line (MBTA) Wikipedia entry

Red Line (MBTA) Wikipedia entry


You’re done! Help us improve this site with you valuable feedback, by taking our 5 minute survey.