TransitCenter’s Transit Ridership Recipes notes that frequency and reliability are essential ingredients for maintaining and growing ridership. In Lesson 3, we explored the reliability of the MBTA Green line. We used the MBTA Wait Time Reliability metric, which measures the percentage of people who had to wait longer than expected for their train. The other ingredient is frequency and is the subject of this lesson.
Frequent transit decreases the amount of planning required to use the system. Once an hour commuter trains require precise planning and arriving early by passengers to avoid missing a train. On the other hand, if riders know that buses or trains run frequently, they can take a “walk up and go” approach to their usage.
The frequency of a transit system is the number of expected vehicles over a period of time, for example, four buses at a stop over 60 minutes. A more common measurement for a transit system’s frequency is headway, which is the time in between transit vehicles. The headway is the inverse of frequency. The headway of the previous example is 15 minutes. The Transit Ridership Recipe notes that headways of 15 minutes or less attract more riders. In general, riders do not need to make scheduled arrivals for headways of less them 15 minutes. They can more or less show up, knowing that the next bus or train is not that far away. They have confidence that they can get to their destination at a predictable time.
This lesson uses bus data from the Open Bus [link] website, which captures live-data feeds from the New York City MTA GPS-enabled buses. We will explore how to calculate the headway of the M3 bus route, which runs in Manhattan from Fort George down 5th Avenue to the East Village and back up Madison Avenue.
2. Learning Outcomes
– Understand how headways describe transit systems
– Locate New York MTA bus data
– Analyze the headway of an MTA bus
– Visualize headways using a bar chart.
3. Finding Data
The MTA equipped all its buses with GPS positioning devices to broadcast their real-time locations from 2011 to 2014. MTA-created and third-party transit apps all use the data to present real-time updates bus arrival times. We will be using this data which is stored and made available on the Open Bus website. Justin Tyndall, an urban economist who research includes transportation policy studies, created the website. It archives the live position data from NYC city buses data feed. The site also stores data from Citibike, which is New York City’s bike share program.
4. Set Up
- Create a Working Directory, a folder on your computer that is easily accessible.
- Download the transit-data-toolkit file, which contains the R files. Open the zip file, and copy the 05-headways.R file to your Working Directory.
- Go to the Open Bus website [link] and navigate to Raw Data and then under Bus Data, select June 2016 – Present.
- This link will open a public Google Drive folder called Raw Data 2. From here, click on 2017/09.
- Scroll down to m3_rawdata.csv and right-click (Windows) or option-click (MacOS) to save the file to your Working Directory.
- Rename the data file me_rawdata_2017-10.csv.
- Next, download the 05-headways.R file and save it to your Working Directory.
5. Data Wrangling
- Launch RStudio.
- When you open RStudio, you might see files and data frames open from the last time you used application. If you see them, close all the files in the Source Pane, by selecting File > Close All. Then, remove any Environment variables by selecting Sessions > Clear Workspace.
- Open the file in RStudio, by selecting File > Open File > 05-headways.R.
- Set the Working Directory. Select: Sessions > Set Working Directory > To Source File Location. To Source File Location sets our Working Directory folder as the default location for our current session in RStudio. Any files we create in RStudio get saved to this location.
- As we have done in previous lessons, place the cursor on the first line of code. When ready, click on the Run button to execute a line of code. Also, recall comments are skipped over until the next line of R is reached.
- If you have not installed ggplot in a previous module, you can install it now. Otherwise, you can skip this line. But be sure to run the next line, to install the lubridate package.
# Install ggplot and lubridate (this is only required once)
install.packages('ggplot2', dependencies = TRUE)
install.packages(lubridate, dependencies = TRUE)
- Next, we load ggplot in RStudio. We will use this package later to graph our findings.
# Load ggplot and lubridate into current session
# install lubridate
- Read through and then click Run and execute the next three lines of R. Remember RStudio ignores the comments.
- First, we take the time-stamped location data for the M3 bus data file, called m3_rawdata-2017-10.csv and save it in the rawbustimes data frame.
- Then, we create a working copy of the data into the bustime data frame.
- With the third line, we open the data frame in a new tab in the Source Pane. Examine the open data frame.
# Read in MBTA headway csv data file
rawbustimes <- read.csv(file="./m3_rawdata-2017-09.csv", head=TRUE,sep=",")
# Create a working data frame
bustimes <- rawbustimes
- Looking at the open bustimes data frame, we see that the first column time is the time of the recording. The data frame also has a column for each stop on the M3 route.
The date and time data saved in the Unix time format. Unix time is the number of seconds elapsed since Thursday, January 1, 1970, in the Coordinate Universal Time (UTC) timezone. Unix time is useful because it easier to subtract two integers than two dates such as October 21, 2017 and March 14, 2017.
Our bustimes data frame contains one column for each bus stop on the M3 route. The stop column name contains the unique stop ID, which the MTA assigns to each bus stop. Every 3 minutes, the Open Bus website tracks the MTA bus system. It records a
1 if there is a bus approaching or at the stop. If not, it records a
0. The Open Bus Methodology documentation provides a detailed description of the capturing and processing of this data.
Let’s start cleaning and preparing our data for analysis.
Occasionally bus stops in the dataset do not contain any data. In R,
NA represent missing values, also called null values. The missing data could be due to temporary or permanent changes in the bus route. We will remove these columns. We can identify them by counting all the
NA values in a column. The empty columns have a count equal to the number of rows. Therefore, we count all the
NA values found in a column, and select the columns, which have a count of
NA not equal to the number of row in the data frame.
- Run the next line of R to remove these empty columns.
# Remove any empty columns
bustimes <- bustimes[, colSums(is.na(bustimes)) != nrow(bustimes)]
Although UNIX time is easier to program, we do not think in UNIX time. Let’s convert our time data into a format that is more human readable. Make sure the cursor is at the start of this block of code. Run the first line to convert the times column into a more familiar format, Year, Month, Day, Time. Note, the time is displayed using the 24-hour clock convention, also referred to as “military time” in the US. Then, Run the next two lines to transform the time zone from UTC to Eastern Daylight Time, the time zone when and where the data was captured.
# Convert time to posxict with UTC time
bustimes$time <- as.POSIXct(bustimes$time, origin="1970-01-01", tz = "UTC")
localtime <- with_tz(bustimes$time, "America/New_York")
bustimes <- cbind(localtime,bustimes)
View the bustimes data frame and compare the two columns. The localtime column should be 4 hours before the times columns, because local time Eastern Daylight Time (EDT) is 4 hours before Coordinate Universal Time (UTC).
Once you are satisfied that the local time has been converted properly to Eastern Daylight Time, Run the next line to remove the time column in UTC.
# Remove the UTC time column
bustimes <- within(bustimes, rm(time))
Working with dates, times, and time zones is always a challenge. Keeping track of the captured data’s time zone and the other relevant time zones, such as other datasets being used is hard. The Daylight Saving Time convention adds another complication. Always carefully think through the time data you have and how you intend on using it. One strategy is to always store your time data in UTC. Only convert to a local time zone when you need to, but never store that way. Now, you will have the same time zone to use as a reference point.
bustimes data frame contains all the stops, but we will only examine one stop for this exercise. We will look at the north-bound bus stop at Union Square East and East 15th Street. The stop has a MTA designed stop ID 404120. Run the next line to create a new data frame with our time data and our stop.
# Create new data frame with just the M3 stop_404120
# UNION SQ E/E 15 ST Stopcode 404120
# This is a north bound bus
headways <- bustimes[, c('localtime','stop_404120')]
The following block of code finds the time between each identified bus, in other words, the headway. Recall, each row in our data frame is a recording of the live data stream, which occurs every 3 minutes. One large assumption is that if two consecutive rows show a bus, we cannot tell if this is the same bus or different bus. We will assume that these are different buses.
- Identified buses have a value of 1. Our code isolates the rows with 1 and calculates the time difference from the prior row with a value of 1, which is the previous bus. The code is admittedly complex. We suggest that you Run through each line of R code, refer back to the data frame after each step to make sure you understand each operation on our data frame.
- The first line creates an index column, which uses the cumulative sum function to incrementally add
1every time a bus is at or approaching the stop.
- Then, we add
1to every row to ensure that the column starts with a value of at least
# The index increases incrementally every time a bus is at or approaching at a stop
# Add 1 to make sure the index doesn't start at zero
headways$index <- cumsum(headways$stop_404120)+1
- Next, Run that next line to add an additional row to offset the index. Offsetting the index is used to set the time of the previous bus on the same row.
# Add 1 to the start of the index, and shift the rest of the index down a row
headways$index <- c(1, headways$index[1:length(headways$index) - 1])
- Next, Run that next line to add an additional row to offset the index. Offsetting the index sets the time of the previous bus on the same row.
- We’re finally ready to use [headways$index] as an index to set the lastbus column with the values of the times of the previous stop.
# Get the date from index.
headways$lastbus <- c(headways$localtime, headways[which(headways$stop_404120==1), "localtime"])[headways$index]
- Run the next two lines of R. Here, we subtract the previous bus time from the current time using the
difftimefunction and which is stored in our
headwaycolumn. Then, we multiply it by the
stop_404120column to keep the rows with an identified bus.
# Find the difference (in seconds) of the time of the recording and the time of the last bus found
# Multiply by stop_404120 to isolate the time intervals if there was a bus at the time of the recording.
headways$headway = difftime(headways$localtime, headways$lastbus, units="secs")
headways$ headway <- headways$ headway * headways$stop_404120
Run the next block of code. The first line removes the initial recordings up and including the first recorded bus. We do this step because we do not know the actual time of the previous bus, because it passed the bus stop because the start of our data set. Then, we keep the rows if the
lastbus is the same day as
We will select the data collected between 7 am EDT and 10 am EDT, and only on weekdays. We’ll use the helpful
lubridate functions hour and wday which makes it much easier to program.
The next line selects a row if the localtime is greater or equal to 7 or less than 10, using the
lubridate hour function to conveniently pull out the hour from the locatime. Then, we select Monday through Friday, using the
wday function. Lubridate treats
1 equal to Sunday,
2 equal to Monday, etc. View the
bustimes data frame to confirm that the selection is correct.
# Remove the first stop time
headways <- with(headways, headways[index != 1 , ])
# Keep row where localtime and lastbus are the same day
headways <- with(headways, headways[wday(localtime) == wday(lastbus) , ])
# Select times 6am EDT through 10pm EDT using the with function from lubridate
# The different between ET and UTC is 4 hours, we want the time to after 10am UTC and before 2am UTC.
bustimes <- with(bustimes, bustimes[hour(localtime) >= 6 & hour(localtime) < 22 , ] )
# select data monday through friday using the with function from lubridate
bustimes <- with(bustimes, bustimes[wday(localtime) >= 2 & wday(localtime) <= 6 , ] )
Now, our data is prepared, and we can calculate the weekday morning headway of a bus stop on the M3.
Run the next two lines of R. The first line converts the headway column from R’s special time difference data type into a numeric data type. Then, we use the mean function to find the average headway, which is the time in between buses. In this line, we also remove any null values or values that equal to
0. Finally, we divide by 60 to convert the average mean into minutes. The value of our operation is stored in the variable headway_mean.
# Calculate the average headway. Remove any 0 and null values from the calculation.
headway_mean <- mean(headways$headway [headways$headway!= 0], na.rm=TRUE)/60
In the RStudio Environment tab, we see the value of the
headway_mean is 16.7. This variable is the average headway of the M3 bus during October 2017. If we go back to the Open Bus website, the site lists the average headway is 18 minutes, which helps to validate our calculation. Our value is lower which aligns the more frequent morning rush hour buses schedule.
Next, we will calculate the headways for each day of the month. The first step, seen in the next line of R, is to create a new column that takes the date time column, strips out the time, and only saves the date into the day column. Then, we select all rows where a bus is identified.
# Identify the different days
# Convert to Date object (Remove the time and just keep the date). Then, remove any rows which do not have a stop, (that is,
headways$day <- as.Date(headways$localtime, tz = "America/New_York")
headways <- with(headways, headways[stop_404120 != 0 , ])
Finally, in the next lines of R, we calculate the average headways by day for the month.
- Run these three lines of R.
- We implement the highly useful R function aggregate, which applies any function to subgroups in a data frame. In our case, our data frame is headways, the subgroup is day, the function to apply is mean, and we remove any null values.
- View the new data frame, named
headway_mean_by_day, in the console. There is one row for each day from our data set. Each subsequent column has the average values for that day. The only column of interest is the
headwaycolumn, which holds the headways for each day.
- Before moving on, we divide that column by 60 to convert the headways into minutes, and save it the
# Calcuate the average headway per weekday
headway_mean_by_day <- aggregate(x = headways, by=list(headways$day), FUN=mean, na.rm=TRUE)
headway_mean_by_day$headway_mean <- headway_mean_by_day$headway/60
Now, we have the average headway by day for October 2017, found in the column headway_mean.
6. Visualize: Graphing our Data
The final step is to plot our data.
- In preparation for graphing our data, Run through the following lines of R to rename the columns to something easier for new viewers to understand.
# Make Subway Ridership Linegraph
# Initialize a ggplot using railrides dataframe, and define axes
colnames(headway_mean_by_day) <- "DAY"
colnames(headway_mean_by_day) <- "HEADWAY"
As in our previous lesson, we will use ggplot to create our graphs.
- Run the following lines of R. First, we set the plot theme.
- The next line defines a
ggplotbar chart with the day as the x-axis,
headwayas the y-axis, and bar color of orange.
- Then, add a title to our chart.
- In the final line, the
plotcommand displays the
ggplotin the Plots tab.
Recall, you can click on the Zoom button in this panel to display our chart in a popup window.
headwaysgraph <- ggplot(headway_mean_by_day, aes(x=DAY, y=HEADWAY))+geom_bar(fill=rgb(0.9,0.6,0), stat="identity")
headwaysgraph <- headwaysgraph + ggtitle("MBTA M3 Average Headways for October 2017")
Our data runs from October 9th to October 31st. Note, we removed the weekends, which is reflected in the gaps of the data. No patterns or insights immediate are apparent. The three days with the highest headways are different days of the week, Monday, October 9th, Tuesday, October 10th, and Wednesday, October 18th. What could cause headways to vary between days? There are a number of factors, both internal and external. The MTA may not have had the scheduled number of buses on the street due to a shortage of operators or mechanical issues. Traffic and ridership might be unusually heavy that day. Weather is another factor to consider. Public events also might also be a factor. The M3 route runs down 5th Avenue, which is the parade route for many communities. (A follow-up question is: when do parade occur in New York?) These are the types of questions to ask as you explore data sets.
Researching the weather or if any public events were taking place on the days with higher headways may shed insight on what caused an increase in headways. Another consideration is going back to our original data and look for any suspicious data. The GPS system fitted on the buses is not perfect, and errors can occur in the recording, transmission, and storage of the data.
8. Going Further
Beyond the M3 Bus in October, what other datasets would be logical places to continue to explore? What are interesting questions to ask? One area of further exploration is performing a similar analysis on other routes besides the M3. Another area is looking at other months of the year to look for seasonal trends in bus headways. Is it possible to find connection between headways and ridership?
Headway is an important tool for understanding a transit system. In conjunction with other data and research, we can use the actual (versus scheduled) headways to gain insight on how well the transit agency is delivering the service it intends to, and what can be improved to encourage more ridership.
You’re done! Help us improve this site with you valuable feedback, by taking our 5 minute survey.