6 – Transit System Vehicle​ Speed

1. Introduction

In the previous lesson, we examined the headway of buses, which is the time between buses on a route. Frequency is an important ingredient to encourage riders to consistently use mass public transit over other modes of transportation. Another important component of a transit system is travel time, of which the speed of the vehicles plays a significant role.

By understanding how quickly vehicles can move through an area the transit system covers, transit planners can design and evaluate the system to ensure passenger can get to their destination in a reasonable amount of time. Riders may not need the exact speed of their bus or train. However, a general understanding of the how long a trip will take influences the options they chose.

For this lesson, we will calculate the average bus speed during morning rush hour. Traffic fluctuates throughout the day. Speed during a specific period, such as morning rush hour, is more informative than the average speed over the entire day because a bus will rarely travel at the daily average speed. The historic data we will use is provided by the New York MTA. In particular, we will look at bus service in the Bronx borough of New York City.

2. Learning Outcomes

Understand how speed is a tool for evaluating transit systems.
Locate New York MTA historic Bustime data
Calculate the average speed of buses during rush hour
Create graphs to visualize the average speed of buses

3. Finding Data

The MTA Developers website provides a collection of free data resources to assist developers in accessing their data services. They provide real-time data services such as Service Status and Bus Time, which provides real-time location updates, using GPS and wireless communication devices installed on all the buses. They also provide historic datasets. Shortly after the launch of Bus Time, the MTA archived the live feed transmitted by the GPS enabled buses. The data has been processed by the MTA into an easy to use form. Having this data is particularly helpful because the raw GTFS data is not human readable and difficult to use.

4. Set Up

Figure 1. MTA Data Terms Page.

  • Scroll down, and click the link to agree to the Terms of Service, which will reveal a list of data sources available for download.

Figure 2. Agree to Terms of Service link.

  • Scroll down and select “Historical MTA Bus Time Data”

Figure 3. Historical MTA Bus Time Data link.

On the Historical MTA Bus Time Data page, we want data for Wednesday, October 8th, 2017, which is the file: http://s3.amazonaws.com/MTABusTime/AppQuest3/MTA-Bus-Time_.2014-10-08.txt.xz.

The .xz file extension denotes a compressed file format. After downloading the file, we will convert the file to a .txt file.

  • In Windows, right-click on the file name noted above and save the file to your Working Directory. Download the free open-source software 7-zip or similar file archiver application to decompress the file. If you use 7-zip, go to your Working Directory, right-click on the file, and select Open with 7-zip / Extract Here.

Figure 4. Convert file on Windows 10 with 7-Zip.

    • On MacOS, control-click on the file name noted above and save the file to your Working Directory. Once in your Working Directory, double-click on the file to decompress the file.
    • Go to your Working Directory, and make sure these files are there.
    • Download the transit-data-toolkit file, which contains the R files. Open the zip file, and copy the06-speed.R file to your Working Directory.

5. Data Wrangling

  • Launch RStudio.
  • When you open RStudio, files and data frames might be open from the last time you used the application. If this is true, close all the files in the Source Pane, by selecting File > Close All. Also remove any prior Environment variables, by selecting Sessions > Clear Workspace.
  • Open the file in RStudio, by selecting File > Open File > 06-speed.R.
  • Set the Working Directory. Select: Sessions > Set Working Directory > To Source File Location. Now, our Working Directory folder is the default location for our current session in RStudio. Any files we create in RStudio will be saved to this location.
    As we have done in previous modules, place the cursor on the first line of code. When ready, click on the Run button to execute a line of code. As a reminder, comments designated with a #, are ignored and skipped over.
  • We’re using three packages: ggplot2, lubridate, and data.table. If you have not installed these in a previous module, you can install it now. Otherwise, you can skip these lines.

# Install ggplot2, lubridate and data.table (this is only required once)
install.packages('ggplot2', dependencies = TRUE)
install.packages('lubridate', dependencies = TRUE)
install.packages('data.table', dependencies = TRUE)
  • Next, we load our packages in RStudio. ggplot2 graphs our findings, lubridate helps format our time data, and data.table adds features to make calculating the speed of the buses easier.
  • Run this block of R code to load our packages.

# Load ggplot into current session

# Install lubridate

# Load data.table

  • Read through and then click Run to execute the next three lines of R. First, we load our bus data file into RStudio, as the rawbustimes data frame. Note, this is a large dataset and could take more than a minute to load. Then, we create a working copy of the data into a data frame named bustimes. With the third line, we open the data frame in a new tab in the Source Pane.

# Read in MTA Bus Time csv data file
rawbustimes <- read.csv(file="./MTA-Bus-Time_.2014-10-08.txt", head=TRUE,sep="\t")
# Create a working data frame
bustimes <- rawbustimes

Figure 5. MTA Subway Historic Bustimes Data.

  • Review the different fields of our data frame. Some of them are easy to understand, others are not. Fortunately, on the page we downloaded our data, the MTA also provides field descriptions of the files. Our analysis uses the fields listed below. We can discard the others.

Distance along trip – The distance in meters from the start of the bus route.

Time_received – The time of the observation (i.e., reading) was received by the MTA’s servers. The time zone is in Coordinate Universal Time (UTC.)

Inferred_trip_id –The MTA gives a unique name to for every trip a bus takes in a day. A trip is one way, and not a round trip. The trip id inferred from the time and location data transmitted by the server.

Inferred_route_id – The route of a bus is inferred by the time and location data transmitted by the server.

# Select Buses in route
bustimes <- with(bustimes, bustimes[inferred_phase == "IN_PROGRESS" , ])

#Remove columns of data we don't need
bustimes <- bustimes[c("time_received","vehicle_id","distance_along_trip",
"inferred_direction_id", "inferred_route_id", "inferred_trip_id")]

# Select Bronx Buses
bustimes <- bustimes[grep("_BX", bustimes$inferred_route_id), ]

Notice in the Environment Pane that our bustimes data frame has almost one million observations (or rows) which is much less than the over 5 million observations in our original rawbustimes data frame. Operations on bustimes will be much faster than operations on rawbustimes. Discarding unneeded data in a working data frame is a common practice. Take care not to discard or delete your raw data, which is the reason we use the working data frame.

Figure 6. Environment Pane.

The time_received field is imported into R as a Factor, or categories, data type. To perform calculations with the field, we need to convert it into R’s POSIXct data type. This data type allows R to recognize the field as a time and date. Run the next line to convert the time_received field into a POSIXct data type.

# Create a POSIXct date and time variable using available data
bustimes$time_received <- as.character.Date(bustimes$time_received)
bustimes$time_received <- as.POSIXct(bustimes$time_received ,format= "%Y-%m-%d %H:%M:%S")
Now, we can verify that the conversion worked by examining the Environment data frame again.

Now, we can verify that the conversion worked by examining the Environment data frame again.

Figure 7. time_received as POSIXct.

Now that our time_received is a POSIXct data type, we can use the lubridate package to select our observations for the morning. For this lesson, we consider morning rush hour to be between 7:00 am Eastern Daylight Time (EDT) and 10:00 am EDT. The MTA saves their time data in the UTC (Coordinated Universal Time) time zone, which is four hours ahead of EST.

  • Run the next line of R to select the readings captured during rush hour and View our data.

# Select Peak time 07:00 EST (11:00 UTC) to 10:00 EST (14:00 UTC)
bustimes <- with(bustimes, bustimes[hour(time_received) >= 11 & hour(time_received) < 11 , ] )

Our data is now prepared and ready for analysis, and we can create our data frames which we will use for performing our speed calculation.
Looking at our data, we see all the different bus routes in the Bronx. For each route, the inferred_trip_id field lists all the trips. The speed is the distance traveled divided by the time. To calculate a single trip’s average speed, we find that trip’s observation with the greatest distance traveled and the observation with the least distance traveled. We will record the difference between these two distances. Then, we find the time difference between those two observations. We will take the distance traveled and divide it by this time difference, the result will be the speed of the bus during this trip between these two observation points. After that, we will calculate the average of the speeds of all the trips for each route.

First, we create a new data frame called speed. Then we rename the last field, max_distance.
Next, we create another data frame called speed_min_distance, and rename the second column and third column to time and min_distance, respectively.

  • Run the next lines of R to set up new data frames.

# Create a format our analysis data frame
speed <- bustimes[c("inferred_trip_id","inferred_route_id", "time_received", "distance_along_trip")]
colnames(speed)[4] <- "max_distance"
speed_min_distance <- bustimes[c("inferred_trip_id", "time_received", "distance_along_trip")]
colnames(speed_min_distance)[2] <- "time"
colnames(speed_min_distance)[3] <- "min_distance"
  • Run the next two lines to convert the data frames into data tables.

# Convert data frame to data.table and sort by speed and min_distance
speed <- data.table(speed)
speed_min_distance <- data.table(speed_min_distance)
  • Click Run to the select and save the observations with the greatest and least distance traveled for each trip in the inferred_trip_id field.

# Select the maximum distance travelled for each trip ID
speed <- speed[ , .SD[which.max(max_distance)], by = inferred_trip_id]
# Select the minimum distance travelled for each trip ID
speed_min_distance <- speed_min_distance[ , .SD[which.min(min_distance)], by = inferred_trip_id]

Run these three lines, which will join our speed and speed_min_distance data tables together. A join operation combines the columns of two tables into one table. Joining two data tables require a “key” field. Our key field is inferred_trip_id. For every row in the speed_min_distance data table, R will copy a row in the speed data table with a matching inferred_trip_id.

setkey(speed, inferred_trip_id)
setkey(min_speed, inferred_trip_id)
speed <- speed[min_speed, nomatch=0]
  • Scroll to the right of our speed data frame and we see that observations now have fields from the speed_min_distance data table.

Figure 8. speed data frame.

Next, we create two new columns. time_diff is the difference between the time of the two distances, time_receivedtime. The distance traveled between those two time stamps is distance_diff.

  • Click Run and execute the next two lines.

speed$time_diff <- speed$time_received - speed$time
speed$distance <- speed$max_distance - speed$min_distance

Occasionally, an error in the data collection records the same distance for different time recordings. The exact cause of these errors is unknown. For this lesson, we will just remove any trips where time_diff or distance is equal to zero.

  • Run these two lines to clear out these instances, and then we are ready to find the speed of the trip.

# Remove any rows which the time difference or distance travelled is zero
speed <- speed[speed$time_diff > 0 , ]
speed <- speed[speed$distance > 0 , ]

We need to convert our time_diff field into a numeric data type so that we can divide the speed by it. Also, the MTA field description documentation mentions that our distance is in meters. The time_diff units of measurements is seconds. To convert meters per second to miles per hour, we multiply our result by 2.237.

  • Click Run on the next line to calculate the speed of each trip.

speed$m_per_sec <- speed$distance / as.numeric(speed$time_diff)

# meters per second to miles per hour = 1:2.237
speed$mph <- speed$m_per_sec * 2.237

We now have the speed in miles per hour for each trip of the bus route.

Figure 9. updated speed data frame.

To find the average speed for each bus route, we go through and find the mean of each bus route found in inferred_route_id. Then, we rename the new field, avg mph.

  • Click Run on the next three lines to calculate the speed of each bus route and View the result.

# find of average (mean) speed from all the trips of each individual route
average_speed <- speed[ , mean(mph), inferred_route_id ]
colnames(average_speed)[2] <- "avg mph"

Figure 10. average_speed data frame.

6. Visualize: Graphing our Data

View the speed, and average_speed data frames to make sure the results make sense and appear as you expect them to be. After you are satisfied with your results, we plot our findings using ggplot. We need a useful way to plot the speeds for our 54 routes. Instead of putting all the data into one graph, we create a series of panels, one for each route. Each plot has the time of day on the x-axes and the speed of the bus on the y-axes.

# Plot
# Initialize a ggplot, and then create lines graphs for bus speeds throughout the day for each bus route.
speedplot <- ggplot(speed, aes(x = time_received, y = mph))
speedplot <- speedplot + geom_line() + facet_wrap(~inferred_route_id, ncol = 10) + theme_light()

The plot will appear in the Plots tab bottom-right hand pane. However, all the plots are hard to view.

  • Click on Zoom to display our plot is a larger window.

Figure 11. Zoom button in Plots tab.

Figure 12. ggplot button in Plots tab.

Note that a few of our panels have high speeds which suggest errors in our data. These errors could have occurred during the recording and transmission of the data. For this exercise, we remove speeds greater than 30 mph. This maximum was chosen to remove the clear outliers, and also give the panels more detail. Free feel to experiment with different cut-off levels. When representing data, these design decisions can shape how we interpret data. For now, Run the next line and then replot our graph.

# Remove any trips with a speed greater than 30 mph
speed <- speed[speed$mph < 30 , ]

# Replot with outliers remoted
# Initialize a ggplot, and then create lines graphs for bus speeds throughout the day for each bus route.
speedplot <- ggplot(speed, aes(x = time_received, y = mph))
speedplot <- speedplot + geom_line() + facet_wrap(~inferred_route_id, ncol = 10) + theme_light()
speedplot <- speedplot + ggtitle("Average Speeds for MTA Bronx Buses from 7am-10am 2014-10-08")

Figure 13. Final graph of Average Speeds of Bronx MTA buses

7. Analysis

Looking at our data, patterns emerge. Which ones have more variance? Routes with _BXM in their name are express routes that go from the Bronx into Manhattan. They do not run as frequently as the other buses. The panels are easy to identify because they have fewer data points. These routes also have a larger variance, which makes sense. They travel over a greater distance and have to cross boroughs which present more opportunities for delays.

BX19 is among the slowest buses. It runs from the New York Botanical Garden and crosses the 145th Street Bridge into Manhattan. From there, it continues on West 145th Street to the Riverbank State Park in Harlem. Why might this bus be slow? Driving cross-town (that is east-west) is slower because the traffic lights are not timed as they are on north-south roads. The bridge might also be a bottleneck, especially during rush hour. That said, our data is limited to one day. We can also try to determine if the weather, an accident, or some other one-off event was the cause of the lower speed. How might we verify these causes?

8. Going Further

We can test our ideas on what is happening with the BX19 route. For people in New York, one way to understand more about this route would be to actually ride the bus on a weekday morning. We can also look up the weather for the day of our data collection on a website such as Weather Underground. Of course, we can always download more bus data from BX19 buses from the MTA Developers website. If we were to do that, which dates should we consider? Since our focus has been on morning rush hour, we should continue our focus on weekdays. However, we may wish to limit our investigation to October 2014 to ensure other seasonal effects do not enter our analysis.

NY MTA Bronx Bus Schedule
Weather Underground

You’re done! Help us improve this site with you valuable feedback, by taking our 5 minute survey.