1 – Mapping A Transit System

1. Introduction

Welcome to the Open Transit Data Toolkit. These lessons teach how to utilize all the rich data that transit authorities make available today. Each task-based lesson requires using different sets of transit data and unique strategies to wrangle the data into a desired form. We begin each lesson starting with finding data. Then, we walk through all the steps of wrangling the data into a usable form. Finally, we analyze and visualize the data. We include scripts, which we review line by line. Even if you are new to programming and don’t understand all the syntax, you still can get into the weeds of the data and begin to experience wrangling the data into something useful. This lesson uses R and RStudio. If you haven’t set up these tools, please refer to Getting Started: Tools for instructions on installing R and RStudio. If you get stuck along the way, check out the Getting Started: Finding Help.

Maps are an important and common tool in understanding transit systems. Passengers use transit maps for navigation. However, by adding additional layers of information, we can turn maps into valuable tools for deriving further insight into a transit system, and telling a story from the data. This lesson creates a simple map of the Massachusetts Bay Transportation Authority (MBTA) subway system, also called the T. In 2013, the MBTA and Massachusetts Department of Transportation (MassDOT) held the MBTA System Maps Competition, a map design competition in honor of National Transportation Week. The event invited anyone to submit redesigns of the MBTA subway map. Figure 1 is the winning map designed by Michael Kvrivishvili, on which the MBTA based the system map still in use today (2018).


Figure 1. Boston rapid transit map (v. E16-140513) by Michael Kvrivishvili Creative Commons Attribution 2.0 Generic.

2. Learning Outcomes

Locate MBTA station location data
Wrangle MBTA location data into a form suitable for mapping using R
Map station location data using the R package leaflet.
Include the map elements required to make a good map.

3. Finding Data

The first thing we need to do is find a list of the MBTA stations and their locations. We will also need a base map, on which to overlay our data.
Reflect: How are locational data formatted? In other words, what are the different ways we can describe a location? Imagine a time when someone asked you the location of a place. How did you answer them? You might have offered cross streets, a street address, or a landmark. For this lesson, we need precise location information, such as the latitude and longitude coordinates of the stations to accurately place the stations on our digital map.
Fortunately, the MBTA provides data using the General Transit Feed Specification (GTFS) data standard. Google developed GTFS in order to provide standards for transit authorities to get their transit data to appear in Google Maps. Before GTFS, transit authorities used their own custom formats, which made using transit data much more difficult.
GTFS has two specifications. The first is GTFS static, which is a collection of text files that describe stops, routes, schedules, and other kinds of system data. GTFS static data does not often change. Transit authorities, including the MBTA, often post a stable link to their GTFS static data, so that it can be consistently found by people and also other computer applications. The second specification is GTFS real-time, which specifies real-time data of a transit system, such as vehicle location data, predictive arrival times, and system notifications. This lesson uses GTFS static data provided by the MBTA. Later Open Transit Data Toolkit lessons will address GTFS real-time data.

Note: When looking for data that is widely used, like subway stations, there may be multiple sources of similar but not identical data. Not surprisingly, different agencies may have different requirements and standards for the data they create and share. In the case of subway stations, MassGIS also releases MBTA station data in a shapefile file format. A shapefile is a data file containing geospatial data. We will utilize shapefiles in later lessons, for now, we will use only the GTFS data as it is in an easy and standard format adopted by many transit authorities.

4. Set Up

    • Create a working folder some on your computer that is easy to access. Store all your files here. We will refer to this folder as your Working Directory.
    • Download the transit-data-toolkit file, which contains the R files. Open the zip file, and copy the 01-mapping-stations.R file to your Working Directory.
    • Go to the MBTA developer portal GTFS page and download the GTFS static file. Open the zip file and make sure you see the stops.txt file and save it to your Working Directory.
    • Double check to make sure both the 01-mapping-stations.R and stops.txt files are saved in to your Working Directory. (The stops.txt and the other GTFS files may have opened into their own folder.)

Figure 2. MBTA Rider Tools Developer website.

After you have the data files, you are ready to start working with the file in R. Refer to Getting Started: Tools/R and RStudio Overview if you are new to R or need a refresher.

      • Launch RStudio.
      • When you open RStudio, you might see files and data frames open from the last time you used it. If that is the case, close all the files in the Source Pane, by selecting File > Close All. Then, remove any Environment variables, by selecting Sessions > Close Workspace.
      • Open the file in RStudio, by selecting File > Open File > 01-mapping-stations.R.
      • Set the Working Directory. Select: Sessions > Set Working Directory > To Source File Location. This step sets our Working Directory folder as the default location for our current session in RStudio. Any files we create in RStudio will be saved here.

Figure 3. RStudio – Source Pane.

5. Data Wrangling

In the Source Pane, read the first two lines.


# Install leaflet (this is only required once)
install.packages('leaflet', dependencies = TRUE)

  • Click on the first line to place the cursor there. The first line is a comment which begins with the # symbol, and contains an explanation of the nearby code. While running the script, R ignores all comments.
  • Click the Run button once to run the next line of the script.

This line installs the R package, Leaflet. Packages provide you extra features which are not included in the basic installation of R. They only need to be installed once. If you run this program again, you do not have to Run this line again. Leaflet is a popular package for making maps.

  • Run the next two lines of R. (Remember R skips the comments.) Although we installed the packages, we still need to load them into R in order to use it. Unlike installing packages, we load the packages very time we run this script.

# Load library
library(leaflet)

Now that packages are loaded, we can read in our data. We will return to the ggplot and gmap after we have analyized our data.
Review the next three lines of R.

# Read in MBTA Station txt file
rawlocs <- read.csv(file="./stops.txt", head=TRUE,sep=",")
View(rawlocs)
  • Click on the first line to place the cursor there. The first line is a comment which begins with the # symbol, and contains an explanation of the nearby code. While running the script, R ignores all comments.
  • Click the Run button twice to run the next two lines of the script.
  • In the first line, we look for the file in the Working Directory. If R finds the file, the file opens into a data frame called rawlocs. If R cannot find this file, it will display an error in the Console Pane.
  • In the second line, we open the rawlocs data frame into a new tab in the Source Pane. Scroll left and right to see all the fields found in stops.txt. If we open stops.txt in Excel or a text editor, the data should match the data in RStudio. If it does not, double check to make sure you are comparing the same file in your Working Directory.
  • When you are ready to continue, click back to the 01-mapping-stations.R tab in the Source Pane.


Figure 4. RStudio – Source Pane.

Read the next lines of the R script in the Source Pane.


# Select columns with MBTA T stations
station_locs <- rawlocs
#Select the columns we want and change columns name to latitude and longitude
station_locs <- station_locs[c("stop_id","stop_name","stop_lat","stop_lon")]
colnames(station_locs)[3] <- "latitude"
colnames(station_locs)[4] <- "longitude"

Run the first line, where we create a copy of rawlocs and rename it station_locs. This step is important because we will be editing the data. A best practice in data wrangling is to keep your raw data untouched. That way, you can easily start from the beginning, if you make a mistake or want to double check your work.
Then, the next line, we select the columns in the data that are most important and discard the other columns.
In the final two lines, we rename two columns to make them easier to read.

R imported some of the data as a factor data type, instead of a character data type. The next lines of R code convert those columns in characters.


# Convert the columns imported as a factor to characters
station_locs$stop_id <- as.character(station_locs$stop_id)
station_locs$stop_name <- as.character(station_locs$stop_name)

Run the first line, which identifies the columns with factor data types.
– The next line reconverts those columns into characters.
– The final line displays our updated station_locs data frame in a new tab in the Source Pane.

We only want to map the subway stations. However, the stops.txt file contains all the stops in the entire MBTA transit system, which also includes buses and commuter rails. Our next data wrangling challenge is to remove the data we don’t want. When working with data, sometimes we have to decide when it is more efficient to prepare the data by hand or programmatically. Sometimes it is faster to manually correct or select data. The decision will partially depend upon an individual’s skill level. In this case, we could probably delete the buses and commuter trains from this file by hand. However, we will use R to practice the manipulation of data.
In the station_locs tab, compare the stop_id and the stop_name columns . We want to find the names of the stations. What are different strategies to find the names?

If you are not familiar with the names of the MBTA subway stations, open a list of the Red Line station from the MBTA website in a web browser. We’ll see why we want the Red Line in a moment.

Then, in RStudio, search for the Alewife station in the station_locs data frame using the search field at the top right of the pane.

Several rows in stop-id have the string Alewife. Each search result must be investigated to find the stations in the data.

For convenience, we can start with row 6034 and stop-id with 70061 as seen below in Figure 5. You may want to write it down or just refer to the image.


Figure 5. Searching in the station_locs data frame.

Now, click on the X to clear the search term and display all of your results.
Scroll down row 6034 and stop-id 70061.
You will see that these rows match the subway stations of the T.
If you were doing this on your own, you would write down all the stop_id values and then go through all of them, until you found where the subway stations were in the dataset.


Figure 6. Finding Red Line stations in thestation_locs data frame

Now that we found the subway station, we want to extract it. The data we want is often a subset of the dataset we initially obtain. Being able to figure out how to extract data in a dataset is an import skill to practice in data wrangling. With practice, you will start to learn different strategies for extracting data. However, every dataset is unique and generally requires some customized solutions to get the data you want. In this case, the stop_id is a combination of text and numbers. However, the stations we want only have numbers. We can use that fact to start isolating the rows of data we want.

Go back to the 01-mapping-stations.R in the Source Pane.
Click Run to execute the next line of R, which tests if a stop_id is a number. If it is not, then it will keep the row and remove the other rows.

# Remove all the rows which do not contain numbers
station_locs <-station_locs[!is.na(as.numeric(station_locs$stop_id)), ]

Note you may get this warning message which you can ignore:

Warning message:In `[.data.frame`(station_locs, !is.na(as.numeric(station_locs$stop_id)), : NAs introduced by coercion

Next, we want to select the rows with subway stations. If you scroll through the rows, you can see that the stations that we want have a stop_id from 70000 to 70279. To select these rows, we first convert the stop_id column to an integer type, and keep only the relevant rows.

Continue to click Run step through the next two comments and one line of R which converts the stop_id column to an integer type and then select the rows we want.

# Convert the Stop ID column into numbers
station_locs$stop_id = as.numeric(station_locs$stop_id)
# Select columns MBTA T stations
station_locs <- station_locs[which ((station_locs$stop_id >= 70000) & (station_locs$stop_id <= 70279) ),]

Look at row 70140, 70141, 70217, and 70218. There are two stops with the name, “Saint Paul Street.” However, they have different coordinates. If we look back at the map of the MBTA subway system, we can see that indeed, there are two “Saint Paul Street” stops on the Green B and C lines. We need to change the name of the stations to make them unique and note the change in the R comments. It is important to document the change because the station names no longer match the original file. This change may be important for future users of our data to know.

Click Run to go through the next four comments and three lines of R to make the Saint Paul Street station names unique

# Saint Paul Street Station names are altered to include their line.
# This change is done to be able to distinguish the two stations
# named Saint Paul Street on the B and C line.
station_locs$stop_name[station_locs$stop_id == 70140] <- "Saint Paul Street B Line"
station_locs$stop_name[station_locs$stop_id == 70141] <- "Saint Paul Street B Line"
station_locs$stop_name[station_locs$stop_id == 70217] <- "Saint Paul Street C Line"
station_locs$stop_name[station_locs$stop_id == 70218] <- "Saint Paul Street C Line"

We can see that many stops are listed twice, for inbound and outbound service; however, we only need one station.
What differentiates each station besides its name? Is there another field (column) we can use to uniquely identify a station? We can see that the latitude and longitude coordinates for each station are unique and can make selections based on that.

Click Run to select unique latitudes an longitudes

# Find the unique latitude and longitude coordinates
station_locs <- station_locs[!duplicated(station_locs[c("latitude", "longitude")]),]

The next step is to clean up the stop_name column and remove extra “outbound” and “inbound” text that is no longer needed. We search each station name for a dash and remove any text after it. Then, we trim any extra spaces that remain in the text field.

Run through these lines of R to trim the stop_name

# Select the rows which do not have Outbound in the text
# Remove string with dash
station_locs$stop_name <- sub("\\-.*","",station_locs$stop_name)
#Remove and extra spaces
station_locs$stop_name <- trimws(station_locs$stop_name,which = c("right"))
View(station_locs)

Review the station_locs dataframe. It now only contains MBTA subway stations, id, name, and the latitude and longitude of the stations. If you followed the steps correctly, you should have 121 entries, as seen in the count in the bottom of Figure 6.


Figure 7. MBTA subway station with location data.

Let’s map our stations and their locations.

6. Visualize: Mapping our Data

Review the follow block of R code. First, we create an leaflet object and name it mbta_subway. The leaflet feature take the parameter: station_locs. setView in the next line defines the location and zoom level of our map. addCircles defines the stations’s marker color and size. Next, we add a title and source to our map, which is a best practice for all maps. Finally, we render our map in the Plots Pane.

# Map the stations
mbta_subway <- leaflet(station_locs) %>%
addTiles() %>%
setView(-71.057083, 42.361145, zoom = 12) %>%
addCircles(~longitude, ~latitude, weight = 3, radius=120,
color="#0073B2", stroke = TRUE, fillOpacity = 0.8) %>%
addLegend("bottomleft", colors="#0073B2", labels="Data Source: MBTA Developer Portal", title="MBTA Subway Stations")

# show the map
mbta_subway

Map PaneFigure 8. mbta_subway map in the Plots Pane.

Click on the Zoom button in the Plots Pane to see a larger version of our map.

station maps of boston
Figure 9. Final version of the mbta_subway map.

7. Analysis

Reflect: What do the circles represent? Besides location, do they communicate any other information? If not, what else could we communicate with this map given additional data?
One thing missing in this map are the individual subway lines. If you refer back to our original GTFS static data, the line of the station was not included. What are our options to include the line of a station? We could manually add the lines to each station in our csv file. Or, we could research other datasets that might have this station and their lines, and them combine it with our dataset. We cover the challenge of color coding the lines in a subsequent lesson.

You’re done! Help us improve this site with you valuable feedback, by taking our 5 minute survey.