1. Introduction

This capstone exercise takes the steps used in the previous lessons and asks you to apply them to your research and interests. Unlike the previous lessons, we do not provide any transit data or specific instructions. Instead, we walk through the same process used in the previous lessons. In each section, we share a way to get started, recommend strategies and additional resources, and offer things to consider.

2. Learning Outcomes

Locate publicly available open datasets.
Clean and prepare a dataset for analysis.
Perform initial exploration of a dataset.
Visualize datasets using best practices.

Finding Data

As the interest in open transit data grows, more cities and transit authorities are releasing data to share. However, each city decides for themselves what data to share.

The previous lessons have used these resources:
New York City MTA Data Downloads
Boston MBTA Back on Track
San Francisco DataSF

Other data sources to explore include:
Chicago Data Portal – Transportation
Los Angeles Metro developer site
NYC Open Data – Department of Transportation
San Diego DataSD – Transportation

Start your transit data analysis project by first asking yourself what you seek to learn from the data. Your answer to this question will determine which dataset will be most useful for your analysis. For example, if you would like to explore ridership trends in the New York City Subway, the fare data or subway entrance and exit data are likely the first datasets you should begin to look into. Don’t worry if you do not initially know what to choose. Resources such DataSF or NYCData have search features to help you explore their collections. Take time to explore what data is available or use the data from one of our previous lessons. Each of the datasets we’ve explored previously still has excellent potential for additional exploratory research.

If you see anything confusing, such as an ambiguous variable name, go back to the documentation provided by the data owner. If you still have questions, you may also contact the owner of the data. Contact information is often included on the data sources. Be aware that the dataset documentation has varying degrees of detail, and data owner may be slow to respond, if at all.

Finally, many larger cities have set up Google Groups, which often have transit employees as members. Always search to see if your question appears in a previous message. If you do not see it, free feel and ask. If you are not experienced with interacting in Google Groups and other kinds of forums, remember to be polite and be as specific as possible to increase your chances of getting a response.

Google Groups:
Boston – MBTA Developers
NYC – MTA Developer Resources
SF – BART Developers

4. Set Up

Once you have your data, set up RStudio and your files in preparation for your analysis.

Create a Working Directoryto store your files for each dataset.
When you start working in RStudio always follow these steps:

  • Close any files that are open in the Source Pane, from prior use by selecting File > Close All.
  • Remove any Environment variables created previously by selecting Sessions > Close Workspace.
  • Select: Sessions > Set Working Directory > To Source File Location. Now, RStudio uses your Working Directory to read and save files.

5. Data Wrangling

Let’s start preparing your data. Data wrangling is challenging because the process of preparing your data for analysis is unique to each dataset. A survey of eighty data scientists found that they spend almost 80 percent of their time on data collection and preparation.

  • As we have done before, load your data into R as your raw_dataframe.
  • Then, save a copy of it as your working_dataframe. This is your working copy which you can edit freely. You want keep the original data intact so that you may easily restart your analysis should you need to do so.

With your working_dataframe, it can be helpful to run some basic R commands to get a general sense of your data. head(yourdataset, n) prints out the first n rows of your data frame. tail(yourdataset, n) prints out the last n rows of your data frame. The summary(yourdataset) function prints out the summary statistics of your data frame. The statistics include the median, mean, and quantile values of your data.
Your data will often contain subgroups of data, for example, different bus or subway routes, or days of the week, or neighborhoods. You may want to find a statistical measure such as the mean for each subgroup in your data. The aggregate function allows you to perform an operation on each subgroup. While you work, look for patterns. Always ask yourself if your data makes sense.
If you want to learn more about analysis, try these resources:
Data Wrangling with R and RStudio webinar
R for Data Science(Especially the Data Analysis section.)

Curated list of R tutorial for Data Science

Remember that data analysis is an iterative process. Expect to return to previous steps as you discover the nuances of your dataset. Once you have your summary statistics, the next step is to graph one or several of them.

6. Visualize: Graphing our Data

It may not be clear what to graph. If this is the case, pick a variable and try to visualize it. As you graph your data, you will learn more about your data. You may find outliers which will lead to insights or realize that there are errors in the dataset. This process is normal and will require some experimentation and perhaps additional analysis to understand what you are seeing.

A good visualization should tell a story to the viewer. To help accomplish that, following a few best practices makes for clear and easy to read visuals.

  • Keep your charts simple and avoid graphing too much information into one visualization.
  • Avoid 3D graphs or other special effects, using too many colors, and putting too much information into one visualization.
  • Include a meaningful title.
  • Show the units of measurement on both axes.

The Colorbrewer website is a favorite tool for selecting color palettes for visualizations. Although it was created for cartographers, the color palettes available on the website work for all types of visualizations. Take care when you select colors for your visualizations. Accessibility is important, and Colorbrewer allows filtering for colorblind-safe color combinations. Contrast is another important accessibility factor.

Resources to aid in creating visualizations:
A11Y Project provides a useful list of resources on web accessibility.
The website has many examples of creating visualizations using ggplot2.
Edward Tufte’s The Visual Display of Quantitative Information is a classic reference book on visualizing data and is still relevant today.

7. Analysis

Now that you have your data and visualization, review them and ask yourself.
What patterns emerge? What relationship appears? Does the data help you answer your original question?

As you answer these questions, keep in mind the difference between correlation and causation. Take care not to confuse correlation and causation. Correlation is the relationship between two variables, whereas causation claims one thing causes the other. Take the hypothetical observation of a decrease in bus ridership with an increase in the price of gasoline. A change in the cost of gas is a plausible cause of an increase in bus ridership as people shift from using a personal vehicle to mass transit. It is tempting to stop here. However, imagine further investigation also found an increase in bus service and a decrease in bus fare during the same time period. We have found three relationships, but it is much harder to declare the actual cause of the increase in bus ridership. In fact, it could be a combination of all three factors.

As you document your visualization and findings, describe the pattern suggested by your analysis. Address if the relationship could be by chance and what other factors might explain the relationship. Finally, offer a discussion of what additional analysis would be of interest, even if you do not have access to the data for it.

You’re done! Help us improve this site with you valuable feedback, by taking our 5 minute survey.