GETTING STARTED

Introduction

Welcome to the Open Transit Data Toolkit. This site provides ideas and instruction for making use of the growing amount of open data made publicly available by transit authorities across the US and around the world. The impetus for the development of the Transit Data Toolkit was the widening gap between growing data availability and the lack of accessible resources for generating knowledge and actionable insight from the data.

The Open Transit Data Toolkit is a series of self-paced lessons to teach you how to use open transit data. Each lesson starts with a task of finding data, preparing it for analysis, and then creating a visualization of the data. With this Toolkit, you will learn some of the fundamentals of data wrangling, tips for creating informative visual data displays, , and gain insight on interpreting and analyzing available transit system dataes. When possible, the lessons use open source applications or services with a free version. The lessons have been designed to be worked through in order. However, each lesson is self-contained and can be done by itself. The skills and strategies covered in the lesson modules can be applied to other cities and available transit data.

Who is this for? / Who isn’t this for?

The Open Transit Data Toolkit is for anyone with an interest in using open transit data. The lessons provide step-by-step instructions on using transit data. However, some skills are required. Familiarity with basic high school math and statistics is expected. Desktop computing skills such as creating folders, saving files, and familiarity with installing software are required. A little programming experience will be helpful, but is not necessary. If you know what a variable is, and that a computer program is a set of instructions that you write, you should be able to follow along with the program walkthroughs. If you don’t, refer to our Resources section for suggestions on learning some basic programming skills.

For novice programmers, we suggest you try the first lessons if you are unsure if you have the skills to get the most out of this project. If you find you need to learn or refresh some skills, please refer to our Resources section [link] for free or inexpensive offerings to assist you.

For seasoned programmers, data analysts, and data scientists, the lessons will familiarize you with some of the nuances of transit data. We hope that you find the explanation of the various ways to measure and evaluate transit systems and the steps provided for wrangling the transit data are helpful. If you can improve any of the scripts we provide, please feel free to submit a pull request to our GitHub account.

What you need to use the Open transit Data Toolkit:

To use the Open Transit Data Toolkit, you need a computer, web browser, and access to the internet. The lessons have been tested on Windows and the Mac OS.

When possible, the Open Transit Data Toolkit uses open source tools such as R, RStudio, and QGIS. In some cases, we may use widely used applications, like Excel.


R and RStudio Overview:
R is free and open source software programing language, which is popular among the statistical and data science community. RStudio is a free Integrated Development Environment (IDE) for R. RStudio allows you to write and execute R scripts, view data, and export your analysis.

First, to install R, go to the R download site and select the version for whatever operating system you are using and follow the instructions. I recommend using the latest version. These lessons use version 3.3.3.

Next, install RStudio Desktop. Go to the RStudio download page and select the free Open Source License version. These lessons and screen grabs use Version 1.0.136.

Download R Studio

  • Launch RStudio.
  • If you are new to RStudio, take a look at the application.
Figure 1. RStudio Overview.

There are three main “Panes” or sections of the screen which we will reference in this lesson.

  1. The first Pane of the RStudio application is the Source Pane. This section contains tabs for any R scripts you are working on. As you will see, you can also view your data frames here as well.
  2. The second Pane has two tabs by default. The Environment tab shows you the variables that you have already defined. The History tab shows you the list of the R commands you have already executed.
  3. The third Pane is the Console Pane. This area is where R commands are executed.

Click on the Console Pane, your cursor will appear at the bottom row, next to the right arrow: >

Click on the Console Pane, your cursor will appear at the bottom row with next to the right arrow: >

Type 2+2 and hit enter

The Console should return:

[1] 4

You can also execute commands from a script, which we will cover later. Error messages are also displayed here in the Console Pane. If something isn’t working as you expect, check this area to see if any errors appear.

  1. The fourth Pane has five tabs by default: Files, Plots, Packages, Help, and Viewer. The Plots tab displays any graphs and charts generated from R commands. The Packages tab displays various packages, which extend the functionality of R. The Help tab contains R documentation. The Viewer tab can display external web content. Of these tabs, the lessons only use the Plots tab.

Note: Before we get started, RStudio automatically opens the last files and data from the last time you used the application.

Close any scripts or dataframes open in the Source Pane by selecting: File > Close All or by manually closing each tab.

Clear any data in the In the Environment tab by selecting: Session > Clear Workspace.

Text Formatting in the Open Transit Data Toolkit:

File names are in italic. For example:

Open a r_script_file.R in RStudio.

Features in software or tools are in bold. For example:

Review the file in the Source Pane in RStudio.

Code is in Courier typeface. For example:

rawlocs <- read.csv(file="./stops.txt", head=TRUE,sep=",")

File / Folders

Working Directory

A Working Directory is a folder and its subfolders that contain all the relevant files for a project. Keeping all the scripts and data in one place makes the project more manageable and helps avoid losing or overwriting your work.

 

Zip Files

Zip files are compressed files that take less memory than their original file. You can create zip files out of one file, a collection of files, or a folder. When dealing with large files, such as large transit data sets, zipping files make them easier to upload or download for storage or sharing. When sharing a large number of files, it can convenient to place them all in a folder and then create a zip file of the folder. The zip file will contain all the files within the original folder. Both Windows and Mac OS come with the ability to zip (compress) and unzip files.
On Windows, to zip a file, locate the file in the File Explorer Application. Right-click on the file and select Send To > Compressed Folder. A compressed file is saved in the location you select. To open a zip file, select and Right-Click on the file in the File Explorer, and select Extract All, and then choose where you want to save your file.

Figure 2. Zipping a file with Windows 10.

On Mac OS, zip a file by locating it in the Finder Application and select File > Compress yourfilename.txt’. A compressed copy of your file is saved in the same folder with the new file name ‘yourfilename.txt.zip’. To open a zip file, select the file in the Finder Application, and then select File > Open. An unzipped copy of the file is saved in the same folder.

Figure 3. Zipping in a file with Mac OS.

 

CSV

What is a csv file? A Comma Separated Values (CSV) file is a text file format, which contains a table of data, where each item is separated by a comma.

Shapefile

A shapefile is a file format used to describe geospatial vector data, including location, shape, and other attributes. Shapefiles are commonly used in geographic information system (GIS) software.

Finding Help

Running into and overcoming obstacles is a constant part of data wrangling and programming. The Open Transit Data Toolkit has a forum to support the people working with these lessons.

As well, there are many other very useful resources on the internet. Even the most experienced programmers use Google or other search engines to help fix bugs and solve problems. When debugging, a good approach is to cut and paste an error message and the tool you are using (e.g. R or CARTO) into Google or another search engine. Quite often, other people have asked a similar question about the message somewhere on the internet.

Another valuable resource is Stack Overflow, which is a developer Q&A community-driven resource, where programming experts answer questions posted to the site. Many questions you may have will already have full or partial answers. It’s sister site, Stack Exchange, is a Q&A community site with topics beyond programming, including GIS and CARTO.

Some transit authorities, like New York’s Metropolitan Transportation Authority (MTA) and Boston’s Massachusetts Bay Transportation Authority (MBTA), manage Google Groups discussion forums which are monitored by members of their data team.

Open Transit Data Toolkit Forums [link]

Google

Stack Overflow

Stack Overflow – Tagged R