9 – Passenger Surveys and Rideshare Usage

1. Introduction

The previous lessons used data collected from vehicles, turnstiles, and other devices. For this lesson, the basis for our analysis is data collected by surveying people. Collecting survey data helps answer questions which are difficult or impossible to answer using other methods. For instance, finding out what modes of transportation people use and why. The San Francisco Municipal Transportation Agency (SFMTA) started conducting the annual Transit Decision Survey in 2013 in an attempt to answer that very question. Each year, the SFMTA conducts the survey and then publishes a summary of the findings.

In 2017, the survey included responses from 804 people. The survey asked people how often they used transportation modes including, walking, biking, driving, buses, subways, and ridesharing services. The survey also asks for demographic information such age, income, and residential location. As with all data, there are limitations to survey data. For instance, surveys rely on people recalling their experience and answering truthfully.

Ridesharing is a popular and often controversial service provided by transportation network companies. Companies such as Lyft and Uber match passengers and driver using a smartphone application. Some reports suggest that the popularity of ridesharing services impacts the number of vehicles on the road and affects the ridership of other modes of transit, such as buses and subways. In this lesson, we will use the Transit Decision Survey to investigate the usage of ridesharing services in San Francisco.

2. Learning Outcomes

Explore survey data and how to evaluate it.
Locate SFMTA Transit Decisions Survey Data.
Calculate self-reported usage of ridesharing services by age and income.
Create graphs to visualize the behaviors.

3. Finding Data

The DataSF website contains open datasets from the city and county of San Francisco on subject areas ranging from the environment to housing. It also hosts the SFMTA Transit Decision Survey data. Each DataSF dataset also lists the department which produced the data and includes a form to contact the dataset owner. This feature is helpful if you have questions about the data.

4. Set Up

  • Get ready to work by creating a Working Directory folder on your computer, somewhere that easily accessible.
  • Go to the Transit Decision Survey site and download the data and save it to the Working Directory.


Figure 1. DataSF Travel Decision Survey Data 2017.

  • Open the xlsx file in Excel and under the File Menu, click on “Save As.”
  • In the open dialog box, select CSV as the File Format.
  • Click the Save button.

Note: depending upon your operating system and version of Excel, your exact steps may be slightly different.


Figure 2. Save As CSV file.

Finally, for the last file we need, copy the 09-survey.R file to your Working Directory. Download the transit-data-toolkit file, which contains the R files, if you need it.

5. Data Wrangling

  • Launch RStudio.
  • When you open RStudio, files and data frames might be open from the last time you used the application. If this is true, close all the files in the Source Pane, by selecting File > Close All. Also remove any prior Environment variables, by selecting Sessions > Clear Workspace.
  • Open the file in RStudio, by selecting File > Open File > 09-survey.R.
  • Set the Working Directory. Select: Sessions > Set Working Directory > To Source File Location. Now, our Working Directory folder is the default location for our current session in RStudio. Any files we create in RStudio will be saved to this location.
  • As we have done in previous modules, place the cursor on the first line of code. When ready, you will click on the Run button to execute a line of code. As a reminder, comments designated with a #, are ignored and skipped over.

We’re using three packages: ggplot2, lubridate, and data.table. If you have not installed these in a previous module, you can install it now. Otherwise, you can skip these lines.


# Install ggplot2, lubridate and data.table (this is only required once)
install.packages('ggplot2', dependencies = TRUE)
install.packages('lubridate', dependencies = TRUE)
install.packages('data.table', dependencies = TRUE)
  • Next, we load our packages in RStudio. ggplot2 graphs our findings and lubridate helps format our time data. We use data.table to prepare our data frames.
  • Run this block of R code to load our packages.

# Load ggplot into current session
library(ggplot2)
# Install lubridate
library(lubridate)
# Load data.table
library(data.table)
  • Read through and then click Run to execute the next three lines of R.

First, we load our survey data into the raw_survey_2017 data frame. Then, we create a working copy of the data frame, named survey_2017. With the third line, we open the data frame in a new tab in the Source Pane.


# Read in MTA SF Survey Results
raw_survey_2017 <- read.csv(file="./TDS_202017_20Data-WEBPAGE.csv", head=TRUE,sep=",")
# Create a working data frame
survey_2017 <- raw_survey_2017
View(survey_2017)


Figure 3. SFMTA Transit Decision Survey Data.

Review the data frame. Each row is a survey participant’s answer to the questions represented by each column. You can see that their responses are numbers rather than words. The definitions are in the original Excel file, in the tab, Data Dictionary.

We are interested in the following questions and their answers:
Q21A. “Have you tried any of these new travel options? If yes, how often do you use them? a. Lyft, Uber or other ridesharing companies.”

1 = Never tried
2 = Daily
3 = Weekly
4 = Monthly
5 = Rarely
6 = I’ve tried it, but I do not use it.

Q27. “How old are you?”
1 = 18 – 24
2 = 25 – 34
3 = 35 – 44
4 = 45 – 54
5 = 55 – 64
6 = 65+
7 = Refused

Q29. “Is your annual household income…”
1 = $15,000 or less
2 = $15,001 through $25,000
3 = $25,001 through $35,000
4 = $35,001 through $75,000
5 = $75,001-$100,000
6 = $100,001 through $200,000
7 = Over $200,000
8=Refused

Note that the values we are using are categorical rather than continuous numbers. Therefore, we will adjust the assigned values to make sure they increase as the values they represent increase. For instance, the “Daily” rideshare usage is assigned a value of 2. However, we will reassign it the highest value and adjust the other categories similarly. Also note that some responses are open-ended ranges, for instance, the income answer of “Over $200,000.” When we eventually take the average of these responses, the averages will indicate a general, but not exact, measurement.

The first thing we will do is select these columns and discard the rest. Also, note that some respondents did record every answer. Therefore, we want to discard any responses which are missing answers. We can do with the complete.cases function.
Then, we will rename the columns to give them more meaningful names.

  • Run the following lines of R.

rideshare <- survey_2017[c("Q21A","Q27","Q29")]
rideshare <- rideshare[complete.cases(rideshare), ]
colnames(rideshare)[1] = "USAGE"
colnames(rideshare)[2] = "AGE"
colnames(rideshare)[3] = "INCOME"


Figure 4. rideshare data frame.

  • Next, Run the following lines of R, which use the cut function to assign easier to read labels to the age and income ranges.

# Separate the rows into age groups
rideshare$AGE_RANGE <- cut(rideshare$AGE,
breaks = 7,
labels = c("18-24 yrs","25-34 yrs", "35-44 yrs", "45-54 yrs", "55-64 yrs", "65+ yrs", "NA"),
right = FALSE)

# Separate the rows into income groups
rideshare$INCOME_RANGE <- cut(rideshare$INCOME,
breaks = 8,
labels = c("$15,000 or less","$15001-$25,000", "$25001-$35,000", "$35001-$75,000",
"$75001-$100,000", "$100,001-$200,000", "Over $200,000", "NA"),
right = FALSE)

Although responses are not precise, we will use them as an estimation of their usage. For each of the questions, we want lower response categories to be assigned lower values. In the case of rideshare usage, the answer “I’ve tried it” gets a value of 6. However, we want to assign it a lower value. We want “Daily” to have the highest value of 5, and the values associated with the subsequent responses to decrease, with the “Never tried” response set to 0.

  • Run the next lines of R to reset the values of USAGE2.

# Reformat the usage score coding to go from increasing ridership
rideshare$USAGE_SCORE <- 0
rideshare$USAGE_SCORE <- ((rideshare$USAGE == 2) * 5) + rideshare$USAGE2
rideshare$USAGE_SCORE <- ((rideshare$USAGE == 3) * 4) + rideshare$USAGE2
rideshare$USAGE_SCORE <- ((rideshare$USAGE == 4) * 3) + rideshare$USAGE2
rideshare$USAGE_SCORE <- ((rideshare$USAGE == 5) * 2) + rideshare$USAGE2
rideshare$USAGE_SCORE <- ((rideshare$USAGE == 6) * 1) + rideshare$USAGE2

Now our values for USAGE_SCORE are as follows:
0 = Never tried
1 = I’ve tried it, but I do not use it.
2 = Rarely
3 = Monthly
4 = Weekly
5 = Daily
You can compare USAGE_SCORE and USAGE2 to ensure the conversion is correct.

  • Create a new data frame called rideshare_mean_byage, by running the next line of R.

The data frame stores data for our visualization.
With our categories of values defined, we calculate the mean usage score by age group. Then, we will calculate the mean usage score by income.

  • Run the next line.

# Find average usage score by age range
rideshare_mean_byage <- aggregate(USAGE_SCORE~AGE,mean, data=rideshare)
rideshare_mean_byage$USAGE_SCORE <- round(rideshare_mean_byage$USAGE_SCORE, digits = 2)
rideshare_mean_byage$AGE <- cut(rideshare_mean_byage$AGE,
breaks = 7,
labels = c("18-24 yrs","25-34 yrs", "35-44 yrs", "45-54 yrs", "55-64 yrs", "65+ yrs", "NA"),
right = FALSE)

Calculate the usage score by income.

  • Run the next line of R to select the readings captured during rush hour and view our data.


# Find average usage score by income range
rideshare_mean_byincome <- aggregate(USAGE_SCORE~INCOME,mean, data=rideshare)
rideshare_mean_byincome$USAGE_SCORE <- round(rideshare_mean_byincome$USAGE_SCORE, digits = 2)
rideshare_mean_byincome$INCOME <- cut(rideshare_mean_byincome$INCOME,
breaks = 8,
labels = c("$15,000 or less","$15001-$25,000", "$25001-$35,000", "$35001-$75,000",
"$75001-$100,000", "$100,001-$200,000", "Over $200,000", "NA"),
right = FALSE)

Now our data is ready to visualize. View the rideshare data frame to see that our new fields are correct.


Figure 5. Final rideshare dataframe.

6. Visualize: Graphing our Data

Plot our data by Age.


# Plot usage score by age
usagebyage <- ggplot(rideshare_mean_byage, aes(x=AGE, y=USAGE_SCORE, label=USAGE_SCORE)) +
geom_point(stat='identity', fill='black', size=10) +
geom_segment(aes(y = 0,
x = AGE,
yend = USAGE_SCORE,
xend = AGE),
color = "black") +
geom_text(color="white", size=3) +
labs(title="Rideshare Usage by Age",
subtitle="SFMTA Transit Decision Survey") +
ylim(0, 6) +
coord_flip()
plot(usagebyage)

The plot will appear in the Plots tab bottom-right hand pane.

  • Click on Zoom to display our plot in a larger window.


Figure 6. Zoom button in Plots tab.


Figure 7. Plot of Rideshare Usage by Age.

In the Plot pane, click the Export button and then Save as Image… to save a copy of our plot.
In the Save Plot as Image window, set the width to 800 and the height to 400.
Change the File Name to: rideshare_usage_byage.
Click the Save button.


Figure 8. Export Plot.


Figure 9. Save Plot as Image.

Next, we create a plot for the usage score by income range.

  • Run these lines of R to plot our chart.

# Plot usage score by income
usagebyincome <- ggplot(rideshare_mean_byincome, aes(x=INCOME, y=USAGE_SCORE, label=USAGE_SCORE)) +
geom_point(stat='identity', fill='black', size=10) +
geom_segment(aes(y = 0,
x = INCOME,
yend = USAGE_SCORE,
xend = INCOME),
color = "black") +
geom_text(color="white", size=3) +
labs(title="Rideshare Usage by Income",
subtitle="SFMTA Transit Decision Survey") +
ylim(0, 6) +
coord_flip()
plot(usagebyincome)
  • In the Plot pane, click the Export button and then Save as Image… to save a copy of our plot.
  • In the Save Plot as Image window, set the width to 800 and the height to 400.
  • Change the File Name to: rideshare_usage_byimage.
  • Click the Save button.


Figure 10. Plot of Rideshare Usage by Income.

Let’s save this graph in the same way we did with the Usage Score by Age.

  • In the Plot pane, click the Export button and then Save as Image… to save a copy of our plot.
  • In the Save Plot as Image window, set the width to 800 and the height to 400.
  • Change the File Name to: rideshare_usage_byincome.
    Click the Save button.

7. Analysis

Look at our plots. We see that overall the usage score decreases as the peoples’ ages increase. However, respondents in the 18-24 year old age group reported slightly less use than 25-34 year olds. This difference could be a reflection of the fact that people in the youngest age bracket have less money for ridesharing services. We see a fairly clear relationship between increases in ridership service usage as income increases. However, one important factor to consider is that as people get older, they frequently earn more money. Therefore, the correlation we see with age might actually be caused by income level, or vice versa.

8. Going Further

A natural next step is to try to segment the usage score by age and income. However, with seven income categories and eight age groups, we have 56 groups (e.g., 25-34 year olds who make $100,000.) With only 804 respondents, the ideal case, where people are evenly distributed among the groupings, would only have 14 responses per category. We mostly likely need more data to make sure we have a large enough sample to have confidence in our analysis. Therefore, we may need to collect additional data or find another way to answer our question.

Resources:

Transportation Decision Survey report

DataSF Travel Decision Survey data

You’re done! Help us improve this site with you valuable feedback, by taking our 5 minute survey.