Bicycle sharing programs allow riders to share bicycles for short trips. These popular programs offer convenient use of bicycles in cities around the world. Today’s programs typically have bicycle stations with bicycle storage docks, which are unlocked using a key fob or mobile device. Riders use real-time system updates to find available bicycles and empty docks to return their bikes. Companies such as Ofo, Spin, and others, offer “dockless” bicycle sharing services in cities such as Beijing, Seattle, and Washington DC.
Bicycle sharing first appeared in experimental European initiatives decades ago. In 1965, Provo, a Dutch activist group, left bicycles painted white in the city of Amsterdam for anyone to use. However, authorities quickly confiscated the bicycles. Later, the Yellow Bicycle free bicycle sharing service launched in La Rochelle, France. Theft and vandalism doomed the experiment. Other small-scale programs followed soon after. However, poor treatment of the bicycles, and the tendency for bikes to collect in certain areas limited the success and adoption of the programs. That is, demand for bikes is not equally distributed across geography and time of day. For example, bicycles tend to collect at the bottom of a hill, as riders use them to travel downhill but not uphill. As well, stations in residential areas are often empty in the morning and full in the evening, as riders travel to work and home. Therefore, successful programs require rebalancing the fleet to ensure bicycles and empty slots are available when riders need them. For example, operators must bring the bikes back up the hill.
In the 2000s, the application of modern technology to the implementations of bicycle sharing not only improved the service, but also increased the feasibility of an economic business model to support the service. Now, riders can sign up in advance, pay by credit card, and find the nearest docking station with a bike to use or an open slot to park. The fees pay for the bikes and stations, and the staff to rebalance the bike loads throughout the day. When a rider is done, she may return to any docking station with space for a bike. While digital and telecommunications technologies aid in the success of these bicycle sharing programs, they do not guarantee success. Theft, vandalism, and rebalancing remain challenges, which cities such as Seattle and Baltimore have had difficulties overcoming.
New York introduced their Citi Bike bicycle sharing program in 2013 with 332 stations and 6,000 bicycles covering Manhattan below 59th Street and limited parts of Brooklyn. The private company Motivate currently owns and operates Citi Bike. Since its creation, the system now includes 706 stations and 12,000 bicycles and reaches 130th St. in Manhattan, Queens, and additional Brooklyn neighborhoods such as Park Slope, Carol Gardens, and Williamsburg. The borough of Queens and Jersey City, NJ also receive coverage.
This lesson uses data from the Open Bus project. We will look at the availability of bicycles, which provides insight into one measure of the performance of the station. Rebalancing the bicycles is an important task to keep the system functioning. Many people use Citi Bike as part of their daily commute. For riders to rely on Citi Bike, they need bikes to be available in the morning and stations to have empty docks in the evening. Specifically, this lesson will analyze how “full” a docking station is in the evening when many commuting riders return from work.
2. Learning Outcomes
– Locate Citi Bike bicycle sharing data.
– Wrangle bicycle sharing data into a form suitable for analysis using R.
– Visualize the data in the dock station data using R.
3. Finding Data
Once again, we are using the Open Bus website to find the data for our Citi Bike analysis. The site hosts monthly archives of all Citi Bike stations with the station identifier, station name, timestamp, location data, as well as the number of bicycles parked in the docking station, number of empty bike slots, and total number of slots for each dock. Justin Tyndall, an urban economist who researches transportation policy, created and operates the Open Bus website. The site also archives live position data from the NYC MTA bus system.
4. Set up
- Create a working folder somewhere on your computer that is easy to access. Save all the files for this module here. We will refer to this folder as your Working Directory.
- Download the transit-data-toolkit file, which contains the R files. Copy the 08-bikeshare.R file, which contains the R code to prepare, analyze, and visualize our bicycle sharing data. Save the file to your Working Directory.
- Go to the Open Bus website and navigate to Raw Data and under Bicycle Share Data, click on June 2016 – Present.
- The link opens the Google Drive folder called Raw Data 2. From here, click on 2017/10.
- Scroll down to bikeshare_nyc_raw.csv and right-click (Windows) or option-click (MacOS) to save the file to your Working Directory.
- Rename the data file bikeshare_nyc_raw.csv.
- Double check to make sure both the 08-bikeshare.R and bicycleshare_nyc_raw.csv files are saved in your Working Directory.
After you have the files, you are ready to start working with the file in R. Refer to Getting Started – R and RStudio Overview if you are new to R or need a refresher.
- Launch RStudio.
- When you open RStudio, you might see files and data frames open from the last time you used it. If this occurs, close all the files in the Source Pane, by selecting File > Close All. Select Sessions > Close Workspace to remove any Environment variables.
- Open the file in RStudio , by selecting File > Open File > 08-bikeshare.R.
To set the Working Directory, select: Sessions > Set Working Directory > To Source File Location. Now our Working Directory folder is the default location for our current session in RStudio. R will look in this folder for files without a file path and save any new files we create in this location.
5. Data Wrangling
Place the cursor on the first line of code and click on the Run button to execute a line of code. Remember, R skips over comments designed by the # symbol.
- We install the following packages needed for this lesson. If you have not installed the ‘
lubridate‘, or ‘
data.tableR packages in previous modules, you can install them now. Otherwise, you can skip these lines.
# Install lubridate, data.table, and ggplot packages (this is only required once)
install.packages('lubridate', dependencies = TRUE)
install.packages('data.table', dependencies = TRUE)
install.packages('ggplot2', dependencies = TRUE)
Besides having to install the packages, we also need to load the packages in at the beginning of every script we use the packages. Run the next lines to load
ggplot into RStudio.
# Load data.table
# Load lubridate
# Load ggplot and scales
# Load mapping libraries
Let’s load our bicycle sharing data into RStudio. The next line of R finds and opens the file bikeshare_nyc_raw.csv, and saves it into a data frame called
rawbikedata. If R cannot find this file, it will display an error in the Console Pane. Then, we save a working copy of our data frame into bikedata. We perform our operations on bikedata, so that rawbikedata is preserved in case we ever need to refer back to it.
# Read in Citibike bicycle sharing csv data file
rawbikedata <- read.csv(file="./bikeshare_nyc_raw.csv", head=TRUE,sep="\t")
# Create a working data frame
bikedata <- rawbikedata
# View bikedata
- Examine the
bikedatadata frame. We see
dock_name, which identifies the docking station of the record. The time stamp of the record is found across several columns:
pm. PM designates a
1if it is after noon and a
0if the time is before noon.
- Occasionally records show that the total number of slots in a dock is zero. These records are likely an error in the data, and we will discard them. Run the next line of R to select the rows where
bikedata$tot_docksis not equal to zero.
# Remove any rows which the total docks is zero
bikedata <- bikedata[bikedata$tot_docks != 0 ,]
- Run the four lines of R code to narrow down the dates to a more manageable amount.
# Select data for the week of October 9th (10/16 - 10/22)
bikedata <- with(bikedata, bikedata[mday(date) <= 22 & mday(date) >= 9, ])
- Run the next block of R. We transform the four columns of time data into one column with a POSIXct time object. This step makes using the time data easier. First, we convert the
datecolumn from a factor data type into a character data type.
- Then, we use the
pmcolumn to convert the
hourcolumn into 24-hour time (also known in the US as “military” time.)
- Next, we use the
sprintffunction to add zeros in front of the
minutecolumns where necessary.
# Create a POSIXct date and time variable using available data
bikedata$date <- as.character.Date(bikedata$date)
bikedata$hour <- bikedata$hour + (bikedata$pm * 12) * (bikedata$hour != 12)
bikedata$hour <- sprintf("%02d",bikedata$hour)
bikedata$minute <- sprintf("%02d",bikedata$minute)
- The last step is to combine the columns into a time object, which contains, year, month, date, hour, and minute information.
bikedata$hour <- paste(bikedata$hour, bikedata$minute, sep=":" )
bikedata$date <- paste(bikedata$date, bikedata$hour, sep="")
bikedata$date <- as.POSIXct(bikedata$date ,format= "%y-%m-%d %H:%M")
- Run the next line to create
avail_ratio. This variable is the ratio of parked bikes in the station and the total number of docks. The ratio describes how full the dock is. A
0.0means the station is empty, and
1.0means the station is completely full.
This ratio is useful for riders because it helps them plan where they can find bicycles to use or an open dock to return a bicycle. Full stations can be inconvenient, especially if the docks are located at the outer boundaries of the bicycle sharing system with a limited number of nearby stations. Riders may need to search for several stations to find an empty slot.
# Create a variable which measure how 'full' a bicycle sharing dock is
# 0 = empty, 1.0 = full
bikedata$avail_ratio <- bikedata$avail_bikes / bikedata$tot_docks
- Finally, we will discard the columns we will not use in our visualization. Run these two lines of R and view our
#Remove columns of data we don't need
bikedata <- bikedata[c("dock_id","dock_name","date","avail_bikes","avail_docks", "tot_docks", "avail_ratio")]
Let’s find the station which has the highest average availability ratio in our data. The higher the ratio, the more “full” the station is.
The next block of R code will help us identify the station with on average the most bikes in the evening when commuters are trying to park their bikes.
- First, we want to narrow our analysts to the evenings. Run the next line of R to create the data frame
evening_bikedatawhich contains readings take only after 18:00 or 6:00 PM ET.
# Select times after 18:00 / 6pm ET using the hour function from lubridate
evening_bikedata <- with(bikedata, bikedata[hour(bikedata$date) >= 18 , ] )
- Review and Run the next section. We use R’s
aggregatefunction to group records by their
dock_nameand then find the average of the
dock_namesubgroup. Then, we save these averages into a new data frame named
evening_full. We also rename the
evening_fullcolumns into something more meaningful as the default names R creates are vague. Finally, we order the
evening_fulldata.frame on the
avg_availablecolumn to easily identify the station with the highest
- View the
evening_fulldata frame to examine our results.
# find the mean of the availability ratio and keep the location coordinates
evening_full <- aggregate(evening_bikedata$avail_ratio, by=list(evening_bikedata$dock_name,evening_bikedata$X_lat,evening_bikedata$X_long), FUN=mean)
# change column names
colnames(evening_full) <- c("dock_name", "latitude","longitude", "avg_available")
# sort by the availability ratio
evening_full <- evening_full[order(evening_full$avg_available), ]
# to retain the order in plot.
evening_full$dock_name <- factor(evening_full$dock_name, levels = evening_full$dock_name)
evening_full$dock_name <- factor(evening_full$dock_name, levels = evening_full$dock_name) # to retain the order in plot.
We will visualize our data in two ways. First, we use Leaflet to map all the Citibike docks and their fullness ratio. Then, we graph one’s station’s fullness ratio over time.
- Click Run to make a color palette using the
colorNumericfunction to map different values of Blue with the fullness ratio. The darker the blue, the fuller the dock is.
# Draw map of docking stations with fullness ratio
palette <- colorNumeric( palette = "Blues", domain = (evening_full$avg_available))
- Run the next two lines to create the Leaflet map and then render it in the Viewer Pane.
#make leaflet map
access_map = leaflet(evening_full) %>% addTiles() %>%
addCircles(lng = ~longitude, lat = ~latitude,
radius = (evening_full$avg_available)*100,
color = ~palette(evening_full$avg_available),
fillOpacity = 1,
popup = ~dock_name)
Each circle represents the docking station. The darker the blue and larger radius mean the doc is on average fuller in the evening. Clicking on the circle displays the name of the docking station.
Now, add a legend to the map.
- Once again, click Run twice. The first line adds a legend to the map. The second line renders it.
# Add Legend
access_map = access_map %>% addLegend('bottomleft',
pal = palette, values = ~avg_available,
title= "Citibike Station Percent Full
Evenings 10/2017 ",
opacity = 1)
Figure 9. Map of Citibike Station’s Average Availability
View the map. Notice how the “full” stations are often on the boundaries of the Citibike service area. Why might this be the case? Let’s take a closer look. Find the Dwight St and Van Dyke St Citibike station, which is in the Red Hook neighborhood of Brooklyn and south-east of Governor’s Island. It also has the highest ratio. Let’s graph how it availability fluctuates over time.
Figure 10. Dwight St and Van Dyke St Citibike station
In the evening_full data.frame, scroll down and find the station with the highest ratio, which is Dwight St and Van Dyke St. Let’s plot the availability of this station.
- Run the next three lines of R to plot our data.
- First, we save the data from the Dwight St and Van Dyke St into its own data frame.
- Then, we use
theme_setto format our graph.
- Create a
ggplot, with the
datevariable as the x-axis and
avail_ratioas the y-axis.
Dwight_VanDyke <- with(bikedata, bikedata[dock_name == "Dwight St & Van Dyke St" , ] )
# Allow Default X Axis Labels
ggplot(Dwight_VanDyke, aes(x=date, y=avail_ratio)) +
geom_point(col="tomato2", size=1) +
labs(title="Dwight St and Van Dyke St Citibike Station Availability",
subtitle="For October 9-22",
last_plot() + scale_x_datetime(breaks = date_breaks("1 day"))
In our graph, we see that the Dwight St and Van Dyke St station is often full in the evening. This condition makes sense when we look at the station’s location on the map. The station is at the edge of the service area and is close IKEA ferry dock. Citibike riders trying to return bikes will need to find alternative stations. This occurrence is inconvenient not only for the extra time the ride takes but also because rides over 45 minutes incur additional fees. Rebalancing the bikes across the system is an important task to help ensure bikes are where they are needed, and stations are not full when rider want to return their bikes.
In our previous discussion on ridership, we saw that frequency and reliability are important factors for successful mass transit systems. Similarly, the success of a bicycle sharing program has its own critical factors. Of course, the technology needs to work. However, the availability of bicycles and open docks is also important. Bicycles and docks need to be available when needed, especially if commuters want to utilize the bicycles as part of their daily commutes to work or for transportation to other important appointments.
The analysis performed in this lesson is a beginning step to understanding the distribution of bicycles and docks in the system over time. Insights from these kinds of analyses inform how the distribution of bikes can be improved. This introduction to the Citi Bike system and its usage bring up other questions to explore. Our analysis makes assumptions about the needs of commuter riders in the evening. Is this a reasonable assumption? What might other riders be taking or returning bike in the evening? How might we test for this? Further, how would you perform a similar analysis for morning commuters? We leave these questions for you to investigate, or even better, devise your own questions to try to answer.
You’re done! Help us improve this site with you valuable feedback, by taking our 5 minute survey.