Boris Bikes: About

The boris-bikes dataset consists of route information from London’s public bicycle hire scheme (Boris Bikes, Santander Cycles).

Route Analysis

Preliminary tasks

copy and unzip 2012-journeys.csv.gz from S:\Training\DataEngineering\datasets\boris-bikes to: boris-bikes\data

Running a spark job

At the command line type:

sbt package && spark-submit --class bikes.runner.<classname> target\scala-2.11\boris-bikes_2.11-0.1.jar

Together these commands build the jar and then execute the desired main class of that jar - through spark.

Completing the tasks

Make the RouteAnalysisTest suite pass! To run the test suite, use:

\> sbt
...
sbt:boris-bikes> test

Hints and tips

Use the sbt shell for quick feedback
Use your IDE to explore the api
df.show and df.printSchema are your friends
Running ~testQuick will rerun failing tests when a relevant code change is detected

Exercises

firstTenRows
- Look for an example of reading a file into a DataSet
tenMostPopularRoutes
- The file is a csv - so read it as one!
- check the api options of a Dataset
mostPopularRoutesWithAverageJourneyTime
- import org.apache.spark.sql.functions._ to gain access to useful functions for DataFrames.
- look at the docs for spark.sql.functions.lit

Day-of-week Analysis

Consider the following questions and think how you might transform the data to be able to answer them:

Which day of the week is the busiest and which is the quietest?
On which day of the week are the average journey times shortest/longest?
Is the most popular route consistently the most popular regardless of the day of the week? And what about the 4th/5th most popular route?
With regard to the route [startId=154, endId=112 "Waterloo Station" to "Stonecutter Street"] - on which day of the week do people complete the journey in the shortest/longest time on average?

After considering how you might tackle the above, obtain the answers using the suggested structure described below (the two resulting datasets weekdayMetrics and weekdayRoutes are necessary for follow-up questions):

Include only journeys that have duration < 12 hrs
Add a column for the weekday derived from the Start Date - journeysWithWeekday
Create a dataset of weekdayMetrics from journeysWithWeekday by grouping by weekday
Create a dataset of weekdayRoutes from journeysWithWeekday by grouping by: (weekday, start station, end station)
Use windowing to add a rank by count to weekdayRoutes and filter to get the top ten most popular routes per weekday. More on windowing.

The following exercises require you to focus on performance. In each case, first take a guess at the execution times you expect to see. Then run the spark job both locally and on the cluster - and note how long each takes to complete - if it is not as you expected then think about why.

Execute your solution to the above.
Join together the resulting 2 datasets from above.
Try your solution to mostPopularRoutesWithAverageJourneyTime (from the earlier Route Analysis work) - both the DataSet and RDD versions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

boris_bikes.md

boris_bikes.md

Boris Bikes: About

Route Analysis

Preliminary tasks

Running a spark job

Completing the tasks

Hints and tips

Exercises

Day-of-week Analysis

Files

boris_bikes.md

Latest commit

History

boris_bikes.md

File metadata and controls

Boris Bikes: About

Route Analysis

Preliminary tasks

Running a spark job

Completing the tasks

Hints and tips

Exercises

Day-of-week Analysis