The boris-bikes dataset consists of route information from London’s public bicycle hire scheme (Boris Bikes, Santander Cycles).
- copy and unzip
At the command line type:
sbt package && spark-submit --class bikes.runner.<classname> target\scala-2.11\boris-bikes_2.11-0.1.jar
Together these commands build the jar and then execute the desired main class of that jar - through spark.
Make the RouteAnalysisTest suite pass! To run the test suite, use:
\> sbt
sbt:boris-bikes> test
- Use the sbt shell for quick feedback
- Use your IDE to explore the api
are your friends- Running
will rerun failing tests when a relevant code change is detected
- Look for an example of reading a file into a DataSet
- The file is a csv - so read it as one!
- check the api options of a Dataset
import org.apache.spark.sql.functions._
to gain access to useful functions for DataFrames.- look at the docs for
Consider the following questions and think how you might transform the data to be able to answer them:
- Which day of the week is the busiest and which is the quietest?
- On which day of the week are the average journey times shortest/longest?
- Is the most popular route consistently the most popular regardless of the day of the week? And what about the 4th/5th most popular route?
- With regard to the route [startId=154, endId=112 "Waterloo Station" to "Stonecutter Street"] - on which day of the week do people complete the journey in the shortest/longest time on average?
After considering how you might tackle the above, obtain the answers
using the suggested structure described below (the two resulting datasets
and weekdayRoutes
are necessary for follow-up questions):
- Include only journeys that have duration < 12 hrs
- Add a column for the weekday derived from the
Start Date
- Create a dataset of
by grouping by weekday - Create a dataset of
by grouping by: (weekday, start station, end station) - Use windowing
to add a rank by count to
and filter to get the top ten most popular routes per weekday. More on windowing.
The following exercises require you to focus on performance. In each case, first take a guess at the execution times you expect to see. Then run the spark job both locally and on the cluster - and note how long each takes to complete - if it is not as you expected then think about why.
- Execute your solution to the above.
- Join together the resulting 2 datasets from above.
- Try your solution to
(from the earlier Route Analysis work) - both the DataSet and RDD versions.