Skip to content

Latest commit

 

History

History
94 lines (71 loc) · 3.35 KB

boris_bikes.md

File metadata and controls

94 lines (71 loc) · 3.35 KB

Boris Bikes: About

The boris-bikes dataset consists of route information from London’s public bicycle hire scheme (Boris Bikes, Santander Cycles).

Route Analysis

Preliminary tasks

  • copy and unzip 2012-journeys.csv.gz from S:\Training\DataEngineering\datasets\boris-bikes to: boris-bikes\data

Running a spark job

At the command line type:

sbt package && spark-submit --class bikes.runner.<classname> target\scala-2.11\boris-bikes_2.11-0.1.jar

Together these commands build the jar and then execute the desired main class of that jar - through spark.

Completing the tasks

Make the RouteAnalysisTest suite pass! To run the test suite, use:

\> sbt
...
sbt:boris-bikes> test

Hints and tips

  • Use the sbt shell for quick feedback
  • Use your IDE to explore the api
  • df.show and df.printSchema are your friends
  • Running ~testQuick will rerun failing tests when a relevant code change is detected
Exercises
  1. firstTenRows

    • Look for an example of reading a file into a DataSet
  2. tenMostPopularRoutes

    • The file is a csv - so read it as one!
    • check the api options of a Dataset
  3. mostPopularRoutesWithAverageJourneyTime

    • import org.apache.spark.sql.functions._ to gain access to useful functions for DataFrames.
    • look at the docs for spark.sql.functions.lit

Day-of-week Analysis

Consider the following questions and think how you might transform the data to be able to answer them:

  1. Which day of the week is the busiest and which is the quietest?
  2. On which day of the week are the average journey times shortest/longest?
  3. Is the most popular route consistently the most popular regardless of the day of the week? And what about the 4th/5th most popular route?
  4. With regard to the route [startId=154, endId=112 "Waterloo Station" to "Stonecutter Street"] - on which day of the week do people complete the journey in the shortest/longest time on average?

After considering how you might tackle the above, obtain the answers using the suggested structure described below (the two resulting datasets weekdayMetrics and weekdayRoutes are necessary for follow-up questions):

  • Include only journeys that have duration < 12 hrs
  • Add a column for the weekday derived from the Start Date - journeysWithWeekday
  • Create a dataset of weekdayMetrics from journeysWithWeekday by grouping by weekday
  • Create a dataset of weekdayRoutes from journeysWithWeekday by grouping by: (weekday, start station, end station)
  • Use windowing to add a rank by count to weekdayRoutes and filter to get the top ten most popular routes per weekday. More on windowing.

The following exercises require you to focus on performance. In each case, first take a guess at the execution times you expect to see. Then run the spark job both locally and on the cluster - and note how long each takes to complete - if it is not as you expected then think about why.

  • Execute your solution to the above.
  • Join together the resulting 2 datasets from above.
  • Try your solution to mostPopularRoutesWithAverageJourneyTime (from the earlier Route Analysis work) - both the DataSet and RDD versions.