The chess dataset consists of about 1 million real games of chess taken from lichess.org
The data has already been parsed and reformatted into a flattened data structure - each move is on its own row.
-
copy and unzip
chess.zip
fromS:\Training\DataEngineering\datasets\chess
to:boris-bikes\data
-
Note, the cluster has the same chess data saved to
/tmp/chess
-
Use the functions from
performance.Benchmark
-
Remember to clear the cache before executing a task when comparing performance (where caching is involved). Otherwise caching from a previous run invalidates your performance comparison.
-
In Zeppelin copy over and execute
Evaluations
andFlatGameData
from the chesspackage
object.
We provide a function Sequences.percentThatEndInMistake
that looks at
move sequences during a game of chess.
Based on the available data, for a sequence of moves seen, it calculates the percent of each sequence that terminate in a mistake.
This information could be used, for example, to power a chess AI.
This function has to do a lot of work. Additionally, the function executes for only one sequence length at a time. Therefore, to calculate the results for a number of different sequence lengths, the function needs to be re-executed a number of times.
-
Create a runner and test the function works for a chosen sequence length, when submitted as a Spark job.
-
Run this function with benchmarking for sequences of length 5 to 10, both locally and in the cluster. Compare the time taken.
- Note that on the cluster, the
spark
variable is already present.
- Note that on the cluster, the
-
Improve the performance
- A well placed
.cache
can do the trick - remember to clear the cache for a valid comparison! See General hints.
- A well placed
The following should be completed in a Zeppelin notebook on the cluster.
Here we contrast the result of partitioning data. We look at 2 datasets that
are almost exactly the same: (1) /tmp/chessFlat
and (2) /tmp/chess
-
(1) and (2) are the same with the exception that (2) has been partitioned by
move
:chessDS.repartition($"move").write.parquet("/tmp/chess")
-
Load the data from (1). Call a function on the DataSet api to observe the partition size of the underlying rdd. Compare this to that of (2)
-
Run the following to observe the physical partitioning that has been done. Do the same for
/tmp/chess
also.%sh hdfs dfs -ls /tmp/chessFlat
-
Run the provided function
chess.MoveAnalysis.allMoveEvalPercents
on each version of the data and compare the run times. Look at what the function is doing. -
Think about why there is a difference in performance on the 2 datasets.
-
Read:
- Managing Spark Partitions with Coalesce and Repartition - Explains how partitions are used in Spark and how to work with them efficiently.
- Repartitioning for performance - prime example Note that you may find the example in this only applies when run locally and not on the cluster. The example demonstrates the effect of multi-cores but does not take into account distributed cores.
- partitioning