Chess: About

The chess dataset consists of about 1 million real games of chess taken from lichess.org

The data has already been parsed and reformatted into a flattened data structure - each move is on its own row.

Move Sequences

Preliminary tasks

copy and unzip chess.zip from S:\Training\DataEngineering\datasets\chess to: boris-bikes\data
Note, the cluster has the same chess data saved to /tmp/chess

General hints and tips

Use the functions from performance.Benchmark
Remember to clear the cache before executing a task when comparing performance (where caching is involved). Otherwise caching from a previous run invalidates your performance comparison.
In Zeppelin copy over and execute Evaluations and FlatGameData from the chess package object.

Exercises

We provide a function Sequences.percentThatEndInMistake that looks at move sequences during a game of chess.

Based on the available data, for a sequence of moves seen, it calculates the percent of each sequence that terminate in a mistake.

This information could be used, for example, to power a chess AI.

Performance:

This function has to do a lot of work. Additionally, the function executes for only one sequence length at a time. Therefore, to calculate the results for a number of different sequence lengths, the function needs to be re-executed a number of times.

Create a runner and test the function works for a chosen sequence length, when submitted as a Spark job.
Run this function with benchmarking for sequences of length 5 to 10, both locally and in the cluster. Compare the time taken.
- Note that on the cluster, the spark variable is already present.
Improve the performance
- A well placed .cache can do the trick
- remember to clear the cache for a valid comparison! See General hints.

Partitioning:

The following should be completed in a Zeppelin notebook on the cluster. Here we contrast the result of partitioning data. We look at 2 datasets that are almost exactly the same: (1) /tmp/chessFlat and (2) /tmp/chess

(1) and (2) are the same with the exception that (2) has been partitioned by move:
```
  chessDS.repartition($"move").write.parquet("/tmp/chess")
```

Load the data from (1). Call a function on the DataSet api to observe the partition size of the underlying rdd. Compare this to that of (2)
Run the following to observe the physical partitioning that has been done. Do the same for /tmp/chess also.
```
  %sh
 hdfs dfs -ls /tmp/chessFlat
```
Run the provided function chess.MoveAnalysis.allMoveEvalPercents on each version of the data and compare the run times. Look at what the function is doing.
Think about why there is a difference in performance on the 2 datasets.
Read:
- Managing Spark Partitions with Coalesce and Repartition - Explains how partitions are used in Spark and how to work with them efficiently.
- Repartitioning for performance - prime example Note that you may find the example in this only applies when run locally and not on the cluster. The example demonstrates the effect of multi-cores but does not take into account distributed cores.
- partitioning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chess.md

chess.md

Chess: About

Move Sequences

Preliminary tasks

General hints and tips

Exercises

Performance:

Partitioning:

Files

chess.md

Latest commit

History

chess.md

File metadata and controls

Chess: About

Move Sequences

Preliminary tasks

General hints and tips

Exercises

Performance:

Partitioning: