Notebooks are a good way to iteratively explore and visualize data.
Code in pyspark-cheatsheet can be run in any Spark notebook with little or no modification.
This how-to shows how to run code in a Jupyter notebook on a local Docker container. The same approach can be used on any Spark notebook.
Install Docker using their installation instructions.
docker pull jupyter/pyspark-notebook
docker run -it --rm -p 8888:8888 -p 4040:4040 -p 4041:4041 jupyter/pyspark-notebook
The second command will produce a link you need to click to open the Jupyter notebook, for example you may see:
[C 20:48:13.082 NotebookApp]
To access the notebook, open this file in a browser:
file:///home/jovyan/.local/share/jupyter/runtime/nbserver-7-open.html
Or copy and paste one of these URLs:
http://21ea37496cd2:8888/?token=5713d34d0c7b3580d925e9dec04636319a982119e085aa40
or http://127.0.0.1:8888/?token=5713d34d0c7b3580d925e9dec04636319a982119e085aa40
In this case you need to open the bottom link with a web browser.
In the Notebook browser window, press New -> Terminal.
The first step is to load all code and sample data into the Notebook. In the terminal window run:
git clone https://github.com/cartershanklin/pyspark-cheatsheet
Some samples use the Delta library to manage data. In the terminal window run:
pip3 install --user delta-spark
Close the terminal tab and return to the Files tab. There's a new folder called pyspark-cheatsheet
. Click into this folder.
Open the notebook by clicking on the file called cheatsheet.ipynb
.
When the notebook loads you need to run the first code cell to start Spark and load sample data. You may see some warnings while this runs.
After Spark is started and data is loaded you can run other cells in the notebook using the Run button.