Set up apache spark on Amazon AWS
Open a connection to the server using the hadoop user: ssh -i ~/your_amazon_key.pem [email protected]
Once connected to the Amazon machine type in the following lines in the terminal:
wget https://github.com/gterziysky/spark/blob/master/install_jupyter chmod +x install_jupyter ./install_jupyter
The installation might take a while. After it is done fire up a browser and connect to http://[ip addresses of your amazon system]:8192/. Now you are all set to run pyspark.
Possibly useful SO answer w.r.t. Spark env vars How to import pyspark in anaconda.
conda create --name spark --channel conda-forge -y python=3.8 py4j numpy pandas pyarrow openjdk=8 notebook
The current version of pyspark available on conda-forge for linux x64 is 2.4.0.
So to install the latest version (as of now 3.2.1) use:
pip install pyspark
or download it from spark.apache.org and set the env variables in your ~/.bashrc
:
export SPARK_HOME="/home/successful/spark-3.2.1-bin-hadoop3.2"
export PATH="$SPARK_HOME/bin:$PATH"
# on Windows it may be a good idea to set also the following:
# JAVA_HOME, HADOOP_HOME and PYSPARK_PYTHON
To remove the conda env simply:
conda env remove -n spark
To run a jupyter notebook on the server running spark do:
jupyter notebook --no-browser --port 1234
Open up a secure tunnel to the notebook server on your local machine:
# if you do not wish to start the ssh in the background, simply remove the -f option
ssh -i ~/.ssh/id_rsa -fNL 1234:localhost:1234 user@notebook_server
Then, go to your preferred browser of choice and type in localhost:1234 to open up jupyter notebook on the server (input the session token you got when starting up the jupyter notebook process on the server).
Finally, from within jupyter notebook, initiate a spark context by:
import pyspark as ps
from pyspark import SparkContext
from pyspark import SparkConf
# Obtain a SparkContext
# Method 1
# Create a SparkSession in Python
spark = SparkSession.builder\
.master("local[*]")\
.appName("MyApp")\
.getOrCreate()
# Obtain a SparkContext instance to communicate with Spark's lower level APIs such as RDD
sc = spark.sparkContext.getOrCreate()
# Method 2
# Alternatively, one can directly obtain a SparkContext instance without explicitly creating a SparkSession first by:
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
# Method 3 (same as Method 2)
conf = SparkConf().setMaster("local").setAppName("MyApp")
sc = SparkContext(conf = conf)
For more information, see here Installing Apache Spark and Python.
For more information on creating a SparkSession and SparkContext, see section "The Life Cycle of a Spark Application" in the Spark: The Definitive Guide: Big Data Processing Made Simple book.
Navigate to the spark config folder:
# Note that the path to where you've installed spark may differ
cd ~/spark-3.2.1-bin-hadoop3.2/conf/
Make a copy of the log4j.properties.template file:
cp log4j.properties.template log4j.properties
Open the newly created log4j.properties
file and set the logging level for the following properties from INFO
or WARN
to ERROR
:
log4j.rootCategory
log4j.logger.org.apache.spark.repl.Main
log4j.logger.org.sparkproject.jetty
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
The following code is from the "Running Production Applications" section of "Chapter 3. A tour of Spark's toolset" from the Spark: The Definitive Guide: Big Data Processing Made Simple book:
# submit Python application to cluster
./bin/spark-submit \
--master local \
./examples/src/main/python/pi.py 10
# the --master option specifies the address of the master node
# local stands for the local machine
# use $SPARK_HOME instead of a hardcoded path
spark-submit --master local $SPARK_HOME/examples/src/main/python/pi.py 10
# "local[*]" means we want to use all cores
spark-submit --master "local[*]" $SPARK_HOME/examples/src/main/python/pi.py 10
Use koalas which is like a distributed alternative to the pandas library. It provides the pandas API on Apache Spark.
Here is an article on how to do a distributed GridSearchCV directly by the team who created Spark: Auto-scaling scikit-learn with Apache Spark.
A great addition to sklean which distributes the workload across a cluster of worker nodes is the spark-sklearn package.