spark

Set up apache spark on Amazon AWS

Open a connection to the server using the hadoop user: ssh -i ~/your_amazon_key.pem [email protected]

Once connected to the Amazon machine type in the following lines in the terminal:

wget https://github.com/gterziysky/spark/blob/master/install_jupyter chmod +x install_jupyter ./install_jupyter

The installation might take a while. After it is done fire up a browser and connect to http://[ip addresses of your amazon system]:8192/. Now you are all set to run pyspark.

Install via conda

Possibly useful SO answer w.r.t. Spark env vars How to import pyspark in anaconda.

conda create --name spark --channel conda-forge -y python=3.8 py4j numpy pandas pyarrow openjdk=8 notebook

The current version of pyspark available on conda-forge for linux x64 is 2.4.0.

So to install the latest version (as of now 3.2.1) use:

pip install pyspark

or download it from spark.apache.org and set the env variables in your ~/.bashrc:

export SPARK_HOME="/home/successful/spark-3.2.1-bin-hadoop3.2"
export PATH="$SPARK_HOME/bin:$PATH"
# on Windows it may be a good idea to set also the following:
# JAVA_HOME, HADOOP_HOME and PYSPARK_PYTHON

To remove the conda env simply:

conda env remove -n spark

To run a jupyter notebook on the server running spark do:

jupyter notebook --no-browser --port 1234

Open up a secure tunnel to the notebook server on your local machine:

# if you do not wish to start the ssh in the background, simply remove the -f option
ssh -i ~/.ssh/id_rsa -fNL 1234:localhost:1234 user@notebook_server

Then, go to your preferred browser of choice and type in localhost:1234 to open up jupyter notebook on the server (input the session token you got when starting up the jupyter notebook process on the server).

Finally, from within jupyter notebook, initiate a spark context by:

import pyspark as ps
from pyspark import SparkContext
from pyspark import SparkConf

# Obtain a SparkContext

# Method 1
# Create a SparkSession in Python
spark = SparkSession.builder\
.master("local[*]")\
.appName("MyApp")\
.getOrCreate()

# Obtain a SparkContext instance to communicate with Spark's lower level APIs such as RDD
sc = spark.sparkContext.getOrCreate()

# Method 2
# Alternatively, one can directly obtain a SparkContext instance without explicitly creating a SparkSession first by:
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
# Method 3 (same as Method 2)
conf = SparkConf().setMaster("local").setAppName("MyApp")
sc = SparkContext(conf = conf)

For more information, see here Installing Apache Spark and Python.

For more information on creating a SparkSession and SparkContext, see section "The Life Cycle of a Spark Application" in the Spark: The Definitive Guide: Big Data Processing Made Simple book.

Logging level

Navigate to the spark config folder:

# Note that the path to where you've installed spark may differ
cd ~/spark-3.2.1-bin-hadoop3.2/conf/

Make a copy of the log4j.properties.template file:

cp log4j.properties.template log4j.properties

Open the newly created log4j.properties file and set the logging level for the following properties from INFO or WARN to ERROR:

log4j.rootCategory
log4j.logger.org.apache.spark.repl.Main
log4j.logger.org.sparkproject.jetty
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter

Send send your application code to a cluster (be it standalone, Mesos or YARN)

The following code is from the "Running Production Applications" section of "Chapter 3. A tour of Spark's toolset" from the Spark: The Definitive Guide: Big Data Processing Made Simple book:

# submit Python application to cluster
./bin/spark-submit \
  --master local \
  ./examples/src/main/python/pi.py 10
# the --master option specifies the address of the master node
# local stands for the local machine

# use $SPARK_HOME instead of a hardcoded path 
spark-submit --master local $SPARK_HOME/examples/src/main/python/pi.py 10

# "local[*]" means we want to use all cores
spark-submit --master "local[*]" $SPARK_HOME/examples/src/main/python/pi.py 10

Data Science on Apache Spark

Use koalas which is like a distributed alternative to the pandas library. It provides the pandas API on Apache Spark.

Here is an article on how to do a distributed GridSearchCV directly by the team who created Spark: Auto-scaling scikit-learn with Apache Spark.

A great addition to sklean which distributes the workload across a cluster of worker nodes is the spark-sklearn package.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
notebooks		notebooks
src		src
GeoLiteCity.dat		GeoLiteCity.dat
README.md		README.md
install_jupyter		install_jupyter
wiki-es-geo.py		wiki-es-geo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark

Install via conda

Logging level

Send send your application code to a cluster (be it standalone, Mesos or YARN)

Data Science on Apache Spark

About

Releases

Packages

Languages

gterziysky/spark

Folders and files

Latest commit

History

Repository files navigation

spark

Install via conda

Logging level

Send send your application code to a cluster (be it standalone, Mesos or YARN)

Data Science on Apache Spark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages