Data drives the world.
In this big data era, the need to analyse large volumes of data has become ever more challenging and quite complex. Several different eco-systems have been developed which try to solve some particular problem. One of the main tool in Big Data eco system is the Apache Spark
Apache Spark analysis of big data became essential easier. Spark brings a lot implementation of useful algorithms for data mining, data analysis, machine learning, algorithms on graphs. Spark takes on the challenge of implementing sophisticated algorithms with tricky optimization and ability to run your code on distributed cluster. Spark effectively solve problems like fault tolerance and provide simple API to make the parallel computation.
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.
This repository serves as a starting point for working with Spark GraphX API. As part of our SDM lab, we'd be focusing on getting a basic idea about how to work with pregel
and get a hands-on experience with distributed processing of large graph.
Pregel, originally developed by Google, is essentially a message-passing interface which facilitates the processing of large-scale graphs. Apache Spark's GraphX module provides the Pregel API which allow us to write distributed graph programs / algorithms. For more details, kindly check out the original paper
Before starting, you may need to setup your machine first. Please follow the below mentioned guides to setup Spark and Maven on your machine.
We have created a setup script which will setup brew, apache-spark, maven and conda enviornment. If you are on Mac machine, you can run the following commands:
git clone https://github.com/mohammadzainabbas/SDM-Lab-2.git
cd SDM-Lab-2 && sh scripts/setup.sh
If you are on Linux, you need to install Apache Spark by yourself. You can follow this helpful guide to install apache spark
. You can install maven via this guide.
We also recommend you to install conda on your machine. You can setup conda from here
After you have conda, create new enviornment via:
conda create -n spark_env python=3.8
Note: We are using Python3.8 because spark doesn't support Python3.9 and above (at the time of writing this)
Activate your enviornment:
conda activate spark_env
Now, you need to install pyspark:
pip install pyspark
If you are using bash:
echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.bashrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.bashrc
. ~/.bashrc
And if you are using zsh:
echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.zshrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.zshrc
. ~/.zshrc
Since, this is a typical maven project, you can run it however you'd like to run a maven project. To facilitate you, we provide you two ways to run this project.
In you are using VS Code, change the args
in the Launch Main
configuration in launch.json
file located at .vscode
directory.
See the main class for the supported arguments.
Just run the following with the supported arguments:
sh scripts/build_n_run.sh exercise1
Note:
exercise1
here is the argument which you'd need to run the first exercise
Again, you can check the main class for the supported arguments.