GitHub - mohammadzainabbas/SDM-Lab-2: Semantic Data Management

SDM - Lab 2 @ UPC 👨🏻‍💻

Data drives the world. In this big data era, the need to analyse large volumes of data has become ever more challenging and quite complex. Several different eco-systems have been developed which try to solve some particular problem. One of the main tool in Big Data eco system is the Apache Spark

Apache Spark analysis of big data became essential easier. Spark brings a lot implementation of useful algorithms for data mining, data analysis, machine learning, algorithms on graphs. Spark takes on the challenge of implementing sophisticated algorithms with tricky optimization and ability to run your code on distributed cluster. Spark effectively solve problems like fault tolerance and provide simple API to make the parallel computation.

1.1. GraphX

GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.

This repository serves as a starting point for working with Spark GraphX API. As part of our SDM lab, we'd be focusing on getting a basic idea about how to work with pregel and get a hands-on experience with distributed processing of large graph.

1.2. Pregel

Pregel, originally developed by Google, is essentially a message-passing interface which facilitates the processing of large-scale graphs. Apache Spark's GraphX module provides the Pregel API which allow us to write distributed graph programs / algorithms. For more details, kindly check out the original paper

2. Setup

Before starting, you may need to setup your machine first. Please follow the below mentioned guides to setup Spark and Maven on your machine.

2.1. Mac

We have created a setup script which will setup brew, apache-spark, maven and conda enviornment. If you are on Mac machine, you can run the following commands:

git clone https://github.com/mohammadzainabbas/SDM-Lab-2.git
cd SDM-Lab-2 && sh scripts/setup.sh

2.2. Linux

If you are on Linux, you need to install Apache Spark by yourself. You can follow this helpful guide to install apache spark. You can install maven via this guide.

We also recommend you to install conda on your machine. You can setup conda from here

After you have conda, create new enviornment via:

conda create -n spark_env python=3.8

Note: We are using Python3.8 because spark doesn't support Python3.9 and above (at the time of writing this)

Activate your enviornment:

conda activate spark_env

Now, you need to install pyspark:

pip install pyspark

If you are using bash:

echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.bashrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.bashrc
. ~/.bashrc

And if you are using zsh:

echo "export PYSPARK_DRIVER_PYTHON=$(which python)" >> ~/.zshrc
echo "export PYSPARK_DRIVER_PYTHON_OPTS=''" >> ~/.zshrc
. ~/.zshrc

3. Run

Since, this is a typical maven project, you can run it however you'd like to run a maven project. To facilitate you, we provide you two ways to run this project.

3.1. VS Code

In you are using VS Code, change the args in the Launch Main configuration in launch.json file located at .vscode directory.

See the main class for the supported arguments.

3.2. Terminal

Just run the following with the supported arguments:

sh scripts/build_n_run.sh exercise1

Note: exercise1 here is the argument which you'd need to run the first exercise

Again, you can check the main class for the supported arguments.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github/workflows		.github/workflows
.idea		.idea
.settings		.settings
.vscode		.vscode
docs		docs
lib		lib
scripts		scripts
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SDM - Lab 2 @ UPC 👨🏻‍💻

Table of contents

1. Introduction

1.1. GraphX

1.2. Pregel

2. Setup

2.1. Mac

2.2. Linux

3. Run

3.1. VS Code

3.2. Terminal

About

Releases

Packages

Contributors 2

Languages

License

mohammadzainabbas/SDM-Lab-2

Folders and files

Latest commit

History

Repository files navigation

SDM - Lab 2 @ UPC 👨🏻‍💻

Table of contents

1. Introduction

1.1. GraphX

1.2. Pregel

2. Setup

2.1. Mac

2.2. Linux

3. Run

3.1. VS Code

3.2. Terminal

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages