GitHub - valvarl/python-spark-hadoop

Dataset

Forest Cover Type Prediction Dataset: The dataset consists of various features like elevation, aspect, slope, and distances to predict forest cover types. It contains 581,012 instances and 54 features.

Model

Decision Tree Classifier: The model is pre-trained using the Decision Tree algorithm. Model training details are available in the train.py.

Spark Applications

Two Spark applications are created for data processing:

single.py
parallel.py

How to Run

python train.py
docker-compose build
docker-compose up -d
docker cp covtype.data namenode:/
docker exec -it namenode bash
hdfs dfs -put /covtype.data /
docker exec -it -u 0 spark-master bash
chmod -R 777 /output

Start Spark Applications

sh start.sh

Resulting Graph

The resulting graph will illustrate the distribution of execution time and RAM usage for both parallel and non-parallel Spark applications across multiple iterations. The graph helps in understanding the performance and resource utilization differences between the two approaches.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
hadoop.env		hadoop.env
parallel.py		parallel.py
single.py		single.py
start.sh		start.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset

Model

Spark Applications

How to Run

Start Spark Applications

Resulting Graph

About

Releases

Packages

Languages

valvarl/python-spark-hadoop

Folders and files

Latest commit

History

Repository files navigation

Dataset

Model

Spark Applications

How to Run

Start Spark Applications

Resulting Graph

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages