NanoDegree_DataLake

Summary

This project is a data lake implementation of a music streaming data using Pyspark. The data resides in S3 as 2 different types; song and log data in json format.

etl.py file reads those data from s3 (Extract) and processes it to create 5 dimensional tables (Transform) and writes them back to s3 as parquet format.

How to run the script

The job can be run using AWS's EMR platform where the cluster can be easily created to perform distributed computing using Spark. Once the cluster is awake, you can follow the steps to run the etl.py using AWS EMR service:

![Step1](https://github.com/erdemah/NanoDegree_DataLake/blob/master/images/addstep0.png)

![Step 2](https://github.com/erdemah/NanoDegree_DataLake/blob/master/images/add_step3.png)

You can also run the spark job by connecting to the EMR's master node and run the following command using terminal: spark-submit etl.py

Notes

The data resides in S3. The copy of the data is added to the repository inside log-data and song_data folders.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
images		images
log-data		log-data
song_data/A		song_data/A
README.md		README.md
etl.py		etl.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NanoDegree_DataLake

Summary

How to run the script

Notes

About

Releases

Packages

Languages

erdemah/NanoDegree_DataLake

Folders and files

Latest commit

History

Repository files navigation

NanoDegree_DataLake

Summary

How to run the script

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages