Skip to content

erdemah/NanoDegree_DataLake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NanoDegree_DataLake

Summary

This project is a data lake implementation of a music streaming data using Pyspark. The data resides in S3 as 2 different types; song and log data in json format.

etl.py file reads those data from s3 (Extract) and processes it to create 5 dimensional tables (Transform) and writes them back to s3 as parquet format.

How to run the script

The job can be run using AWS's EMR platform where the cluster can be easily created to perform distributed computing using Spark. Once the cluster is awake, you can follow the steps to run the etl.py using AWS EMR service:

![Step1](https://github.com/erdemah/NanoDegree_DataLake/blob/master/images/addstep0.png)

![Step 2](https://github.com/erdemah/NanoDegree_DataLake/blob/master/images/add_step3.png)

You can also run the spark job by connecting to the EMR's master node and run the following command using terminal: spark-submit etl.py

Notes

The data resides in S3. The copy of the data is added to the repository inside log-data and song_data folders.

About

NanoDegree Datalake Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages