Project3 Cloud-based Big Data Systems Project

Project objectives

Use a major Big Data system to perform a Data Engineering related task
Example systems could be: (AWS Athena, AWS Spark/EMR, AWS Sagemaker, Databricks, Snowflake)

Introduction

This project aims to build a cloud-based ETL Data pipeline to support data dashboard analysis.

Goals

Data Ingestion — Build a mechanism to ingest data from different sources
Data lake — We will be getting data from multiple sources so we need centralized repo to store them
ETL System — We are getting data in raw format, transforming this data into the proper format
Scalability — As the size of our data increases, we need to make sure our system scales with it
Cloud — We can’t process vast amounts of data on our local computer so we need to use the cloud, in this case, we will use AWS
Reporting — Build a dashboard to get answers to the question we asked earlier

Services we will use

Amazon S3: Amazon S3 is an object storage service that provides manufacturing scalability, data availability, security, and performance.
QuickSight: Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud.
AWS Glue: A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
AWS Lambda: Lambda is a computing service that allows programmers to run code without creating or managing servers.
AWS Athena: Athena is an interactive query service for S3 that makes it easy to analyze data directly in S3 using standard SQL.

Data we will use

The dataset contains statistics (CSV files) on daily popular YouTube videos over the course of many months.
Dataset: https://www.kaggle.com/datasets/datasnaek/youtube-new

Structure Diagrams

Steps

Download dataset:
https://www.kaggle.com/datasets/datasnaek/youtube-new
AWS S3, to create a bucket

AWS S3:
Bucket name: de-youtube-raw
Keep other default settings
- Like: Block all public access
- can choose whether to enable encryption

AWS Glue, to create a crawler and catalog

Crawels->Add crawlers
- Crawler name: de-youtube
- Keep other default settings
Add data source: Include path: choose the data bucket in S3
Add new IAM role: AWSGlueServiceRole-de-p3
Set output and scheduling
- Target database: create a new database in AWS Glue: de-youtube-p3
Run crawler
After creation, choose Database-Tables-View data-Preview data, and it will direct us to AWS Athena

AWS Athena, to build data dashboard

In Query editor, click View settings to set the output location.
Need a light ETL here: Data cleansing to convert JSON to Apache Parquet. See AWS lambda.

AWS lambda

Create a function: Author from scratch, could choose how to create an IAM role for this function.
Write the lambda_function.py
Configuration
- Edit environment variables, four variables
- Time: 3 minutes
Test: Create Configure test event, then Test

Back to AWS Athena, now we can use SQL query to perform data tasks.

Further to do

Use services like Quicklook to build data dashboard.

Referennce

Perform advanced streaming data transformations with Apache Spark and Kafka in Azure HDInsight-Microsoft Learnhttps://learn.microsoft.com/en-us/training/modules/perform-advanced-streaming-data-transformations-with-spark-kafka/
Hadoop Tutorial
A very brief introduction to MapReducehttps://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf
Reference material: https://github.com/darshilparmar/dataengineering-youtube-analysis-project

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
lambda_function.py		lambda_function.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project3 Cloud-based Big Data Systems Project

Project objectives

Introduction

Goals

Services we will use

Data we will use

Structure Diagrams

Steps

Further to do

Referennce

About

Releases

Packages

Languages

LiuSuen/data-engineering-system-on-cloud

Folders and files

Latest commit

History

Repository files navigation

Project3 Cloud-based Big Data Systems Project

Project objectives

Introduction

Goals

Services we will use

Data we will use

Structure Diagrams

Steps

Further to do

Referennce

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages