- Use a major Big Data system to perform a Data Engineering related task
- Example systems could be: (AWS Athena, AWS Spark/EMR, AWS Sagemaker, Databricks, Snowflake)
This project aims to build a cloud-based ETL Data pipeline to support data dashboard analysis.
- Data Ingestion — Build a mechanism to ingest data from different sources
- Data lake — We will be getting data from multiple sources so we need centralized repo to store them
- ETL System — We are getting data in raw format, transforming this data into the proper format
- Scalability — As the size of our data increases, we need to make sure our system scales with it
- Cloud — We can’t process vast amounts of data on our local computer so we need to use the cloud, in this case, we will use AWS
- Reporting — Build a dashboard to get answers to the question we asked earlier
- Amazon S3: Amazon S3 is an object storage service that provides manufacturing scalability, data availability, security, and performance.
- QuickSight: Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud.
- AWS Glue: A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.
- AWS Lambda: Lambda is a computing service that allows programmers to run code without creating or managing servers.
- AWS Athena: Athena is an interactive query service for S3 that makes it easy to analyze data directly in S3 using standard SQL.
The dataset contains statistics (CSV files) on daily popular YouTube videos over the course of many months.
Dataset: https://www.kaggle.com/datasets/datasnaek/youtube-new
- Download dataset:
https://www.kaggle.com/datasets/datasnaek/youtube-new - AWS S3, to create a bucket
- AWS S3:
- Bucket name: de-youtube-raw
- Keep other default settings
- Like:
Block all public access
- can choose whether to enable encryption
- Like:
- AWS Glue, to create a crawler and catalog
Crawels
->Add crawlers
- Crawler name: de-youtube
- Keep other default settings
- Add data source:
Include path
: choose the data bucket in S3 - Add new IAM role: AWSGlueServiceRole-de-p3
- Set output and scheduling
- Target database: create a new database in AWS Glue: de-youtube-p3
- Run crawler
- After creation, choose
Database
-Tables
-View data
-Preview data
, and it will direct us to AWS Athena
- AWS Athena, to build data dashboard
- In
Query editor
, clickView settings
to set the output location. - Need a light ETL here: Data cleansing to convert JSON to Apache Parquet. See AWS lambda.
- AWS lambda
- Create a function: Author from scratch, could choose how to create an IAM role for this function.
- Write the lambda_function.py
- Configuration
- Edit environment variables, four variables
- Time: 3 minutes
- Test: Create Configure test event, then Test
- Back to AWS Athena, now we can use SQL query to perform data tasks.
- Use services like Quicklook to build data dashboard.
- Perform advanced streaming data transformations with Apache Spark and Kafka in Azure HDInsight-Microsoft Learnhttps://learn.microsoft.com/en-us/training/modules/perform-advanced-streaming-data-transformations-with-spark-kafka/
- Hadoop Tutorial
- A very brief introduction to MapReducehttps://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf
- Reference material: https://github.com/darshilparmar/dataengineering-youtube-analysis-project