-ETL-project-

ETL Data Pipeline Implementation Using AWS Glue, S3, and Athena.

ETL Data Pipeline Implementation Using AWS Glue

Project Overview

This project demonstrates the implementation of an ETL pipeline for processing airline, airport, and flight data using AWS Glue, S3, and Athena. It automates data ingestion, transformation, and querying to generate insights for operational analysis.

Tools & Technologies

AWS Glue: Used for data transformation and job orchestration.
Amazon S3: Served as the raw and curated data storage.
Athena: Enabled querying and visualization of the curated data.
SQL: For data transformations and aggregation in Athena.
IAM: To manage secure access to AWS resources.

Features

Data Sources:
- Airline, airport, and flight data stored in S3.
- Raw data ingested into AWS Glue for further processing.
ETL Workflow:
- Crawlers: Automatically discover raw data schemas and create Glue tables.
- ETL Jobs: Join and transform datasets with SQL queries.
- Output: Store curated data in S3 for reporting.
Athena Analysis:
- Created tables for curated data.
- Generated insights using SQL queries, such as flight counts by airline and airport.

Steps to Reproduce

1. Setup

Create AWS Glue Database: studentID_assignment5_db.
Set up S3 buckets:
- studentID-assignment5-raw-bucket
- studentID-assignment5-curated-bucket

2. Data Ingestion

Upload .csv files into respective folders (airlines, airports, flights) in the raw bucket.

3. Crawlers

Configure and run Glue Crawlers for each dataset.
Verify the creation of raw tables in the Glue database.

4. ETL Job

Use Glue Studio to:

Add source nodes for raw data.

Join tables using SQL:

SELECT 
    al.airline, 
    ap.airport, 
    fl.month, 
    COUNT(*) AS flt_cnt
FROM fl
JOIN ap ON (fl.origin_airport = ap.iata_code)
JOIN al ON (fl.airline = al.iata_code)
WHERE day_of_week IN (1, 2, 3, 4, 5)
GROUP BY al.airline, ap.airport, fl.month;

Output results to the curated bucket.

5. Athena Queries

Load the curated data into Athena.
Run SQL queries to generate insights (see SQL/Athena_Queries.sql).

Deliverables

ETL Pipeline: End-to-end automation from raw data ingestion to curated data output.
Insights: Flight counts by airlines and airports, trends, and analytics.
Screenshots: Include key configurations and execution results.

Results

Successfully processed and curated datasets for querying.
Improved efficiency in analyzing airline operations using AWS services.

Repository Contents

SQL/: Contains SQL scripts used in Athena.
Scripts/: Glue job scripts or configurations.
Resources/: Sample data files used for testing.
Screenshots/: Visuals showcasing the ETL process and results.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

-ETL-project-

ETL Data Pipeline Implementation Using AWS Glue

Project Overview

Tools & Technologies

Features

Steps to Reproduce

1. Setup

2. Data Ingestion

3. Crawlers

4. ETL Job

5. Athena Queries

Deliverables

Results

Repository Contents

About

Releases

Packages

shikhasingh96/-ETL-project-

Folders and files

Latest commit

History

Repository files navigation

-ETL-project-

ETL Data Pipeline Implementation Using AWS Glue

Project Overview

Tools & Technologies

Features

Steps to Reproduce

1. Setup

2. Data Ingestion

3. Crawlers

4. ETL Job

5. Athena Queries

Deliverables

Results

Repository Contents

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages