ETL Data Pipeline Implementation Using AWS Glue, S3, and Athena.
This project demonstrates the implementation of an ETL pipeline for processing airline, airport, and flight data using AWS Glue, S3, and Athena. It automates data ingestion, transformation, and querying to generate insights for operational analysis.
- AWS Glue: Used for data transformation and job orchestration.
- Amazon S3: Served as the raw and curated data storage.
- Athena: Enabled querying and visualization of the curated data.
- SQL: For data transformations and aggregation in Athena.
- IAM: To manage secure access to AWS resources.
-
Data Sources:
- Airline, airport, and flight data stored in S3.
- Raw data ingested into AWS Glue for further processing.
-
ETL Workflow:
-
Athena Analysis:
- Created tables for curated data.
- Generated insights using SQL queries, such as flight counts by airline and airport.
- Create AWS Glue Database:
studentID_assignment5_db
. - Set up S3 buckets:
studentID-assignment5-raw-bucket
studentID-assignment5-curated-bucket
- Upload
.csv
files into respective folders (airlines
,airports
,flights
) in the raw bucket.
- Configure and run Glue Crawlers for each dataset.
- Verify the creation of raw tables in the Glue database.
- Use Glue Studio to:
- Add source nodes for raw data.
- Join tables using SQL:
SELECT al.airline, ap.airport, fl.month, COUNT(*) AS flt_cnt FROM fl JOIN ap ON (fl.origin_airport = ap.iata_code) JOIN al ON (fl.airline = al.iata_code) WHERE day_of_week IN (1, 2, 3, 4, 5) GROUP BY al.airline, ap.airport, fl.month;
- Output results to the curated bucket.
- Load the curated data into Athena.
- Run SQL queries to generate insights (see
SQL/Athena_Queries.sql
).
- ETL Pipeline: End-to-end automation from raw data ingestion to curated data output.
- Insights: Flight counts by airlines and airports, trends, and analytics.
- Screenshots: Include key configurations and execution results.
- Successfully processed and curated datasets for querying.
- Improved efficiency in analyzing airline operations using AWS services.
SQL/
: Contains SQL scripts used in Athena.Scripts/
: Glue job scripts or configurations.Resources/
: Sample data files used for testing.Screenshots/
: Visuals showcasing the ETL process and results.