GitHub - L1xus/AstroMrs: Scalable ETL data pipeline that extracts, transforms, and loads movie data from the TMDB API into a MongoDB database, leveraging Apache Spark for efficient distributed data processing and Apache Airflow for automating and orchestrating the workflow to ensure reliability and maintainability.

Overview

This repository contains a personal project aimed at enhancing my skills in Data Engineering. The project involves building a data pipeline that extracts data from an API, transforms it, and loads it into a mongodb database. I use Spark to distribute this process and Airflow to automate it.

Important

This project is made for the sake of practicing and learning

Architecture diagram

Will add this later

How it works

1. Extract

The extract phase is handled by the function fetch_movies from src/fetch_movies.py module.

API Requests: The fetch_movies function sends GET requests to TMDB API endpoints.
Spark Integration: For endpoints with multiple pages, the script uses Spark's RDD to distribute the API calls across 2 spark-workers.

2. Transform

The transform phase involves the function validation_aka_transformation from src/transform_movies.py module.

Data Cleaning: The validation_aka_transformation function performs data cleaning on the fetched movies by removing duplicates, unwanted fields and null values.
Data Transformation: The function also transforms the release_date to datetime and creates a year field.

3. Load

The load phase is execute by the save_movies_mongo function from src/store_movies.py module.

Storage in MongoDB: After validation, the transformed movies is loaded into a MongoDB Collection (movies_collection).

Those 3 ETL functions are called within a spark job to speed up the process of extracting, transforming and loading large volumes of movies.

Apache Airflow is used to automate and orchestrate the spark job, enabling scheduled execution and efficient management of the ETF workflow.

Tech Stack

Python: Main programming language used for building the ETL pipeline logic.
Docker & Docker-Compose: Containerizes the application, manage services like Spark, Airflow, MongoDB.
Apache Airflow: Automate and scheduled the ETL workflow.
PySpark: Handles distributed data processing for the ETL.
MongoDB: NoSQL database to store the transformed movies.

Prerequisites

To run the project you need:

Docker - You must allocate a minimum of 8 GB of Docker memory resource.
Python 3.8+ (pip)
docker-compose
TMDB API Keys

Run the project

docker-compose up airflow-init

docker-compose up --build

In Airflow webserver (Admin >> Connections) you'll need to create a spark connection.

References

Contributions

Feel free to submit a pull request or report issues. Contributions are welcome to make this project even better!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
dags		dags
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
movies.py		movies.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Table of Contents

Architecture diagram

How it works

1. Extract

2. Transform

3. Load

Tech Stack

Prerequisites

Run the project

References

Contributions

About

Releases

Packages

Languages

L1xus/AstroMrs

Folders and files

Latest commit

History

Repository files navigation

Overview

Table of Contents

Architecture diagram

How it works

1. Extract

2. Transform

3. Load

Tech Stack

Prerequisites

Run the project

References

Contributions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages