This repository contains a personal project aimed at enhancing my skills in Data Engineering. The project involves building a data pipeline that extracts data from an API, transforms it, and loads it into a mongodb database. I use Spark to distribute this process and Airflow to automate it.
Important
This project is made for the sake of practicing and learning
Will add this later
The extract phase is handled by the function fetch_movies from src/fetch_movies.py module.
- API Requests: The fetch_movies function sends GET requests to TMDB API endpoints.
- Spark Integration: For endpoints with multiple pages, the script uses Spark's RDD to distribute the API calls across 2 spark-workers.
The transform phase involves the function validation_aka_transformation from src/transform_movies.py module.
- Data Cleaning: The validation_aka_transformation function performs data cleaning on the fetched movies by removing duplicates, unwanted fields and null values.
- Data Transformation: The function also transforms the release_date to datetime and creates a year field.
The load phase is execute by the save_movies_mongo function from src/store_movies.py module.
- Storage in MongoDB: After validation, the transformed movies is loaded into a MongoDB Collection (movies_collection).
Those 3 ETL functions are called within a spark job to speed up the process of extracting, transforming and loading large volumes of movies.
Apache Airflow is used to automate and orchestrate the spark job, enabling scheduled execution and efficient management of the ETF workflow.
- Python: Main programming language used for building the ETL pipeline logic.
- Docker & Docker-Compose: Containerizes the application, manage services like Spark, Airflow, MongoDB.
- Apache Airflow: Automate and scheduled the ETL workflow.
- PySpark: Handles distributed data processing for the ETL.
- MongoDB: NoSQL database to store the transformed movies.
To run the project you need:
- Docker - You must allocate a minimum of 8 GB of Docker memory resource.
- Python 3.8+ (pip)
- docker-compose
- TMDB API Keys
docker-compose up airflow-init
docker-compose up --build
In Airflow webserver (Admin >> Connections) you'll need to create a spark connection.
- TMDB API Documentation
- RDD vs DataFrame : The Deference!
- Spark Documentation
- PySpark Documentation
- Running Airflow in Docker
- Apache Airflow Tutorial: Architecture, Concepts, and How to Run Airflow Locally With Docker
- How to Schedule and Automate Spark Jobs Using Apache Airflow
Feel free to submit a pull request or report issues. Contributions are welcome to make this project even better!