This project provides an ETL (Extract, Transform, Load) framework designed to handle data with nested JSON structures.
This project requires Python 3.x and the following libraries:
- pandas
- json
- pyspark
You can install them using pip:
pip install pandas json spark pyspark
. Usage Instructions:
-
the dataset is nested json format and it can't be read so we needed to change into a readable or structured format.
- Data Source: this problem is found a lot you can find it online and one of the best providing these datasets is kaggle.
- Transformation: we need to explode pivot columns and flattening nested structures to deal with the real data .
- Loading: data is being uploaded to postgreSQL using psycopg2 library.