This project aims to build an end-to-end data engineering pipeline for processing and analyzing data from the MiBici public bicycle system in the Guadalajara Metropolitan Area. The pipeline developed using Mage involves downloading open data from the MiBici website, storing it in a Google Cloud Storage ( (GCS) bucket, transforming the data using Apache Spark, and loading it into Google BigQuery for visualization and analysis.
The project was developed as the final assignment for the Data Engineering Zoomcamp in the 2024 Cohort.
The application fetches data from the MiBici. from 2020 .
Objective:
Our project aims to streamline bike-sharing data analysis and visualization by offering a comprehensive solution. Through automated data ingestion, processing, and visualization, we empower stakeholders to extract actionable insights more efficiently, enhancing decision-making and operational efficiency.
Pipeline description:
The Pipelines are orchestrated via Mage
-
Extraction:The pipeline retrieves data from APIs and stores it in a more optimized data format, Parquet. The data files are partitioned based on the year and month.
-
Transformation:Once stored, we proceed to extract the data from the Parquet files on GCS and apply transformative processes. This includes filtering out outliers, such as trips exceeding a day in duration, and performing data cleansing tasks such as handling missing values by replacing them with mean values. Additionally, we calculate metrics like trip durations to enrich the dataset.
-
Datawarehouse,the processed data is subsequently uploaded to BigQuery, Google's cloud-based data warehouse, where it is stored in partitioned tables. Partitioning by year and month facilitates efficient data querying. The target tables consist of dimensional data such as stations and rides_fact tables, along with a curated dataset for visualization purposes, referred to as "rides-analytics".
-
from BigQuery dataset we will be analyzing and visualizing bicycle usage patterns and trends using Tableau, google's data looker or other visualization tools..
Before running the data engineering pipeline, ensure you have:
-
VS Code with the Remote - SSH extension installed
-
Setup a GCP account
To run the code you need to follow the steps below.
To facilitate seamless deployment and reproducibility, we utilize Terraform to provision a Google Cloud Platform (GCP) virtual machine (VM). The Terraform configuration pulls our Mage git repository repository containing the data pipeline, automating the setup process.
*Note: these instructions are used for Linux/WSL.
git clone https://github.com/Javeed-Pasha/Bike-usage-analytics-project Bike-usage-analytics-project/
sudo apt update && sudo apt install terraform
terraform init
Create Service Account key
ensure the service account key as following permissions.
BigQuery Admin,Cloud Run Developer,Cloud SQL Admin,Compute Admin,Service Account Token Creator,Storage Admin
download the service account key and save as my-creds.json
and save it in your local machine under ~/.gc/
directory. This key file will be used to authenticate requests to GCP services
sudo apt-get update && sudo apt-get install google-cloud-sdk
gcloud init
export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/my-creds.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
ssh-keygen -t rsa -f ~/.ssh/gcp_key -C gcp -b 2048
Note that in this context, "gcp_key" refers to the filename of the SSH key, while "gcp" represents the username. This user credential will be utilized for logging into the newly created GCP VM in the subsequent steps
In the Terrafom files cloned in previous step go to varaibles.tf
-
In variables.tf:
- If the SSH key uses a different username, adjust VM_USER_HOME to /home/USERNAME in the Terraform file, replacing "USERNAME" with your chosen username.
- Set project to your GCP project ID.
- Set bq_dataset to your BigQuery dataset name.
- Set gcs_bucketname to your GCP bucket name.
-
In main.tf:
- If the username is "gcp", no changes are required.
- Replace
ssh-keys = "gcp:${file("~/.ssh/gcp_key.pub")}"
withssh-keys = "<USERNAME>:${file("~/.ssh/KEYFILENAME.pub")}"
, replacing with your chosen username and KEYFILENAME with the filename from step 6 of the SSH key generation process.
The Terraform scripts provisions a Google Cloud Platform (GCP) virtual machine (VM) and fetches this Mage git repository, which contains the Mage data pipeline to extract the MiBici data into GCS bucket and into BigQuery warehouse.
IMPORTANT: The only thing that can fail in terraform apply are due to gcp bucket name and bigquery dataset name conflicts. In case the creation of the GCP bucket and BigQuery dataset fails during the Terraform execution due to name conflicts,you'll need to choose new names and rerun the terraform plan and terraform apply steps..
terraform init
terraform plan
terraform apply
- Connect to your new GCP VM using VS Code with the Remote - SSH extension.
- Enter the SSH connection details for GCP VM.
sudo chown -R $USER:$USER ~/mage
sudo usermod -a -G docker $USER
newgrp docker
Manually Copy your service account key json contents created in step 3 to ~/mage/my-creds.json on new GCP VM .
-
To begin, navigate to the directory
cd ~/mage
in your terminal.Next, ensure that Docker containers are running. If not, start them by executing docker-compose up -d -
Ensure that you configure port forwarding in VS Code for ports 6789 and 5432.
-
Now, you can access the Mage application at http://localhost:6789/.
-
Modify the create_spark_session block within the DataPipeline_mibici pipeline. Replace the variables listed below with the variables defined in step 7(1).
bucket_name='REPLACE_WITH_GCP_BUCKETNAME' project_id = 'REPLACE_WITH_GCP_PROJECT_ID' bigquery_dataset = 'REPLACE_WITH_BIGQUERY_DATASETNAME'
-
Finally,execute the pipeline named DataPipeline_mibici. Navigate to the triggers section and click
RUN@ONCE
to run the pipeline once. -
Important: If the pipeline DataPipeline_mibici fails for any reason, please edit the pipeline and manually rerun the block called Create_Spark_Session and related failed blocks manually.
Your pipeline should look like this:
Once the process is complete, the raw csv data for rides will be partitioned by year and month and stored as parquet files in Google Cloud Storage under the directory as shown in picture below .
zoomcamp/
└── bikesdataset/
└── raw/
├── rides/
│ ├── year=2021/
│ │ └── month=1/
│ │ └── ...
│ ├── year=2021/
│ │ └── month=2/
│ │ └── ...
│ └── ...
└── nomenclature/
└── ...
And in BigQuery, you will find a Dimension table named Dim_Stations, a Fact table called Rides_Fact, and an Analytics table named rides-analytics. The analytics table will contain metrics such as ride routes to identify popular routes.
-
Log in to Google looker studio
-
Connect your dataset using the Big Query Connector
-
Select your project name and the dataset. This would launch a dashboard page
-
Create your visualizations and share.