Skip to content

Javeed-Pasha/Bike-usage-analytics-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bike usage analytics project

Project Description

This project aims to build an end-to-end data engineering pipeline for processing and analyzing data from the MiBici public bicycle system in the Guadalajara Metropolitan Area. The pipeline developed using Mage involves downloading open data from the MiBici website, storing it in a Google Cloud Storage ( (GCS) bucket, transforming the data using Apache Spark, and loading it into Google BigQuery for visualization and analysis.

The project was developed as the final assignment for the Data Engineering Zoomcamp in the 2024 Cohort.

The application fetches data from the MiBici. from 2020 .

Objective:

Our project aims to streamline bike-sharing data analysis and visualization by offering a comprehensive solution. Through automated data ingestion, processing, and visualization, we empower stakeholders to extract actionable insights more efficiently, enhancing decision-making and operational efficiency.

Pipeline description:

The Pipelines are orchestrated via Mage

  • Extraction:The pipeline retrieves data from APIs and stores it in a more optimized data format, Parquet. The data files are partitioned based on the year and month.

  • Transformation:Once stored, we proceed to extract the data from the Parquet files on GCS and apply transformative processes. This includes filtering out outliers, such as trips exceeding a day in duration, and performing data cleansing tasks such as handling missing values by replacing them with mean values. Additionally, we calculate metrics like trip durations to enrich the dataset.

  • Datawarehouse,the processed data is subsequently uploaded to BigQuery, Google's cloud-based data warehouse, where it is stored in partitioned tables. Partitioning by year and month facilitates efficient data querying. The target tables consist of dimensional data such as stations and rides_fact tables, along with a curated dataset for visualization purposes, referred to as "rides-analytics".

  • from BigQuery dataset we will be analyzing and visualizing bicycle usage patterns and trends using Tableau, google's data looker or other visualization tools..


Prerequisites

Before running the data engineering pipeline, ensure you have:

To run the code you need to follow the steps below.


Setup and Reproducibility

To facilitate seamless deployment and reproducibility, we utilize Terraform to provision a Google Cloud Platform (GCP) virtual machine (VM). The Terraform configuration pulls our Mage git repository repository containing the data pipeline, automating the setup process.

*Note: these instructions are used for Linux/WSL.

1. Clone the Repository

	git clone https://github.com/Javeed-Pasha/Bike-usage-analytics-project  Bike-usage-analytics-project/

2. Install and Initialize Terraform

	sudo apt update && sudo apt install terraform 
	terraform init

3. Create and download service account key

Create Service Account key

ensure the service account key as following permissions.

BigQuery Admin,Cloud Run Developer,Cloud SQL Admin,Compute Admin,Service Account Token Creator,Storage Admin

download the service account key and save as my-creds.json and save it in your local machine under ~/.gc/ directory. This key file will be used to authenticate requests to GCP services

4. Install and initialize google cloud SDK

	sudo apt-get update && sudo apt-get install google-cloud-sdk
	gcloud init

5. Set Google Application Credentials

	export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/my-creds.json 
	gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS

6. Create SSH Key

	ssh-keygen -t rsa -f ~/.ssh/gcp_key -C gcp -b 2048

Note that in this context, "gcp_key" refers to the filename of the SSH key, while "gcp" represents the username. This user credential will be utilized for logging into the newly created GCP VM in the subsequent steps

7. Update Terraform Configuration

In the Terrafom files cloned in previous step go to varaibles.tf

  • In variables.tf:

    1. If the SSH key uses a different username, adjust VM_USER_HOME to /home/USERNAME in the Terraform file, replacing "USERNAME" with your chosen username.
    2. Set project to your GCP project ID.
    3. Set bq_dataset to your BigQuery dataset name.
    4. Set gcs_bucketname to your GCP bucket name.
  • In main.tf:

    1. If the username is "gcp", no changes are required.
    2. Replace ssh-keys = "gcp:${file("~/.ssh/gcp_key.pub")}" with ssh-keys = "<USERNAME>:${file("~/.ssh/KEYFILENAME.pub")}" , replacing with your chosen username and KEYFILENAME with the filename from step 6 of the SSH key generation process.

8. Deploy Infrastructure

The Terraform scripts provisions a Google Cloud Platform (GCP) virtual machine (VM) and fetches this Mage git repository, which contains the Mage data pipeline to extract the MiBici data into GCS bucket and into BigQuery warehouse.

IMPORTANT: The only thing that can fail in terraform apply are due to gcp bucket name and bigquery dataset name conflicts. In case the creation of the GCP bucket and BigQuery dataset fails during the Terraform execution due to name conflicts,you'll need to choose new names and rerun the terraform plan and terraform apply steps..

terraform init
terraform plan
terraform apply

9. Use VSCode Remote SSH Extension

  • Connect to your new GCP VM using VS Code with the Remote - SSH extension.
  • Enter the SSH connection details for GCP VM.

10. Update File Ownership and add the user to docker group

sudo chown -R $USER:$USER ~/mage
sudo usermod -a -G docker $USER

11. temporarily switch to docker so that you dont get permissions issue

	newgrp docker

12. Copy Service Account Key

Manually Copy your service account key json contents created in step 3 to ~/mage/my-creds.json on new GCP VM .

Running the Code

  • To begin, navigate to the directory cd ~/mage in your terminal.Next, ensure that Docker containers are running. If not, start them by executing docker-compose up -d

  • Ensure that you configure port forwarding in VS Code for ports 6789 and 5432.

  • Now, you can access the Mage application at http://localhost:6789/.

  • Modify the create_spark_session block within the DataPipeline_mibici pipeline. Replace the variables listed below with the variables defined in step 7(1).

    bucket_name='REPLACE_WITH_GCP_BUCKETNAME'
    project_id = 'REPLACE_WITH_GCP_PROJECT_ID'
    bigquery_dataset = 'REPLACE_WITH_BIGQUERY_DATASETNAME'
    
  • Finally,execute the pipeline named DataPipeline_mibici. Navigate to the triggers section and click RUN@ONCE to run the pipeline once.

  • Important: If the pipeline DataPipeline_mibici fails for any reason, please edit the pipeline and manually rerun the block called Create_Spark_Session and related failed blocks manually.

Your pipeline should look like this:

Once the process is complete, the raw csv data for rides will be partitioned by year and month and stored as parquet files in Google Cloud Storage under the directory as shown in picture below .

zoomcamp/
└── bikesdataset/
    └── raw/
        ├── rides/
        │   ├── year=2021/
        │   │   └── month=1/
        │   │       └── ...
        │   ├── year=2021/
        │   │   └── month=2/
        │   │       └── ...
        │   └── ...
        └── nomenclature/
            └── ...

And in BigQuery, you will find a Dimension table named Dim_Stations, a Fact table called Rides_Fact, and an Analytics table named rides-analytics. The analytics table will contain metrics such as ride routes to identify popular routes.



Creating Visualisations

  • Log in to Google looker studio

  • Connect your dataset using the Big Query Connector

  • Select your project name and the dataset. This would launch a dashboard page

  • Create your visualizations and share.



Go to Top

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages