- Project Philosophy
- Tech Stacks
- Data Warehouse Schema
- Streamlit Cloud Deployment
- Implementation
- Data Validations
- How To Run?
Kardía is a project designed to analyze the factors that contribute to heart attacks and predict the likelihood of someone experiencing one.
This project is divided into three parts:
- ETL: A streamlined pipeline for extracting, transforming, and loading health-related data.
- Analysis: Power BI, time Series and Network analysis was used to analyze key factors contributing to heart attacks.
- Prediction App: A machine learning Streamlit web application that uses a Random Forest Classifier to predict heart attack risk with 98.7% accuracy, based on user input.
- As a user, I want to input my personal health data, so I can receive a prediction on my likelihood of having a heart attack.
- As a user, I want to view the accuracy of the heart attack prediction model, so I can trust the results I'm given.
- As a user, I want to explore a Power BI report that visualizes heart attack data, so I can see how different health factors correlate with heart attacks.
- As a user, I want to filter the Power BI report by age, gender, and other health conditions, so I can focus on data relevant to me or my demographic.
- Python: This project utilizes Python for creating the ETL (Extract, Transform, Load) pipeline, enabling efficient data handling and preprocessing.
- Streamlit: Streamlit is used to create the user interface for the machine learning model. It provides an interactive platform where users can input their data and receive heart attack predictions in real-time, making the model easy to use and accessible.
- MySQL: MySQL is used to design and manage the schema in the database, enabling organized, scalable, and efficient data storage.
- DVC (Data Version Control): DVC is employed to version the data, ensuring that every change in the dataset is tracked and reproducible. This is especially important in projects dealing with evolving data sources.
- Random Forest Classifier: The machine learning model at the heart of this project is built using Python. The Random Forest algorithm is chosen for its effectiveness in handling binary classification tasks, like predicting the likelihood of a heart attack.
- MLflow: For model versioning, MLflow is used to manage the lifecycle of the Random Forest model, including tracking experiments, packaging code into reproducible runs, and deploying models.
- Power BI: Power BI is used to create interactive visualizations and dashboards that provide insights into heart attack trends, enabling data analysis and reporting for better understanding and decision-making.
- PowerShell Scripts: To streamline and automate repetitive tasks such as running specific Python scripts or managing data workflows.
- Windows Task Scheduler to scheculde a Batch script for the ETL process.
The machine learning model is stored in Google Drive. When the app starts, it checks if the model is already available locally. If not, it downloads the model from Google Drive using gdown
. The model is then loaded into the app using joblib
.
To ensure efficient use of resources, the model is cached using Streamlit’s @st.cache_resource
decorator. This helps reduce the memory load and avoid redundant downloads, especially when the app is reopened.
The app is designed with an intuitive interface that allows users to interact with the machine learning model. Streamlit simplifies deployment by automatically handling scalability and hosting, while the app remains responsive and user-friendly.
You can access the Streamlit app from here.
Home screen | Prediction Screen |
---|---|
Overview Screen | Line Chart Screen |
---|---|
Personal Analysis Screen | Scatter Screen |
---|---|
Disease Analysis Screen | Decomposition Tree Screen |
---|---|
This project employs a validation methodology to ensure the reliability and accuracy data loading. Which helps in identifying and addressing potential issues early in the development process.
Logs |
---|
you can see also the data versioning with DVC.
To set up Kardía locally, follow these steps:
-
Python: i prefer downloading Miniconda. Miniconda offers several advantages over a standalone Python installation, especially for data science and scientific computing tasks.
You can see how to install it here. -
MySQL and MySQL Workbench: Download them here.
Also in this list, download the MySQL connector. If you're on windows, download this oneConnector/NET
-
Power BI: it works on Windows only, and you can download it from the Microsoft Store or from here.
- clone the repo
git clone https://github.com/mostafa-fallaha/heart-disease-prediction.git
cd heart-disease-prediction
- install the required Python packages
pip instal -r requirements.txt
- create the DVC storage
mkdir /tmp/dvc_heart
-
Download the parquet file from here. And put it in
ETL/docs
. -
create the logs table, you can find the SQL scripts for it in
ETL/dwh/logs_table.sql
. -
create a
.env
file in the root of the project containing the following:
DB_USER=your database username
DB_PASSWORD=your database password
DB_HOST=your host (usually localhost)
DB_PORT=the port where mysql is running (usually 3306)
LOGS_DB=the database where your logs table is.
DB_STAGING=the staging schema name (create the schema in mysql workbench, no need to create any table)
DB_DWH=the DWH schema name (you need to create tables, in the step 3)
VERSION=0.9 (this to increment the data version whenever you run the ETL process)
-
run the
extract.py
in the ETL folder to load to the staging schema. -
in mysql workbench, create a new schema (the DWH schema) and put the name in the .env file (here
DB_DWH
). And then run thefinal_dwh.sql
(you can find it in ETL/dwh) in the newly created schema to create the tables and the relations. -
run the
transform.py
in the ETL folder to transform the data and load it to the DWH and to version the data via DVC. -
run, train and version (via MLflow) the machine learning model that reads the data from DVC via the DVC python API.
cd DataScience
python3 model_versioning.py
- run the mlflow ui: cd to the root directory
cd ..
mlflow ui
this will take the whole terminal.
- run the streamlit app: open a new terminal in the project directory.
cd DataScience
streamlit run app.py
Now, you should be able to run the Streamlit app locally and explore its features.
- to access the Power BI report, you can download it from here.