This repository contains a comprehensive implementation of a Machine Learning Model Lifecycle Management System. It demonstrates the complete lifecycle of a machine learning model, including training, versioning, scheduling, monitoring, alerting, and serving predictions via an API.
- Introduction
- Architecture Overview
- Project Structure
- Environment Setup
- Running the Project
- Detailed Explanation of Components
- Testing and Edge Cases
- Conclusion
- Contact Information
- Additional Notes for Reviewers
This project showcases an end-to-end Machine Learning Model Lifecycle Management System, focusing on the following key aspects:
- Model Training and Versioning: Automate model training and store models with unique versions, hyperparameters, and accuracy metrics.
- Scheduled Training: Use Apache Airflow to schedule daily training with retries and service-level agreements (SLAs).
- Monitoring: Visualize model accuracy over time and compute the average accuracy over the last week using Prometheus and Grafana.
- Alerting: Set up alerts for scenarios such as the latest model being older than 36 hours or the model accuracy dropping below a certain threshold.
- Model Serving: Provide an API endpoint to serve predictions using the latest available model, which automatically updates when a new version is available.
The system integrates several components to achieve complete lifecycle management:
- AWS S3: Stores MLflow artifacts, including model versions and metrics.
- MLflow: Manages experiment tracking, model versioning, and artifact storage.
- Apache Airflow: Orchestrates scheduled training tasks with retries and SLAs.
- Prometheus: Collects and stores metrics for monitoring model performance.
- Grafana: Visualizes metrics and sets up alerting rules.
- Flask API: Serves the latest model for predictions via a RESTful interface.
- Docker and Docker Compose: Containerizes and orchestrates all services for consistent deployment.
-
Model Training:
- Airflow triggers the model training DAG daily.
- The training script logs metrics and parameters to MLflow.
- Trained models and artifacts are stored in the S3 bucket via MLflow.
-
Model Monitoring:
- Prometheus scrapes metrics from the MLflow exporter.
- Grafana visualizes these metrics and calculates averages over time.
- Alerts are configured in Grafana to monitor specific conditions.
-
Model Serving:
- The Flask API loads the latest model from MLflow.
- Clients can request predictions via the API endpoint.
- The API automatically updates the model if a new version is available.
.
βββ README.md
βββ docker-compose.yml
βββ setup_aws_env.sh
βββ requirements.txt
βββ postgres-init.sql
βββ mlflow/
β βββ Dockerfile
βββ orchestration/
β βββ dags/
β β βββ __init__.py
β β βββ daily_model_training.py
β βββ python_scripts/
β βββ __init__.py
β βββ data_generation.py
β βββ data_preprocessing.py
β βββ model_training.py
β βββ model_evaluation.py
βββ monitoring/
β βββ prometheus/
β β βββ prometheus.yml
β βββ grafana/
β β βββ provisioning/
β β β βββ datasources/
β β β β βββ datasources.yaml
β β β βββ dashboards/
β β β β βββ dashboards.yaml
β β β βββ alerting/
β β β βββ alerting.yaml
β β βββ dashboards/
β β βββ ml_model_metrics.json
β βββ exporter/
β βββ Dockerfile
β βββ exporter.py
βββ serving/
β βββ app.py
β βββ requirements.txt
β βββ Dockerfile
βββ data/
βββ (Data files generated during runtime)
Notes For Reviewers: I didn't always follow best practices in my project due to time constraints and simplicity. I will attempt to explain my decisions in certain areas and what would have been preferable in ideal circumstances.
- Operating System: Linux-based OS is recommended.
- Docker: Ensure Docker is installed and running (My Macbook M2 was running Docker-Desktop).
- Docker Compose: Version compatible with Docker.
- AWS Credentials: Access to AWS services (For this project, S3) with necessary permissions.
To run this project, you will need AWS credentials with the necessary permissions to access S3 and other AWS services used in this project.
- Please contact the project author at [email protected] to directly obtain the necessary credentials.
- Important: Handle these credentials securely and do not expose them publicly.
-
Clone the Repository:
git clone https://github.com/bilalimamoglu/ml-lifecycle-system.git cd ml-lifecycle-system
-
Set Up AWS Credentials:
-
Copy the provided AWS credentials (
credentials
andconfig
files) into the project's.aws
directory:mkdir -p .aws cp path_to_provided_credentials .aws/credentials cp path_to_provided_config .aws/config
Replace
path_to_provided_credentials
andpath_to_provided_config
with the actual paths to the files. -
Alternatively, run the provided script to set up AWS environment variables:
aws configure
./setup_aws_env.sh
This script copies your AWS credentials from
~/.aws/credentials
and~/.aws/config
to the project's.aws
directory and sets the environment variablesAWS_SHARED_CREDENTIALS_FILE
andAWS_CONFIG_FILE
.
-
-
Initialize PostgreSQL Databases:
- The
postgres-init.sql
script will automatically create the necessary databases and users when the PostgreSQL container starts.
- The
-
Build and Start Services:
-
Use Docker Compose to build and run all services:
docker-compose up -d --build
-
-
Access the Web Interfaces:
- Airflow: http://localhost:8080 (Username:
airflow
, Password:airflow
) - MLflow: http://localhost:5000
- Grafana: http://localhost:3000 (Username:
admin
, Password:admin
) - MailHog: http://localhost:8025
- Model API: http://localhost:5001
- Airflow: http://localhost:8080 (Username:
-
Trigger the Airflow DAG:
- In the Airflow UI, manually trigger the
daily_model_training
DAG to start the initial training process.
- In the Airflow UI, manually trigger the
-
Verify Model Training:
- Check the MLflow UI to verify that a new run has been logged.
- Ensure that model artifacts are stored in the S3 bucket.
-
Test the Model API:
-
Send a POST request to the
/predict
endpoint:curl -X POST -H "Content-Type: application/json" \ -d '{"sepal length (cm)": 5.1, "sepal width (cm)": 3.5, "petal length (cm)": 1.4, "petal width (cm)": 0.2}' \ http://localhost:5001/predict
-
Expected Response:
{ "prediction": ["setosa"] }
-
-
Monitor Metrics and Alerts:
- In Grafana, view the ML Model Metrics Dashboard.
- Check for any alerts in the Alerting section.
- Use MailHog to view any email notifications triggered by alerts.
- Bucket Name:
mlflow-artifacts-bilalimg
- Purpose: Stores MLflow artifacts such as models, metrics, and parameters.
- Configuration:
- The bucket is configured to allow access only from authorized IAM users and services.
- Access policies are set to ensure that only specified IAM roles and users can access the bucket.
-
IAM Users:
cli-user
: Used for AWS CLI interactions, such as uploading artifacts to S3.reviewer-user
: Provides access for reviewers to inspect the project.
-
IAM Group:
- Name:
ml-lifecycle
- Purpose: Manages permissions for users involved in the ML lifecycle project.
- Policy Attached:
ml-lifecycle-policy
- Name:
-
IAM Policy (
ml-lifecycle-policy
):- Permissions:
S3
: Access to the specified S3 bucket for reading and writing artifacts.
- Permissions:
- AWS S3: Provides scalable, secure, and durable storage for artifacts. It integrates seamlessly with MLflow for artifact storage.
- IAM Users and Roles: Using IAM users and groups with appropriate policies ensures secure and organized access management, adhering to the principle of least privilege.
- MLflow: Provides a robust framework for tracking experiments, versioning models, and storing artifacts.
- Integration with S3: Allows for scalable and durable storage of models and artifacts.
- PostgreSQL Backend: Serves as a reliable backend for MLflow's tracking data.
- Simplicity To make it less complicated overall, I merely used MLFlow Experiment Tracking in this instance.
-
DAG File:
orchestration/dags/daily_model_training.py
-
Tasks:
-
generate_data:
- Generates the Iris dataset and saves it to
data/iris.csv
. - Why: Guarantees that each training cycle (in the real world) uses new data.
- Generates the Iris dataset and saves it to
-
preprocess_data:
- Splits data into training and testing sets.
-
train_model:
- Trains a Random Forest model with random hyperparameters.
- Logs parameters and metrics to MLflow.
-
evaluate_model:
- Evaluates the trained model on the test set.
- Logs accuracy to MLflow.
-
-
Retries and SLAs:
- Configured in
default_args
of the DAG. - Retries set to
1
with a delay of5
minutes.
- Configured in
- Apache Airflow: Provides a powerful platform for orchestrating complex workflows with scheduling, retries, and SLA management.
- PythonOperator: Allows for flexibility in defining tasks using Python functions.
- The orchestration folder contains DAGS. This is a bad habit in the actual world. Because you have to pause the entire Airflow system for a while in order to deploy each time you alter the DAGS.
- Additionally, to keep things separate, modular and more professional, I would prefer DockerOperator or KubernetesPodOperator over PythonOperator if I had more time.
-
Configuration File:
monitoring/prometheus/prometheus.yml
-
Exporter:
- Custom exporter located in
monitoring/exporter/
. - Exposes MLflow metrics such as
model_accuracy
andtraining_accuracy_score
. - Dockerfile:
- Custom exporter located in
-
Provisioning:
- Dashboards, datasources, and alerting rules are provisioned using configuration files in
monitoring/grafana/provisioning/
.
- Dashboards, datasources, and alerting rules are provisioned using configuration files in
-
Datasources:
- Configured in
datasources.yaml
to point to the Prometheus instance.
- Configured in
-
Dashboard:
-
JSON file
ml_model_metrics.json
defines panels for:- Model Accuracy Over Time: Displays
model_accuracy
metric. - Training Accuracy Score: Shows
training_accuracy_score
.
- Model Accuracy Over Time: Displays
-
-
Alerts:
-
Configured in
alerting.yaml
. -
Alert Conditions:
-
Low Model Accuracy:
- Triggers if
model_accuracy
falls below0.99
. - Evaluated over the last
5
minutes.
- Triggers if
-
Model Staleness:
- Triggers if the model hasn't been updated in the last
36
hours. - Uses
model_last_updated
metric from the exporter.
- Triggers if the model hasn't been updated in the last
-
-
-
Notifications:
- Alerts are sent via email using MailHog, a local SMTP server for testing.
- Prometheus: Ideal for collecting and storing time-series data, enabling real-time monitoring.
- Grafana: Provides a user-friendly interface for visualizing metrics and setting up complex alerting rules.
- MailHog: Enables email notification testing without actually sending emails. I used it for the first time and found it to be useful for testing.
-
Location:
serving/app.py
-
Functionality:
- Provides a
/predict
endpoint that accepts input features in JSON format. - Returns predictions based on the latest model.
- Provides a
-
Model Update Daemon:
- A background thread checks for new model versions every
60
seconds. - Loads the latest model from MLflow if a new version is available.
- A background thread checks for new model versions every
-
Why Not Use Model Registry:
- Chose simplicity over setting up MLflow Model Registry.
- Directly querying the latest run ensures the most recent model is used without additional configuration.
- In the real world settings, different models would produce different results over varying periods of time. For instance, I would choose the model that performed the best during the previous month.
- Flask: Lightweight and easy to set up for building a RESTful API.
To ensure that the system works as expected, please perform the following test cases:
-
Model Training Verification:
- Objective: Verify that the model is trained successfully and all parameters and metrics are logged to MLflow.
- Steps:
- Trigger the Airflow DAG.
- Check MLflow UI for a new run.
- Verify that the model artifacts are stored in the specified S3 bucket.
-
Scheduled Training:
- Objective: Confirm that the Airflow DAG runs daily as scheduled and handles retries.
- Steps:
- Check the Airflow scheduler to ensure the DAG is scheduled correctly.
- Simulate a failure (e.g., by temporarily disabling network connectivity) and observe if the DAG retries.
-
Model Serving:
- Objective: Test the prediction endpoint with valid and invalid input data.
- Steps:
- Send valid data to the
/predict
endpoint and verify the response. - Send invalid data (e.g., missing fields) and confirm that appropriate error messages are returned.
- Send valid data to the
-
Monitoring Metrics:
- Objective: Verify that Prometheus is scraping metrics and Grafana is displaying them correctly.
- Steps:
- Access the Prometheus UI and check the metrics.
- View the Grafana dashboard and confirm that metrics are displayed.
-
Alerting Mechanisms:
- Objective: Ensure that alerts are triggered under specified conditions.
- Steps:
- Modify the
evaluate_model.py
to produce a low-accuracy model and trigger an alert. - Stop the Airflow scheduler and wait for 36 hours (or adjust the alert threshold for testing) to trigger a staleness alert.
- Check MailHog to verify that alert emails are received.
- Modify the
-
Data Integrity:
- Objective: Ensure data preprocessing is functioning correctly.
- Steps:
- Verify that the data is correctly split into training and testing sets.
- Check that the data files exist and contain expected values.
Note: Testing edge cases ensures that the system is robust and can handle unexpected scenarios gracefully.
For any questions or further assistance, please contact:
- Name: Bilal Imamoglu
- Email: [email protected]
- GitHub: github.com/bilalimamoglu
-
AWS Resources:
- All necessary AWS resources, including IAM users, policies, and S3 bucket, have already been set up.
- You do not need to create or modify any AWS configurations. Just use reviewer cli.
-
Credentials:
- Please handle the provided AWS credentials securely and ensure they are not exposed in any public forums or repositories.
- Contact the author to obtain the necessary credentials.
-
Testing Environment:
- The project is designed to run in a local development environment using Docker.
- No deployment to AWS services beyond S3 is required.
-
Data Privacy:
- The project uses the Iris dataset, which is publicly available and does not contain sensitive information.
-
Feedback:
- Your feedback is valuable. Please feel free to provide any comments or suggestions regarding the implementation.
Thank you for taking the time to review this project!