Getting started with Azure MLOps (based on enterprise grade solution from Azure MLOps (v2) Solution Accelerator )

Challenges during implementing V2 using terraform, github action and Azure cli version 2

Azure Free Account or Azure for Students versions do not work as they lack access at the subscription level for authorization, and you are not eligible to request VM quotas. Therefore, you need to have your own subscription and ensure you are eligible to request VM quotas in the region where you are working.
V2 combines multiple repositories to run workflows, making it challenging to navigate the codebase.
V2 is an excellent reference for those with some experience who understand the concepts and can modify the setup as needed. Microsoft provides comprehensive documentation for its implementation.
And yet, there are still several challenges you may encounter during implementation. For instance, the training pipeline and online deployment pipeline may fail due to issues with building the container image on top of the provided base image, often caused by dependency conflicts.

This repository guides you through orchestrating an Azure ML workspace, performing classical ML training, and deploying an online endpoint using a similar template and the same data as in V2

Using this repo

Pre-requisites

basic knowledge in azure cloud, azure ML and azure cli
What terrarom is
basic understanding of github action
docker basics to understand how containers are used inside azure ml and for debugging
needs to have pay-as-you-go subscription, and git installed, enoguh quotas of azureml instance to deploy endpoint
tools: azure cli, docker and terraform installed(for debugging). These installations are optional but they might be very useful for debugging

Limitations

Does not implement AzureML-Observability
There is a single main branch, and no separate production or development environments have been created. GitHub workflows need to be triggered manually.
No dynamic variable names are used. So, rerunning the pipelines throws an error. eg. during resource creation, online-endpoint creation etc.
- if you want to rerun all workflows, delete the resource groups in azure portal. Also, make sure to permanently delete the deleted resources especially ml workspace and key-vault.
- if you just want to run the online-deployment, delete the existing endpoint and modify the endpoint name in source code.

Instructions

clone this repo
modify the variables in infrastructure/terraform.tfvars
push the code in your github
create service principle using App registration and assign contributor role at subscription level.
create client secret and save the value
store the repository secrets in github. Impelementing steps 4-6 is also described in this link
you are now ready to experiment. Go to Actions in Github and it will have 3 workflows.
- select the Deploy Azure Resources workflow and click run workflow. It takes sometime to provision you your ml workspace. Once the resource are deployed verify them from azure portal/cli
- select the Deploy Model Training Pipeline and click run workflow. It takes a bit long time to provision your docker environment and initiate the training pipeline. Once the workflow completes, go to your azure ml workspace in the portal and select pipeline and navigate different components in UI, if you want to check the logs double click the component and check the logs. Those components turns green from gray-->blue-->geen which indicates the run is successful. If you see red then there are some issue. You can check the logs. Make sure the pipeline run is successful.
- finally, select the Deploy and Test Online Endpoint and Deployment and click run workflow. If the run is successful you will have a working endpoint. You can test the online endpoint from the portal as described in the same link as in step 6

Debugging

Debugging Deploy Azure Resources Workflow

In this part, if there is any issue, read the error message in github workflow carefully. Generally, error may occur due to naming convention described in azure documentation such as global unique names, characters limitations etc. Modify the variables in infrastructure/terraform.tfvars
you can test it locally. Make sure you have terraform installed and cofnfigured. Try running terraform init, terraform plan within mlops/infrastructure folder. Make sure to delete the changes caused by terraform or simply use .gitignore

Debugging Deploy Model Training Pipeline

Issues may arise due to failure of creating docker image for training. Check the packages and versions in data-science/environment/train-conda.yml that fits the need. ALso you can check the logs from azure workspace-->pipelines-->outputs + logs
You can test if the docker build works with your requirements locally. Create the docker image with your requirements on top of base image mentioned in mlops/azureml/train/train-env.yml. Make sure the build is successful(need to have some docker skills). If the image build fails you can check the logs. Make sure you create python virtual environment while implementing this step

Debugging Deploy and Test Online Endpoint and Deployment

Again, the issue may arise due to already available enpoints,if so, delete the existing endpoint, modify the endpoint name in mlops/azureml/deploy/online-deployment.yml and in mlops/azureml/deploy/online-endpoint.yml. Push the code changes in github. Finally run the workflow.
If the issue is from docker image build or score.py, the azureml provides testing and deploying the endpoint locally. For that, you need to download the trained model and check mlops/debug-online-deployment/test-local-deployment path for local testing or follow this tutorial

Final suggestion: If you are aware of all of these technologies and want to build your own from scartch, carefully check the file references in other files and how they are dependent and read more documentations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Getting started with Azure MLOps (based on enterprise grade solution from Azure MLOps (v2) Solution Accelerator )

Challenges during implementing V2 using terraform, github action and Azure cli version 2

Using this repo

Pre-requisites

Limitations

Instructions

Debugging

Debugging Deploy Azure Resources Workflow

Debugging Deploy Model Training Pipeline

Debugging Deploy and Test Online Endpoint and Deployment

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
data-science		data-science
data		data
debug-online-deployment/test-local-deployment		debug-online-deployment/test-local-deployment
infrastructure		infrastructure
mlops/azureml		mlops/azureml
README.md		README.md

sangamdeuja/mlops

Folders and files

Latest commit

History

Repository files navigation

Getting started with Azure MLOps (based on enterprise grade solution from Azure MLOps (v2) Solution Accelerator )

Challenges during implementing V2 using terraform, github action and Azure cli version 2

Using this repo

Pre-requisites

Limitations

Instructions

Debugging

Debugging Deploy Azure Resources Workflow

Debugging Deploy Model Training Pipeline

Debugging Deploy and Test Online Endpoint and Deployment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages