Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.

Summary

In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..." This dataset contains information about the bank's customers. We want to know if the client receives a fixed-term deposit or not.

In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..." The best performing model was a VotingEnsemble with accuracy 0.9169 in AutoML.

Scikit-learn Pipeline

Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm. For this project, We used the bankmarketing_train. In this dataset We can find information about customers like: age, job, marital, education, housing, loan We want to predict if the customer will receive a fixed-term deposit. The target column is: y.

In this part of the project, the algorithm chosen was a Classification Algorithm, specifically Logistic Regression.

To develop this project I modified udacity-project and train.py. In the train.py I obtained the data and cleaned with the cleand_data function. After that I splitted the data into train and test sets. In the main function I received the parameters for use the Logistic regression algorithm. The parameters are: C and Max_iter.

In the udacity-project I defined the parameter sampling. For select the values I did some tests. Also, I added an early stopping policy. I created a SKLearn, this create an estimator for the training experiments. Other step was create a HyperDriveConfig with all that I created early: hyperparameter, policy, estimator. The metric selected to improve was 'Accuracy' and the goal is to maximize this metric. In each training series the main metric is evaluated. The early termination policy uses the primary metric to identify low-performing strings. After all this setup I ran the code, saw the best execution and I saved the best model.

What are the benefits of the parameter sampler you chose? I choosed RandomParameterSampling.

The benefits of this parameter sampler are:

It supports discrete and continous hyperparameters.
Also, supports early termination of low-performance runs.

In this case, I used discrete hyperparameters for C and max_iter. The values are randomly selected, the objective is to find the best combination.

What are the benefits of the early stopping policy you chose? I choosed a BanditPolicy. It is a termination policy based on the delay factor or the amount of delay and the evaluation interval. This directive terminates those series where the main metric is not within the specified delay factor or amount with respect to to the best performing series.

AutoML

In 1-2 sentences, describe the model and hyperparameters generated by AutoML. AutoML is a good approach, it allows to know the performance of our model in several algorithms with different parameters. I started obtaining the data and splitting into test and train datasets. To use AutoML we need to configure AutoMLConfig. In this case the task selected was 'classification' like in the other exercise. The Metric selected was Accuracy. I defined some important parameters like: iterations, max_cores_per_iteration and max_concurrent_iterations. After that, I ran the experiment and I saw how in every iteration it uses a new algorithm.

Pipeline comparison

Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one? Between the 2 models executed in the project, the one that gave the best results was AutoML. In the case of HyperDrive, the accuracy obtained was 0.9109.

If we see AutoML we can see that the best result is for VotingEnsemble with accuracy of 0.91688.

We can see all the algorithms executed with his accuracy.

In the next image, for the algorithm VotingEnsemble we can see what features are directly impacting the model.

Some metrics:

After that, I think AutoMl is better than HyperDrive because with HyperDrive we only test one algorithm and with AutoML we can test several algorithms and choose the best one. In this case the accuracy is pretty similar, but with AutoMl the result is better.

Future work

What are some areas of improvement for future experiments? Why might these improvements help the model?

One of the improvements will be preprocess the data better for both Scikit-learn and AutoML and check if results is better. Also, for Scikit-learn I can try the following:

Other combination of values for the parameters --C and --Max_iter.
Try different classification algorithms.

For AutoML I can try the following:

Using more data because with more data the model can't easily memorize patterns.
Removing not important features from dataset.
Try with other values for Cross-validation.

The goal is to find the better combination that improves the accuracy.

##Questions Provide a description of the hypertuning parameters that you are exploring for the Logistic Regression model

Logistic Regression is used when the dependent variable is categorical. In this case, we want to predict if the customers will receive a fixed-term deposit or not. In our project we have 2 parameters for this model: C and Max_iter.

With Max_iter we define the maximum number iterations taken for the solvers to converge. The value by default is 100. I defined that parameter like choice with 3 options: 10, 50 ,100.

The parameter C is the inverse of regularization strength. With smaller values you specify stronger regularization. The value by default is 1.0 In this case I defined that parameter like choice with 3 options: 1, 2, 4

If we see the best_run details (line 6) we can see that the parameters chosen was:

['--C', '1', '--max_iter', '100']

Add some information about the parameters of the best model produced by AutoML

In my case the best model generated by AutoML was VotingEnsemble. Ensemble learning improves machine learning results and predictive performance by combining multiple models. The Voting Ensemble predicts based on the weighted average of predicted class probabilities for classification task. At the code line 23 we can see the details of this model.

The ensembled algorithms were: 'LightGBM', 'XGBoostClassifier', 'XGBoostClassifier', 'XGBoostClassifier', 'LightGBM', 'SGD'
And the ensemble weights were: [0.3, 0.2, 0.2, 0.1, 0.1, 0.1]

Does the model perform any form of additional preprocessing on the data?

In AutoML experiments, automating scaling and normalization techniques are applied by default. If we want we can also enable additional featurization. In my case, I didn't define other aditional featurization.

Some of this techniques that are automatically applied are:

Drop high cardinality or no variance features.
Impute missing values.
Generate additional features.
Transform and encode.
Word embeddings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Optimizing an ML Pipeline in Azure

Overview

Summary

Scikit-learn Pipeline

AutoML

Pipeline comparison

Future work

Files

README.md

Latest commit

History

README.md

File metadata and controls

Optimizing an ML Pipeline in Azure

Overview

Summary

Scikit-learn Pipeline

AutoML

Pipeline comparison

Future work