Bank has multiple banking products that it sells to customer such as saving account, credit cards, investments etc. It wants to which customer will purchase its credit cards. For the same it has various kind of information regarding the demographic details of the customer, their banking behavior etc. Once it can predict the chances that customer will purchase a product, it wants to use the same to make pre-payment to the authors.
In this part I will demonstrate how to build a model, to predict which clients will subscribing to a term deposit, with inception of machine learning. In the first part we will deal with the description and visualization of the analysed data, and in the second we will go to data classification models.
-Desire target
-Data Understanding
-Preprocessing Data
-Machine learning Model
-Prediction
-Comparing Results
Predict if a client will subscribe (yes/no) to a term deposit — this is defined as a classification problem.
The dataset (Assignment-2_data.csv) used in this assignment contains bank customers’ data. File name: Assignment-2_Data File format: . csv Numbers of Row: 45212 Numbers of Attributes: 17 non- empty conditional attributes attributes and one decision attribute.
Data pre-processing is a main step in Machine Learning as the useful information which can be derived it from data set directly affects the model quality so it is extremely important to do at least necessary preprocess for our data before feeding it into our model.
In this assignment, we are going to utilize python to develop a predictive machine learning model. First, we will import some important and necessary libraries.
Below we are can see that there are various numerical and categorical columns. The most important column here is y, which is the output variable (desired target): this will tell us if the client subscribed to a term deposit(binary: ‘yes’,’no’).
We must to check missing values in our dataset if we do have any and do, we have any duplicated values or not.
We can see that in 'age' 9 missing values and 'balance' as well 3 values missed. In this case based that our dataset it has around 45k row I will remove them from dataset. on Pic 1 and 2 you will see before and after.
From the above analysis we can see that only 5289 people out of 45200 have subscribed which is roughly 12%. We can see that our dataset highly unbalanced. we need to take it as a note.
Our list of categorical variables.
Our list of numerical variables.
In above boxplot we can see that some point in very young age and as well impossible age. So,
Now, we don’t have issues on this feature so we can use it
This attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.In this case I will not remove it we have very low 0. However, for realistic model we will need to place it to our depended and independent variables.
I don’t see any outliers on this feature so we can use it without any preprocessing.
Number of days that passed by after the client was last contacted from a previous campaign (-1 means client was not previously contacted). We have to treat feature by using label encoding, because have -1 in 36940 values to mean client was not previously contacted.
Number of contacts performed before this campaign and for Particular client. Here we can see some outliers. We will clean them all.
It looks perfect now.
All is clear he. We can proceed it without any changes.
All is clear he. We can proceed it without any changes.
Correlation shows the relationship between variables in the dataset
When building a machine learning model, it is important to preprocess the data to have an efficient model. We will need to change our 'pdays' to categorical data. ML models are require all input and output values should to be numerical. So if our dataset have categorical data, we must have to encode it into the numbers before fit and evaluate a model. There are several methods available.Here I have used One-hot Encoding Another data preprocessing method is to rescale our numerical columns; this helps to normalize our data within a particular range. Sklearn preprocessing StandardScaler() was used here.
Output of data set after do the scaling.
Next, we will combine our tables. Frame with numerical columns which we scaled and normalize, and our categorical frame without original numerical data.
To proceed in building our prediction model, we have to specify our dependent and independent variablels. Here we can place 'duration' for more realistic model.
By using below codes i have divide the data set into 30% for testing and 70% for training by using train_test_split from sklearn.model_selection. It is reasonable to always split the dataset into train and test set when building a machine learning model because it helps us to evaluate the performance of the model.
As you rememeber our data a bit imbalanced. This can affect our prediction. I will do oversampling here.
Its applied-on training set. Now we a finally ready to do modeling and prediction. It always very important to preprocess data perfectly before jump to next step, we can see perfect result of our work
I will first compare the model performance of the following 3 machine learning models using default hyperparameters:
•Logistic Regression
• Decision Tree
• K Nearest Neighbors (KNN)
First, we will load libraries which we will use for ML and plots with reports
Logistic regression is a traditional machine learning model that fits a linear decision boundary between the positive and negative samples. Logsitic regression uses a line (Sigmoid function) in to predict if the dependent variable is true or false based on the independent variables. One advantage of logistic regression is the model is interpretable — we know which features are important for predicting positive or negative. Take note that the modeling is sensitive to the scaling of the features, so that is why we scaled the features above. We can fit logistic regression using the following code from scikit-learn
As you can see our accuracy is 0.90. From above code output we can see the overall prediction accuracy of the model. But we can’t evaluate the model by looking overall prediction accuracy only. So have to do the study with comparing to the classification report also.
This machine learning models is tree-based methods. The simplest tree-based method is known as a decision tree. The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules gotten from training data. In Decision Trees, for predicting a class label for a record we start from the root of the tree. One advantage of tree-based methods is that they have no assumptions about the structure of the data and are able to pick up non-linear effects if given sufficient tree depth. We can fit decision trees using the following code.
Our accuracy with this model 0.88. Let’s evaluate the model by looking overall prediction accuracy only. So have to do the study with comparing to the classification report as well.
KNN is one the simplest machine learning models. KNN looks at the k closest datapoints and probability sample that has positive labels. This model is very easy to understand, versatile, and you don’t need an assumption for the data structure. KNN is also good for multivariate analysis. A caveat with this algorithm is being sensitivity to K and takes a long time to evaluate if the number of trained samples is large. We can fit KNN using the following code from scikit-learn.
Our accuracy with this model 0.85. A bit lower than previous one. Let’s evaluate the model by looking overall prediction accuracy only. So have to do the study with comparing to the classification report as well.
AUC (Area under the ROC Curve): It provides an aggregate measure of performance across all possible classification.
We were able to analyse bank marketing dataset, I built different models which help us to analyse the dataset properly, I classify the dataset according to the data preparing description. Here I showed various plots for easy reading and understanding. Result of my classification I present in the following table. I can see that obtain result of model mostly are similar. But in my opinion the best one is Logistic regression model. It can predict the chances that customer will purchase a product across all possible classification with score 0.91