Skip to content

UjjAgarwal/Insurance-Default-Prediction

Repository files navigation

Insurance-Default-Prediction

Insurance is a form of risk management tool which allows the insured party to hedge the risk of a uncertain loss.It is vital for companies to identify the key factors that influence the premium payment decision to maximise their profits and maintain their customers and business.

About the Project

The objective of this project is to predict the probability that a customer will default the premium payment, so that the insurance agent can proactively reach out to the policy holder to follow up for the payment of premium. Simultaneously, it will also help understand customer demographics which are more likely to default and to price the premium amount in accordance to the same.

Data overview & Exploratory Data Analysis

The dataset consists of 17 variables and 79853 customer observations.Data has a mix of Indicator and Continuous variables which mainly covers Customer’s demographic information, premium payment related behavior and Risk profiling.

Data Analysis: Univariate

•Values range from 0 to 100 with majority of data points falling   • Data has a wide range of 24,030 to 90,262,600 (right skew)
in the lower range of 0% to 5%                   • Mean = 208847
• Mean = 31.43%                       • Data has too many outliers
• Data has outliers

Data Analysis: Bivariate

• Marital status vs Income, Late Payment, No of premium paid & Risk score

No significant difference across parameters between Married and Unmarried customers.

Data Analysis: Multivariate

There is no high correlation among variables, in general. However, from the above we can infer that: Customers making higher % of cash payment are likely to make more delayed payments and are likely to have lower Risk Score. Higher age customers have paid more number of premiums but lesser premium amount in cash. Higher Income customers are likely to pay higher Premium.

Modelling

Logistic Regression, Random Forest, K-Nearest Neighbours and Naive Bayes were used.

Parameter Logistic Regression Random Forest K-Nearest Neighbours Naive Bayes With Bagging
Classification Error 0.27 0.10 0.37 0.25 0.17
Accuracy 0.73 0.90 0.63 0.74 0.83
Loss 0.41 0.04 0.57 0.37 0.26
Oppurtunity Loss 0.11 0.15 0.14 0.13 0.07
Top 2 models per above comparison are Logistic Regression and Random Forest.

Let’s further compare these 2 models:

ROC curves and Precision Recall curves were also plotted to compare performance measures.

Interpretation of Model Measures:
• Random Forest has a lower CER
• Random Forest has a higher accuracy
• Logistic Regression has a lower specificity.
• Logistic Regression has a lower sensitivity
• Logistic Regression has a lower AUC
• Random Forest has a higher K-S
• It can be seen that the ROC curve for Random forest is closer to the left corner and the PRC curve is closer to the right corner.

Therefore, from above comparisons it can be seen that Random Forest has overall better performance indicators.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages