Insurance-Default-Prediction

Insurance is a form of risk management tool which allows the insured party to hedge the risk of a uncertain loss.It is vital for companies to identify the key factors that influence the premium payment decision to maximise their profits and maintain their customers and business.

About the Project

The objective of this project is to predict the probability that a customer will default the premium payment, so that the insurance agent can proactively reach out to the policy holder to follow up for the payment of premium. Simultaneously, it will also help understand customer demographics which are more likely to default and to price the premium amount in accordance to the same.

Data overview & Exploratory Data Analysis

The dataset consists of 17 variables and 79853 customer observations.Data has a mix of Indicator and Continuous variables which mainly covers Customer’s demographic information, premium payment related behavior and Risk profiling.

Data Analysis: Univariate

•Values range from 0 to 100 with majority of data points falling • Data has a wide range of 24,030 to 90,262,600 (right skew)
in the lower range of 0% to 5% • Mean = 208847
• Mean = 31.43% • Data has too many outliers
• Data has outliers

Data Analysis: Bivariate

• Marital status vs Income, Late Payment, No of premium paid & Risk score

No significant difference across parameters between Married and Unmarried customers.

Data Analysis: Multivariate

There is no high correlation among variables, in general. However, from the above we can infer that: Customers making higher % of cash payment are likely to make more delayed payments and are likely to have lower Risk Score. Higher age customers have paid more number of premiums but lesser premium amount in cash. Higher Income customers are likely to pay higher Premium.

Modelling

Logistic Regression, Random Forest, K-Nearest Neighbours and Naive Bayes were used.

Parameter	Logistic Regression	Random Forest	K-Nearest Neighbours	Naive Bayes	With Bagging
Classification Error	0.27	0.10	0.37	0.25	0.17
Accuracy	0.73	0.90	0.63	0.74	0.83
Loss	0.41	0.04	0.57	0.37	0.26
Oppurtunity Loss	0.11	0.15	0.14	0.13	0.07

Top 2 models per above comparison are Logistic Regression and Random Forest.

Let’s further compare these 2 models:

ROC curves and Precision Recall curves were also plotted to compare performance measures.

Interpretation of Model Measures:
• Random Forest has a lower CER
• Random Forest has a higher accuracy
• Logistic Regression has a lower specificity.
• Logistic Regression has a lower sensitivity
• Logistic Regression has a lower AUC
• Random Forest has a higher K-S
• It can be seen that the ROC curve for Random forest is closer to the left corner and the PRC curve is closer to the right corner.

Therefore, from above comparisons it can be seen that Random Forest has overall better performance indicators.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
images		images
Customer Dataset.xlsx		Customer Dataset.xlsx
Insurance Default Prediction.R		Insurance Default Prediction.R
Insurance_Deafalult_Prediction_Report.pdf		Insurance_Deafalult_Prediction_Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Insurance-Default-Prediction

About the Project

Data overview & Exploratory Data Analysis

Data Analysis: Univariate

Data Analysis: Bivariate

Data Analysis: Multivariate

Modelling

About

Releases

Packages

Languages

UjjAgarwal/Insurance-Default-Prediction

Folders and files

Latest commit

History

Repository files navigation

Insurance-Default-Prediction

About the Project

Data overview & Exploratory Data Analysis

Data Analysis: Univariate

Data Analysis: Bivariate

Data Analysis: Multivariate

Modelling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages