Insurance is a form of risk management tool which allows the insured party to hedge the risk of a uncertain loss.It is vital for companies to identify the key factors that influence the premium payment decision to maximise their profits and maintain their customers and business.
The objective of this project is to predict the probability that a customer will default the premium payment, so that the insurance agent can proactively reach out to the policy holder to follow up for the payment of premium. Simultaneously, it will also help understand customer demographics which are more likely to default and to price the premium amount in accordance to the same.
The dataset consists of 17 variables and 79853 customer observations.Data has a mix of Indicator and Continuous variables which mainly covers Customer’s demographic information, premium payment related behavior and Risk profiling.
•Values range from 0 to 100 with majority of data points falling • Data has a wide range of 24,030 to 90,262,600 (right skew)
in the lower range of 0% to 5% • Mean = 208847
• Mean = 31.43% • Data has too many outliers
• Data has outliers
• Marital status vs Income, Late Payment, No of premium paid & Risk score
No significant difference across parameters between Married and Unmarried customers.
There is no high correlation among variables, in general. However, from the above we can infer that: Customers making higher % of cash payment are likely to make more delayed payments and are likely to have lower Risk Score. Higher age customers have paid more number of premiums but lesser premium amount in cash. Higher Income customers are likely to pay higher Premium.
Logistic Regression, Random Forest, K-Nearest Neighbours and Naive Bayes were used.
Parameter | Logistic Regression | Random Forest | K-Nearest Neighbours | Naive Bayes | With Bagging |
---|---|---|---|---|---|
Classification Error | 0.27 | 0.10 | 0.37 | 0.25 | 0.17 |
Accuracy | 0.73 | 0.90 | 0.63 | 0.74 | 0.83 |
Loss | 0.41 | 0.04 | 0.57 | 0.37 | 0.26 |
Oppurtunity Loss | 0.11 | 0.15 | 0.14 | 0.13 | 0.07 |
Let’s further compare these 2 models:
ROC curves and Precision Recall curves were also plotted to compare performance measures.
Interpretation of Model Measures:
• Random Forest has a lower CER
• Random Forest has a higher accuracy
• Logistic Regression has a lower specificity.
• Logistic Regression has a lower sensitivity
• Logistic Regression has a lower AUC
• Random Forest has a higher K-S
• It can be seen that the ROC curve for Random forest is closer to the left corner and the PRC curve is closer to the right corner.
Therefore, from above comparisons it can be seen that Random Forest has overall better performance indicators.