From 6c5d09a29583b63a89988bd637f95a441e8d0ec9 Mon Sep 17 00:00:00 2001 From: Hazel Jiang Date: Sat, 28 Nov 2020 19:11:32 -0800 Subject: [PATCH 1/4] first draft report, finish introduction and part of eda --- doc/report.Rmd | 49 ++++ doc/report.html | 589 ++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 638 insertions(+) create mode 100644 doc/report.Rmd create mode 100644 doc/report.html diff --git a/doc/report.Rmd b/doc/report.Rmd new file mode 100644 index 0000000..d5d2505 --- /dev/null +++ b/doc/report.Rmd @@ -0,0 +1,49 @@ +--- +title: "Credit Card Default Predicting" +date: 2020-11-28 +author: "Selma Duric, Lara Habashy, Hazel Jiang" +output: + html_document: + toc: TRUE +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = FALSE) +library(knitr) +library(tidyverse) +``` + +## Summary + +## Introduction + +In recent years, credit card becomes more and more popular in Taiwan. Because card-issuing banks are all trying to increase market share, there exists more unqualified applicants who are not able to pay their credit card on time. This behaviour is very harmful to both banks and cardholders. (#reference) It is always better to prevent than to solve a problem. By detecting patterns of people who tend to default their credit card payment, banks are able to minimize the risk of issuing credict card to people who may not be able to pay on time. + +Here we would like to use a machine learning algorithm to predict whether a person is going to defualt on his/her credit card payment. We are going to test on different model and hyperparameters to find the best score on prediction. With the model, banks could predict if the applicant has the ability to pay on time and make better decision on whether to issue the person a credit card. Thus, if the machine learning algorithm can make accurate prediction, banks are able to find reliable applicants and minimize their loss on default payment. + +## Methods + +### Data + +The dataset we are using in the project is originally from Department of Information Management in Chun Hua University, Taiwan and Department of Civil Engineering in Tamkang University, Taiwan. It was sourced from UCI Machine Learning Repository (#references) and can be found [here](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#). [This file](http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls) is what we used to build the model. The data set contains 30,000 observations representing individual customers in Taiwan. Each row contains relevant information about the distinct individual as well as how timely they were with their bill payments and the corresponding bill amounts for each time period. The bill payment information contains records from April 2005 to September 2005 and each individual have the same number of time periods. The data was collected from an important cash and credit card issuing bank in Taiwan.We will make our prediction based on the features given by the data. + +### Analysis + +There are 30,000 observations of distinct credit card clients in this data set with each row represents a client. 25 different feature are included with information of each given client, such as gender, age, approved credit limit, education, marital status, their past payment history, bill statements, and previous payments for 6 months (April-Sept 2005). Feature transformations are applied to the given features so each observation has the same number of time periods. [Here](https://github.com/UBC-MDS/DSCI522_group_12/blob/main/src/project_eda.md) is a more detailed exploratory analysis that explained how we transform and use each feature.There exists class imbalance in the data set, and one pattern we found is that people with higher credit card limit are more likely to default their payment. + + +```{r limit plot, fig.cap='Figure 1. Density of Credit Limit Between Default Clients and On-time Clients', out.width='50%'} +knitr::include_graphics('../results/density_plot.png') +``` + +Both `LogisticRegression` and `RandomForest` model will be used to build this classification model to predict whether a client will default on the credit card payment. Because of the class imbalance we have, we will look at test accuracy as well as f1 scores on both model. For each model, the appropriate hyperparameters were chosen using 5-fold cross validation. The R and Python programing languages and the following R and Python packages were used to perform the analysis: ...ADD packages and ref. + +The code used to perform the analysis and create this report can be found [here](https://github.com/UBC-MDS/DSCI522_group_12/tree/main/src) + +## Results & Discussion + +```{r results, message=FALSE} +result <- read_csv('../results/prediction_results.csv') +knitr::kable(result, caption = 'Table 1. This is a summary of the scores for LogisticRegression and RandomForest') +``` +## References \ No newline at end of file diff --git a/doc/report.html b/doc/report.html new file mode 100644 index 0000000..598c77e --- /dev/null +++ b/doc/report.html @@ -0,0 +1,589 @@ + + + + + + + + + + + + + + + +Credit Card Default Predicting + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + +
+

Summary

+
+
+

Introduction

+

In recent years, credit card becomes more and more popular in Taiwan. Because card-issuing banks are all trying to increase market share, there exists more unqualified applicants who are not able to pay their credit card on time. This behaviour is very harmful to both banks and cardholders. (#reference) It is always better to prevent than to solve a problem. By detecting patterns of people who tend to default their credit card payment, banks are able to minimize the risk of issuing credict card to people who may not be able to pay on time.

+

Here we would like to use a machine learning algorithm to predict whether a person is going to defualt on his/her credit card payment. We are going to test on different model and hyperparameters to find the best score on prediction. With the model, banks could predict if the applicant has the ability to pay on time and make better decision on whether to issue the person a credit card. Thus, if the machine learning algorithm can make accurate prediction, banks are able to find reliable applicants and minimize their loss on default payment.

+
+
+

Methods

+
+

Data

+

The dataset we are using in the project is originally from Department of Information Management in Chun Hua University, Taiwan and Department of Civil Engineering in Tamkang University, Taiwan. It was sourced from UCI Machine Learning Repository (#references) and can be found here. This file is what we used to build the model. The data set contains 30,000 observations representing individual customers in Taiwan. Each row contains relevant information about the distinct individual as well as how timely they were with their bill payments and the corresponding bill amounts for each time period. The bill payment information contains records from April 2005 to September 2005 and each individual have the same number of time periods. The data was collected from an important cash and credit card issuing bank in Taiwan.We will make our prediction based on the features given by the data.

+
+
+

Analysis

+

There are 30,000 observations of distinct credit card clients in this data set with each row represents a client. 25 different feature are included with information of each given client, such as gender, age, approved credit limit, education, marital status, their past payment history, bill statements, and previous payments for 6 months (April-Sept 2005). Feature transformations are applied to the given features so each observation has the same number of time periods. Here is a more detailed exploratory analysis that explained how we transform and use each feature.There exists class imbalance in the data set, and one pattern we found is that people with higher credit card limit are more likely to default their payment.

+
+Figure 1. Density of Credit Limit Between Default Clients and On-time Clients +

+Figure 1. Density of Credit Limit Between Default Clients and On-time Clients +

+
+

Both LogisticRegression and RandomForest model will be used to build this classification model to predict whether a client will default on the credit card payment. Because of the class imbalance we have, we will look at test accuracy as well as f1 scores on both model. For each model, the appropriate hyperparameters were chosen using 5-fold cross validation. The R and Python programing languages and the following R and Python packages were used to perform the analysis:

+

The code used to perform the analysis and create this report can be found here

+
+
+
+

Results & Discussion

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table 1. This is a summary of the scores for LogisticRegression and RandomForest
mean_test_scoreparamsmodel
0.5105468{‘C’: 382, ‘class_weight’: ‘balanced’}LogisticRegression
0.5103729{‘C’: 679, ‘class_weight’: ‘balanced’}LogisticRegression
0.5102955{‘C’: 559, ‘class_weight’: ‘balanced’}LogisticRegression
0.4771019{‘max_depth’: 946, ‘n_estimators’: 161}RandomForest
0.4733326{‘max_depth’: 1793, ‘n_estimators’: 168}RandomForest
0.4704956{‘max_depth’: 560, ‘n_estimators’: 94}RandomForest
0.4690769{‘max_depth’: 1408, ‘n_estimators’: 43}RandomForest
0.4432478{‘max_depth’: 736, ‘n_estimators’: 20}RandomForest
0.3958155{‘C’: 158, ‘class_weight’: ‘none’}LogisticRegression
0.3958155{‘C’: 596, ‘class_weight’: ‘none’}LogisticRegression
+
+
+

References

+
+ + + + +
+ + + + + + + + + + + + + + + From 770f4655a4d9625e97e8f7820af7e926c07b75ec Mon Sep 17 00:00:00 2001 From: Hazel Jiang Date: Sat, 28 Nov 2020 19:13:46 -0800 Subject: [PATCH 2/4] add md report file --- doc/report.md | 115 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 115 insertions(+) create mode 100644 doc/report.md diff --git a/doc/report.md b/doc/report.md new file mode 100644 index 0000000..353d1f2 --- /dev/null +++ b/doc/report.md @@ -0,0 +1,115 @@ +Credit Card Default Predicting +================ +Selma Duric, Lara Habashy, Hazel Jiang +2020-11-28 + + - [Summary](#summary) + - [Introduction](#introduction) + - [Methods](#methods) + - [Data](#data) + - [Analysis](#analysis) + - [Results & Discussion](#results-discussion) + - [References](#references) + +## Summary + +## Introduction + +In recent years, credit card becomes more and more popular in Taiwan. +Because card-issuing banks are all trying to increase market share, +there exists more unqualified applicants who are not able to pay their +credit card on time. This behaviour is very harmful to both banks and +cardholders. (\#reference) It is always better to prevent than to solve +a problem. By detecting patterns of people who tend to default their +credit card payment, banks are able to minimize the risk of issuing +credict card to people who may not be able to pay on time. + +Here we would like to use a machine learning algorithm to predict +whether a person is going to defualt on his/her credit card payment. We +are going to test on different model and hyperparameters to find the +best score on prediction. With the model, banks could predict if the +applicant has the ability to pay on time and make better decision on +whether to issue the person a credit card. Thus, if the machine learning +algorithm can make accurate prediction, banks are able to find reliable +applicants and minimize their loss on default payment. + +## Methods + +### Data + +The dataset we are using in the project is originally from Department of +Information Management in Chun Hua University, Taiwan and Department of +Civil Engineering in Tamkang University, Taiwan. It was sourced from UCI +Machine Learning Repository (\#references) and can be found +[here](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#). +[This +file](http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls) +is what we used to build the model. The data set contains 30,000 +observations representing individual customers in Taiwan. Each row +contains relevant information about the distinct individual as well as +how timely they were with their bill payments and the corresponding bill +amounts for each time period. The bill payment information contains +records from April 2005 to September 2005 and each individual have the +same number of time periods. The data was collected from an important +cash and credit card issuing bank in Taiwan.We will make our prediction +based on the features given by the data. + +### Analysis + +There are 30,000 observations of distinct credit card clients in this +data set with each row represents a client. 25 different feature are +included with information of each given client, such as gender, age, +approved credit limit, education, marital status, their past payment +history, bill statements, and previous payments for 6 months (April-Sept +2005). Feature transformations are applied to the given features so each +observation has the same number of time periods. +[Here](https://github.com/UBC-MDS/DSCI522_group_12/blob/main/src/project_eda.md) +is a more detailed exploratory analysis that explained how we transform +and use each feature.There exists class imbalance in the data set, and +one pattern we found is that people with higher credit card limit are +more likely to default their payment. + +
+ +Figure 1. Density of Credit Limit Between Default Clients and On-time Clients + +

+ +Figure 1. Density of Credit Limit Between Default Clients and On-time +Clients + +

+ +
+ +Both `LogisticRegression` and `RandomForest` model will be used to build +this classification model to predict whether a client will default on +the credit card payment. Because of the class imbalance we have, we will +look at test accuracy as well as f1 scores on both model. For each +model, the appropriate hyperparameters were chosen using 5-fold cross +validation. The R and Python programing languages and the following R +and Python packages were used to perform the analysis: …ADD packages and +ref. + +The code used to perform the analysis and create this report can be +found [here](https://github.com/UBC-MDS/DSCI522_group_12/tree/main/src) + +## Results & Discussion + +| mean\_test\_score | params | model | +| ----------------: | :----------------------------------------- | :----------------- | +| 0.5105468 | {‘C’: 382, ‘class\_weight’: ‘balanced’} | LogisticRegression | +| 0.5103729 | {‘C’: 679, ‘class\_weight’: ‘balanced’} | LogisticRegression | +| 0.5102955 | {‘C’: 559, ‘class\_weight’: ‘balanced’} | LogisticRegression | +| 0.4771019 | {‘max\_depth’: 946, ‘n\_estimators’: 161} | RandomForest | +| 0.4733326 | {‘max\_depth’: 1793, ‘n\_estimators’: 168} | RandomForest | +| 0.4704956 | {‘max\_depth’: 560, ‘n\_estimators’: 94} | RandomForest | +| 0.4690769 | {‘max\_depth’: 1408, ‘n\_estimators’: 43} | RandomForest | +| 0.4432478 | {‘max\_depth’: 736, ‘n\_estimators’: 20} | RandomForest | +| 0.3958155 | {‘C’: 158, ‘class\_weight’: ‘none’} | LogisticRegression | +| 0.3958155 | {‘C’: 596, ‘class\_weight’: ‘none’} | LogisticRegression | + +Table 1. This is a summary of the scores for LogisticRegression and +RandomForest + +## References From 5dcc4ab4ebc7e3dea1c18287e6fcc088b58a0451 Mon Sep 17 00:00:00 2001 From: Hazel Jiang Date: Sat, 28 Nov 2020 21:42:15 -0800 Subject: [PATCH 3/4] update doc folder. add reference to it. finish first draft of report. --- doc/reference.bib | 85 +++++++ doc/report.Rmd | 41 +++- doc/report.html | 589 ---------------------------------------------- doc/report.md | 196 ++++++++++++--- 4 files changed, 281 insertions(+), 630 deletions(-) create mode 100644 doc/reference.bib delete mode 100644 doc/report.html diff --git a/doc/reference.bib b/doc/reference.bib new file mode 100644 index 0000000..891346a --- /dev/null +++ b/doc/reference.bib @@ -0,0 +1,85 @@ +@article{yeh2009comparisons, + title={The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients}, + author={Yeh, I-Cheng and Lien, Che-hui}, + journal={Expert Systems with Applications}, + volume={36}, + number={2}, + pages={2473--2480}, + year={2009}, + publisher={Elsevier} +} + +@Manual{R, + title = {R: A Language and Environment for Statistical Computing}, + author = {{R Core Team}}, + organization = {R Foundation for Statistical Computing}, + address = {Vienna, Austria}, + year = {2020}, + url = {https://www.R-project.org/}, + } + +@Manual{tidyverse, + title = {tidyverse: Easily Install and Load the 'Tidyverse'}, + author = {Hadley Wickham}, + year = {2017}, + note = {R package version 1.2.1}, + url = {https://CRAN.R-project.org/package=tidyverse}, +} + +@Manual{knitr, + title = {knitr: A General-Purpose Package for Dynamic Report Generation in R}, + author = {Yihui Xie}, + year = {2020}, + note = {R package version 1.29}, + url = {https://yihui.org/knitr/}, + } + +@Manual{docopt, + title = {docopt: Command-Line Interface Specification Language}, + author = {Edwin {de Jonge}}, + year = {2018}, + note = {R package version 0.6.1}, + url = {https://CRAN.R-project.org/package=docopt}, +} + +@Manual{featherr, + title = {feather: R Bindings to the Feather 'API'}, + author = {Hadley Wickham}, + year = {2019}, + note = {R package version 0.3.5}, + url = {https://CRAN.R-project.org/package=feather} +} + +@article{scikit-learn, + title={Scikit-learn: Machine Learning in {P}ython}, + author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. + and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. + and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and + Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, + journal={Journal of Machine Learning Research}, + volume={12}, + pages={2825--2830}, + year={2011} +} + +@book{Python, + author = {Van Rossum, Guido and Drake, Fred L.}, + title = {Python 3 Reference Manual}, + year = {2009}, + isbn = {1441412697}, + publisher = {CreateSpace}, + address = {Scotts Valley, CA} +} + +@software{reback2020pandas, + author = {The pandas development team}, + title = {pandas-dev/pandas: Pandas}, + month = Aug, + year = 2020, + publisher = {Zenodo}, + version = {1.1.1}, + doi = {10.5281/zenodo.3993412}, + url = {https://doi.org/10.5281/zenodo.3993412} +} + + diff --git a/doc/report.Rmd b/doc/report.Rmd index d5d2505..6938725 100644 --- a/doc/report.Rmd +++ b/doc/report.Rmd @@ -1,9 +1,10 @@ --- title: "Credit Card Default Predicting" -date: 2020-11-28 +date: "2020-11-28" author: "Selma Duric, Lara Habashy, Hazel Jiang" +bibliography: reference.bib output: - html_document: + github_document: toc: TRUE --- @@ -17,33 +18,51 @@ library(tidyverse) ## Introduction -In recent years, credit card becomes more and more popular in Taiwan. Because card-issuing banks are all trying to increase market share, there exists more unqualified applicants who are not able to pay their credit card on time. This behaviour is very harmful to both banks and cardholders. (#reference) It is always better to prevent than to solve a problem. By detecting patterns of people who tend to default their credit card payment, banks are able to minimize the risk of issuing credict card to people who may not be able to pay on time. +In recent years, credit card becomes more and more popular in Taiwan. Because card-issuing banks are all trying to increase market share, there exists more unqualified applicants who are not able to pay their credit card on time. This behavior is very harmful to both banks and cardholders.[@yeh2009comparisons] It is always better to prevent than to solve a problem. By detecting patterns of people who tend to default their credit card payment, banks are able to minimize the risk of issuing credit card to people who may not be able to pay on time. -Here we would like to use a machine learning algorithm to predict whether a person is going to defualt on his/her credit card payment. We are going to test on different model and hyperparameters to find the best score on prediction. With the model, banks could predict if the applicant has the ability to pay on time and make better decision on whether to issue the person a credit card. Thus, if the machine learning algorithm can make accurate prediction, banks are able to find reliable applicants and minimize their loss on default payment. +Here we would like to use a machine learning algorithm to predict whether a person is going to default on his/her credit card payment. We are going to test on different model and hyperparameters to find the best score on prediction. With the model, banks could predict if the applicant has the ability to pay on time and make better decision on whether to issue the person a credit card. Thus, if the machine learning algorithm can make accurate prediction, banks are able to find reliable applicants and minimize their loss on default payment. ## Methods ### Data -The dataset we are using in the project is originally from Department of Information Management in Chun Hua University, Taiwan and Department of Civil Engineering in Tamkang University, Taiwan. It was sourced from UCI Machine Learning Repository (#references) and can be found [here](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#). [This file](http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls) is what we used to build the model. The data set contains 30,000 observations representing individual customers in Taiwan. Each row contains relevant information about the distinct individual as well as how timely they were with their bill payments and the corresponding bill amounts for each time period. The bill payment information contains records from April 2005 to September 2005 and each individual have the same number of time periods. The data was collected from an important cash and credit card issuing bank in Taiwan.We will make our prediction based on the features given by the data. +The dataset we are using in the project is originally from Department of Information Management in Chun Hua University, Taiwan and Department of Civil Engineering in Tamkang University, Taiwan. It was sourced from UCI Machine Learning Repository and can be found [here](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#). [This file](http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls) is what we used to build the model. The data set contains 30,000 observations representing individual customers in Taiwan. Each row contains relevant information about the distinct individual as well as how timely they were with their bill payments and the corresponding bill amounts for each time period. The bill payment information contains records from April 2005 to September 2005 and each individual have the same number of time periods. The data was collected from an important cash and credit card issuing bank in Taiwan.We will make our prediction based on the features given by the data. ### Analysis -There are 30,000 observations of distinct credit card clients in this data set with each row represents a client. 25 different feature are included with information of each given client, such as gender, age, approved credit limit, education, marital status, their past payment history, bill statements, and previous payments for 6 months (April-Sept 2005). Feature transformations are applied to the given features so each observation has the same number of time periods. [Here](https://github.com/UBC-MDS/DSCI522_group_12/blob/main/src/project_eda.md) is a more detailed exploratory analysis that explained how we transform and use each feature.There exists class imbalance in the data set, and one pattern we found is that people with higher credit card limit are more likely to default their payment. +There are 30,000 observations of distinct credit card clients in this data set with each row represents a client. 25 different features are included with information of each given client, such as gender, age, approved credit limit, education, marital status, their past payment history, bill statements, and previous payments for 6 months (April-Sept 2005). Feature transformations are applied to the given features so each observation has the same number of time periods. [Here](https://github.com/UBC-MDS/DSCI522_group_12/blob/main/src/project_eda.md) is a more detailed exploratory analysis that explained how we transform and use each feature.There exists class imbalance in the data set, and one pattern we found is that people with higher credit card limit are more likely to default their payment. -```{r limit plot, fig.cap='Figure 1. Density of Credit Limit Between Default Clients and On-time Clients', out.width='50%'} +```{r limit plot, fig.cap='Figure 1. Density of Credit Limit Between Default Clients and On-time Clients', out.width='45%'} knitr::include_graphics('../results/density_plot.png') ``` -Both `LogisticRegression` and `RandomForest` model will be used to build this classification model to predict whether a client will default on the credit card payment. Because of the class imbalance we have, we will look at test accuracy as well as f1 scores on both model. For each model, the appropriate hyperparameters were chosen using 5-fold cross validation. The R and Python programing languages and the following R and Python packages were used to perform the analysis: ...ADD packages and ref. +Another pattern we found is that there exists a correlation between education level are default payment. Will analyzing this feature further in our machine learning model. + +```{r education level, fig.cap='Figure 2. Correlation Between Educational level and Default Payment', out.width='45%'} +knitr::include_graphics('../results/correlation_plot.png') +``` + +Both `LogisticRegression` and `RandomForest` model from scikit-learn[@scikit-learn] will be used to build this classification model to predict whether a client will default on the credit card payment. Because of the class imbalance we have, we will look at test accuracy as well as f1 scores on both model. For each model, the appropriate hyperparameters were chosen using 5-fold cross validation. The R[@R] and Python[@Python] programming languages and the following R and Python packages were used to perform the analysis: docopt[@docopt], feather[@featherr], knitr[@knitr], tidyverse[@tidyverse]and Pandas[@reback2020pandas] The code used to perform the analysis and create this report can be found [here](https://github.com/UBC-MDS/DSCI522_group_12/tree/main/src) ## Results & Discussion -```{r results, message=FALSE} -result <- read_csv('../results/prediction_results.csv') -knitr::kable(result, caption = 'Table 1. This is a summary of the scores for LogisticRegression and RandomForest') +To look at which model is better for the prediction, we first compare the two models with default hyperparameters. We used `DummyRegression` with `strategy='prior'` as our baseline. Although it has an accuracy score of 0.78, it is not very reliable because we have class imbalance in the data set and f1 score is more important in our prediction. Our baseline has f1 score of 0, which is not good. On the other hand, both `RandomForest` and `LogisticRegression` has better score on f1. `RandomForest` has a very high f1 on the training set, but the score is low on the validation set, and there exists a huge gap between the two scores, which means we have an overfitting problem. On the other hand, `LogisticRegression` has very similar training and validation f1 scores, it has a higher f1 score compare to `RandomForest` model. Therefore, we believe `LogisticRegression` is a better model to use for prediction. + +```{r compare results, message=FALSE, warning=FALSE} +result_compare <- read_csv('../results/prediction_prelim_results.csv') +knitr::kable(result_compare, caption = 'Table 1. This is a comparison betweeen accuracy and f1 with default hyperparameters for each model ') ``` + +Then we did hyperparameter tuning for both models and compare the results with previous table. The hyperparameters we chose for `RandomForest` is `n_estimators` (low=10, high=300) and `max_depth` (low=1, high=5000). The hyperparameters for `LogisticRegression` is `class_weight` ("balanced" vs "none") and `C` (low=0, high=1000). We only focus on f1 score in this comparasion since it is more relavant to the issue we care about. We ranked the f1 score from high to low. As indicated in the table, our best f1 score is 0.51 with hyperparameter *C=382* and *class_weight='balanced'*. The results also show that the top 3 f1 scores are all come from `LogisricRegression`. This finding further confirmed our results from previous table that `LogisticRegression` is a better model to use than `RandomForest`. + +```{r f1 results, message=FALSE, warning=FALSE} +result_f1 <- read_csv('../results/prediction_hp_results.csv') +knitr::kable(result_f1, caption='Table 2. This is the result of f1 score with optimized hyperpamaters for each model') +``` +Based on the result above, we find that although `LogisticRegression` is a better model to use, the f1 score is only around 0.5. It is not a very good score, which means the prediction from this model is not as reliable. To further improve this model in future, it is a good idea to take consideration of other hyperparameters or apply feature engineering to add more useful features to help with prediction. Furthermore, we may also want to look at the confusion matrix of model performance and try to minimize the false negative in the prediction by changing the threshold of the model. + + ## References \ No newline at end of file diff --git a/doc/report.html b/doc/report.html deleted file mode 100644 index 598c77e..0000000 --- a/doc/report.html +++ /dev/null @@ -1,589 +0,0 @@ - - - - - - - - - - - - - - - -Credit Card Default Predicting - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - -
-

Summary

-
-
-

Introduction

-

In recent years, credit card becomes more and more popular in Taiwan. Because card-issuing banks are all trying to increase market share, there exists more unqualified applicants who are not able to pay their credit card on time. This behaviour is very harmful to both banks and cardholders. (#reference) It is always better to prevent than to solve a problem. By detecting patterns of people who tend to default their credit card payment, banks are able to minimize the risk of issuing credict card to people who may not be able to pay on time.

-

Here we would like to use a machine learning algorithm to predict whether a person is going to defualt on his/her credit card payment. We are going to test on different model and hyperparameters to find the best score on prediction. With the model, banks could predict if the applicant has the ability to pay on time and make better decision on whether to issue the person a credit card. Thus, if the machine learning algorithm can make accurate prediction, banks are able to find reliable applicants and minimize their loss on default payment.

-
-
-

Methods

-
-

Data

-

The dataset we are using in the project is originally from Department of Information Management in Chun Hua University, Taiwan and Department of Civil Engineering in Tamkang University, Taiwan. It was sourced from UCI Machine Learning Repository (#references) and can be found here. This file is what we used to build the model. The data set contains 30,000 observations representing individual customers in Taiwan. Each row contains relevant information about the distinct individual as well as how timely they were with their bill payments and the corresponding bill amounts for each time period. The bill payment information contains records from April 2005 to September 2005 and each individual have the same number of time periods. The data was collected from an important cash and credit card issuing bank in Taiwan.We will make our prediction based on the features given by the data.

-
-
-

Analysis

-

There are 30,000 observations of distinct credit card clients in this data set with each row represents a client. 25 different feature are included with information of each given client, such as gender, age, approved credit limit, education, marital status, their past payment history, bill statements, and previous payments for 6 months (April-Sept 2005). Feature transformations are applied to the given features so each observation has the same number of time periods. Here is a more detailed exploratory analysis that explained how we transform and use each feature.There exists class imbalance in the data set, and one pattern we found is that people with higher credit card limit are more likely to default their payment.

-
-Figure 1. Density of Credit Limit Between Default Clients and On-time Clients -

-Figure 1. Density of Credit Limit Between Default Clients and On-time Clients -

-
-

Both LogisticRegression and RandomForest model will be used to build this classification model to predict whether a client will default on the credit card payment. Because of the class imbalance we have, we will look at test accuracy as well as f1 scores on both model. For each model, the appropriate hyperparameters were chosen using 5-fold cross validation. The R and Python programing languages and the following R and Python packages were used to perform the analysis:

-

The code used to perform the analysis and create this report can be found here

-
-
-
-

Results & Discussion

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Table 1. This is a summary of the scores for LogisticRegression and RandomForest
mean_test_scoreparamsmodel
0.5105468{‘C’: 382, ‘class_weight’: ‘balanced’}LogisticRegression
0.5103729{‘C’: 679, ‘class_weight’: ‘balanced’}LogisticRegression
0.5102955{‘C’: 559, ‘class_weight’: ‘balanced’}LogisticRegression
0.4771019{‘max_depth’: 946, ‘n_estimators’: 161}RandomForest
0.4733326{‘max_depth’: 1793, ‘n_estimators’: 168}RandomForest
0.4704956{‘max_depth’: 560, ‘n_estimators’: 94}RandomForest
0.4690769{‘max_depth’: 1408, ‘n_estimators’: 43}RandomForest
0.4432478{‘max_depth’: 736, ‘n_estimators’: 20}RandomForest
0.3958155{‘C’: 158, ‘class_weight’: ‘none’}LogisticRegression
0.3958155{‘C’: 596, ‘class_weight’: ‘none’}LogisticRegression
-
-
-

References

-
- - - - -
- - - - - - - - - - - - - - - diff --git a/doc/report.md b/doc/report.md index 353d1f2..c70a2e7 100644 --- a/doc/report.md +++ b/doc/report.md @@ -18,14 +18,14 @@ Selma Duric, Lara Habashy, Hazel Jiang In recent years, credit card becomes more and more popular in Taiwan. Because card-issuing banks are all trying to increase market share, there exists more unqualified applicants who are not able to pay their -credit card on time. This behaviour is very harmful to both banks and -cardholders. (\#reference) It is always better to prevent than to solve -a problem. By detecting patterns of people who tend to default their -credit card payment, banks are able to minimize the risk of issuing -credict card to people who may not be able to pay on time. +credit card on time. This behavior is very harmful to both banks and +cardholders.(Yeh and Lien 2009) It is always better to prevent than to +solve a problem. By detecting patterns of people who tend to default +their credit card payment, banks are able to minimize the risk of +issuing credit card to people who may not be able to pay on time. Here we would like to use a machine learning algorithm to predict -whether a person is going to defualt on his/her credit card payment. We +whether a person is going to default on his/her credit card payment. We are going to test on different model and hyperparameters to find the best score on prediction. With the model, banks could predict if the applicant has the ability to pay on time and make better decision on @@ -40,7 +40,7 @@ applicants and minimize their loss on default payment. The dataset we are using in the project is originally from Department of Information Management in Chun Hua University, Taiwan and Department of Civil Engineering in Tamkang University, Taiwan. It was sourced from UCI -Machine Learning Repository (\#references) and can be found +Machine Learning Repository and can be found [here](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#). [This file](http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls) @@ -57,7 +57,7 @@ based on the features given by the data. ### Analysis There are 30,000 observations of distinct credit card clients in this -data set with each row represents a client. 25 different feature are +data set with each row represents a client. 25 different features are included with information of each given client, such as gender, age, approved credit limit, education, marital status, their past payment history, bill statements, and previous payments for 6 months (April-Sept @@ -71,7 +71,7 @@ more likely to default their payment.
-Figure 1. Density of Credit Limit Between Default Clients and On-time Clients +Figure 1. Density of Credit Limit Between Default Clients and On-time Clients

@@ -82,34 +82,170 @@ Clients

-Both `LogisticRegression` and `RandomForest` model will be used to build -this classification model to predict whether a client will default on -the credit card payment. Because of the class imbalance we have, we will +Another pattern we found is that there exists a correlation between +education level are default payment. Will analyzing this feature further +in our machine learning model. + +
+ +Figure 2. Correlation Between Educational level and Default Payment + +

+ +Figure 2. Correlation Between Educational level and Default Payment + +

+ +
+ +Both `LogisticRegression` and `RandomForest` model from +scikit-learn(Pedregosa et al. 2011) will be used to build this +classification model to predict whether a client will default on the +credit card payment. Because of the class imbalance we have, we will look at test accuracy as well as f1 scores on both model. For each model, the appropriate hyperparameters were chosen using 5-fold cross -validation. The R and Python programing languages and the following R -and Python packages were used to perform the analysis: …ADD packages and -ref. +validation. The R(R Core Team 2020) and Python(Van Rossum and Drake +2009) programming languages and the following R and Python packages were +used to perform the analysis: docopt(de Jonge 2018), feather(Wickham +2019), knitr(Xie 2020), tidyverse(Wickham 2017)and Pandas(team 2020) The code used to perform the analysis and create this report can be found [here](https://github.com/UBC-MDS/DSCI522_group_12/tree/main/src) ## Results & Discussion -| mean\_test\_score | params | model | -| ----------------: | :----------------------------------------- | :----------------- | -| 0.5105468 | {‘C’: 382, ‘class\_weight’: ‘balanced’} | LogisticRegression | -| 0.5103729 | {‘C’: 679, ‘class\_weight’: ‘balanced’} | LogisticRegression | -| 0.5102955 | {‘C’: 559, ‘class\_weight’: ‘balanced’} | LogisticRegression | -| 0.4771019 | {‘max\_depth’: 946, ‘n\_estimators’: 161} | RandomForest | -| 0.4733326 | {‘max\_depth’: 1793, ‘n\_estimators’: 168} | RandomForest | -| 0.4704956 | {‘max\_depth’: 560, ‘n\_estimators’: 94} | RandomForest | -| 0.4690769 | {‘max\_depth’: 1408, ‘n\_estimators’: 43} | RandomForest | -| 0.4432478 | {‘max\_depth’: 736, ‘n\_estimators’: 20} | RandomForest | -| 0.3958155 | {‘C’: 158, ‘class\_weight’: ‘none’} | LogisticRegression | -| 0.3958155 | {‘C’: 596, ‘class\_weight’: ‘none’} | LogisticRegression | - -Table 1. This is a summary of the scores for LogisticRegression and -RandomForest +To look at which model is better for the prediction, we first compare +the two models with default hyperparameters. We used `DummyRegression` +with `strategy='prior'` as our baseline. Although it has an accuracy +score of 0.78, it is not very reliable because we have class imbalance +in the data set and f1 score is more important in our prediction. Our +baseline has f1 score of 0, which is not good. On the other hand, both +`RandomForest` and `LogisticRegression` has better score on f1. +`RandomForest` has a very high f1 on the training set, but the score is +low on the validation set, and there exists a huge gap between the two +scores, which means we have an overfitting problem. On the other hand, +`LogisticRegression` has very similar training and validation f1 scores, +it has a higher f1 score compare to `RandomForest` model. Therefore, we +believe `LogisticRegression` is a better model to use for prediction. + +| X1 | Baseline | Random Forest | Logistic Regression | +| :------------------------- | -------: | ------------: | ------------------: | +| mean\_accuracy\_train | 0.7788 | 0.9995 | 0.7448 | +| mean\_accuracy\_validation | 0.7788 | 0.8168 | 0.7440 | +| mean\_f1\_train | 0.0000 | 0.9988 | 0.5125 | +| mean\_f1\_validation | 0.0000 | 0.4756 | 0.5109 | + +Table 1. This is a comparison betweeen accuracy and f1 with default +hyperparameters for each model + +Then we did hyperparameter tuning for both models and compare the +results with previous table. The hyperparameters we chose for +`RandomForest` is `n_estimators` (low=10, high=300) and `max_depth` +(low=1, high=5000). The hyperparameters for `LogisticRegression` is +`class_weight` (“balanced” vs “none”) and `C` (low=0, high=1000). We +only focus on f1 score in this comparasion since it is more relavant to +the issue we care about. We ranked the f1 score from high to low. As +indicated in the table, our best f1 score is 0.51 with hyperparameter +*C=382* and *class\_weight=‘balanced’*. The results also show that the +top 3 f1 scores are all come from `LogisricRegression`. This finding +further confirmed our results from previous table that +`LogisticRegression` is a better model to use than `RandomForest`. + +| mean f1 score | params | model | +| ------------: | :----------------------------------------- | :----------------- | +| 0.5105468 | {‘C’: 382, ‘class\_weight’: ‘balanced’} | LogisticRegression | +| 0.5103729 | {‘C’: 679, ‘class\_weight’: ‘balanced’} | LogisticRegression | +| 0.5102955 | {‘C’: 559, ‘class\_weight’: ‘balanced’} | LogisticRegression | +| 0.4770697 | {‘max\_depth’: 946, ‘n\_estimators’: 161} | RandomForest | +| 0.4732701 | {‘max\_depth’: 1793, ‘n\_estimators’: 168} | RandomForest | +| 0.4712085 | {‘max\_depth’: 560, ‘n\_estimators’: 94} | RandomForest | +| 0.4665116 | {‘max\_depth’: 1408, ‘n\_estimators’: 43} | RandomForest | +| 0.4423804 | {‘max\_depth’: 736, ‘n\_estimators’: 20} | RandomForest | +| 0.3958155 | {‘C’: 158, ‘class\_weight’: ‘none’} | LogisticRegression | +| 0.3958155 | {‘C’: 596, ‘class\_weight’: ‘none’} | LogisticRegression | + +Table 2. This is the result of f1 score with optimized hyperpamaters for +each model + +Based on the result above, we find that although `LogisticRegression` is +a better model to use, the f1 score is only around 0.5. It is not a very +good score, which means the prediction from this model is not as +reliable. To further improve this model in future, it is a good idea to +take consideration of other hyperparameters or apply feature engineering +to add more useful features to help with prediction. Furthermore, we may +also want to look at the confusion matrix of model performance and try +to minimize the false negative in the prediction by changing the +threshold of the model. ## References + +
+ +
+ +de Jonge, Edwin. 2018. *Docopt: Command-Line Interface Specification +Language*. . + +
+ +
+ +Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. +Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in +Python.” *Journal of Machine Learning Research* 12: 2825–30. + +
+ +
+ +R Core Team. 2020. *R: A Language and Environment for Statistical +Computing*. Vienna, Austria: R Foundation for Statistical Computing. +. + +
+ +
+ +team, The pandas development. 2020. *Pandas-Dev/Pandas: Pandas* (version +1.1.1). Zenodo. . + +
+ +
+ +Van Rossum, Guido, and Fred L. Drake. 2009. *Python 3 Reference Manual*. +Scotts Valley, CA: CreateSpace. + +
+ +
+ +Wickham, Hadley. 2017. *Tidyverse: Easily Install and Load the +’Tidyverse’*. . + +
+ +
+ +———. 2019. *Feather: R Bindings to the Feather ’Api’*. +. + +
+ +
+ +Xie, Yihui. 2020. *Knitr: A General-Purpose Package for Dynamic Report +Generation in R*. . + +
+ +
+ +Yeh, I-Cheng, and Che-hui Lien. 2009. “The Comparisons of Data Mining +Techniques for the Predictive Accuracy of Probability of Default of +Credit Card Clients.” *Expert Systems with Applications* 36 (2): +2473–80. + +
+ +
From 688e27bdaee9a6ee6231299facd9fda714b626ad Mon Sep 17 00:00:00 2001 From: Hazel Jiang Date: Sat, 28 Nov 2020 22:26:16 -0800 Subject: [PATCH 4/4] add summary to report, fixed gramma issue, rendered html --- doc/report.Rmd | 26 +- doc/report.html | 663 ++++++++++++++++++++++++++++++++++++++++++++++++ doc/report.md | 103 ++++---- 3 files changed, 730 insertions(+), 62 deletions(-) create mode 100644 doc/report.html diff --git a/doc/report.Rmd b/doc/report.Rmd index 6938725..7bdcb41 100644 --- a/doc/report.Rmd +++ b/doc/report.Rmd @@ -1,10 +1,10 @@ --- -title: "Credit Card Default Predicting" -date: "2020-11-28" -author: "Selma Duric, Lara Habashy, Hazel Jiang" +title: "Credit Card Default Prediction" +author: "Selma Duric, Lara Habashy, Hazel Jiang
" +date: "11/28/2020" bibliography: reference.bib output: - github_document: + html_document: toc: TRUE --- @@ -16,6 +16,8 @@ library(tidyverse) ## Summary +Here we attempt to apply two machine learning models `LogisticRegression` and `RandomForest` on a credit card default data set and find the better model with optimized hyperparameter to predict if a client is likely to default payment on the credit card in order to lower the risk for banks to issue credit card to more reliable clients. `LogisticRegression` performed better compared to `RandomForest`. Our best prediction has f1 score of 0.51 with optimzed hyperpameter of *C=382* and *class_weight='balanced'*. + ## Introduction In recent years, credit card becomes more and more popular in Taiwan. Because card-issuing banks are all trying to increase market share, there exists more unqualified applicants who are not able to pay their credit card on time. This behavior is very harmful to both banks and cardholders.[@yeh2009comparisons] It is always better to prevent than to solve a problem. By detecting patterns of people who tend to default their credit card payment, banks are able to minimize the risk of issuing credit card to people who may not be able to pay on time. @@ -26,41 +28,41 @@ Here we would like to use a machine learning algorithm to predict whether a pers ### Data -The dataset we are using in the project is originally from Department of Information Management in Chun Hua University, Taiwan and Department of Civil Engineering in Tamkang University, Taiwan. It was sourced from UCI Machine Learning Repository and can be found [here](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#). [This file](http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls) is what we used to build the model. The data set contains 30,000 observations representing individual customers in Taiwan. Each row contains relevant information about the distinct individual as well as how timely they were with their bill payments and the corresponding bill amounts for each time period. The bill payment information contains records from April 2005 to September 2005 and each individual have the same number of time periods. The data was collected from an important cash and credit card issuing bank in Taiwan.We will make our prediction based on the features given by the data. +The dataset we are using in the project is originally from Department of Information Management in Chun Hua University, Taiwan and Department of Civil Engineering in Tamkang University, Taiwan. It was sourced from UCI Machine Learning Repository and can be found [here](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#). [This file](http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls) is what we used to build the model. The data set contains 30,000 observations representing individual customers in Taiwan. Each row contains relevant information about the distinct individual as well as how timely they were with their bill payments and the corresponding bill amounts for each time period. The bill payment information contains records from April 2005 to September 2005 and each individual have the same number of time periods. The data was collected from an important cash and credit card issuing bank in Taiwan. We will make our prediction based on the features given by the data. ### Analysis -There are 30,000 observations of distinct credit card clients in this data set with each row represents a client. 25 different features are included with information of each given client, such as gender, age, approved credit limit, education, marital status, their past payment history, bill statements, and previous payments for 6 months (April-Sept 2005). Feature transformations are applied to the given features so each observation has the same number of time periods. [Here](https://github.com/UBC-MDS/DSCI522_group_12/blob/main/src/project_eda.md) is a more detailed exploratory analysis that explained how we transform and use each feature.There exists class imbalance in the data set, and one pattern we found is that people with higher credit card limit are more likely to default their payment. +There are 30,000 observations of distinct credit card clients in this data set with each row representing a client. 25 different features are included with information of each given client, such as gender, age, approved credit limit, education, marital status, their past payment history, bill statements, and previous payments for 6 months (April-Sept 2005). Feature transformations are applied to the given features so each observation has the same number of time periods. [Here](https://github.com/UBC-MDS/DSCI522_group_12/blob/main/src/project_eda.md) is a more detailed exploratory analysis that explained how we transform and use each feature. There exists class imbalance in the data set, and one pattern we found is that people with higher credit card limit are more likely to default their payment. ```{r limit plot, fig.cap='Figure 1. Density of Credit Limit Between Default Clients and On-time Clients', out.width='45%'} knitr::include_graphics('../results/density_plot.png') ``` -Another pattern we found is that there exists a correlation between education level are default payment. Will analyzing this feature further in our machine learning model. +Another pattern we found is that there exists a correlation between education level and default payment. We will analyze this feature further in our machine learning model. ```{r education level, fig.cap='Figure 2. Correlation Between Educational level and Default Payment', out.width='45%'} knitr::include_graphics('../results/correlation_plot.png') ``` -Both `LogisticRegression` and `RandomForest` model from scikit-learn[@scikit-learn] will be used to build this classification model to predict whether a client will default on the credit card payment. Because of the class imbalance we have, we will look at test accuracy as well as f1 scores on both model. For each model, the appropriate hyperparameters were chosen using 5-fold cross validation. The R[@R] and Python[@Python] programming languages and the following R and Python packages were used to perform the analysis: docopt[@docopt], feather[@featherr], knitr[@knitr], tidyverse[@tidyverse]and Pandas[@reback2020pandas] +Both a linear classification model `LogisticRegression` and an ensemble decision tree classification model `RandomForest` from scikit-learn(Pedregosa et al. 2011) will be used to build this classification model to see which better predicts whether a client will default on the credit card payment. Because of the class imbalance we have, we will look at test accuracy as well as f1 scores on both models. For each model, the appropriate hyperparameters were chosen using 5-fold cross validation. The R[@R] and Python[@Python] programming languages and the following R and Python packages were used to perform the analysis: docopt[@docopt], feather[@featherr], knitr[@knitr], tidyverse[@tidyverse]and Pandas[@reback2020pandas]. The code used to perform the analysis and create this report can be found [here](https://github.com/UBC-MDS/DSCI522_group_12/tree/main/src) ## Results & Discussion -To look at which model is better for the prediction, we first compare the two models with default hyperparameters. We used `DummyRegression` with `strategy='prior'` as our baseline. Although it has an accuracy score of 0.78, it is not very reliable because we have class imbalance in the data set and f1 score is more important in our prediction. Our baseline has f1 score of 0, which is not good. On the other hand, both `RandomForest` and `LogisticRegression` has better score on f1. `RandomForest` has a very high f1 on the training set, but the score is low on the validation set, and there exists a huge gap between the two scores, which means we have an overfitting problem. On the other hand, `LogisticRegression` has very similar training and validation f1 scores, it has a higher f1 score compare to `RandomForest` model. Therefore, we believe `LogisticRegression` is a better model to use for prediction. +To look at which model is better for prediction, we first compare the two models with default hyperparameters. We used `DummyRegression` with `strategy='prior'` as our baseline. Although it has an accuracy score of 0.78, it is not very reliable because we have class imbalance in the data set and f1 score is more important in our prediction. Our baseline has f1 score of 0, which is not good. On the other hand, both `RandomForest` and `LogisticRegression` has better score on f1. `RandomForest` has a very high f1 on the training set, but the score is low on the validation set, and there exists a huge gap between the two scores, which means we have an overfitting problem. On the other hand, `LogisticRegression` has very similar training and validation f1 scores, it has a higher f1 score compared to `RandomForest` model. Therefore, we believe `LogisticRegression` is a better model to use for prediction. ```{r compare results, message=FALSE, warning=FALSE} result_compare <- read_csv('../results/prediction_prelim_results.csv') -knitr::kable(result_compare, caption = 'Table 1. This is a comparison betweeen accuracy and f1 with default hyperparameters for each model ') +knitr::kable(result_compare, caption = 'Table 1.Comparison between accuracy and f1 with default hyperparameters for each model ') ``` -Then we did hyperparameter tuning for both models and compare the results with previous table. The hyperparameters we chose for `RandomForest` is `n_estimators` (low=10, high=300) and `max_depth` (low=1, high=5000). The hyperparameters for `LogisticRegression` is `class_weight` ("balanced" vs "none") and `C` (low=0, high=1000). We only focus on f1 score in this comparasion since it is more relavant to the issue we care about. We ranked the f1 score from high to low. As indicated in the table, our best f1 score is 0.51 with hyperparameter *C=382* and *class_weight='balanced'*. The results also show that the top 3 f1 scores are all come from `LogisricRegression`. This finding further confirmed our results from previous table that `LogisticRegression` is a better model to use than `RandomForest`. +Since the validation scores were comparable, we decided to tune hyperparameters for both models and compare the results with the previous table. The hyperparameters we chose for `RandomForest` is `n_estimators` (low=10, high=300) and `max_depth` (low=1, high=5000). The hyperparameters for `LogisticRegression` is `class_weight` ("balanced" vs "none") and `C` (low=0, high=1000). We only focus on f1 score in this comparasion since it is more relavant to the issue we care about. We ranked the f1 score from high to low. As indicated in the table, our best f1 score is 0.51 with hyperparameter *C=382* and *class_weight='balanced'*. The results also show that the top 3 f1 scores are all come from `LogisricRegression`. This finding further confirmed our results from previous table that `LogisticRegression` is a better model to use than `RandomForest`. ```{r f1 results, message=FALSE, warning=FALSE} result_f1 <- read_csv('../results/prediction_hp_results.csv') -knitr::kable(result_f1, caption='Table 2. This is the result of f1 score with optimized hyperpamaters for each model') +knitr::kable(result_f1, caption='Table 2. F1 score with optimized hyperpamaters for each model') ``` Based on the result above, we find that although `LogisticRegression` is a better model to use, the f1 score is only around 0.5. It is not a very good score, which means the prediction from this model is not as reliable. To further improve this model in future, it is a good idea to take consideration of other hyperparameters or apply feature engineering to add more useful features to help with prediction. Furthermore, we may also want to look at the confusion matrix of model performance and try to minimize the false negative in the prediction by changing the threshold of the model. diff --git a/doc/report.html b/doc/report.html new file mode 100644 index 0000000..6d493ad --- /dev/null +++ b/doc/report.html @@ -0,0 +1,663 @@ + + + + + + + + + + + + + + + +Credit Card Default Prediction + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + +
+

Summary

+

Here we attempt to apply two machine learning models LogisticRegression and RandomForest on a credit card default data set and find the better model with optimized hyperparameter to predict if a client is likely to default payment on the credit card in order to lower the risk for banks to issue credit card to more reliable clients. LogisticRegression performed better compared to RandomForest. Our best prediction has f1 score of 0.51 with optimzed hyperpameter of C=382 and class_weight=‘balanced’.

+
+
+

Introduction

+

In recent years, credit card becomes more and more popular in Taiwan. Because card-issuing banks are all trying to increase market share, there exists more unqualified applicants who are not able to pay their credit card on time. This behavior is very harmful to both banks and cardholders.(Yeh and Lien 2009) It is always better to prevent than to solve a problem. By detecting patterns of people who tend to default their credit card payment, banks are able to minimize the risk of issuing credit card to people who may not be able to pay on time.

+

Here we would like to use a machine learning algorithm to predict whether a person is going to default on his/her credit card payment. We are going to test on different model and hyperparameters to find the best score on prediction. With the model, banks could predict if the applicant has the ability to pay on time and make better decision on whether to issue the person a credit card. Thus, if the machine learning algorithm can make accurate prediction, banks are able to find reliable applicants and minimize their loss on default payment.

+
+
+

Methods

+
+

Data

+

The dataset we are using in the project is originally from Department of Information Management in Chun Hua University, Taiwan and Department of Civil Engineering in Tamkang University, Taiwan. It was sourced from UCI Machine Learning Repository and can be found here. This file is what we used to build the model. The data set contains 30,000 observations representing individual customers in Taiwan. Each row contains relevant information about the distinct individual as well as how timely they were with their bill payments and the corresponding bill amounts for each time period. The bill payment information contains records from April 2005 to September 2005 and each individual have the same number of time periods. The data was collected from an important cash and credit card issuing bank in Taiwan. We will make our prediction based on the features given by the data.

+
+
+

Analysis

+

There are 30,000 observations of distinct credit card clients in this data set with each row representing a client. 25 different features are included with information of each given client, such as gender, age, approved credit limit, education, marital status, their past payment history, bill statements, and previous payments for 6 months (April-Sept 2005). Feature transformations are applied to the given features so each observation has the same number of time periods. Here is a more detailed exploratory analysis that explained how we transform and use each feature. There exists class imbalance in the data set, and one pattern we found is that people with higher credit card limit are more likely to default their payment.

+
+Figure 1. Density of Credit Limit Between Default Clients and On-time Clients +

+Figure 1. Density of Credit Limit Between Default Clients and On-time Clients +

+
+

Another pattern we found is that there exists a correlation between education level and default payment. We will analyze this feature further in our machine learning model.

+
+Figure 2. Correlation Between Educational level and Default Payment +

+Figure 2. Correlation Between Educational level and Default Payment +

+
+

Both a linear classification model LogisticRegression and an ensemble decision tree classification model RandomForest from scikit-learn(Pedregosa et al. 2011) will be used to build this classification model to see which better predicts whether a client will default on the credit card payment. Because of the class imbalance we have, we will look at test accuracy as well as f1 scores on both models. For each model, the appropriate hyperparameters were chosen using 5-fold cross validation. The R(R Core Team 2020) and Python(Van Rossum and Drake 2009) programming languages and the following R and Python packages were used to perform the analysis: docopt(de Jonge 2018), feather(Wickham 2019), knitr(Xie 2020), tidyverse(Wickham 2017)and Pandas(team 2020).

+

The code used to perform the analysis and create this report can be found here

+
+
+
+

Results & Discussion

+

To look at which model is better for prediction, we first compare the two models with default hyperparameters. We used DummyRegression with strategy='prior' as our baseline. Although it has an accuracy score of 0.78, it is not very reliable because we have class imbalance in the data set and f1 score is more important in our prediction. Our baseline has f1 score of 0, which is not good. On the other hand, both RandomForest and LogisticRegression has better score on f1. RandomForest has a very high f1 on the training set, but the score is low on the validation set, and there exists a huge gap between the two scores, which means we have an overfitting problem. On the other hand, LogisticRegression has very similar training and validation f1 scores, it has a higher f1 score compared to RandomForest model. Therefore, we believe LogisticRegression is a better model to use for prediction.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table 1.Comparison between accuracy and f1 with default hyperparameters for each model
X1BaselineRandom ForestLogistic Regression
mean_accuracy_train0.77880.99950.7448
mean_accuracy_validation0.77880.81680.7440
mean_f1_train0.00000.99880.5125
mean_f1_validation0.00000.47560.5109
+

Since the validation scores were comparable, we decided to tune hyperparameters for both models and compare the results with the previous table. The hyperparameters we chose for RandomForest is n_estimators (low=10, high=300) and max_depth (low=1, high=5000). The hyperparameters for LogisticRegression is class_weight (“balanced” vs “none”) and C (low=0, high=1000). We only focus on f1 score in this comparasion since it is more relavant to the issue we care about. We ranked the f1 score from high to low. As indicated in the table, our best f1 score is 0.51 with hyperparameter C=382 and class_weight=‘balanced’. The results also show that the top 3 f1 scores are all come from LogisricRegression. This finding further confirmed our results from previous table that LogisticRegression is a better model to use than RandomForest.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table 2. F1 score with optimized hyperpamaters for each model
mean f1 scoreparamsmodel
0.5105468{‘C’: 382, ‘class_weight’: ‘balanced’}LogisticRegression
0.5103729{‘C’: 679, ‘class_weight’: ‘balanced’}LogisticRegression
0.5102955{‘C’: 559, ‘class_weight’: ‘balanced’}LogisticRegression
0.4770697{‘max_depth’: 946, ‘n_estimators’: 161}RandomForest
0.4732701{‘max_depth’: 1793, ‘n_estimators’: 168}RandomForest
0.4712085{‘max_depth’: 560, ‘n_estimators’: 94}RandomForest
0.4665116{‘max_depth’: 1408, ‘n_estimators’: 43}RandomForest
0.4423804{‘max_depth’: 736, ‘n_estimators’: 20}RandomForest
0.3958155{‘C’: 158, ‘class_weight’: ‘none’}LogisticRegression
0.3958155{‘C’: 596, ‘class_weight’: ‘none’}LogisticRegression
+

Based on the result above, we find that although LogisticRegression is a better model to use, the f1 score is only around 0.5. It is not a very good score, which means the prediction from this model is not as reliable. To further improve this model in future, it is a good idea to take consideration of other hyperparameters or apply feature engineering to add more useful features to help with prediction. Furthermore, we may also want to look at the confusion matrix of model performance and try to minimize the false negative in the prediction by changing the threshold of the model.

+
+
+

References

+
+
+

de Jonge, Edwin. 2018. Docopt: Command-Line Interface Specification Language. https://CRAN.R-project.org/package=docopt.

+
+
+

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

+
+
+

team, The pandas development. 2020. Pandas-Dev/Pandas: Pandas (version 1.1.1). Zenodo. https://doi.org/10.5281/zenodo.3993412.

+
+
+

Van Rossum, Guido, and Fred L. Drake. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.

+
+
+

Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

+
+
+

———. 2019. Feather: R Bindings to the Feather ’Api’. https://CRAN.R-project.org/package=feather.

+
+
+

Xie, Yihui. 2020. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.

+
+
+

Yeh, I-Cheng, and Che-hui Lien. 2009. “The Comparisons of Data Mining Techniques for the Predictive Accuracy of Probability of Default of Credit Card Clients.” Expert Systems with Applications 36 (2): 2473–80.

+
+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/doc/report.md b/doc/report.md index c70a2e7..f441107 100644 --- a/doc/report.md +++ b/doc/report.md @@ -1,7 +1,7 @@ -Credit Card Default Predicting +Credit Card Default Prediction ================ -Selma Duric, Lara Habashy, Hazel Jiang -2020-11-28 +Selma Duric, Lara Habashy, Hazel Jiang
+11/28/2020 - [Summary](#summary) - [Introduction](#introduction) @@ -13,6 +13,15 @@ Selma Duric, Lara Habashy, Hazel Jiang ## Summary +Here we attempt to apply two machine learning models +`LogisticRegression` and `RandomForest` on a credit card default data +set and find the better model with optimized hyperparameter to predict +if a client is likely to default payment on the credit card in order to +lower the risk for banks to issue credit card to more reliable clients. +`LogisticRegression` performed better compared to `RandomForest`. Our +best prediction has f1 score of 0.51 with optimzed hyperpameter of +*C=382* and *class\_weight=‘balanced’*. + ## Introduction In recent years, credit card becomes more and more popular in Taiwan. @@ -51,13 +60,13 @@ how timely they were with their bill payments and the corresponding bill amounts for each time period. The bill payment information contains records from April 2005 to September 2005 and each individual have the same number of time periods. The data was collected from an important -cash and credit card issuing bank in Taiwan.We will make our prediction +cash and credit card issuing bank in Taiwan. We will make our prediction based on the features given by the data. ### Analysis There are 30,000 observations of distinct credit card clients in this -data set with each row represents a client. 25 different features are +data set with each row representing a client. 25 different features are included with information of each given client, such as gender, age, approved credit limit, education, marital status, their past payment history, bill statements, and previous payments for 6 months (April-Sept @@ -65,7 +74,7 @@ history, bill statements, and previous payments for 6 months (April-Sept observation has the same number of time periods. [Here](https://github.com/UBC-MDS/DSCI522_group_12/blob/main/src/project_eda.md) is a more detailed exploratory analysis that explained how we transform -and use each feature.There exists class imbalance in the data set, and +and use each feature. There exists class imbalance in the data set, and one pattern we found is that people with higher credit card limit are more likely to default their payment. @@ -83,8 +92,8 @@ Clients Another pattern we found is that there exists a correlation between -education level are default payment. Will analyzing this feature further -in our machine learning model. +education level and default payment. We will analyze this feature +further in our machine learning model.
@@ -98,34 +107,36 @@ Figure 2. Correlation Between Educational level and Default Payment
-Both `LogisticRegression` and `RandomForest` model from -scikit-learn(Pedregosa et al. 2011) will be used to build this -classification model to predict whether a client will default on the -credit card payment. Because of the class imbalance we have, we will -look at test accuracy as well as f1 scores on both model. For each -model, the appropriate hyperparameters were chosen using 5-fold cross -validation. The R(R Core Team 2020) and Python(Van Rossum and Drake -2009) programming languages and the following R and Python packages were -used to perform the analysis: docopt(de Jonge 2018), feather(Wickham -2019), knitr(Xie 2020), tidyverse(Wickham 2017)and Pandas(team 2020) +Both a linear classification model `LogisticRegression` and an ensemble +decision tree classification model `RandomForest` from +scikit-learn(Pedregosa et al. 2011) will be used to build this +classification model to see which better predicts whether a client will +default on the credit card payment. Because of the class imbalance we +have, we will look at test accuracy as well as f1 scores on both models. +For each model, the appropriate hyperparameters were chosen using 5-fold +cross validation. The R(R Core Team 2020) and Python(Van Rossum and +Drake 2009) programming languages and the following R and Python +packages were used to perform the analysis: docopt(de Jonge 2018), +feather(Wickham 2019), knitr(Xie 2020), tidyverse(Wickham 2017)and +Pandas(team 2020). The code used to perform the analysis and create this report can be found [here](https://github.com/UBC-MDS/DSCI522_group_12/tree/main/src) ## Results & Discussion -To look at which model is better for the prediction, we first compare -the two models with default hyperparameters. We used `DummyRegression` -with `strategy='prior'` as our baseline. Although it has an accuracy -score of 0.78, it is not very reliable because we have class imbalance -in the data set and f1 score is more important in our prediction. Our -baseline has f1 score of 0, which is not good. On the other hand, both +To look at which model is better for prediction, we first compare the +two models with default hyperparameters. We used `DummyRegression` with +`strategy='prior'` as our baseline. Although it has an accuracy score of +0.78, it is not very reliable because we have class imbalance in the +data set and f1 score is more important in our prediction. Our baseline +has f1 score of 0, which is not good. On the other hand, both `RandomForest` and `LogisticRegression` has better score on f1. `RandomForest` has a very high f1 on the training set, but the score is low on the validation set, and there exists a huge gap between the two scores, which means we have an overfitting problem. On the other hand, `LogisticRegression` has very similar training and validation f1 scores, -it has a higher f1 score compare to `RandomForest` model. Therefore, we +it has a higher f1 score compared to `RandomForest` model. Therefore, we believe `LogisticRegression` is a better model to use for prediction. | X1 | Baseline | Random Forest | Logistic Regression | @@ -135,21 +146,22 @@ believe `LogisticRegression` is a better model to use for prediction. | mean\_f1\_train | 0.0000 | 0.9988 | 0.5125 | | mean\_f1\_validation | 0.0000 | 0.4756 | 0.5109 | -Table 1. This is a comparison betweeen accuracy and f1 with default -hyperparameters for each model - -Then we did hyperparameter tuning for both models and compare the -results with previous table. The hyperparameters we chose for -`RandomForest` is `n_estimators` (low=10, high=300) and `max_depth` -(low=1, high=5000). The hyperparameters for `LogisticRegression` is -`class_weight` (“balanced” vs “none”) and `C` (low=0, high=1000). We -only focus on f1 score in this comparasion since it is more relavant to -the issue we care about. We ranked the f1 score from high to low. As -indicated in the table, our best f1 score is 0.51 with hyperparameter -*C=382* and *class\_weight=‘balanced’*. The results also show that the -top 3 f1 scores are all come from `LogisricRegression`. This finding -further confirmed our results from previous table that -`LogisticRegression` is a better model to use than `RandomForest`. +Table 1.Comparison between accuracy and f1 with default hyperparameters +for each model + +Since the validation scores were comparable, we decided to tune +hyperparameters for both models and compare the results with the +previous table. The hyperparameters we chose for `RandomForest` is +`n_estimators` (low=10, high=300) and `max_depth` (low=1, high=5000). +The hyperparameters for `LogisticRegression` is `class_weight` +(“balanced” vs “none”) and `C` (low=0, high=1000). We only focus on f1 +score in this comparasion since it is more relavant to the issue we care +about. We ranked the f1 score from high to low. As indicated in the +table, our best f1 score is 0.51 with hyperparameter *C=382* and +*class\_weight=‘balanced’*. The results also show that the top 3 f1 +scores are all come from `LogisricRegression`. This finding further +confirmed our results from previous table that `LogisticRegression` is a +better model to use than `RandomForest`. | mean f1 score | params | model | | ------------: | :----------------------------------------- | :----------------- | @@ -164,8 +176,7 @@ further confirmed our results from previous table that | 0.3958155 | {‘C’: 158, ‘class\_weight’: ‘none’} | LogisticRegression | | 0.3958155 | {‘C’: 596, ‘class\_weight’: ‘none’} | LogisticRegression | -Table 2. This is the result of f1 score with optimized hyperpamaters for -each model +Table 2. F1 score with optimized hyperpamaters for each model Based on the result above, we find that although `LogisticRegression` is a better model to use, the f1 score is only around 0.5. It is not a very @@ -188,14 +199,6 @@ Language*. . -
- -Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. -Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in -Python.” *Journal of Machine Learning Research* 12: 2825–30. - -
-
R Core Team. 2020. *R: A Language and Environment for Statistical