Merge pull request #39 from HazelJJJ/main

Finalized report
UBC-MDS · Nov 29, 2020 · f52870f · f52870f
2 parents ef35b01 + 688e27b
commit f52870f
Show file tree

Hide file tree

Showing 4 changed files with 1,072 additions and 0 deletions.
diff --git a/doc/reference.bib b/doc/reference.bib
@@ -0,0 +1,85 @@
+@article{yeh2009comparisons,
+  title={The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients},
+  author={Yeh, I-Cheng and Lien, Che-hui},
+  journal={Expert Systems with Applications},
+  volume={36},
+  number={2},
+  pages={2473--2480},
+  year={2009},
+  publisher={Elsevier}
+}
+
+@Manual{R,
+    title = {R: A Language and Environment for Statistical Computing},
+    author = {{R Core Team}},
+    organization = {R Foundation for Statistical Computing},
+    address = {Vienna, Austria},
+    year = {2020},
+    url = {https://www.R-project.org/},
+  }
+
+@Manual{tidyverse,
+    title = {tidyverse: Easily Install and Load the 'Tidyverse'},
+    author = {Hadley Wickham},
+    year = {2017},
+    note = {R package version 1.2.1},
+    url = {https://CRAN.R-project.org/package=tidyverse},
+}
+
+@Manual{knitr,
+    title = {knitr: A General-Purpose Package for Dynamic Report Generation in R},
+    author = {Yihui Xie},
+    year = {2020},
+    note = {R package version 1.29},
+    url = {https://yihui.org/knitr/},
+  }
+
+@Manual{docopt,
+    title = {docopt: Command-Line Interface Specification Language},
+    author = {Edwin {de Jonge}},
+    year = {2018},
+    note = {R package version 0.6.1},
+    url = {https://CRAN.R-project.org/package=docopt},
+}
+
+@Manual{featherr,
+    title = {feather: R Bindings to the Feather 'API'},
+    author = {Hadley Wickham},
+    year = {2019},
+    note = {R package version 0.3.5},
+    url = {https://CRAN.R-project.org/package=feather}
+}
+
+@article{scikit-learn,
+ title={Scikit-learn: Machine Learning in {P}ython},
+ author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+ journal={Journal of Machine Learning Research},
+ volume={12},
+ pages={2825--2830},
+ year={2011}
+}
+
+@book{Python,
+  author = {Van Rossum, Guido and Drake, Fred L.},
+  title = {Python 3 Reference Manual},
+  year = {2009},
+  isbn = {1441412697},
+  publisher = {CreateSpace},
+  address = {Scotts Valley, CA}
+}
+
+@software{reback2020pandas,
+    author       = {The pandas development team},
+    title        = {pandas-dev/pandas: Pandas},
+    month        = Aug,
+    year         = 2020,
+    publisher    = {Zenodo},
+    version      = {1.1.1},
+    doi          = {10.5281/zenodo.3993412},
+    url          = {https://doi.org/10.5281/zenodo.3993412}
+}
+
+
diff --git a/doc/report.Rmd b/doc/report.Rmd
@@ -0,0 +1,70 @@
+---
+title: "Credit Card Default Prediction"
+author: "Selma Duric, Lara Habashy, Hazel Jiang</br>"
+date: "11/28/2020"
+bibliography: reference.bib
+output: 
+  html_document:
+    toc: TRUE
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = FALSE)
+library(knitr)
+library(tidyverse)
+```
+
+## Summary
+
+Here we attempt to apply two machine learning models `LogisticRegression` and `RandomForest` on a credit card default data set and find the better model with optimized hyperparameter to predict if a client is likely to default payment on the credit card in order to lower the risk for banks to issue credit card to more reliable clients. `LogisticRegression` performed better compared to `RandomForest`. Our best prediction has f1 score of 0.51 with optimzed hyperpameter of *C=382* and *class_weight='balanced'*.
+
+## Introduction
+
+In recent years, credit card becomes more and more popular in Taiwan. Because card-issuing banks are all trying to increase market share, there exists more unqualified applicants who are not able to pay their credit card on time. This behavior is very harmful to both banks and cardholders.[@yeh2009comparisons] It is always better to prevent than to solve a problem. By detecting patterns of people who tend to default their credit card payment, banks are able to minimize the risk of issuing credit card to people who may not be able to pay on time.
+
+Here we would like to use a machine learning algorithm to predict whether a person is going to default on his/her credit card payment. We are going to test on different model and hyperparameters to find the best score on prediction. With the model, banks could predict if the applicant has the ability to pay on time and make better decision on whether to issue the person a credit card. Thus, if the machine learning algorithm can make accurate prediction, banks are able to find reliable applicants and minimize their loss on default payment. 
+
+## Methods
+
+### Data
+
+The dataset we are using in the project is originally from Department of Information Management in Chun Hua University, Taiwan and Department of Civil Engineering in Tamkang University, Taiwan. It was sourced from UCI Machine Learning Repository and can be found [here](http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#). [This file](http://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls) is what we used to build the model. The data set contains 30,000 observations representing individual customers in Taiwan. Each row contains relevant information about the distinct individual as well as how timely they were with their bill payments and the corresponding bill amounts for each time period. The bill payment information contains records from April 2005 to September 2005 and each individual have the same number of time periods. The data was collected from an important cash and credit card issuing bank in Taiwan. We will make our prediction based on the features given by the data.
+
+### Analysis
+
+There are 30,000 observations of distinct credit card clients in this data set with each row representing a client. 25 different features are included with information of each given client, such as gender, age, approved credit limit, education, marital status, their past payment history, bill statements, and previous payments for 6 months (April-Sept 2005). Feature transformations are applied to the given features so each observation has the same number of time periods. [Here](https://github.com/UBC-MDS/DSCI522_group_12/blob/main/src/project_eda.md) is a more detailed exploratory analysis that explained how we transform and use each feature. There exists class imbalance in the data set, and one pattern we found is that people with higher credit card limit are more likely to default their payment.    
+
+
+```{r limit plot, fig.cap='Figure 1. Density of Credit Limit Between Default Clients and On-time Clients', out.width='45%'}
+knitr::include_graphics('../results/density_plot.png')
+```
+
+Another pattern we found is that there exists a correlation between education level and default payment. We will analyze this feature further in our machine learning model.
+
+```{r education level, fig.cap='Figure 2. Correlation Between Educational level and Default Payment', out.width='45%'}
+knitr::include_graphics('../results/correlation_plot.png')
+```
+
+Both a linear classification model `LogisticRegression` and an ensemble decision tree classification model `RandomForest` from scikit-learn(Pedregosa et al. 2011) will be used to build this classification model to see which better predicts whether a client will default on the credit card payment. Because of the class imbalance we have, we will look at test accuracy as well as f1 scores on both models. For each model, the appropriate hyperparameters were chosen using 5-fold cross validation. The R[@R] and Python[@Python] programming languages and the following R and Python packages were used to perform the analysis: docopt[@docopt], feather[@featherr], knitr[@knitr], tidyverse[@tidyverse]and Pandas[@reback2020pandas].
+
+The code used to perform the analysis and create this report can be found [here](https://github.com/UBC-MDS/DSCI522_group_12/tree/main/src)
+
+## Results & Discussion
+
+To look at which model is better for prediction, we first compare the two models with default hyperparameters. We used `DummyRegression` with `strategy='prior'` as our baseline. Although it has an accuracy score of 0.78, it is not very reliable because we have class imbalance in the data set and f1 score is more important in our prediction. Our baseline has f1 score of 0, which is not good. On the other hand, both `RandomForest` and `LogisticRegression` has better score on f1. `RandomForest` has a very high f1 on the training set, but the score is low on the validation set, and there exists a huge gap between the two scores, which means we have an overfitting problem. On the other hand, `LogisticRegression` has very similar training and validation f1 scores, it has a higher f1 score compared to `RandomForest` model. Therefore, we believe `LogisticRegression` is a better model to use for prediction.
+
+```{r compare results, message=FALSE, warning=FALSE}
+result_compare <- read_csv('../results/prediction_prelim_results.csv')
+knitr::kable(result_compare, caption = 'Table 1.Comparison between accuracy and f1 with default hyperparameters for each model ')
+```
+
+Since the validation scores were comparable, we decided to tune hyperparameters for both models and compare the results with the previous table. The hyperparameters we chose for `RandomForest` is `n_estimators` (low=10, high=300) and `max_depth` (low=1, high=5000). The hyperparameters for `LogisticRegression` is `class_weight` ("balanced" vs "none") and `C` (low=0, high=1000). We only focus on f1 score in this comparasion since it is more relavant to the issue we care about. We ranked the f1 score from high to low. As indicated in the table, our best f1 score is 0.51 with hyperparameter *C=382* and *class_weight='balanced'*. The results also show that the top 3 f1 scores are all come from `LogisricRegression`. This finding further confirmed our results from previous table that `LogisticRegression` is a better model to use than `RandomForest`. 
+
+```{r f1 results, message=FALSE, warning=FALSE}
+result_f1 <- read_csv('../results/prediction_hp_results.csv')
+knitr::kable(result_f1, caption='Table 2. F1 score with optimized hyperpamaters for each model')
+```
+Based on the result above, we find that although `LogisticRegression` is a better model to use, the f1 score is only around 0.5. It is not a very good score, which means the prediction from this model is not as reliable. To further improve this model in future, it is a good idea to take consideration of other hyperparameters or apply feature engineering to add more useful features to help with prediction. Furthermore, we may also want to look at the confusion matrix of model performance and try to minimize the false negative in the prediction by changing the threshold of the model.
+
+
+## References
diff --git a/doc/report.html b/doc/report.html