-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
526fe5d
commit 3c507e4
Showing
3 changed files
with
41 additions
and
38 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,11 @@ | ||
# WOE | ||
|
||
传统信用评分卡模型中常用WOE分箱对原始特征进行非线性映射,常见的分箱方式有等宽分箱、等频分箱、最优分箱等,这里采用基于决策树的最优分箱思想,基于iv值求最优分箱,可处理连续型变量和离散型变量。 | ||
WOE Transformation常用于信用风险评分卡(Credit Risk Scorecard)模型中,采用分箱的方式对原始特征进行非线性映射。常见的分箱方法有等宽分箱、等频分箱、最优分箱等,这里设计了一种基于决策树的分箱算法,其核心是基于iv值最大求最优分箱,可同时处理连续型变量和离散型变量。 | ||
|
||
1、连续型变量分箱:针对一对feature-label构造决策树,选择最优分裂点时需保证左树iv+右树iv之和最大,如果二者之和大于不分裂时的iv则分裂,否则不分裂;同时需要保证每个叶子节点样本数量大于给定的最小样本数量。最终,每个父节点存储了用于分箱的分裂点信息,叶子节点存储了该分箱内的woe、iv、正负样本数量等信息; | ||
1、连续型变量:针对一对feature-label构造决策树,选择最优分裂点时需保证左树iv+右树iv之和最大,如果二者之和大于不分裂时的iv则分裂,否则不分裂;同时需要保证每个叶子节点样本数量大于给定的最小样本数量。最终,每个父节点存储了用于分箱的分裂点信息,叶子节点存储了该分箱内的woe、iv、正负样本数量等信息; | ||
|
||
2、离散型变量分箱:对特征的每个离散值求woe值,用经woe值替换后的样本构造决策树,方法与处理连续型变量一致。需要注意的是在树的每一次分裂过程中,都要记录下分裂所涉及到的原始特征值。最终,每个叶子节点存储了该分箱内的原始特征值、woe、iv、正负样本数量等信息; | ||
2、离散型变量:对特征的每个离散值求woe值,用经woe值替换后的样本构造决策树,方法与处理连续型变量一致。需要注意的是在树的每一次分裂过程中,都要记录下分裂所涉及到的原始特征值。最终,每个叶子节点存储了该分箱内的原始特征值、woe、iv、正负样本数量等信息; | ||
|
||
3、提取分裂点信息、分箱内的原始特征值、woe、iv、正负样本数量信息构成分箱规则,进而对原始数据进行woe转化。 | ||
3、提取树结构中存储的的分裂点信息、分箱内的原始特征值、woe、iv、正负样本数量信息构成分箱规则。最终生成的分箱规则中,bin_value_list表示离散特征每个分箱对应的原始特征值;split_left表示连续特征分箱左界(>),split_right表示连续特征分箱右界(<=);iv_sum表示该特征所有分箱iv之和。 | ||
|
||
针对UCI信用卡用户违约和支付数据集credit card,对比了model builder和采用本方法得到的分箱结果,表明基于决策树的最优分箱效果超过了model builder:分箱数量合理、箱内样本数量均匀、iv值比model builder跑出来的要大。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,26 +1,26 @@ | ||
is_continous,is_identify,var_dtype,var_name | ||
0,1,object,ID | ||
1,0,int64,LIMIT_BAL | ||
0,0,object,SEX | ||
0,0,object,EDUCATION | ||
0,0,object,MARRIAGE | ||
1,0,int64,AGE | ||
1,0,int64,PAY_0 | ||
1,0,int64,PAY_2 | ||
1,0,int64,PAY_3 | ||
1,0,int64,PAY_4 | ||
1,0,int64,PAY_5 | ||
1,0,int64,PAY_6 | ||
1,0,int64,BILL_AMT1 | ||
1,0,int64,BILL_AMT2 | ||
1,0,int64,BILL_AMT3 | ||
1,0,int64,BILL_AMT4 | ||
1,0,int64,BILL_AMT5 | ||
1,0,int64,BILL_AMT6 | ||
1,0,int64,PAY_AMT1 | ||
1,0,int64,PAY_AMT2 | ||
1,0,int64,PAY_AMT3 | ||
1,0,int64,PAY_AMT4 | ||
1,0,int64,PAY_AMT5 | ||
1,0,int64,PAY_AMT6 | ||
0,1,int64,target | ||
is_continous,var_dtype,var_name | ||
1,int64,LIMIT_BAL | ||
0,int64,SEX | ||
0,int64,EDUCATION | ||
0,int64,MARRIAGE | ||
1,int64,AGE | ||
1,int64,PAY_0 | ||
1,int64,PAY_2 | ||
1,int64,PAY_3 | ||
1,int64,PAY_4 | ||
1,int64,PAY_5 | ||
1,int64,PAY_6 | ||
1,int64,BILL_AMT1 | ||
1,int64,BILL_AMT2 | ||
1,int64,BILL_AMT3 | ||
1,int64,BILL_AMT4 | ||
1,int64,BILL_AMT5 | ||
1,int64,BILL_AMT6 | ||
1,int64,PAY_AMT1 | ||
1,int64,PAY_AMT2 | ||
1,int64,PAY_AMT3 | ||
1,int64,PAY_AMT4 | ||
1,int64,PAY_AMT5 | ||
1,int64,PAY_AMT6 | ||
-1,int64,ID | ||
-1,int64,label |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters