Skip to content

Commit

Permalink
更改了conf格式
Browse files Browse the repository at this point in the history
  • Loading branch information
zhaoxingfeng committed Dec 16, 2018
1 parent 526fe5d commit 3c507e4
Show file tree
Hide file tree
Showing 3 changed files with 41 additions and 38 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# WOE

传统信用评分卡模型中常用WOE分箱对原始特征进行非线性映射,常见的分箱方式有等宽分箱、等频分箱、最优分箱等,这里采用基于决策树的最优分箱思想,基于iv值求最优分箱,可处理连续型变量和离散型变量
WOE Transformation常用于信用风险评分卡(Credit Risk Scorecard)模型中,采用分箱的方式对原始特征进行非线性映射。常见的分箱方法有等宽分箱、等频分箱、最优分箱等,这里设计了一种基于决策树的分箱算法,其核心是基于iv值最大求最优分箱,可同时处理连续型变量和离散型变量

1、连续型变量分箱:针对一对feature-label构造决策树,选择最优分裂点时需保证左树iv+右树iv之和最大,如果二者之和大于不分裂时的iv则分裂,否则不分裂;同时需要保证每个叶子节点样本数量大于给定的最小样本数量。最终,每个父节点存储了用于分箱的分裂点信息,叶子节点存储了该分箱内的woe、iv、正负样本数量等信息;
1、连续型变量:针对一对feature-label构造决策树,选择最优分裂点时需保证左树iv+右树iv之和最大,如果二者之和大于不分裂时的iv则分裂,否则不分裂;同时需要保证每个叶子节点样本数量大于给定的最小样本数量。最终,每个父节点存储了用于分箱的分裂点信息,叶子节点存储了该分箱内的woe、iv、正负样本数量等信息;

2、离散型变量分箱:对特征的每个离散值求woe值,用经woe值替换后的样本构造决策树,方法与处理连续型变量一致。需要注意的是在树的每一次分裂过程中,都要记录下分裂所涉及到的原始特征值。最终,每个叶子节点存储了该分箱内的原始特征值、woe、iv、正负样本数量等信息;
2、离散型变量:对特征的每个离散值求woe值,用经woe值替换后的样本构造决策树,方法与处理连续型变量一致。需要注意的是在树的每一次分裂过程中,都要记录下分裂所涉及到的原始特征值。最终,每个叶子节点存储了该分箱内的原始特征值、woe、iv、正负样本数量等信息;

3、提取分裂点信息、分箱内的原始特征值、woe、iv、正负样本数量信息构成分箱规则,进而对原始数据进行woe转化
3、提取树结构中存储的的分裂点信息、分箱内的原始特征值、woe、iv、正负样本数量信息构成分箱规则。最终生成的分箱规则中,bin_value_list表示离散特征每个分箱对应的原始特征值;split_left表示连续特征分箱左界(>),split_right表示连续特征分箱右界(<=);iv_sum表示该特征所有分箱iv之和

针对UCI信用卡用户违约和支付数据集credit&nbsp;card,对比了model&nbsp;builder和采用本方法得到的分箱结果,表明基于决策树的最优分箱效果超过了model&nbsp;builder:分箱数量合理、箱内样本数量均匀、iv值比model&nbsp;builder跑出来的要大。
52 changes: 26 additions & 26 deletions f_conf/credit_card.conf
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
is_continous,is_identify,var_dtype,var_name
0,1,object,ID
1,0,int64,LIMIT_BAL
0,0,object,SEX
0,0,object,EDUCATION
0,0,object,MARRIAGE
1,0,int64,AGE
1,0,int64,PAY_0
1,0,int64,PAY_2
1,0,int64,PAY_3
1,0,int64,PAY_4
1,0,int64,PAY_5
1,0,int64,PAY_6
1,0,int64,BILL_AMT1
1,0,int64,BILL_AMT2
1,0,int64,BILL_AMT3
1,0,int64,BILL_AMT4
1,0,int64,BILL_AMT5
1,0,int64,BILL_AMT6
1,0,int64,PAY_AMT1
1,0,int64,PAY_AMT2
1,0,int64,PAY_AMT3
1,0,int64,PAY_AMT4
1,0,int64,PAY_AMT5
1,0,int64,PAY_AMT6
0,1,int64,target
is_continous,var_dtype,var_name
1,int64,LIMIT_BAL
0,int64,SEX
0,int64,EDUCATION
0,int64,MARRIAGE
1,int64,AGE
1,int64,PAY_0
1,int64,PAY_2
1,int64,PAY_3
1,int64,PAY_4
1,int64,PAY_5
1,int64,PAY_6
1,int64,BILL_AMT1
1,int64,BILL_AMT2
1,int64,BILL_AMT3
1,int64,BILL_AMT4
1,int64,BILL_AMT5
1,int64,BILL_AMT6
1,int64,PAY_AMT1
1,int64,PAY_AMT2
1,int64,PAY_AMT3
1,int64,PAY_AMT4
1,int64,PAY_AMT5
1,int64,PAY_AMT6
-1,int64,ID
-1,int64,label
19 changes: 11 additions & 8 deletions woe.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,7 @@ class WoeFeatureProcess(object):
def __init__(self, path_conf, path_woe_rule, min_sample_rate=0.1, min_iv=0.0005):
"""
:param path_conf: 描述每个特征的情况
is_continous: 1为连续型变量,0为离散型变量
is_identify: 置为1表示该特征不参与woe转化
is_continous: 1为连续型变量,0为离散型变量,-1表示不参与分箱
var_dtype: 特征数据类型
var_name: 特征名
:param path_woe_rule: 存储csv格式特征分箱
Expand All @@ -85,8 +84,8 @@ def __init__(self, path_conf, path_woe_rule, min_sample_rate=0.1, min_iv=0.0005)
"""
self.dataset = None
self.conf = pd.read_csv(path_conf)
self.continous_var_list = self.conf[(self.conf['is_continous'] == 1) & (self.conf['is_identify'] == 0)]['var_name']
self.discrete_var_list = self.conf[(self.conf['is_continous'] == 0) & (self.conf['is_identify'] == 0)]['var_name']
self.continous_var_list = self.conf[self.conf['is_continous'] == 1]['var_name']
self.discrete_var_list = self.conf[self.conf['is_continous'] == 0]['var_name']
self.woe_rule_dict = dict()
self.woe_rule_df = pd.DataFrame()
self.path_woe_rule = path_woe_rule
Expand All @@ -97,6 +96,8 @@ def __init__(self, path_conf, path_woe_rule, min_sample_rate=0.1, min_iv=0.0005)
self.min_iv = min_iv

def fit(self, dataset):
if 'label' not in dataset.columns:
raise ValueError("The dataset must contains label(0&1)!")
self.dataset = dataset
self.total_bad_cnt = dataset[dataset['label'] == 1].__len__()
self.total_good_cnt = dataset[dataset['label'] == 0].__len__()
Expand All @@ -106,21 +107,23 @@ def fit(self, dataset):
for var in self.continous_var_list:
if var in self.dataset.columns:
print(var.center(80, '='))
self.dataset[var] = self.dataset[var].astype(self.conf.loc[self.conf['var_name'] == var, 'var_dtype'].values[0])
var_df = self.fit_continous(self.dataset[[var, 'label']], var)
self.woe_rule_df = var_df if self.woe_rule_df.empty else pd.concat([self.woe_rule_df, var_df], ignore_index=1)

print("PROCESS DISCRETE VARIABLES".center(80, '='))
for var in self.discrete_var_list:
if var in self.dataset.columns:
print(var.center(80, '='))
self.dataset[var] = self.dataset[var].astype(self.conf.loc[self.conf['var_name'] == var, 'var_dtype'].values[0])
var_df = self.fit_discrete(self.dataset[[var, 'label']], var)
self.woe_rule_df = var_df if self.woe_rule_df.empty else pd.concat([self.woe_rule_df, var_df], ignore_index=1)

cols = ['var_name', 'bin_value_list', 'split_left', 'split_right', 'sub_sample_cnt', 'sub_sample_bad_cnt',
'sub_sample_good_cnt', 'woe', 'iv', 'iv_sum']
self.woe_rule_df = self.woe_rule_df.sort_values(by=['var_name', 'split_left']).reset_index(drop=True)
self.woe_rule_df = self.woe_rule_df[cols]
self.woe_rule_df = self.woe_rule_df.sort_values(by=['iv_sum', 'var_name'], ascending=False).reset_index(drop=True)
self.woe_rule_df = self.woe_rule_df.sort_values(by=['var_name', 'split_left'], ascending=True)
self.woe_rule_df = self.woe_rule_df.sort_values(by=['iv_sum', 'var_name'], ascending=False)
self.woe_rule_df = self.woe_rule_df[cols].reset_index(drop=True)
self.woe_rule_df.to_csv(self.path_woe_rule, index=None, float_format="%.4f")

for var, grp in self.woe_rule_df.groupby(['var_name']):
Expand Down Expand Up @@ -194,7 +197,7 @@ def fit_discrete(self, dataset, var):
temp = sorted(value_woe_dict.iteritems(), key=lambda x: x[1])
bin_woe_list, bin_value_list = [x[1] for x in temp], [x[0] for x in temp]
var_tree = self._fit_discrete(dataset, var, bin_value_list, bin_woe_list)
# print(var_tree.describe_tree())
print(var_tree.describe_tree())
woe_iv_list, split_value_list = var_tree.format_tree(var_tree, [], [])

var_df = pd.DataFrame({"var_name": var,
Expand Down

0 comments on commit 3c507e4

Please sign in to comment.