信贷风控建模实战(四)——评分卡建模之XGBoost
1. XGBoost的评分映射逻辑
在前一篇文章信贷风控实战(三)——评分卡建模之逻辑回归中我们已经提到过,目前工业界标准的评分映射逻辑如下:
其中score时映射后的评分输出,650是基础分, 分值系数为50,基础分和分值系数我们可以在实际的业务场景下进行相应的调整。
是样本非预期的概率,
是样本预期的概率。在XGBoost中,我们可以使用模型对负样本(1为欺诈,定义为业务上的负样本)的预测概率来映射得分,这样我们可以得到如下的评分映射公式:
2. XGBoost模型训练
依赖包引入
# 依赖包引入
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn import metrics
import math
数据导入
# 数据导入
data = pd.read_csv('Acard.txt')
data.head()
obs_mth bad_ind uid td_score jxl_score mj_score rh_score zzc_score zcx_score person_info finance_info credit_info act_info
0 2018-10-31 0.0 A10000005 0.675349 0.144072 0.186899 0.483640 0.928328 0.369644 -0.322581 0.023810 0.00 0.217949
1 2018-07-31 0.0 A1000002 0.825269 0.398688 0.139396 0.843725 0.605194 0.406122 -0.128677 0.023810 0.00 0.423077
2 2018-09-30 0.0 A1000011 0.315406 0.629745 0.535854 0.197392 0.614416 0.320731 0.062660 0.023810 0.10 0.448718
3 2018-07-31 0.0 A10000481 0.002386 0.609360 0.366081 0.342243 0.870006 0.288692 0.078853 0.071429 0.05 0.179487
4 2018-07-31 0.0 A1000069 0.406310 0.405352 0.783015 0.563953 0.715454 0.512554 -0.261014 0.023810 0.00 0.423077
挑选训练样本和时间外样本(OOT)
# 选择'2018-11-30'之外的数据作为时间内样本,用于模型训练
train = data[data.obs_mth != '2018-11-30'].reset_index().copy()
# 选择'2018-11-30'的数据作为时间外样本,用于模型验证
valid = data[data.obs_mth == '2018-11-30'].reset_index().copy()
构建训练集和验证集
# 挑选特征
feature_list = ['td_score', 'jxl_score', 'mj_score','rh_score', 'zzc_score', 'zcx_score',
'person_info', 'finance_info','credit_info', 'act_info']
x_train = train[feature_list].copy()
y_train = train['bad_ind'].copy()
x_valid = valid[feature_list].copy()
y_valid = valid['bad_ind'].copy()
训练XGBoost分类模型
from multiprocessing import cpu_count
#定义XGB函数
def xgb_test(x_train, y_train, x_valid, y_valid):
clf = xgb.XGBClassifier(
boosting_type='gbdt',
num_leaves=31,
reg_Ap=0.0,
reg_lambda=1,
max_depth=2,
n_estimators=800,
max_features = 140,
objective='binary:logistic',
subsample=0.7,
colsample_bytree=0.7,
subsample_freq=1,
learning_rate=0.05,
min_child_weight=50,
random_state=4,
n_jobs=cpu_count() - 1,
num_iterations = 800 #迭代次数
)
clf.fit(x_train, y_train,
eval_set=[(x_train, y_train),(x_valid, y_valid)],
eval_metric='auc',
early_stopping_rounds=100)
return clf
model = xgb_test(x_train, y_train, x_valid, y_valid)
Parameters: { "boosting_type", "max_features", "num_iterations", "num_leaves", "reg_Ap", "subsample_freq" } are not used.
[0] validation_0-auc:0.71269 validation_1-auc:0.68213
[1] validation_0-auc:0.71689 validation_1-auc:0.68692
[2] validation_0-auc:0.75801 validation_1-auc:0.73044
[3] validation_0-auc:0.79231 validation_1-auc:0.77025
[4] validation_0-auc:0.79175 validation_1-auc:0.76880
[5] validation_0-auc:0.79369 validation_1-auc:0.77189
[6] validation_0-auc:0.79850 validation_1-auc:0.77900
[7] validation_0-auc:0.79800 validation_1-auc:0.77741
[8] validation_0-auc:0.79903 validation_1-auc:0.77547
[9] validation_0-auc:0.79819 validation_1-auc:0.77658
[10] validation_0-auc:0.79744 validation_1-auc:0.77398
[11] validation_0-auc:0.80187 validation_1-auc:0.77610
[12] validation_0-auc:0.80160 validation_1-auc:0.77537
[13] validation_0-auc:0.80052 validation_1-auc:0.77682
[14] validation_0-auc:0.80034 validation_1-auc:0.77583
[15] validation_0-auc:0.79969 validation_1-auc:0.77641
[16] validation_0-auc:0.79943 validation_1-auc:0.77730
[17] validation_0-auc:0.79956 validation_1-auc:0.77821
[18] validation_0-auc:0.79996 validation_1-auc:0.77583
[19] validation_0-auc:0.80000 validation_1-auc:0.77647
[20] validation_0-auc:0.80101 validation_1-auc:0.77837
[21] validation_0-auc:0.80067 validation_1-auc:0.77950
[22] validation_0-auc:0.80015 validation_1-auc:0.77966
[23] validation_0-auc:0.80037 validation_1-auc:0.77948
[24] validation_0-auc:0.80045 validation_1-auc:0.77862
[25] validation_0-auc:0.79997 validation_1-auc:0.77620
[26] validation_0-auc:0.80046 validation_1-auc:0.77856
[27] validation_0-auc:0.80049 validation_1-auc:0.77795
[28] validation_0-auc:0.80113 validation_1-auc:0.77869
[29] validation_0-auc:0.80094 validation_1-auc:0.77912
[30] validation_0-auc:0.80053 validation_1-auc:0.77948
[31] validation_0-auc:0.80063 validation_1-auc:0.77918
[32] validation_0-auc:0.80079 validation_1-auc:0.77857
[33] validation_0-auc:0.80199 validation_1-auc:0.78021
[34] validation_0-auc:0.80190 validation_1-auc:0.77989
[35] validation_0-auc:0.80261 validation_1-auc:0.77975
[36] validation_0-auc:0.80239 validation_1-auc:0.77918
[37] validation_0-auc:0.80235 validation_1-auc:0.77880
[38] validation_0-auc:0.80220 validation_1-auc:0.77835
[39] validation_0-auc:0.80261 validation_1-auc:0.77880
[40] validation_0-auc:0.80249 validation_1-auc:0.77852
[41] validation_0-auc:0.80258 validation_1-auc:0.77886
[42] validation_0-auc:0.80235 validation_1-auc:0.77876
[43] validation_0-auc:0.80250 validation_1-auc:0.77817
[44] validation_0-auc:0.80224 validation_1-auc:0.77793
[45] validation_0-auc:0.80273 validation_1-auc:0.77764
[46] validation_0-auc:0.80275 validation_1-auc:0.77791
[47] validation_0-auc:0.80291 validation_1-auc:0.77818
[48] validation_0-auc:0.80264 validation_1-auc:0.77805
[49] validation_0-auc:0.80293 validation_1-auc:0.77784
[50] validation_0-auc:0.80288 validation_1-auc:0.77775
[51] validation_0-auc:0.80271 validation_1-auc:0.77761
[52] validation_0-auc:0.80297 validation_1-auc:0.77805
[53] validation_0-auc:0.80381 validation_1-auc:0.77869
[54] validation_0-auc:0.80328 validation_1-auc:0.77790
[55] validation_0-auc:0.80465 validation_1-auc:0.77887
[56] validation_0-auc:0.80441 validation_1-auc:0.77925
[57] validation_0-auc:0.80501 validation_1-auc:0.77941
[58] validation_0-auc:0.80558 validation_1-auc:0.77916
[59] validation_0-auc:0.80507 validation_1-auc:0.77942
[60] validation_0-auc:0.80538 validation_1-auc:0.77917
[61] validation_0-auc:0.80541 validation_1-auc:0.77905
[62] validation_0-auc:0.80593 validation_1-auc:0.77877
[63] validation_0-auc:0.80561 validation_1-auc:0.77884
[64] validation_0-auc:0.80583 validation_1-auc:0.77842
[65] validation_0-auc:0.80612 validation_1-auc:0.77815
[66] validation_0-auc:0.80658 validation_1-auc:0.77767
[67] validation_0-auc:0.80670 validation_1-auc:0.77849
[68] validation_0-auc:0.80662 validation_1-auc:0.77765
[69] validation_0-auc:0.80706 validation_1-auc:0.77709
[70] validation_0-auc:0.80740 validation_1-auc:0.77745
[71] validation_0-auc:0.80821 validation_1-auc:0.77758
[72] validation_0-auc:0.80789 validation_1-auc:0.77791
[73] validation_0-auc:0.80803 validation_1-auc:0.77879
[74] validation_0-auc:0.80796 validation_1-auc:0.77929
[75] validation_0-auc:0.80797 validation_1-auc:0.77950
[76] validation_0-auc:0.80784 validation_1-auc:0.77853
[77] validation_0-auc:0.80798 validation_1-auc:0.77813
[78] validation_0-auc:0.80790 validation_1-auc:0.77778
[79] validation_0-auc:0.80768 validation_1-auc:0.77611
[80] validation_0-auc:0.80787 validation_1-auc:0.77626
[81] validation_0-auc:0.80865 validation_1-auc:0.77714
[82] validation_0-auc:0.80878 validation_1-auc:0.77750
[83] validation_0-auc:0.80885 validation_1-auc:0.77748
[84] validation_0-auc:0.80870 validation_1-auc:0.77733
[85] validation_0-auc:0.80856 validation_1-auc:0.77750
[86] validation_0-auc:0.80928 validation_1-auc:0.77817
[87] validation_0-auc:0.80943 validation_1-auc:0.77807
[88] validation_0-auc:0.80965 validation_1-auc:0.77780
[89] validation_0-auc:0.80959 validation_1-auc:0.77790
[90] validation_0-auc:0.80967 validation_1-auc:0.77780
[91] validation_0-auc:0.81014 validation_1-auc:0.77744
[92] validation_0-auc:0.81021 validation_1-auc:0.77819
[93] validation_0-auc:0.81038 validation_1-auc:0.77775
[94] validation_0-auc:0.81035 validation_1-auc:0.77774
[95] validation_0-auc:0.81040 validation_1-auc:0.77738
[96] validation_0-auc:0.81041 validation_1-auc:0.77759
[97] validation_0-auc:0.81028 validation_1-auc:0.77719
[98] validation_0-auc:0.81072 validation_1-auc:0.77752
[99] validation_0-auc:0.81068 validation_1-auc:0.77751
[100] validation_0-auc:0.81089 validation_1-auc:0.77785
[101] validation_0-auc:0.81074 validation_1-auc:0.77835
[102] validation_0-auc:0.81116 validation_1-auc:0.77851
[103] validation_0-auc:0.81132 validation_1-auc:0.77825
[104] validation_0-auc:0.81150 validation_1-auc:0.77846
[105] validation_0-auc:0.81157 validation_1-auc:0.77834
[106] validation_0-auc:0.81197 validation_1-auc:0.77765
[107] validation_0-auc:0.81196 validation_1-auc:0.77702
[108] validation_0-auc:0.81210 validation_1-auc:0.77686
[109] validation_0-auc:0.81222 validation_1-auc:0.77709
[110] validation_0-auc:0.81218 validation_1-auc:0.77749
[111] validation_0-auc:0.81248 validation_1-auc:0.77750
[112] validation_0-auc:0.81271 validation_1-auc:0.77731
[113] validation_0-auc:0.81285 validation_1-auc:0.77742
[114] validation_0-auc:0.81304 validation_1-auc:0.77759
[115] validation_0-auc:0.81334 validation_1-auc:0.77785
[116] validation_0-auc:0.81328 validation_1-auc:0.77798
[117] validation_0-auc:0.81337 validation_1-auc:0.77820
[118] validation_0-auc:0.81351 validation_1-auc:0.77828
[119] validation_0-auc:0.81356 validation_1-auc:0.77819
[120] validation_0-auc:0.81362 validation_1-auc:0.77832
[121] validation_0-auc:0.81371 validation_1-auc:0.77844
[122] validation_0-auc:0.81385 validation_1-auc:0.77858
[123] validation_0-auc:0.81400 validation_1-auc:0.77858
[124] validation_0-auc:0.81404 validation_1-auc:0.77835
[125] validation_0-auc:0.81416 validation_1-auc:0.77865
[126] validation_0-auc:0.81430 validation_1-auc:0.77861
[127] validation_0-auc:0.81429 validation_1-auc:0.77851
[128] validation_0-auc:0.81440 validation_1-auc:0.77867
[129] validation_0-auc:0.81452 validation_1-auc:0.77859
[130] validation_0-auc:0.81464 validation_1-auc:0.77863
[131] validation_0-auc:0.81459 validation_1-auc:0.77861
[132] validation_0-auc:0.81471 validation_1-auc:0.77888
[133] validation_0-auc:0.81488 validation_1-auc:0.77869
训练集上的KS
#训练集预测
y_pred = model.predict_proba(x_train)[:,1]
fpr_xgb_train, tpr_xgb_train,_ = metrics.roc_curve(y_train, y_pred)
train_ks = abs(fpr_xgb_train - tpr_xgb_train).max()
print('ks on train set is: ',train_ks)
ks on train set is: 0.4582934972077165
验证集上的KS
#跨时间验证集预测
y_pred = model.predict_proba(x_valid)[:,1]
fpr_xgb,tpr_xgb,_ = metrics.roc_curve(y_valid, y_pred)
evl_ks = abs(fpr_xgb - tpr_xgb).max()
print('ks on valid set is: ',evl_ks)
ks on valid set is: 0.43979696100086196
训练集和验证集上的ROC曲线
# 训练集和验证集上的ROC对比
from matplotlib import pyplot as plt
plt.plot(fpr_lr_train, tpr_lr_train, label = 'LR_train')
plt.plot(fpr_lr_valid, tpr_lr_valid, label = 'LR_valid')
plt.plot([0,1], [0,1], 'k--')
plt.xlabel('Flase positive rate')
plt.ylabel('True positive rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()

模型报告
# 生成模型报告
row_num, col_num = 0, 0
bins = 20
Y_predict = y_pred
Y = y_valid
rows = Y.shape[0]
lis = [(Y_predict[i], Y[i]) for i in range(rows)]
ks_lis = sorted(lis, key=lambda x : x[0], reverse=True)
bin_num = int(rows/bins + 1)
good = sum([1 for (p,y) in ks_lis if y <= 0.5])
bad = sum([1 for (p,y) in ks_lis if y > 0.5])
good_cnt, bad_cnt = 0,0
KS = []
GOOD = []
GOOD_CNT = []
BAD = []
BAD_CNT = []
BAD_PCTG = []
BAD_RATE = []
DCT_REPORT = []
report = {}
for j in range(bins):
ds = ks_lis[j * bin_num : min(bin_num * (j + 1), rows)]
good1 = sum([1 for (p,y) in ds if y <= 0.5])
bad1 = sum([1 for (p,y) in ds if y > 0.5])
bad_cnt += bad1
good_cnt += good1
bad_pctg = round(bad_cnt / sum(y_valid), 3)
bad_rate = round(bad1 /(bad1 + good1), 3)
ks = round(math.fabs(bad_cnt / bad - good_cnt / good), 3)
KS.append(ks)
GOOD.append(good1)
GOOD_CNT.append(good_cnt)
BAD.append(bad1)
BAD_CNT.append(bad_cnt)
BAD_PCTG.append(bad_pctg)
BAD_RATE.append(bad_rate)
report['KS'] = KS
report['正样本个数'] = GOOD
report['正样本累计个数'] = GOOD_CNT
report['负样本个数'] = BAD
report['负样本累计个数'] = BAD_CNT
report['捕获率'] = BAD_PCTG
report['负样本占比'] = BAD_RATE
pd.DataFrame(report)

3. 评分映射
# 评分映射
def score(pred):
score = 650 + 50*(math.log2((1- pred)/ pred))
return score
valid['xbeta'] = model.predict_proba(x_valid)[:, 1]
valid['score'] = valid.apply(lambda x : score(x.xbeta), axis=1)
fpr_lr,tpr_lr,_ = metrics.roc_curve(y_valid, valid['score'])
evl_ks = abs(fpr_lr - tpr_lr).max()
print('ks on valid set after scoring: ',evl_ks)
ks on valid set after scoring: 0.43979696100086196
通过对模型进行映射评分,在时间外样本上的KS值与逻辑回归模型预测的KS值一致,因此通过这种方式可以验证映射函数的逻辑是否正确。因为在映射函数逻辑正确的情况下,是不会影响模型的排序能力的, 故映射后的KS值应当与模型映射前直接输出的KS值是一致的。拿到用户评分之后呢,我们接下来也可以针对评分进行等级划分:
# 划分评级
def level(score):
level = 0
if score <= 780:
level = "D"
elif score <= 790 and score > 780 :
level = "C"
elif score <= 800 and score > 790:
level = "B"
elif score > 800 :
level = "A"
return level
valid['level'] = valid.score.map(lambda x : level(x) )
valid['level'].groupby(valid['level']).count()/len(valid)
level
A 0.705853
B 0.167449
C 0.069171
D 0.057527
Name: level, dtype: float64
4.特征重要性
由于是树模型,因此可以在模型训练完成之后,我们可以轻松的得到训练数据的各特征的重要性,以观察在实际建模中原始数据指标的有效性。
# 特征重要性排序
# 特征重要性排序
temp = pd.DataFrame()
temp['feature_name'] = model.feature_names_in_
temp['importance'] = model.feature_importances_
sorted_features = temp.sort_values('importance', ascending=False)
sorted_features
feature_name importance
7 finance_info 0.371015
8 credit_info 0.199961
6 person_info 0.119167
9 act_info 0.083946
5 zcx_score 0.048099
2 mj_score 0.047659
3 rh_score 0.038957
1 jxl_score 0.032673
4 zzc_score 0.032648
0 td_score 0.025874
sorted_features.set_index('feature_name').plot.bar(figsize=(12,6),rot=0)

通过观察我们也能发现一个很有趣的现象,就是该公司内部的用户信息,如金融属性的信息、信用信息等重要性均高于外部供应商提供的征信分。关于特征的重要性,后期会专门开辟一个专题来讲,这里就简单介绍基于XGBoost的评分卡构建过程。
信贷风控建模实战系列
信贷风控建模实战(一)——建模流程总览
信贷风控建模实战(二)——策略生成及规则挖掘
信贷风控建模实战(三)——评分卡建模之逻辑回归
信贷风控建模实战(四)——评分卡建模之XGBoost
信贷风控建模实战(五)——特征工程
信贷风控建模实战(六)——异常检测
信贷风控建模实战(七)——群组划分or聚类
信贷风控建模实战(八)——风控基础概念