阿里云天池——金融风控-贷款违约预测（一）

2020-09-16 本文已影响0人 kaka22

赛题理解

赛题以金融风控中的个人信贷为背景，要求选手根据贷款申请人的数据信息预测其是否有违约的可能，以此判断是否通过此项贷款，这是一个典型的分类问题。通过这道赛题来引导大家了解金融风控中的一些业务背景，解决实际问题，帮助竞赛新人进行自我练习、自我提高。

数据是给好的，但是实际的风控数据的X的定义和y标签的定义都是很有学问的。
简单来说：得划分，观察期——观察点——表现期。
观察期确定X，利用运营商数据、电商数据、金融机构数据、第三方数据等并进行衍生。
观察点：一般选择授信日。
表现期(需要结合滚动率分析、Vintage分析进行确定)：确定Y标签。

赛题概况

比赛要求参赛选手根据给定的数据集，建立模型，预测金融风险。赛题以预测金融风险为任务，数据集报名后可见并可下载，该数据来自某信贷平台的贷款记录，总数据量超过120w，包含47列变量信息，其中15列为匿名变量。为了保证比赛的公平性，将会从中抽取80万条作为训练集，20万条作为测试集A，20万条作为测试集B，同时会对employmentTitle、purpose、postCode和title等信息进行脱敏。

数据概况

一般而言，对于数据在比赛界面都有对应的数据概况介绍（匿名特征除外），说明列的性质特征。了解列的性质会有助于我们对于数据的理解和后续分析。 Tip:匿名特征，就是未告知数据列所属的性质的特征列。
train.csv

id 为贷款清单分配的唯一信用证标识
loanAmnt 贷款金额
term 贷款期限（year）
interestRate 贷款利率
installment 分期付款金额
grade 贷款等级
subGrade 贷款等级之子级
employmentTitle就业职称
employmentLength 就业年限（年）
homeOwnership 借款人在登记时提供的房屋所有权状况
annualIncome 年收入
verificationStatus 验证状态
issueDate 贷款发放的月份
purpose 借款人在贷款申请时的贷款用途类别
postCode 借款人在贷款申请中提供的邮政编码的前3位数字
regionCode 地区编码
dti 债务收入比
delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
openAcc 借款人信用档案中未结信用额度的数量
pubRec 贬损公共记录的数量
pubRecBankruptcies 公开记录清除的数量
revolBal 信贷周转余额合计
revolUtil 循环额度利用率，或借款人使用的相对于所有可用循环信贷的信贷金额
totalAcc 借款人信用档案中当前的信用额度总数
initialListStatus 贷款的初始列表状态
applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
earliesCreditLine 借款人最早报告的信用额度开立的月份
title 借款人提供的贷款名称
policyCode 公开可用的策略代码=1新产品不公开可用的策略代码=2
n系列匿名特征匿名特征n0-n14，为一些贷款人行为计数特征的处理

预测指标

竞赛采用AUC作为评价指标。AUC（Area Under Curve）被定义为 ROC曲线下与坐标轴围成的面积。

为什么用AUC呢？(之前实习的时候，刚开始我就做了一个准确率90%+的模型，兴冲冲的拿给领导去看，然后就被告知，这个模型的入模变量有问题，再进行筛选吧)
后边也知道了不光得看AUC，还得结合KS指标进行评价。

为什么风控模型不拿准确率来衡量呢？为什么要用AUC和KS呢？

因为风控模型并不和猫狗二分类问题一样，信贷风控追求的是风险与收益之间的平衡，因此好坏定义常常是模糊的。原因在于，坏的客群虽然能带来坏账损失，但同时也能带来利息、罚息等收入。那么，我们能接受多坏的客群呢？这就取决于风险容忍度。在风控中，y的定义并不是非黑即白（离散型），而用概率分布（连续型）来衡量或许更合理。

还有一个问题就是在风控场景中，样本不均衡问题非常普遍，一般正负样本比都能达到1：100甚至更低。此时，评估模型的准确率是不可靠的。因为只要全部预测为负样本，就能达到很高的准确率。例如，如果数据集中有95个猫和5个狗，分类器会简单的将其都分为猫，此时准确率是95%。因此，评估准确率是没有意义的。
(参考：客户层申请评分卡(A卡)模型
 风控模型—区分度评估指标(KS)深入理解应用)

AUC

真阳率： $TPR = \frac{TP}{TP+FN}$
假阳率： $FPR = \frac{FP}{FP+TN}$

业务的目的：追求更高的TPR，也就是"抓对了"；以及更低的FPR，也就是"抓错了"。

给定不同的阈值T，低于阈值预测为bad，高于阈值预测为good，然后计算TPR和FPR。
重复多次，在不同阈值T下计算得到多个TPR和FPR。
以FPR为横轴，TPR为纵轴，画出ROC曲线。曲线下方的面积即为AUC值。

我们的诉求就是更高的TPR更低的FPR，因此可以的定义如下目标函数(也就是KS了)：

$KS = MAX(|TPR - FPR|)$

而TPR是比FPR大的，因此有 $TPR = KS + FPR$ ,也就是说ROC曲线上点的切线的截距项反应了KS值的大小，而KS反应的是累计坏客户率(TPR = 阈值左方区域（预测为bad & 真实为bad） / 整体区域真实为bad)和累计好客户率(FPR = 阈值左方区域（预测为bad & 真实为good）/ 整体区域真实为good)的区分。

详情请看求是旺在路上大神的文章，我也是在看过他的多篇文章后才对风控有了更深的理解的。

若希望KS尽可能大，那么切点需要尽可能接近（0，1），此时AUC一般也会增大。
对于相同的KS值，在KS曲线上有两个选择，但TPR和FPR同时大或同时小。虽然我们的目的通常是抓对更多的坏人（TPR⬆），尽可能减少错抓的好人（FPR⬇），但两者需要trade-off。到底选择哪个阈值，取决于业务目标：是希望对bad有更高的召回，还是对good有更低的误伤？
由于KS只是在一个最大分隔点时的值，并不够全面。通常我们也会同时参考KS和AUC（或Gini）

基本的评分卡模型(逻辑回归)

这一用到了两个特别好的评分卡建模的包：toad、scorecardpy。

如果本地运行较慢可以将数据上传到kaggle或者天池的在线编程。

pip install toad
pip install scorecardpy

参考：
https://github.com/ShichenXie/scorecardpy 这个有例子(虽然例子里有小错误)，但不影响使用
https://toad.readthedocs.io/en/latest/ 也有中文教程但还是英语的好读一些

加载包

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
import warnings
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
warnings.filterwarnings('ignore')
import toad
import scorecardpy as sc

读取数据

data_train =pd.read_csv('../input/fengkong/train.csv', index_col='id')
data_test_a = pd.read_csv('../input/fengkong/testA.csv', index_col='id')

数据清洗

区分数值列和非数值列(object类型：日期、字符串等)

'''
# 非数值列
s = data_train.apply(lambda x:x.dtype)
tecols = s[s=='object'].index.tolist()
'''
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))
label = 'isDefault'
numerical_fea.remove(label)

对于非数值列进行编码

category_fea
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

最初的想法是，对于'grade', 'subGrade'采用LabelEncoder()，对于'employmentLength'提取字符串中的数字，'issueDate', 'earliesCreditLine'都转为与最小时间的差值。

'''
for data in [data_train, data_test_a]:
    data['grade'] = data['grade'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7})
    data['subGrade'] = data['subGrade'].map({'A1':1.0,'A2':1.2,'A3':1.4,'A4':1.6,'A5':1.8,
                                       'B1':2.0,'B2':2.2,'B3':2.4,'B4':2.6,'B5':2.8,
                                       'C1':3.0,'C2':3.2,'C3':3.4,'C4':3.6,'C5':3.8,
                                       'D1':4.0,'D2':4.2,'D3':4.4,'D4':4.6,'D5':4.8,
                                       'E1':5.0,'R2':5.2,'E3':5.4,'E4':5.6,'E5':5.8,
                                       'F1':6.0,'F2':6.2,'F3':6.4,'F4':6.6,'F5':6.8,
                                       })

def employmentLength_to_int(s):
    if pd.isnull(s):
        return s
    else:
        return np.int8(s.split()[0])
    
for data in [data_train, data_test_a]:
    data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
    data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
    data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
    
data_train['employmentLength'].value_counts(dropna=False).sort_index()
'''
'''
#转化成时间格式
for data in [data_train, data_test_a]:
    data['issueDate'] = pd.to_datetime(data['issueDate'],format='%Y-%m-%d')
    startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
    #构造时间特征
    data['issueDate'] = data['issueDate'].apply(lambda x: x-startdate).dt.days
data_train['issueDate'].sample(5)
'''
'''
for data in [data_train, data_test_a]:
    data['earliesCreditLine'] = pd.to_datetime(data['earliesCreditLine'])
    startdate = np.min(data['earliesCreditLine'])
    #构造时间特征
    data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda x: x-startdate).dt.days
data_train['earliesCreditLine'].sample(5)
'''

但是后期发现了一个更加简单的 TargetEncoder
https://zhuanlan.zhihu.com/p/40231966
https://blog.csdn.net/SHU15121856/article/details/102100689

简而言之就是：对每个特征c的每个取值，将其变成使target为1的频率
$\frac{这个特征出现时，target为1的次数}{这个特征出现的总次数}$
其实就是这个特征对应的坏客户率。

为了避免过拟合，K折目标编码将要编码的样本分成K份，每其中一份中的样本的目标编码，使用的是另外K-1份数据中相同类别的那些样本的频率值。

from category_encoders.target_encoder import TargetEncoder
te = TargetEncoder(cols=category_fea)
train = te.fit_transform(data_train, target)
test = te.transform(data_test_a)

数据探索

与describe()类似，但toad中的detect函数更全面，不光能给出数值变量的统计特征，也可以对类别变量进行统计分析。
可以看到数值类型、数据大小、缺失情况、唯一值的数量、均值、方差、分位数(频率前几的类别型变量)。

toad.detect(train)

可以看出policyCode均为1，这个变量对于分类是不起作用的。

特征筛选

计算IV值
scorecardpy中，计算sc.iv(dt, y, x=None, positive='bad|1', order=True):
toad中，toad.quality(dataframe, target=’target’, iv_only=False): 输出IV（信息值），基尼系数，熵和唯一值的数量的集合。功能按IV降序排序。 “ target”是目标变量，“ iv_only”指定是否仅计算IV。

toad.quality(train, target=target, iv_only=True)

变量选择

基本可以从以下几个方面进行筛选：缺失率、单一值、变异系数、稳定性PSI、信息量IV值、基于RF/XGBoost特征重要性、线性相关性、多重共线性、逐步回归、P值的显著性检验。

scorecardpy中：通过IV值，缺失率，单一值率进行筛选

var_filter(dt, y, x=None, iv_limit=0.02, missing_limit=0.95,
identical_limit=0.95, var_rm=None, var_kp=None,
return_rm_reason=False, positive='bad|1'):

 Params
    ------
    dt: A data frame with both x (predictor/feature) and y 
      (response/label) variables.
    y: Name of y variable.
    x: Name of x variables. Default is NULL. If x is NULL, then all 
      variables except y are counted as x variables.
    iv_limit: The information value of kept variables should>=iv_limit. 
      The default is 0.02.
    missing_limit: The missing rate of kept variables should<=missing_limit. 
      The default is 0.95.
    identical_limit: The identical value rate (excluding NAs) of kept 
      variables should <= identical_limit. The default is 0.95.
    var_rm: Name of force removed variables, default is NULL.
    var_kp: Name of force kept variables, default is NULL.
    return_rm_reason: Logical, default is FALSE.
    positive: Value of positive class, default is "bad|1".
    
    Returns
    ------
    DataFrame
        A data.table with y and selected x variables
    Dict(if return_rm_reason == TRUE)
        A DataFrame with y and selected x variables and 
          a DataFrame with the reason of removed x variable.

toad中：

toad.selection.select(dataframe, target=’target’, empty=0.9, iv=0.02, corr=0.7, return_drop=False, exclude=None):

根据缺失百分比，IV和相关性（与其他特征）进行初步特征选择，变量为： 
empyt = 0.9：缺失百分比大于90％的要素被过滤；
iv = 0.02：消除IV小于0.02的特征； 
corr = 0.7：如果两个或多个特征的Pearson相关性大于0.7，则IV值较低的特征将被消除；
return_drop = False：如果设置为True，则该函数返回已删除列的列表；否则为false。
exclude = None：输入要从算法中排除的功能列表，通常是ID列和month列。

这里使用Toad中的函数进行筛选变量：

train_selected, dropped = toad.selection.select(train, target=target, empty=0.9, iv=0.02, corr=0.9, return_drop=True)

print("keep:",train_selected.shape[1],
      "drop empty:",len(dropped['empty']),
      "drop iv:",len(dropped['iv']),
      "drop corr:",len(dropped['corr']))

输出

keep: 15 drop empty: 0 drop iv: 24 drop corr: 6
{'empty': array([], dtype=float64),
 'iv': array(['employmentLength', 'purpose', 'postCode', 'regionCode',
        'delinquency_2years', 'openAcc', 'pubRec', 'pubRecBankruptcies',
        'revolBal', 'totalAcc', 'initialListStatus', 'applicationType',
        'policyCode', 'n0', 'n1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n10',
        'n11', 'n12', 'n13'], dtype=object),
 'corr': array(['n9', 'grade', 'n2.1', 'installment', 'ficoRangeHigh',
        'interestRate'], dtype=object)}

分箱

scorecardpy中：默认决策树分箱

woebin(dt, y, x=None, 
           var_skip=None, breaks_list=None, special_values=None, 
           stop_limit=0.1, count_distr_limit=0.05, bin_num_limit=8, 
           # min_perc_fine_bin=0.02, min_perc_coarse_bin=0.05, max_num_bin=8, 
           positive="bad|1", no_cores=None, print_step=0, method="tree",
           ignore_const_cols=True, ignore_datetime_cols=True, 
           check_cate_num=True, replace_blank=True, 
           save_breaks_list=None, **kwargs):
 WOE Binning
    ------
    `woebin` generates optimal binning for numerical, factor and categorical 
    variables using methods including tree-like segmentation or chi-square 
    merge. woebin can also customizing breakpoints if the breaks_list or 
    special_values was provided.
    
    The default woe is defined as ln(Distr_Bad_i/Distr_Good_i). If you 
    prefer ln(Distr_Good_i/Distr_Bad_i), please set the argument `positive` 
    as negative value, such as '0' or 'good'. If there is a zero frequency 
    class when calculating woe, the zero will replaced by 0.99 to make the 
    woe calculable.
    
    Params
    ------
    dt: A data frame with both x (predictor/feature) and y (response/label) variables.
    y: Name of y variable.
    x: Name of x variables. Default is None. If x is None, 
      then all variables except y are counted as x variables.
    var_skip: Name of variables that will skip for binning. Defaults to None.
    breaks_list: List of break points, default is None. 
      If it is not None, variable binning will based on the 
      provided breaks.
    special_values: the values specified in special_values 
      will be in separate bins. Default is None.
    count_distr_limit: The minimum percentage of final binning 
      class number over total. Accepted range: 0.01-0.2; default 
      is 0.05.
    stop_limit: Stop binning segmentation when information value 
      gain ratio less than the stop_limit, or stop binning merge 
      when the minimum of chi-square less than 'qchisq(1-stoplimit, 1)'. 
      Accepted range: 0-0.5; default is 0.1.
    bin_num_limit: Integer. The maximum number of binning.
    positive: Value of positive class, default "bad|1".
    no_cores: Number of CPU cores for parallel computation. 
      Defaults None. If no_cores is None, the no_cores will 
      set as 1 if length of x variables less than 10, and will 
      set as the number of all CPU cores if the length of x variables 
      greater than or equal to 10.
    print_step: A non-negative integer. Default is 1. If print_step>0, 
      print variable names by each print_step-th iteration. 
      If print_step=0 or no_cores>1, no message is print.
    method: Optimal binning method, it should be "tree" or "chimerge". 
      Default is "tree".
    ignore_const_cols: Logical. Ignore constant columns. Defaults to True.
    ignore_datetime_cols: Logical. Ignore datetime columns. Defaults to True.
    check_cate_num: Logical. Check whether the number of unique values in 
      categorical columns larger than 50. It might make the binning process slow 
      if there are too many unique categories. Defaults to True.
    replace_blank: Logical. Replace blank values with None. Defaults to True.
    save_breaks_list: The file name to save breaks_list. Default is None.
    
    Returns
    ------
    dictionary
        Optimal or customized binning dataframe.

toad中默认卡房分箱：

Toad的分箱功能同时支持类别变量和数值变量。 
toad.transform.Combiner()用于训练 
1.初始化：c = toad.transform.Combiner() 
2. 训练分箱：c.fit(dataframe, y = ‘target’, method = ‘chi’, min_samples = None, n_bins = None, empty_separate = False) 
§ y: target variable; 
§ method: the method to apply binning. Suport ‘chi’ (Chi-squared), ‘dt’, (decisin tree), ‘kmeans’ (K-means), ‘quantile’ (by the same percentile), and ‘step’ (by the same step); 
§ min_samples: can be a number or a porportion. Minimum number / porportion of samples required in each bucket; 
§ n_bins: mininum number of buckets. If the number is too large, the algorithm will return the maxinum number of buckets it can get; 
§ empty_separate: whether to seperate the missing values in a bucket. If False, missing values will be put along with the bucket of most close bad rate. 
3. 分箱结果：c.export() 
4. 分箱调整: c.set_rules(dict) 
5. 应用分箱并转为离散值 c.transform(dataframe, labels=False): 
§ labels: whether to convert data to explanatory labels. Returns 0, 1, 2 … when False. Categorical features will be sorted in a descending order of porportion. Returns (-inf, 0], (0,10], (10, inf) when True. 

Note: 1. remember to exclude the unwanted columns, especially ID column and timestamp column. 2. Columns with large number of unique values may take much time to train.

两者都有调整分箱的方法：
scorecardpy : sc.woebin(dt_s, y="creditability", breaks_list=breaks_adj)
toad ：c.set_rules(dict)

这里采用scorecardpy中的方法进行WOE分箱

train_selected = pd.concat([train_selected, target.rename('isDefault')], axis=1) 
bins = sc.woebin(train_selected, y="isDefault")

将分箱可视化并分析单调性：

sc.woebin_plot(bins)

几个例子：

这里需要观察单调性，然后进行分箱调整(没有做)。

然后讲训练集和测试集都转为WOE编码

train_woe = sc.woebin_ply(train_selected, bins)
test_a_woe = sc.woebin_ply(test_a_selected, bins)

模型训练

# breaking dt into train and val
train, val = sc.split_df(train_woe, 'isDefault').values()

y_train = train.loc[:,'isDefault']
X_train = train.loc[:,train.columns != 'isDefault']
y_val = val.loc[:,'isDefault']
X_val = val.loc[:,val.columns != 'isDefault']

# logistic regression ------
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l1', C=0.9, solver='saga', n_jobs=-1)
lr.fit(X_train, y_train)
# lr.coef_
# lr.intercept_

# predicted proability
train_pred = lr.predict_proba(X_train)[:,1]
val_pred = lr.predict_proba(X_val)[:,1]

查看在训练集和验证集上的AUC及KS

train_perf = sc.perf_eva(y_train, train_pred, title = "train")
val_perf = sc.perf_eva(y_val, val_pred, title = "val")

KS、AUC都在合理的范围之内，并且验证集合训练集的表现的十分接近，说明模型十分稳健。

评分卡

将变量分箱转为分数：

card = sc.scorecard(bins, lr, xcolumns = X_train.columns)

{'basepoints':      variable  bin  points
 0  basepoints  NaN   488.0,
 'n14':    variable         bin  points
 26      n14  [-inf,1.0)     9.0
 27      n14   [1.0,3.0)     3.0
 28      n14   [3.0,5.0)    -6.0
 29      n14   [5.0,inf)   -12.0,
 'employmentTitle':            variable                  bin  points
 30  employmentTitle      [-inf,200000.0)    -0.0
 31  employmentTitle  [200000.0,240000.0)     1.0
 32  employmentTitle  [240000.0,310000.0)     1.0
 33  employmentTitle       [310000.0,inf)     0.0,
 'earliesCreditLine':             variable                                        bin  points
 0  earliesCreditLine                 [-inf,0.17999999999999997)     6.0
 1  earliesCreditLine  [0.17999999999999997,0.19999999999999996)     2.0
 2  earliesCreditLine  [0.19999999999999996,0.20999999999999996)    -1.0
 3  earliesCreditLine  [0.20999999999999996,0.22999999999999995)    -4.0
 4  earliesCreditLine                  [0.22999999999999995,inf)    -8.0,
 'homeOwnership':         variable         bin  points
 5  homeOwnership  [-inf,1.0)    12.0
 6  homeOwnership   [1.0,2.0)   -13.0
 7  homeOwnership   [2.0,inf)    -3.0,
 'verificationStatus':               variable         bin  points
 8   verificationStatus  [-inf,1.0)     7.0
 9   verificationStatus   [1.0,2.0)    -1.0
 10  verificationStatus   [2.0,inf)    -5.0,
 'revolUtil':      variable          bin  points
 34  revolUtil  [-inf,20.0)     1.0
 35  revolUtil  [20.0,35.0)     0.0
 36  revolUtil  [35.0,55.0)     0.0
 37  revolUtil  [55.0,75.0)    -0.0
 38  revolUtil   [75.0,inf)    -0.0,
 'annualIncome':         variable                 bin  points
 39  annualIncome      [-inf,45000.0)   -14.0
 40  annualIncome   [45000.0,65000.0)    -6.0
 41  annualIncome   [65000.0,75000.0)     0.0
 42  annualIncome  [75000.0,105000.0)     8.0
 43  annualIncome      [105000.0,inf)    20.0,
 'title':    variable         bin  points
 11    title  [-inf,4.0)     0.0
 12    title   [4.0,5.0)    -0.0
 13    title   [5.0,6.0)    -0.0
 14    title  [6.0,20.0)     0.0
 15    title  [20.0,inf)    -0.0,
 'loanAmnt':     variable                bin  points
 44  loanAmnt      [-inf,4000.0)    17.0
 45  loanAmnt   [4000.0,10000.0)    11.0
 46  loanAmnt  [10000.0,16000.0)    -2.0
 47  loanAmnt      [16000.0,inf)    -9.0,
 'n2':    variable         bin  points
 52       n2  [-inf,4.0)     7.0
 53       n2   [4.0,6.0)     3.0
 54       n2   [6.0,9.0)    -3.0
 55       n2   [9.0,inf)   -11.0,
 'issueDate':      variable                                        bin  points
 48  issueDate                 [-inf,0.17999999999999994)    24.0
 49  issueDate  [0.17999999999999994,0.19999999999999996)     3.0
 50  issueDate  [0.19999999999999996,0.21999999999999995)    -4.0
 51  issueDate                  [0.21999999999999995,inf)   -18.0,
 'subGrade':     variable                        bin  points
 16  subGrade                 [-inf,0.1)    64.0
 17  subGrade                  [0.1,0.2)    19.0
 18  subGrade  [0.2,0.30000000000000004)   -13.0
 19  subGrade  [0.30000000000000004,0.4)   -34.0
 20  subGrade                  [0.4,inf)   -54.0,
 'dti':    variable          bin  points
 21      dti  [-inf,14.0)     8.0
 22      dti  [14.0,21.0)     2.0
 23      dti  [21.0,25.0)    -3.0
 24      dti  [25.0,30.0)    -8.0
 25      dti   [30.0,inf)   -14.0,
 'term':    variable         bin  points
 56     term  [-inf,5.0)    11.0
 57     term   [5.0,inf)   -26.0,
 'ficoRangeLow':         variable            bin  points
 58  ficoRangeLow   [-inf,685.0)    -7.0
 59  ficoRangeLow  [685.0,710.0)     0.0
 60  ficoRangeLow  [710.0,740.0)     9.0
 61  ficoRangeLow  [740.0,760.0)    18.0
 62  ficoRangeLow    [760.0,inf)    25.0}

模型验证

得到y变量相应的得分，并检查训练集和验证集所得分数的稳定性(PSI指标)：

train_data = train_selected.loc[train.index].drop(columns=['isDefault'])
val_data = train_selected.loc[val.index].drop(columns=['isDefault'])
# credit score
train_score = sc.scorecard_ply(train_data, card, print_step=0)
val_score = sc.scorecard_ply(val_data, card, print_step=0)
# psi
sc.perf_psi(
  score = {'train':train_score, 'test':val_score},
  label = {'train':y_train, 'test':y_val}
)

这结果有些离谱。。

完成测试集的预测

lr2 = LogisticRegression(penalty='l1', C=0.9, solver='saga', n_jobs=-1)
lr2.fit(X_train, y_train)

# predicted proability
test_pred = lr2.predict_proba(test_a_woe)[:,1]

线上验证的结果为0.7113与(训练集的0.7141比较接近)。