tianchi——贷款违约预测

2020-11-22 本文已影响0人 andyham

解决的问题是预测预测用户贷款是否违约为任务。提交的形式应该是一个二分类形式（2个字段，一个是id，一个是违约的可能性），目前提交第一名结果是0.749。该结果是竞赛采用AUC作为评价指标。AUC（Area Under Curve）被定义为 ROC曲线下与坐标轴围成的面积。
比赛地址：https://tianchi.aliyun.com/competition/entrance/531830/forum

1.数据处理

train.csv训练集所包含的内容

id：贷款清单分配的唯一信用证标识
loanAmnt ：贷款金额
term ：贷款期限（year）
interestRate ：贷款利率
installment ：分期付款金额
grade：贷款等级
subGrade：贷款等级之子级
employmentTitle ：就业职称
employmentLength ：就业年限（年）
homeOwnership ：借款人在登记时提供的房屋所有权状况
annualIncome：年收入
verificationStatus：验证状态
issueDate ：贷款发放的月份
purpose ：借款人在贷款申请时的贷款用途类别
postCode ：借款人在贷款申请中提供的邮政编码的前3位数字
regionCode ：地区编码
dti ：债务收入比
delinquency_2years ：借款人过去2年信用档案中逾期30天以上的违约事件数
ficoRangeLow ：借款人在贷款发放时的fico所属的下限范围
ficoRangeHigh ：借款人在贷款发放时的fico所属的上限范围
openAcc ：借款人信用档案中未结信用额度的数量
pubRec ：贬损公共记录的数量
pubRecBankruptcies ：公开记录清除的数量
revolBal ：信贷周转余额合计
revolUtil ：循环额度利用率，或借款人使用的相对于所有可用循环信贷的信贷金额
totalAcc ：借款人信用档案中当前的信用额度总数
initialListStatus ：贷款的初始列表状态
applicationType ：表明贷款是个人申请还是与两个共同借款人的联合申请
earliesCreditLine ：借款人最早报告的信用额度开立的月份
title ：借款人提供的贷款名称
policyCode ：公开可用的策略代码=1新产品不公开可用的策略代码=2
n：系列匿名特征匿名特征n0-n14，为一些贷款人行为计数特征的处理

如果想更方便快捷地了解数据的全貌，推荐一个python库：pandas_profiling，这个库只需要一行代码就可以生成数据EDA报告，文章最后有调试介绍。

pandas_profiling形成的数据预览

Constant表示只有一个变量值；High cardinality是指高数量类别特征；High correlation是指高相似特征

特征之间的关系

当然还包括缺失值、最值、均值、中值、标准差等，还可以查看Common values和Extreme values这两类普遍值和极端值。
对以上特征进行分类：

numerrical——表示数值特征
nominal——表示无顺序的类别特征
ordina——表示有顺序的类别特征
y——表示预测值。

    numerrical = ['loanAmnt','interestRate','installment','annualIncome','dti',
                  'delinquency_2years','ficoRangeHigh','ficoRangeLow','openAcc',
                  'pubRec','pubRecBankruptcies','revolBal','revolUtil','totalAcc']
    nominal = ['term','employmentTitle','homeOwnership','verificationStatus',
               'purpose','postCode','regionCode','initialListStatus','applicationType',
               'title','n0','n1','n2','n3','n4','n5','n6','n7','n8','n9','n10','n11','n12',
               'n13','n14','id']
    ordinal = ['grade','subGrade','employmentLength','earliesCreditLine','issueDate']
    y = ['isDefault']

2.特征工程

通过以上pandas_profiling对探索性数据分析(EDA)后，通过删除一些冗余值（若干疑似重复列n2），在没有引入业务知识；
对所有非数值字段直接Target encode；采用LGBMRegressor，随手设置了一些参数；就可以实现本地十折AUC均值0.7317，线上0.7291，具体可以见官网分享操作。

import pandas as pd
import numpy as np
from category_encoders.target_encoder import TargetEncoder
from sklearn.model_selection import KFold
from sklearn.metrics import auc, roc_curve
from lightgbm import LGBMRegressor

# 导入数据
train = pd.read_csv('train.csv', index_col='id')
test = pd.read_csv('testA.csv', index_col='id')
target = train.pop('isDefault')
test = test[train.columns]

# 非数值列
s = train.apply(lambda x:x.dtype)
tecols = s[s=='object'].index.tolist()

# 模型
def makelgb():
    lgbr = LGBMRegressor(num_leaves=30
                        ,max_depth=5
                        ,learning_rate=.02
                        ,n_estimators=1000
                        ,subsample_for_bin=5000
                        ,min_child_samples=200
                        ,colsample_bytree=.2
                        ,reg_alpha=.1
                        ,reg_lambda=.1
                        )
    return lgbr

# 本地验证
kf = KFold(n_splits=10, shuffle=True, random_state=100)
devscore = []
for tidx, didx in kf.split(train.index):
    tf = train.iloc[tidx]
    df = train.iloc[didx]
    tt = target.iloc[tidx]
    dt = target.iloc[didx]
    te = TargetEncoder(cols=tecols)
    tf = te.fit_transform(tf, tt)
    df = te.transform(df)
    lgbr = makelgb()
    lgbr.fit(tf, tt)
    pre = lgbr.predict(df)
    fpr, tpr, thresholds = roc_curve(dt, pre)
    score = auc(fpr, tpr)
    devscore.append(score)
print(np.mean(devscore))

# 在整个train集上重新训练，预测test，输出结果
lgbr = makelgb()
te = TargetEncoder(cols=tecols)
tf = te.fit_transform(train, target)
df = te.transform(test)
lgbr.fit(tf, target)
pre = lgbr.predict(df)
pd.Series(pre, name='isDefault', index=test.index).reset_index().to_csv('submit.csv', index=False)

那么结合标杆算法，我们首先确定的基础是用的LGB，分三个角度去优化，分别是：特征工程，编码以及业务知识优化（和第一点有重合）。

2.1 特征工程

①债权类——从annualIncome（年收入）、installment（分期付款金额）、loanAmnt（贷款金额）、dti（债务收入比）几个财务类信息互相组合提取出新特征，如：
年收入/分期付款，如年收入10w，分期付款1w，那么这个比就是10.
贷款金额/分期付款，如贷款30w，分期付款1w，那个这个比就是30.
收入*债务比，就得到了债务的值，如收入10w，债务比是0.2，那么债务就5w，这个值是为下面的比值服务的。
贷款金额/债务，就得到银行的贷款和外债的比，如贷款30w，债务是5w，那么这个比就是6，很明显这个值要越大越好（0除外），即没有债务。
计算信用开户到本次借贷的时间（CreditLine），即信用账户的年限。

    x['Income_installment']=round(x.loc[:,'annualIncome']/x.loc[:,'installment'],2)
    x['loanAmnt_installment']=round(x.loc[:,'loanAmnt']/x.loc[:,'installment'],2)
    x['debt']=round(x.loc[:,'annualIncome']*x.loc[:,'dti'],2)
    x['loanAmnt_debt']=round(x.loc[:,'annualIncome']/x.loc[:,'debt'],2)
    #----------------------------CreditLine--------------------------------
    x['issueDate'] = x.loc[:,"issueDate"].apply(lambda s: int(s[:4]))
    x['earliesCreditLine'] = x.loc[:,'earliesCreditLine'].apply(lambda s: int(s[-4:]))
    x['CreditLine'] = x.loc[:,'earliesCreditLine'] - x.loc[:,'issueDate']

③变量的处理
如employmentLength——把就业年限变为几个档次（转换为连续变量）

    def employmentLength_to_int(s):
        if pd.isnull(s):
            return s
        else:
            return np.int8(s.split()[0])
    x["employmentLength"].replace(to_replace="10+ years", value="10 years", inplace=True)
    x["employmentLength"].replace(to_replace="< 1 year", value="0 years", inplace=True)
    x['employmentLength'] = x.loc[:,"employmentLength"].apply(employmentLength_to_int)

数据特征中有提到fico有high有low，求平均

  x['fico']=(x.loc[:,'ficoRangeHigh']+x.loc[:,'ficoRangeLow'])*0.5

代码块

综上所述，增添新特征；根据之前Warnings，删除相似特征(High correlation)、唯一值(Unique)、单变量值(Constant)，和以上用来生成新特征的旧特征。

 numerrical=list(set(numerrical) - {'ficoRangeHigh', 'ficoRangeLow'}) + 
    ['Income_installment','loanAmnt_installment','loanAmnt_debt','fico']
    nominal=list(set(nominal)-{'id','n10', 'n2'})
    ordinal=list(set(ordinal) - {'grade', 'earliesCreditLine', 'issueDate'}) + ['CreditLine']

2.2 特征编码

选择了XGB模型，所以按照模型去查询编码方式（CatBoost）。
根据XGBoost之类别特征的处理和kaggle编码categorical feature两篇编码总结。
https://blog.csdn.net/m0_37870649/article/details/104550054
https://zhuanlan.zhihu.com/p/40231966
anyway，编码总结：
label encoding
特征存在内在顺序 (ordinal feature)
one hot encoding
特征无内在顺序，category数量 < 4
target encoding (mean encoding, likelihood encoding, impact encoding)
特征无内在顺序，category数量 > 4
beta target encoding
特征无内在顺序，category数量 > 4, K-fold cross validation
不做处理（模型自动编码）
CatBoost，lightgbm

from category_encoders import WOEEncoder ,OneHotEncoder,CatBoostEncoder,TargetEncoder
    def Category_Encoders(train_x, train_y, test_x, vel_x):
        for col in nominal:
            distinct = train_x[col].nunique()
            if distinct < 4 and distinct >2:
                enc = OneHotEncoder(handle_missing='indicator').fit(train_x[col], train_y)
            elif distinct >= 4:
                # enc = WOEEncoder().fit(train_x[col], train_y)
                # enc = TargetEncoder().fit(train_x[col],train_y)
                enc = CatBoostEncoder().fit(train_x[col],train_y)

            train_x[col] = enc.transform(train_x[col])
            test_x[col] = enc.transform(test_x[col])
            vel_x[col] = enc.transform(vel_x[col])

        return train_x, test_x, vel_x

3.模型调参

调参一般使用GridSearchCV，但是几M的数据，范围调整大一些，就可能要跑十几个小时，本次是80w条数据，基本放弃。
当需要调很多参数或是数据集很大的时候，欢迎使用贝叶斯优化调参：

3.1贝叶斯优化调参

https://blog.csdn.net/ssswill/article/details/85274097

def BO_xgb(x,y):
    t1=time.clock()

    def xgb_cv(max_depth,gamma,min_child_weight,max_delta_step,subsample,colsample_bytree):
        paramt={'booster': 'gbtree',
                'max_depth': int(max_depth),
                'gamma': gamma,
                'eta': 0.1,
                'objective': 'binary:logistic',
                'nthread': 4,
                'eval_metric': 'auc',
                'subsample': max(min(subsample, 1), 0),
                'colsample_bytree': max(min(colsample_bytree, 1), 0),
                'min_child_weight': min_child_weight,
                'max_delta_step': int(max_delta_step),
                'seed': 1001}
        model=XGBClassifier(**paramt)
        res = cross_val_score(model,x, y, scoring='roc_auc', cv=5).mean()
        return res
    cv_params ={'max_depth': (5, 12),
                'gamma': (0.001, 10.0),
                'min_child_weight': (0, 20),
                'max_delta_step': (0, 10),
                'subsample': (0.4, 1.0),
                'colsample_bytree': (0.4, 1.0)}
    xgb_op = BayesianOptimization(xgb_cv,cv_params)
    xgb_op.maximize(n_iter=20)
    print(xgb_op.max)

    t2=time.clock()
    print('耗时：',(t2-t1))

    return xgb_op.max

我们对'max_depth'，'gamma','min_child_weight'，'max_delta_step'，'subsample'，'colsample_bytree'六个参数进行调参，并最后赋予'n_estimators':1000，'learning_rate':0.02。
最终最佳参数为

'booster': 'gbtree','eta': 0.1,'nthread': 4,'eval_metric': 'auc','objective': 'binary:logistic',
                    'colsample_bytree': 0.4354, 'gamma': 9.888, 'max_delta_step': 4,'n_estimators':1000,'learning_rate':0.02,
                    'max_depth': 10, 'min_child_weight': 3.268, 'subsample': 0.7157

3.2分别观察下预测集和训练集的ROC

def roc(m,x,y,name):
    y_pred = m.predict_proba(x)[:,1]
    """"预测并计算roc的相关指标"""
    fpr, tpr, threshold = metrics.roc_curve(y, y_pred)
    roc_auc = metrics.auc(fpr, tpr)
    print(name+'AUC：{}'.format(roc_auc))
    """画出roc曲线图"""
    plt.figure(figsize=(8, 8))
    plt.title(name)
    plt.plot(fpr, tpr, 'b', label = name + 'AUC = %0.4f' % roc_auc)
    plt.ylim(0,1)
    plt.xlim(0,1)
    plt.legend(loc='best')
    plt.title('ROC')
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    # 画出对角线
    plt.plot([0,1],[0,1],'r--')
    plt.show()

3.3 提交成绩

def prediction(m,x):
    submit=pd.read_csv('sample_submit.csv')
    y_pred = m.predict_proba(x)[:,1]
    submit['isDefault'] = y_pred
    submit.to_csv('prediction.csv', index=False)

4.操作参考

官网分享操作：
【零基础入门金融风控-贷款违约预测】比赛描述
https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.18.3b3022fa38mjJF&postId=129318
【零基础入门金融风控-贷款违约预测】Baseline-LGBM（结果0.730左右）
https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.21.3b3022fa38mjJF&postId=128654
【sklearn中多种编码方式】
https://mattzheng.blog.csdn.net/article/details/107851162

pandas_profiling ：教你一行代码生成数据分析报告：
https://zhuanlan.zhihu.com/p/85967505

讲解视频
第1讲：赛题理解baseline讲解
主讲人：鱼佬（王贺）武汉大学计算机硕士，天池数据科学家，2019和2020腾讯广告算法冠军
链接：https://tianchi.aliyun.com/course/video?liveId=41203
第2讲：数据探索性分析和特征工程
主讲人：言溪（陶旭东）：北京师范大学硕士，算法工程师
链接：https://tianchi.aliyun.com/course/live?liveId=41204
第3讲：建模调参，模型融合
主讲人：小一（吴争光）Datawhale成员，金融风控爱好者，数据分析工程师
链接：https://tianchi.aliyun.com/course/live?liveId=41206

https://www.jianshu.com/p/bc96824e1ca8