【风控建模】决策树-坏客户分析

2019-02-22 本文已影响3人 MichalLiu

项目背景

根据业务方得知，存在部分客户多次借款后借死的现象，且不同省份反映客户借死有差异性。现提取河北、江西两个省份客户做决策分析。

借死的定义：客户借款后本金逾期天数大于60天；

分析目的

提取相关性大的特征分析借死现象，根据客户提现前几次正常借款情况来预测客户本次提现借死的概率，大概率借死的客户拒绝提现。

y值定义

1 ：客户逾期天数>60天

0 ：客户逾期天数≤60天

计算特征变量WOE值、IV值

WOE、IV值计算公式

ps:IV值越大，特征相关性越好，越能预测出准确的模型。

决策树模型-python代码

# 决策树
import pandas as pd
import numpy as np
from itertools import product
import matplotlib.pyplot as plt
from sklearn import datasets # load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import classification_report
from IPython.display import Image
from sklearn import tree
import pydotplus
import os
os.environ["PATH"] += os.pathsep + 'D:/Program Files (x86)/Graphviz2.38/bin/'

data = pd.read_csv('treedata.csv')

from sklearn.preprocessing import LabelEncoder
class_le1 = LabelEncoder()
class_le2 = LabelEncoder()
data['INDUSTRYNAME'] = class_le1.fit_transform(data['INDUSTRYNAME'].values)
data['COMPANYDUTYNAME'] = class_le2.fit_transform(data['COMPANYDUTYNAME'].values)

y = np.array(data['TARGET'])
X = np.array(data[['DAYOFCREDIT','LAMOUNT','INDUSTRYNAME','SEX','COMPANYDUTYNAME']])


# 拆分比例，训练集，测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.8, test_size=0.2)

# 训练模型，限制树的最大深度4
clf = DecisionTreeClassifier(criterion="entropy",max_depth=5,min_samples_leaf=200)
#拟合模型
clf.fit(X_train, y_train)

print("模型准确性","="*60)
print('Training accuracy: ', clf.score(X_train, y_train)) # accuracy: 精确（性），准确（性）
print('Testing accuracy: ', clf.score(X_test, y_test))
# 预测训练集的准确率
answer_train = clf.predict(X_train)
print('训练集准确率：',np.mean(answer_train == y_train)) # 通过求均值，算准确率，与clf.score(X_train, y_train)值一样
# 预测测试集的准确率
answer_test = clf.predict(X_test)
print('测试集准确率：',np.mean(answer_test == y_test)) # 通过求均值，算准确率，与clf.score(X_test, y_test)值一样

#评价模型准确性
y_prob = clf.predict_proba(X_test)[:,1]
print(y_prob)
y_pred = np.where(y_prob > 0.5, 1, 0)
print(clf.score(X_test, y_pred))

# 系数反映每个特征的影响力。越大表示该特征在分类中起到的作用越大
print("\n\r特征重要性","="*60)
print(clf.feature_importances_)
print(list(data)[1:])

# 可视化
dot_data = tree.export_graphviz(clf, out_file=None,
                         feature_names=list(data)[1:], # iris.feature_names，
                         #class_names=cl, # iris.target_names
                         filled=True, rounded=True,
                         special_characters=True #,proportion=True
                         )
graph = pydotplus.graph_from_dot_data(dot_data)
# 使用ipython的终端jupyter notebook显示。
Image(graph.create_png())
# 如果没有ipython的jupyter notebook，可以把此图写到pdf文件里，在pdf文件里查看。
graph.write_pdf("tree1.pdf")

可视化效果

模型可视化

决策提取

客群分类占比高的提取特征条件

模型效果

用新样本查看模型的泛化误差。

模型应用

针对不同省份，根据决策模型的规则，对有大概率发生借死现象的客户群体，进行拒绝提现处理，可以提升回款率，减少损失