【风控建模】风险评分A卡
一、信用风险评级模型的类型
信用风险计量体系包括主体评级模型和债项评级两部分。主体评级和债项评级均有一系列评级模型组成,其中主体评级模型可用“四张卡”来表示,分别是A卡、B卡、C卡和F卡;债项评级模型通常按照主体的融资用途,分为企业融资模型、现金流融资模型和项目融资模型等。
A卡,又称为申请者评级模型,主要应用于相关融资类业务中新用户的主体评级,适用于个人和机构融资主体。
B卡,又称为行为评级模型,主要应用于相关融资类业务中存量客户在续存期内的管理,如对客户可能出现的逾期、延期等行为进行预测,仅适用于个人融资主体。
C卡,又称为催收评级模型,主要应用于相关融资类业务中存量客户是否需要催收的预测管理,仅适用于个人融资主体。
F卡,又称为欺诈评级模型,主要应用于相关融资类业务中新客户可能存在的欺诈行为的预测管理,适用于个人和机构融资主体。
主要讨论主体评级模型-A卡的开发过程。
二、信用风险评级模型开发流程
典型的评级模型开发流程如图2.1所示。该流程中各个步骤的顺序可根据具体情况的不同进行适当调整,也可以根据需要重复某些步骤。
信用风险评级模型的主要开发流程如下:
(1) 数据获取,包括获取存量客户及潜在客户的数据。存量客户是指已经在证券公司开展相关融资类业务的客户,包括个人客户和机构客户;潜在客户是指未来拟在证券公司开展相关融资类业务的客户,主要包括机构客户,这也是解决证券业样本较少的常用方法,这些潜在机构客户包括上市公司、公开发行债券的发债主体、新三板上市公司、区域股权交易中心挂牌公司、非标融资机构等。
(2) EDA(探索性数据分析)与数据描述,该步骤主要是获取样本总体的大概情况,以便制定样本总体的数据预处理方法。描述样本总体情况的指标主要有缺失值情况、异常值情况、平均值、中位数、最大值、最小值、分布情况等。
(3) 数据预处理,主要工作包括数据清洗、缺失值处理、异常值处理,主要是为了将获取的原始数据转化为可用作模型开发的格式化数据。
(4) 变量选择,该步骤主要是通过统计学的方法,筛选出对违约状态影响最显著的指标。
(5) 模型开发,该步骤主要包括变量分段、变量的WOE(证据权重)变换和逻辑回归估算三部分。
(6) 主标尺与模型验证,该步骤主要是开发某类主体的主标尺并进行模型的验证与校准。
(7) 模型评估,该步骤主要是根据模型验证和主标尺设计的结果,评估模型的区分能力、预测能力、稳定性,并形成模型评估报告,得出模型是否可以使用的结论。
(8) 模型实施,即模型的部署和应用。
(9) 监测与报告,该步骤主要工作是定期检测模型的使用情况,并关注和定期检验模型的区分能力与预测能力的变化及模型稳定性的变化,在出现模型可能不能满足业务需求的情况时,反馈至模型开发团队,及时进行模型更新或重新开发。
三、A卡python 3 代码部分
数据说明:第一列为标签y,第二列及之后是数据X。
part 1 : 数据清洗
缺失值处理,异常值处理;
# 1、数据清洗:缺失值处理,异常值处理
import pandas as pd
import matplotlib.pyplot as plt #导入图像库
from sklearn.ensemble import RandomForestRegressor
# 用随机森林对缺失值预测填充函数
def set_missing(df):
# 把已有的数值型特征取出来
process_df = df.ix[:,[5,0,1,2,3,4,6,7,8,9]]#按这个列顺序取出数据框
# 分成已知该特征和未知该特征两部分
known = process_df[process_df.MonthlyIncome.notnull()].as_matrix()
unknown = process_df[process_df.MonthlyIncome.isnull()].as_matrix()
# X为特征属性值
X = known[:, 1:]#取第2列之后的矩阵
# y为结果标签值
y = known[:, 0]#取第1列,并转化为行 即为MonthlyIncome列
# fit到RandomForestRegressor之中
rfr = RandomForestRegressor(random_state=0, n_estimators=200,max_depth=3,n_jobs=-1)
rfr.fit(X,y)
# 用得到的模型进行未知特征值预测
predicted = rfr.predict(unknown[:, 1:]).round(0)
print(predicted)
# 用得到的预测结果填补原缺失数据
df.loc[(df.MonthlyIncome.isnull()), 'MonthlyIncome'] = predicted
return df
if __name__ == '__main__':
#载入数据
data = pd.read_csv('cs-training.csv')
#数据集确实和分布情况
data.describe().to_csv('DataDescribe.csv')#了解数据集的分布情况
data=set_missing(data)#用随机森林填补比较多的缺失值
data=data.dropna()#删除比较少的缺失值
data = data.drop_duplicates()#删除重复项
data.to_csv('MissingData.csv',index=False)
data.describe().to_csv('MissingDataDescribe.csv')
#异常值处理
#年龄等于0的异常值进行剔除
data=data[data['age']>0]
part 2 : 数据切分
为了验证模型的拟合效果,我们需要对数据集进行切分,分成训练集和测试集。
import pandas as pd
import matplotlib.pyplot as plt #导入图像库
from sklearn.model_selection import train_test_split
if __name__ == '__main__':
data = pd.read_csv('PretreatmentData.csv')
# 数据集中好客户为0,违约客户为1,考虑到正常的理解,
# 能正常履约并支付利息的客户为1,所以我们将SeriousDlqin2yrs取反。
data['SeriousDlqin2yrs']=1-data['SeriousDlqin2yrs']
Y = data['SeriousDlqin2yrs']
X = data.ix[:, 1:]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
# print(Y_train)
train = pd.concat([Y_train, X_train], axis=1)
test = pd.concat([Y_test, X_test], axis=1)
train.to_csv('TrainData.csv',index=False)
test.to_csv('TestData.csv',index=False)
print(data.shape)
print(train.shape)
print(test.shape)
part 3 : 变量选择
分箱处理,相关性分析和IV筛选
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
import scipy.stats.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import math
import seaborn as sns
def calc_var_median(sample_set):
# 计算相邻变量的中位数,以便进行决策树二元切分
# print(sample_set.loc[:, 'age']) 输出age列
var_list = list(np.unique(sample_set.loc[:, 'var'])) # 输出age列不重复的列表
var_list.sort() # 从小到大排序,总共有68个
var_median_list = []
for i in range(len(var_list) - 1):
var_median = (var_list[i] + var_list[i + 1]) / 2 # 取中间值,
var_median_list.append(var_median) # 生成样本的中位数列表,
return var_median_list
# 选择最优切分点,min_sample 参数为最小叶子节点的样本阈值,如果小于该阈值则不进行切分,如前面所述设置为整体样本量的5%
def choose_best_split(sample_set, min_sample):
# 使用CART分类决策树选择最好的样本切分点,返回切分点
var_median_list = calc_var_median(sample_set)
sample_cnt = sample_set.shape[0] # 有多少行
sample1_cnt = sum(sample_set['target'])
sample0_cnt = sample_cnt - sample1_cnt
# Gini系数公式,
Gini = 1 - np.square(sample1_cnt / sample_cnt) - np.square(sample0_cnt / sample_cnt)
ifSplit=0
bestGini = 0.0;
bestSplit_point = 0.0 # 最好的切分点
for split in var_median_list:
left = sample_set[sample_set['var'] < split][['var', 'target']]
right = sample_set[sample_set['var'] > split][['var', 'target']]
# print("="*60)
# print(len(left),len(right))
left_cnt = left.shape[0];
right_cnt = right.shape[0]
left1_cnt = sum(left['target']);
right1_cnt = sum(right['target'])
left0_cnt = left_cnt - left1_cnt;
right0_cnt = right_cnt - right1_cnt
left_ratio = left_cnt / sample_cnt;
right_ratio = right_cnt / sample_cnt
if left_cnt < min_sample or right_cnt < min_sample:
continue
# 计算该切分点的Gini系数
Gini_left = 1 - np.square(left1_cnt / left_cnt) - np.square(left0_cnt / left_cnt)
Gini_right = 1 - np.square(right1_cnt / right_cnt) - np.square(right0_cnt / right_cnt)
Gini_temp = Gini - (left_ratio * Gini_left + right_ratio * Gini_right) # 括号内的为分割后的GINI系数
if Gini_temp > bestGini:
bestGini = Gini_temp;
bestSplit_point = split
ifSplit=1 # 已经切分
else:
continue
Gini = Gini - bestGini
return bestSplit_point,ifSplit #返回最优切分点
# split_list 参数是用来保存返回的切分点,每次切分后返回的切分点存入该list
# 在这里判断切分点分割的左子树和右子树是否满足“内部节点再划分所需的最小样本数>=总样本量的10%”的条件,如果满足则进行递归调用。
def bining_data_split(sample_set, min_sample, split_list):
# 划分数据找到最优分割点list
split,ifSplit = choose_best_split(sample_set, min_sample)
if ifSplit > 0:
split_list.append(split)
# 根据分割点划分数据集,判断是否能继续划分,如果可以则继续
sample_set_left = sample_set[sample_set['var'] < split][['var', 'target']]
sample_set_right = sample_set[sample_set['var'] > split][['var', 'target']]
if len(sample_set_left) >= min_sample * 2:
bining_data_split(sample_set_left, min_sample, split_list)
else:
None
if len(sample_set_right) >= min_sample * 2:
bining_data_split(sample_set_right, min_sample, split_list)
else:
None
#用GINI系数分箱
def gini_bin(df,target,item,rate):
sample_set = pd.DataFrame() # 构建一个用于存放woe的空数据框,df是数据框,
sample_set['target']=df[target]
sample_set['var']=df[item]
# 根据分箱得到最优分割点list
# 计算最小样本阈值(终止条件)rate
min_sample = sample_set.shape[0] * rate
split_list = []
bining_data_split(sample_set, min_sample, split_list)
split_list.sort()
least = 0.0
maxage = df[item].max() + 1.0
df_var_temp = pd.DataFrame()
d1 = df_var_temp
for i in range(len(split_list) + 1):
df_var = d1
if i == len(split_list):
df_var_temp = df[(df[item] > least) & (df[item] < maxage)][[target, item]] # 选择其中2列
df_var_temp['Bucket'] = '(' + str(least) + ',' + str(maxage) + ')'
d1 = df_var.append(df_var_temp)
else:
df_var_temp = df[(df[item] > least) & (df[item] < split_list[i])][[target, item]]
df_var_temp['Bucket'] = '(' + str(least) + ',' + str(split_list[i]) + ')'
least = split_list[i]
d1 = df_var.append(df_var_temp)
d2 = d1.groupby('Bucket', as_index=True)
d3 = pd.DataFrame(d2.age.min(), columns=['min'])
d3['min'] = d2.min().age
d3['max'] = d2.max().age
d3['sum'] = d2.sum().SeriousDlqin2yrs
d3['total'] = d2.count().SeriousDlqin2yrs
d3['rate'] = d2.mean().SeriousDlqin2yrs
good = d1[target].sum()
bad = d1[target].count() - good
d3['woe'] = np.log((d3['rate'] / (1 - d3['rate'])) / (good / bad))
d3['goodattribute'] = d3['sum'] / good
d3['badattribute'] = (d3['total'] - d3['sum']) / bad
iv = ((d3['goodattribute'] - d3['badattribute']) * d3['woe']).sum()
d4 = (d3.sort_values(by='min'))
woe = list(d4['woe'].round(3)) # WOE保留3位小数
#print(split_list)
cut=[]
cut.append(float('-inf'))
for i in range(len(split_list)):
cut.append(split_list[i])
cut.append(float('inf'))
#print(cut)
return d4, iv, cut, woe
# 最后整合以下来个函数调用,返回一个分割点list。
# 可以使用sklearn库的决策树测试一下单变量分类对结果进行验证,在分类方法相同,剪枝条件一致的情况下结果是一致的
## 等频分箱#########################################
# 定义自动分箱函数,简单分箱,
def mono_bin(Y, X, n = 20):
r = 0
good=Y.sum()
bad=Y.count()-good
while np.abs(r) < 1:
# qcut根据X值的频率来选择箱子的均匀间隔,cut将根据值本身来选择箱子均匀间隔
d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n)}) # 区间Bucket,按照X值划分区间,按频率
#print("=====")
#print(pd.qcut(X, n))
d2 = d1.groupby('Bucket', as_index = True)
# print(d2.mean()) 值为每个区间的X,Y 的均值
# 如果数据中没有重复值, 并且当两个变量完全单调相关时,斯皮尔曼相关系数则为+1或−1
r, p = stats.spearmanr(d2.mean().X, d2.mean().Y) # 统计学之三大相关性系数: http://blog.sina.com.cn/s/blog_69e75efd0102wmd2.html
n = n - 1
d3 = pd.DataFrame(d2.X.min(), columns = ['min']) #创建空数据框d3
d3['min']=d2.min().X
d3['max'] = d2.max().X
d3['sum'] = d2.sum().Y
d3['total'] = d2.count().Y
d3['rate'] = d2.mean().Y
d3['woe']=np.log((d3['rate']/(1-d3['rate']))/(good/bad))
d3['goodattribute']=d3['sum']/good
d3['badattribute']=(d3['total']-d3['sum'])/bad
iv=((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()
d4 = (d3.sort_values(by = 'min'))
#print("=" * 60) ## 打印60个等号
#print(d4) #打印分箱数据
cut=[]
cut.append(float('-inf'))
for i in range(1,n+1):
qua=X.quantile(i/(n+1))
cut.append(round(qua,4))
cut.append(float('inf'))
woe=list(d4['woe'].round(3))
#print(cut)
return d4,iv,cut,woe
#分类变量分箱函数
def self_bin(Y,X,cat):
good=Y.sum()
bad=Y.count()-good
d1=pd.DataFrame({'X':X,'Y':Y,'Bucket':pd.cut(X,cat)})
d2=d1.groupby('Bucket', as_index = True)
d3 = pd.DataFrame(d2.X.min(), columns=['min'])
d3['min'] = d2.min().X
d3['max'] = d2.max().X
d3['sum'] = d2.sum().Y
d3['total'] = d2.count().Y
d3['rate'] = d2.mean().Y
d3['woe'] = np.log((d3['rate'] / (1 - d3['rate'])) / (good / bad))
d3['goodattribute'] = d3['sum'] / good
d3['badattribute'] = (d3['total'] - d3['sum']) / bad
iv = ((d3['goodattribute'] - d3['badattribute']) * d3['woe']).sum()
d4 = (d3.sort_values(by='min'))
#print("=" * 60)
#print(d4)
woe = list(d4['woe'].round(3))
return d4, iv,woe
#用woe代替
def replace_woe(series,cut,woe):
list=[]
i=0
while i<len(series):
value=series[i]
j=len(cut)-2
m=len(cut)-2
while j>=0:
if value>=cut[j]:
j=-1
else:
j -=1
m -= 1
list.append(woe[m])
i += 1
return list
#计算分数函数
def get_score(coe,woe,factor):
scores=[]
for w in woe:
score=round(coe*w*factor,0)
scores.append(score)
return scores
#根据变量计算分数
def compute_score(series,cut,score):
list = []
i = 0
while i < len(series):
value = series[i]
j = len(cut) - 2
m = len(cut) - 2
while j >= 0:
if value >= cut[j]:
j = -1
else:
j -= 1
m -= 1
list.append(score[m])
i += 1
return list
#画变量的分布图,直方图
def drawdistplot(item):
d = np.array(data[item].iloc[0:500]) #取前面500行
sns.set(palette="muted", color_codes=True)
f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=True)
sns.distplot(d, kde=False, color="b", ax=axes[0, 0])
sns.distplot(d, hist=False, rug=True, color="r", ax=axes[0, 1])
sns.distplot(d, hist=False, color="g", kde_kws={"shade": True}, ax=axes[1, 0])
sns.distplot(d, color="m", ax=axes[1, 1])
plt.show()
#画箱线图
def drawboxplot(item):
sns.boxplot(x = data['SeriousDlqin2yrs'],y = data[item])
plt.show()
#计算KS值
def Score_KS(Y, X, n = 10):
good=Y.sum()
bad=Y.count()-good
d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.cut(X, n)}) # 区间Bucket,按照X值划分区间,按频率
d2 = d1.groupby('Bucket', as_index=True)
d3 = pd.DataFrame(d2.X.min(), columns = ['min']) #创建空数据框d3
d3['min'] = d2.min().X
d3['max'] = d2.max().X
d3['total'] = d2.count().Y
d3['good'] = d2.sum().Y
d3['bad'] = d2.count().Y-d2.sum().Y
d3['goodrate'] = d2.sum().Y/good
d3['badrate'] = (d2.count().Y-d2.sum().Y)/bad
leijigoodrate=[] # 累积好客户比例
leijibadrate=[] # 累积坏客户比例
goodrate = 0.0
badrate =0.0
for i in range(d3.shape[0]):
goodrate=goodrate+d3['goodrate'][i]
leijigoodrate.append(goodrate)
badrate=badrate+d3['badrate'][i]
leijibadrate.append(badrate)
d3['totalgoodrate']=leijigoodrate
d3['totalbadrate']=leijibadrate
d3['KS_Range']=d3['totalbadrate']-d3['totalgoodrate']
print(d3)
if __name__ == '__main__':
data = pd.read_csv('TrainData.csv')
"""
SeriousDlqin2yrs:好客户坏客户
RevolvingUtilizationOfUnsecuredLines:无担保放款的循环利用
age:年龄
NumberOfTime30-59DaysPastDueNotWorse:35-59天逾期但不糟糕次数
DebtRatio:负债比率
MonthlyIncome:月收入
NumberOfOpenCreditLinesAndLoans:开放式信贷和贷款数量
NumberOfTimes90DaysLate:90天逾期次数
NumberRealEstateLoansOrLines:不动产贷款或额度数量
NumberOfTime60-89DaysPastDueNotWorse:60-89天逾期但不糟糕次数
NumberOfDependents:家属数量
"""
## 画年龄的分布图,箱线图############
## 画图帖子https://blog.csdn.net/qq_34264472/article/details/53814653
#drawdistplot('age') # 分布图
#drawboxplot('age') # 箱线图
#drawdistplot('MonthlyIncome')
x1=data['RevolvingUtilizationOfUnsecuredLines']
x2=data['DebtRatio']
fig=plt.figure(1)
ax=fig.add_subplot(111)
ax.boxplot([x1,x2])
ax.set_xticklabels(['RevolvingUtilizationOfUnsecuredLines','DebtRatio'])
plt.show()
grouped=data['SeriousDlqin2yrs'].groupby(data['SeriousDlqin2yrs']).count()
print("坏客户占比:{:.2%}".format(grouped[0]/grouped[1]))
grouped.plot(kind='bar')
plt.show()
# 再看年龄对违约客户率的影响,由下图可知,违约客户率随着年龄增大而逐步下降。
# 现在再来分析月收入对违约客户数量的影响,将月收入划分为以下几个级别:[0,5000],[5000,10000], [10000,15000],[15000,20000],[20000,100000].由下面两图可知,在20000收入之前随着收入增加,违约客户率递减,而当月收入大于20000后,违约客户率又随收入增高发生递增。
# 接下来我们再分析NumberOfDependents(家属数量)与最终输出结果之间的关系。可以看出随着家庭人口增多,违约客户率呈现递增的态势。
# 观察逾期30-59天次数与违约客户率之间的关系,可以看出随着违约次数增加,违约客户率呈现递增的态势。
# 对数据的单变量分析就暂时处理到这里,剩余的变量也采取同样的方式进行分析。
######################################################################
pinf = float('inf')#正无穷大
ninf = float('-inf')#负无穷大
dfx1, ivx1,cutx1,woex1 = mono_bin(data.SeriousDlqin2yrs,data.RevolvingUtilizationOfUnsecuredLines,n=10)
dfx2, ivx2,cutx2,woex2 = mono_bin(data.SeriousDlqin2yrs, data.age, n=10)
#dfx2, ivx2,cutx2,woex2 = gini_bin(data,'SeriousDlqin2yrs','age',0.1) # GINI系数分箱
dfx4, ivx4,cutx4,woex4 = mono_bin(data.SeriousDlqin2yrs, data.DebtRatio, n=20)
dfx5, ivx5,cutx5,woex5 = mono_bin(data.SeriousDlqin2yrs, data.MonthlyIncome, n=10)
# 针对不能最优分箱的变量,分箱如下:
cutx3 = [ninf, 0, 1, 3, 5, pinf]
cutx6 = [ninf, 1, 2, 3, 5, pinf]
cutx7 = [ninf, 0, 1, 3, 5, pinf]
cutx8 = [ninf, 0,1,2, 3, pinf]
cutx9 = [ninf, 0, 1, 3, pinf]
cutx10 = [ninf, 0, 1, 2, 3, 5, pinf]
dfx3, ivx3,woex3 = self_bin(data.SeriousDlqin2yrs, data['NumberOfTime30-59DaysPastDueNotWorse'], cutx3)
dfx6, ivx6 ,woex6= self_bin(data.SeriousDlqin2yrs, data['NumberOfOpenCreditLinesAndLoans'], cutx6)
dfx7, ivx7,woex7 = self_bin(data.SeriousDlqin2yrs, data['NumberOfTimes90DaysLate'], cutx7)
dfx8, ivx8,woex8 = self_bin(data.SeriousDlqin2yrs, data['NumberRealEstateLoansOrLines'], cutx8)
dfx9, ivx9,woex9 = self_bin(data.SeriousDlqin2yrs, data['NumberOfTime60-89DaysPastDueNotWorse'], cutx9)
dfx10, ivx10,woex10 = self_bin(data.SeriousDlqin2yrs, data['NumberOfDependents'], cutx10)
"""多变量分析:变量间的相关性分析,相关性图我们通过Python里面的seaborn包,调用heatmap()绘图函数进行绘制"""
corr = data.corr() # 计算各变量的相关性系数
xticks = ['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10'] # x轴标签
yticks = list(corr.index) # y轴标签
fig = plt.figure()
ax1 = fig.add_subplot(1, 1, 1)
sns.heatmap(corr, annot=True, cmap='rainbow', ax=ax1,
annot_kws={'size': 9, 'weight': 'bold', 'color': 'blue'}) # 绘制相关性系数热力图
ax1.set_xticklabels(xticks, rotation=0, fontsize=10)
ax1.set_yticklabels(yticks, rotation=0, fontsize=10)
plt.show()
"""
由上图可以看出,各变量之间的相关性是非常小的。NumberOfOpenCreditLinesAndLoans和NumberRealEstateLoansOrLines的相关性系数为0.43。
接下来,我进一步计算每个变量的Infomation Value(IV)。IV指标是一般用来确定自变量的预测能力。 其公式为:
IV=sum((goodattribute-badattribute)*ln(goodattribute/badattribute))
通过IV值判断变量预测能力的标准是:
< 0.02: unpredictive
0.02 to 0.1: weak
0.1 to 0.3: medium
0.3 to 0.5: strong
> 0.5: suspicious
"""
#生成的IV图代码:
ivlist=[ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]
index=['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']
fig1 = plt.figure(1)
ax1 = fig1.add_subplot(1, 1, 1)
x = np.arange(len(index))+1
ax1.bar(x, ivlist, width=0.4)
ax1.set_xticks(x)
ax1.set_xticklabels(index, rotation=0, fontsize=12)
ax1.set_ylabel('IV(Information Value)', fontsize=14)
for a, b in zip(x, ivlist):
plt.text(a, b + 0.01, '%.4f' % b, ha='center', va='bottom', fontsize=10)
#输出各个变量的IV值图
plt.show()
#可以看出,DebtRatio、MonthlyIncome、NumberOfOpenCreditLinesAndLoans、NumberRealEstateLoansOrLines和NumberOfDependents变量的IV值明显较低,所以予以删除。
# 替换成woe
data['RevolvingUtilizationOfUnsecuredLines'] = Series(replace_woe(data['RevolvingUtilizationOfUnsecuredLines'], cutx1, woex1))
data['age'] = Series(replace_woe(data['age'], cutx2, woex2))
data['NumberOfTime30-59DaysPastDueNotWorse'] = Series(replace_woe(data['NumberOfTime30-59DaysPastDueNotWorse'], cutx3, woex3))
data['DebtRatio'] = Series(replace_woe(data['DebtRatio'], cutx4, woex4))
data['MonthlyIncome'] = Series(replace_woe(data['MonthlyIncome'], cutx5, woex5))
data['NumberOfOpenCreditLinesAndLoans'] = Series(replace_woe(data['NumberOfOpenCreditLinesAndLoans'], cutx6, woex6))
data['NumberOfTimes90DaysLate'] = Series(replace_woe(data['NumberOfTimes90DaysLate'], cutx7, woex7))
data['NumberRealEstateLoansOrLines'] = Series(replace_woe(data['NumberRealEstateLoansOrLines'], cutx8, woex8))
data['NumberOfTime60-89DaysPastDueNotWorse'] = Series(replace_woe(data['NumberOfTime60-89DaysPastDueNotWorse'], cutx9, woex9))
data['NumberOfDependents'] = Series(replace_woe(data['NumberOfDependents'], cutx10, woex10))
data.to_csv('WoeData.csv', index=False)
# 替换成woe
test= pd.read_csv('TestData.csv')
test['RevolvingUtilizationOfUnsecuredLines'] = Series(replace_woe(test['RevolvingUtilizationOfUnsecuredLines'], cutx1, woex1))
test['age'] = Series(replace_woe(test['age'], cutx2, woex2))
test['NumberOfTime30-59DaysPastDueNotWorse'] = Series(replace_woe(test['NumberOfTime30-59DaysPastDueNotWorse'], cutx3, woex3))
test['DebtRatio'] = Series(replace_woe(test['DebtRatio'], cutx4, woex4))
test['MonthlyIncome'] = Series(replace_woe(test['MonthlyIncome'], cutx5, woex5))
test['NumberOfOpenCreditLinesAndLoans'] = Series(replace_woe(test['NumberOfOpenCreditLinesAndLoans'], cutx6, woex6))
test['NumberOfTimes90DaysLate'] = Series(replace_woe(test['NumberOfTimes90DaysLate'], cutx7, woex7))
test['NumberRealEstateLoansOrLines'] = Series(replace_woe(test['NumberRealEstateLoansOrLines'], cutx8, woex8))
test['NumberOfTime60-89DaysPastDueNotWorse'] = Series(replace_woe(test['NumberOfTime60-89DaysPastDueNotWorse'], cutx9, woex9))
test['NumberOfDependents'] = Series(replace_woe(test['NumberOfDependents'], cutx10, woex10))
test.to_csv('TestWoeData.csv', index=False)
part 4 : 输出分数公式
#计算分数
#coe为逻辑回归模型的系数
coe=[9.738849,0.638002,0.505995,1.032246,1.790041,1.131956] # 等频分箱的回归系数
#coe=[9.751109,0.637494,0.524291,1.030989,1.790220,1.130356] # GINI分箱的回归系数
# 我们取600分为基础分值,PDO为20(每高20分好坏比翻一倍),好坏比取20。
# 授信评分卡520分以下拒绝,新版评分卡100分以下拒绝
p = 20 / math.log(2)
q = 600 - 20 * math.log(20) / math.log(2)
baseScore = round(q + p * coe[0], 0)
# 计算各项部分分数
x1 = get_score(coe[1], woex1, p)
x2 = get_score(coe[2], woex2, p)
x3 = get_score(coe[3], woex3, p)
x7 = get_score(coe[4], woex7, p)
x9 = get_score(coe[5], woex9, p)
print("==输出基础分值=====================")
print(baseScore) # 基础得分
print("==输出变量切分值===================")
print(cutx1,cutx2, cutx3, cutx7, cutx9)
print("==输出各切分段分值=================")
print(x1,x2, x3, x7, x9)
test1 = pd.read_csv('TestData.csv')
test1['BaseScore']=Series(np.zeros(len(test1)))+baseScore
test1['x1'] = Series(compute_score(test1['RevolvingUtilizationOfUnsecuredLines'], cutx1, x1))
test1['x2'] = Series(compute_score(test1['age'], cutx2, x2))
test1['x3'] = Series(compute_score(test1['NumberOfTime30-59DaysPastDueNotWorse'], cutx3, x3))
test1['x7'] = Series(compute_score(test1['NumberOfTimes90DaysLate'], cutx7, x7))
test1['x9'] = Series(compute_score(test1['NumberOfTime60-89DaysPastDueNotWorse'], cutx9, x9))
test1['Score'] = test1['x1'] + test1['x2'] + test1['x3'] + test1['x7'] +test1['x9'] + baseScore
test1.to_csv('ScoreData.csv', index=False)
#plt.show()
#输出KS值
print("==输出KS值========================")
Scoredf = pd.read_csv('ScoreData.csv')
Score_KS(Scoredf['SeriousDlqin2yrs'], Scoredf['Score'], 10)
part 5 : 调用逻辑回归模型,输出回归系数
import pandas as pd
import matplotlib.pyplot as plt #导入图像库
import matplotlib
import seaborn as sns
import statsmodels.api as sm
from sklearn.metrics import roc_curve, auc
from scipy.stats import ks_2samp
if __name__ == '__main__':
matplotlib.rcParams['axes.unicode_minus'] = False
data = pd.read_csv('WoeData.csv')
# 应变量
Y=data['SeriousDlqin2yrs']
# 自变量,剔除对因变量影响不明显的变量
X=data.drop(['SeriousDlqin2yrs','DebtRatio','MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans','NumberRealEstateLoansOrLines','NumberOfDependents'],axis=1)
X1=sm.add_constant(X)
logit=sm.Logit(Y,X1)
result=logit.fit()
print("="*60)
print(result.summary())
print(result.params)
#模型检验AUC ROC ####################################
test = pd.read_csv('TestWoeData.csv')
# 应变量
Y_test = test['SeriousDlqin2yrs']
# 自变量,剔除对因变量影响不明显的变量,与模型变量对应
X_test = test.drop(['SeriousDlqin2yrs', 'DebtRatio', 'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans','NumberRealEstateLoansOrLines', 'NumberOfDependents'], axis=1)
X3 = sm.add_constant(X_test)
resu = result.predict(X3) #进行预测
fpr, tpr, threshold = roc_curve(Y_test, resu)
rocauc = auc(fpr, tpr) #计算AUC
plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % rocauc) #生成ROC曲线
plt.legend(loc='lower right')
plt.plot([0, 1], [0, 1], 'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('TPR') # 真正率
plt.xlabel('FPR') # 假正率
plt.show()