python画KS图,求KS值
2020-04-16 本文已影响0人
xiaogp
ks计算公式
ks用来衡量以一定阈值选定二分类模型预测结果集,各分类命中各自组内比重的差值,某一刻阈值使得此差值最大,此刻的差值就是ks值,ks越大代表模型可以更多地命中某类标签,同时尽可能地错判另一类的标签,具体公式如下:
数据输入
输入:predictions, labels,cut_point
predictions: 为每条样本的预测值组成的集合,预测概率在0-1之间
labels: 为每条样本的真实值(0, 1)组成的集合,本例中1是坏客户
cut_point: KS的阈值分割点的数量
数据预览,左列labels,右列predictions
head -4 test_predict_res.txt
0.0 0.831193
0.0 0.088209815
1.0 0.93411493
0.0 0.022157196
python代码实现
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams["font.sans-serif"] = ["SimHei"]
def ks_plot(predictions, labels, cut_point=100):
good_len = len([x for x in labels if x == 0]) # 所有好客户数量
bad_len = len([x for x in labels if x == 1]) # 所有坏客户数量
predictions_labels = list(zip(predictions, labels))
good_point = []
bad_point = []
diff_point = [] # 记录每个阈值点下的KS值
x_axis_range = np.linspace(0, 1, cut_point)
for i in x_axis_range:
hit_data = [x[1] for x in predictions_labels if x[0] <= i] # 选取当前阈值下的数据
good_hit = len([x for x in hit_data if x == 0]) # 预测好客户数
bad_hit = len([x for x in hit_data if x == 1]) # 预测坏客户数量
good_rate = good_hit / good_len # 预测好客户占比总好客户数
bad_rate = bad_hit / bad_len # 预测坏客户占比总坏客户数
diff = good_rate - bad_rate # KS值
good_point.append(good_rate)
bad_point.append(bad_rate)
diff_point.append(diff)
ks_value = max(diff_point) # 获得最大KS值为KS值
ks_x_axis = diff_point.index(ks_value) # KS值下的阈值点索引
ks_good_point, ks_bad_point = good_point[ks_x_axis], bad_point[ks_x_axis] # 阈值下好坏客户在组内的占比
threshold = x_axis_range[ks_x_axis] # 阈值
plt.plot(x_axis_range, good_point, color="green", label="好企业比率")
plt.plot(x_axis_range, bad_point, color="red", label="坏企业比例")
plt.plot(x_axis_range, diff_point, color="darkorange", alpha=0.5)
plt.plot([threshold, threshold], [0, 1], linestyle="--", color="black", alpha=0.3, linewidth=2)
plt.scatter([threshold], [ks_good_point], color="white", edgecolors="green", s=15)
plt.scatter([threshold], [ks_bad_point], color="white", edgecolors="red", s=15)
plt.scatter([threshold], [ks_value], color="white", edgecolors="darkorange", s=15)
plt.title("KS={:.3f} threshold={:.3f}".format(ks_value, threshold))
plt.text(threshold + 0.02, ks_good_point + 0.05, round(ks_good_point, 2))
plt.text(threshold + 0.02, ks_bad_point + 0.05, round(ks_bad_point, 2))
plt.text(threshold + 0.02, ks_value + 0.05, round(ks_value, 2))
plt.legend(loc=4)
plt.grid()
plt.show()
if __name__ == "__main__":
# 读取预测数据和真实标签
labels = []
predictions = []
with open("test_predict_res.txt", "r", encoding="utf8") as f:
for line in f.readlines():
labels.append(float(line.strip().split()[0]))
predictions.append(float(line.strip().split()[1]))
ks_plot(predictions, labels)

KS图的解释
举例预测企业风险,预测概率越接近1是高风险企业,则当选取0.121作为分类器预测概率阈值时,有最大KS=0.526,也就是说如果判定模型预测结果大于0.121作为坏企业,会命中70%的坏企业,但是会有17%的好企业被错判.