Python学习数据挖掘

python画KS图,求KS值

2020-04-16  本文已影响0人  xiaogp

ks计算公式

ks用来衡量以一定阈值选定二分类模型预测结果集,各分类命中各自组内比重的差值,某一刻阈值使得此差值最大,此刻的差值就是ks值,ks越大代表模型可以更多地命中某类标签,同时尽可能地错判另一类的标签,具体公式如下:
ks=max(\frac{Cum.B} {Bad total} - \frac{Cum.G} {Good total})

数据输入

输入:predictions, labels,cut_point

predictions: 为每条样本的预测值组成的集合,预测概率在0-1之间
labels: 为每条样本的真实值(0, 1)组成的集合,本例中1是坏客户
cut_point: KS的阈值分割点的数量

数据预览,左列labels,右列predictions

head -4 test_predict_res.txt
0.0 0.831193
0.0 0.088209815
1.0 0.93411493
0.0 0.022157196

python代码实现

import numpy as np
import matplotlib.pyplot as plt
import matplotlib

matplotlib.rcParams["font.sans-serif"] = ["SimHei"]

def ks_plot(predictions, labels, cut_point=100):
    good_len = len([x for x  in labels if x == 0])  # 所有好客户数量
    bad_len = len([x for x in labels if x == 1])  # 所有坏客户数量
    predictions_labels = list(zip(predictions, labels))
    good_point = []
    bad_point = []
    diff_point = []  # 记录每个阈值点下的KS值

    x_axis_range = np.linspace(0, 1, cut_point)
    for i in x_axis_range:
        hit_data = [x[1] for x in predictions_labels if x[0] <= i]  # 选取当前阈值下的数据
        good_hit = len([x for x in hit_data if x == 0])  # 预测好客户数
        bad_hit = len([x for x in hit_data if x == 1])  # 预测坏客户数量
        good_rate = good_hit / good_len  # 预测好客户占比总好客户数
        bad_rate = bad_hit / bad_len  # 预测坏客户占比总坏客户数
        diff = good_rate - bad_rate  # KS值
        good_point.append(good_rate)
        bad_point.append(bad_rate)
        diff_point.append(diff)

    ks_value = max(diff_point)  # 获得最大KS值为KS值
    ks_x_axis = diff_point.index(ks_value)  # KS值下的阈值点索引
    ks_good_point, ks_bad_point = good_point[ks_x_axis], bad_point[ks_x_axis]  # 阈值下好坏客户在组内的占比
    threshold = x_axis_range[ks_x_axis]  # 阈值

    plt.plot(x_axis_range, good_point, color="green", label="好企业比率")
    plt.plot(x_axis_range, bad_point, color="red", label="坏企业比例")
    plt.plot(x_axis_range, diff_point, color="darkorange", alpha=0.5)
    plt.plot([threshold, threshold], [0, 1], linestyle="--", color="black", alpha=0.3, linewidth=2)
    
    plt.scatter([threshold], [ks_good_point], color="white", edgecolors="green", s=15)
    plt.scatter([threshold], [ks_bad_point], color="white", edgecolors="red", s=15)
    plt.scatter([threshold], [ks_value], color="white", edgecolors="darkorange", s=15)
    plt.title("KS={:.3f} threshold={:.3f}".format(ks_value, threshold))
    
    plt.text(threshold + 0.02, ks_good_point + 0.05, round(ks_good_point, 2))
    plt.text(threshold + 0.02, ks_bad_point + 0.05, round(ks_bad_point, 2))
    plt.text(threshold + 0.02, ks_value + 0.05, round(ks_value, 2))
    
    plt.legend(loc=4)
    plt.grid()
    plt.show()


if __name__ == "__main__":
    # 读取预测数据和真实标签
    labels = []
    predictions = []
    with open("test_predict_res.txt", "r", encoding="utf8") as f:
        for line in f.readlines():
            labels.append(float(line.strip().split()[0]))
            predictions.append(float(line.strip().split()[1]))

    ks_plot(predictions, labels)

ks_plot.png

KS图的解释

举例预测企业风险,预测概率越接近1是高风险企业,则当选取0.121作为分类器预测概率阈值时,有最大KS=0.526,也就是说如果判定模型预测结果大于0.121作为坏企业,会命中70%的坏企业,但是会有17%的好企业被错判.

上一篇 下一篇

猜你喜欢

热点阅读