StratifiedKFold简记

2019-05-24  本文已影响0人  madeirak

参考:StratifiedKFold 和 KFold 的比较

# coding:utf-8
import numpy as np
from sklearn.model_selection import KFold,StratifiedKFold

X=np.array([
    ['a', 2, 122.21, 4],
    ['b', 3, 132.12, 14],
    ['c', 31, 155.33, 24],
    ['d', 12, 143.93, 34],
    ['c', 32, 124.31, 44],
    ['a', 1, 151.11, 54],
    ['b', 11, 112.33, 64],
    ['b', 21, 137.82, 74]
])

y=np.array([1,1,0,0,1,1,0,0])
sfolder = StratifiedKFold(n_splits=3,random_state=0,shuffle=False)
floder = KFold(n_splits=3,random_state=0,shuffle=False)

for train_idx,val_idx in sfolder.split(X,y):
    print('Train: %s | Val: %s' % (train_idx, val_idx))
    print(" ")
    #print("train_x: ",X[train_idx])
    #print("train_y: ",y[train_idx])
    #print("val_x:", X[test_idx])
    #print("val_y:", y[test_idx])

    #print("\n\n")

print("\n\n  --------  \n\n")

for train_idx, val_idx in folder.split(X,y):
    print('Train: %s | Val: %s' % (train_idx, test_idx))
    print(" ")            

输出的仅仅为索引,要是数据需要加上代码中的注释部分

可以看到上图中StratifiedKFold 分层采样交叉切分,确保训练集,测试集中各类别样本的比例与原始数据集中相同。比如原数据中,0,1两类比例是1:1,通过观察StratifiedKFold切分的每个测试集可以发现,0,1两类的占比也为1:1,这就是分层采样。确定了测试集后,测试集的补集就是训练集。

同时应注意到每个测试集均是互斥的。


上一篇 下一篇

猜你喜欢

热点阅读