数据挖掘与机器学习

数据集划分方法

2020-02-24 本文已影响0人清梦载星河

问题：如何划分训练集和验证集，从而保证验证集上的表现能代表模型的泛化能力。

1. 划分的基本准则

基本准则：保持训练集和验证集之间数据互斥，即测试样本不出现在训练样本中。

2. 划分方法

留出法

直接将数据集划分为两个互斥的数据集，其中一个做训练集，另一个做验证集。
常见划分比例：7:3、7.5:2.5、8:2。
缺点：由于是随机取样数据，所以结果可能不具有代表性。
相关函数：from sklearn.model_selection import train_test_split

示例代码：

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.25, random_state=42)

# 提取训练集中的数据
x_train = train_set.copy()[['col1','col2','col3']]
y_train = train_set.copy()['col4']

# 提取验证集中的数据
x_test = test_set.copy()[['col1','col2','col3']]
y_test = test_set.copy()['col4']

# 训练数据等。。。

交叉验证法（CV）

相关函数：
- from sklearn.model_selection import KFold
- 或from sklearn.model_selection import cross_val_score

示例代码：

from sklearn.model_selection import KFold

# n_splits指交叉验证子集数
# shuffle指每次取样后是否重新打乱再取样
kf = KFold(n_splits = 10, shuffle=True)

for train_index,test_index in kf.split(df):
    # 拆分
    x_traincv, x_testcv = x.loc[train_index], x.loc[test_index]
    y_traincv, y_testcv = y.loc[train_index], y.loc[test_index]
    
    # 训练。。。

上一篇下一篇

猜你喜欢

热点阅读