数据集拆分:训练集、验证集、测试集

2021-10-30  本文已影响0人  小贝学生信

1、数据集拆分

1.1 训练集

1.2 测试集

1.3 拆分方法

(1)K折交叉验证

K折交叉验证是比较常用的拆分训练集、测试集,并用于模型训练、验证的方式。具体步骤如下--

(2)Bootstrapping自助法抽样

自助法抽样的核心理解就是:有放回的抽样。

再强调一点就是:是将原始训练集进一步拆分为 训练集与测试集。在上述过程中,都是把测试集放到一边的,不去管它,直到确定好模型之后才会用到test测试集。

2、数据实操

# Ames housing data
ames <- AmesHousing::make_ames()

2.1 原始训练集与测试集的拆分

有多种方式可供选择

set.seed(123)  # for reproducibility
index_1 <- sample(1:nrow(ames), round(nrow(ames) * 0.7))
train_1 <- ames[index_1, ]
test_1  <- ames[-index_1, ]
library(caret)
set.seed(123)  # for reproducibility
index_2 <- createDataPartition(ames$Sale_Price, p = 0.7, 
                               list = FALSE)
train_2 <- ames[index_2, ]
test_2  <- ames[-index_2, ]
library(rsample)
set.seed(123)  # for reproducibility
split_1  <- initial_split(ames, prop = 0.7)
train_3  <- training(split_1)
test_3   <- testing(split_1)
补充:对于不均衡样本的拆分方式

对于分类为目的的有监督学习(例如癌症恶性、良性预测);当收集的样本分布很不均衡时,在抽样还有训练过程中都需要多加考虑。

# Job attrition data
library(tidyverse)
library(modeldata)
churn <- attrition %>% 
  mutate_if(is.ordered, .funs = factor, ordered = FALSE)

#如下未离职员工与离职员工比例约为 84:16
table(churn$Attrition) %>% prop.table()
## 
##        No       Yes 
## 0.8387755 0.1612245
# stratified sampling with the rsample package
set.seed(123)
split_strat  <- initial_split(churn, prop = 0.7, 
                              strata = "Attrition")
train_strat  <- training(split_strat)
test_strat   <- testing(split_strat)

table(train_strat$Attrition) %>% prop.table()
## 
##       No      Yes 
## 0.838835 0.161165
table(test_strat$Attrition) %>% prop.table()
## 
##        No       Yes 
## 0.8386364 0.1613636

通过抽样来均衡比例
Down-sampling balances the dataset by reducing the size of the abundant class(es) to match the frequencies in the least prevalent class. This method is used when the quantity of data is sufficient.
On the contrary, up-sampling is used when the quantity of data is insufficient. It tries to balance the dataset by increasing the size of rarer samples. Rather than getting rid of abundant samples, new rare samples are generated by using repetition or bootstrapping

# K-折交叉验证
rsample::vfold_cv(ames, v = 10)

# Bootstrap
rsample::bootstraps(ames, times = 10)
上一篇 下一篇

猜你喜欢

热点阅读