输入数据集预处理

2021-10-30  本文已影响0人  小贝学生信

加利福尼亚大学伯克利分校的杰出统计学家莱奥·布莱曼Leo Breiman once said “live with your data before you plunge into modeling.”
保证输入数据的质量不言而喻,从数据收集完成到应用模型训练的中间需要对数据进行初步的整理加工,已达到适合模型要求、提高建模效率等目的。本节将从两大方面概述数据预处理的方法(专业术语称之为特征工程)。

0、示例演示数据及主要R包

ames <- AmesHousing::make_ames()
dim(ames)

set.seed(123)
library(rsample)
split <- initial_split(ames, prop = 0.7, 
                       strata = "Sale_Price")
ames_train  <- training(split)
# [1] 2049   81
ames_test   <- testing(split)
# [1] 881  81
library(recipes)
# 将特征分为预测变量与目标变量两类
ames_recipe <- recipe(Sale_Price ~ ., data = ames_train)
# Recipe
# 
# Inputs:
#   
#   role #variables
# outcome          1
# predictor         80

一、数据预处理的方式

Part 1、low quality data

低质量数据包括两方面(1)对于某些样本的某些指标没有收集到数据,即缺失值(NA);(2)另一方面即使收集到数据,样本的某些特征信息的变异度很小,对建模的贡献很小,应该考虑予以筛除。

1.1 缺失值

缺失值可以分为两类:informative missingness 与 missingness at random

(1)informative missingness:缺失值本身也提供了某些信息
(2)missingness at random:就是因为收集过程的问题,没有采集到的数据,特征为随机分布;
# 对 Gr_Liv_Area 特征列的缺失值用中位数模拟
ames_recipe %>%
  step_medianimpute(Gr_Liv_Area)

# 所有特征变量的缺失值用KNN模拟
ames_recipe %>%
  step_knnimpute(all_predictors(), neighbors = 6)

# 所有特征变量的缺失值用Tree-baesd 模拟
ames_recipe %>%
  step_bagimpute(all_predictors())

1.2 low variance feature

ames_recipe %>%
  step_nzv(all_nominal()) 
# step_nzv() 默认参数  freq_cut(index1) = 95/5; unique_cut(index2) = 10
# Nominal variables include both character and factor.

Part 2、variable transform

2.1 Target engineering

ames_recipe %>%
  step_log(all_outcomes())
ames_recipe %>%
  step_BoxCox(all_outcomes())
ames_recipe %>%
  step_YeoJohnson(all_outcomes())

2.2 Feature engineering

(1)numeric feature

Normalization:类似上面1.1的log转换
recipe(Sale_Price ~ ., data = ames_train) %>%
  step_YeoJohnson(all_numeric())     
Standardization:将所有数值类variable的均值与变异度水平拉到统一水平(

zero mean and unit variance)


ames_recipe %>%
  step_center(all_numeric(), -all_outcomes()) %>% #中心化 a mean of zero.
  step_scale(all_numeric(), -all_outcomes()) #归一化  a standard deviation of one.

(2)categorical feature

A:合并频数很低的类别为一类
count(ames_train, Neighborhood) %>% arrange(n)
## # A tibble: 28 x 2
##    Neighborhood                                n
##    <fct>                                   <int>
##  1 Landmark                                    1
##  2 Green_Hills                                 2
##  3 Greens                                      7
##  4 Blueste                                     9
##  5 Northpark_Villa                            17
##  6 Briardale                                  18
##  7 Veenker                                    20
##  8 Bloomington_Heights                        21
##  9 South_and_West_of_Iowa_State_University    30
## 10 Meadow_Village                             30
## # … with 18 more rows

#将所有频率小于0.01的类别合并为other类
recipe(Sale_Price ~ ., data = ames_train) %>%
  step_other(Neighborhood, threshold = 0.01, 
             other = "other") 
B:将分类变量转为数值变量,有One-Hot Encoding与Dummy Encoding两种
recipe(Sale_Price ~ ., data = ames_train) %>%
  step_dummy(all_nominal(), one_hot = TRUE)
# Nominal variables include both character and factor.
C: Label encoding
count(ames_train, MS_SubClass) %>% head()
# # A tibble: 6 x 2
# MS_SubClass                                n
#   <fct>                                  <int>
# 1 One_Story_1946_and_Newer_All_Styles      756
# 2 One_Story_1945_and_Older                  94
# 3 One_Story_with_Finished_Attic_All_Ages     4
# 4 One_and_Half_Story_Unfinished_All_Ages    13
# 5 One_and_Half_Story_Finished_All_Ages     203
# 6 Two_Story_1946_and_Newer                 404

recipe(Sale_Price ~ ., data = ames_train) %>%
  step_integer(MS_SubClass)
# 转换处理后的结果如下图所示,16个有序类别转换为1~16的数值,还是1列

Part 3、降维

recipe(Sale_Price ~ ., data = ames_train) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  step_pca(all_numeric(), threshold = .95)
  # A fraction of the total variance that should be covered by the components. 

二、方法整合

1、推荐处理顺序

2、数据处理示例

#step1: 选择合适处理方式组合
blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_nzv(all_nominal())  %>%
  step_integer(matches("Qual|Cond|QC|Qu")) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_pca(all_numeric(), -all_outcomes())
  
blueprint

#step2: train this blueprint on some training data
prepare <- prep(blueprint, training = ames_train)
prepare

#step3: apply our blueprint to new data(the training data or future test data)
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
baked_train

3、结合caret包建模的使用方式


blueprint <- recipe(Sale_Price ~ ., data = ames_train) %>%
  step_nzv(all_nominal()) %>%
  step_integer(matches("Qual|Cond|QC|Qu")) %>%
  step_center(all_numeric(), -all_outcomes()) %>%
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)


library(caret)
# Specify resampling plan
cv <- trainControl(
  method = "repeatedcv", 
  number = 10, 
  repeats = 5
)

# Construct grid of hyperparameter values
hyper_grid <- expand.grid(k = seq(2, 25, by = 1))

# Tune a knn model using grid search
knn_fit2 <- train(
  blueprint, 
  data = ames_train, 
  method = "knn", 
  trControl = cv, 
  tuneGrid = hyper_grid,
  metric = "RMSE"
)

knn_fit2
# plot cross validation results
ggplot(knn_fit2)

突然联想到之前学习单细胞测序数据分析流程时,很多数据预处理方式与这里提到的知识点类似~

上一篇 下一篇

猜你喜欢

热点阅读