45-R语言机器学习:神经网络与深度学习
《精通机器学习:基于R 第二版》学习笔记
1、神经网络介绍
“神经网络”的概念相当宽泛,它包括了很多相关的方法。我们主要关注使用反向传播方法进行训练的前馈神经网络。
神经网络模型的优点在于,可以对输入变量(特征)和响应变量之间的高度复杂关系进行建模,特别是关系呈现高度非线性时。神经网络模型的构建和评价不需要基本假设,对于定量和定性响应变量都适用。
神经网络的结果是一个黑盒。换言之,没有一个带有系数的等式可供检验并分享给业务伙伴。实际上,结果几乎是无法解释的。另外一种批评意见的主要内容是,当初始的随机输入发生变化时,我们不清楚结果会发生什么变化。还有,神经网络的训练过程需要昂贵的时间和计算成本。
常用的激活函数:sigmoid、Rectifier、Maxout以及双曲正切函数(tanh)。
使用R画出sigmoid函数:
> library(pacman)
> p_load(ggplot2, dplyr, hrbrthemes)
> sigmoid <- function(x) {
+ 1/(1 + exp(-x))
+ }
>
> x <- seq(-5, 5, 0.1)
> df <- tibble(sigmoid.x = sigmoid(x), index = 1:length(x))
> ggplot(df, aes(index, sigmoid.x)) + geom_point() + theme_ft_rc()
sigmoid函数
tanh() 函数(双曲正切函数)是sigmoid函数的一种变体,它的输出值为-1 ~ 1。
> tibble(x = x, sigmoid.x = sigmoid(x), tanh.x = tanh(x)) %>% ggplot(aes(x)) +
+ geom_line(aes(y = sigmoid.x, color = "sigmoid"), size = 1) +
+ geom_line(aes(y = tanh.x, color = "tanh"), size = 1) +
+ theme_ft_rc() + theme(legend.position = "top", legend.title = element_blank())
sigmoid函数和tanh函数
2、深度学习简介
深度学习是机器学习的一个分支,它的基础就是神经网络,它的特点其实就是使用机器学习技术(一般是无监督学习)在输入变量的基础之上构建新的特征。
3、数据理解与数据准备
> library(pacman)
> p_load(MASS)
>
> data("shuttle")
> str(shuttle)
## 'data.frame': 256 obs. of 7 variables:
## $ stability: Factor w/ 2 levels "stab","xstab": 2 2 2 2 2 2 2 2 2 2 ...
## $ error : Factor w/ 4 levels "LX","MM","SS",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ sign : Factor w/ 2 levels "nn","pp": 2 2 2 2 2 2 1 1 1 1 ...
## $ wind : Factor w/ 2 levels "head","tail": 1 1 1 2 2 2 1 1 1 2 ...
## $ magn : Factor w/ 4 levels "Light","Medium",..: 1 2 4 1 2 4 1 2 4 1 ...
## $ vis : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ use : Factor w/ 2 levels "auto","noauto": 1 1 1 1 1 1 1 1 1 1 ...
数据集包括256个观测和7个变量。所有变量都是分类变量,响应变量 use 有两个水平,auto 和 noauto。
stability :能否稳定定位( stab / xstab )
error :误差大小( MM / SS / LX )
sign :误差的符号,正或负( pp / nn )
wind :风向的符号( head / tail )
magn :风力强度( Light / Medium / Strong / Out of Range )
vis :能见度( yes / no )
> table(shuttle$use)
##
## auto noauto
## 145 111
使用自动着陆的决策比例为57%。 table() 函数完美适用于两个变量之间的对比,但如果加入更多的变量,vcd包的 structable()函数会是更好的选择:
> p_load(vcd)
> tab1 <- structable(wind + magn ~ use, shuttle)
> print(tab1)
## wind head tail
## magn Light Medium Out Strong Light Medium Out Strong
## use
## auto 19 19 16 18 19 19 16 19
## noauto 13 13 16 14 13 13 16 13
在表中,我们可以看出在逆风(head)的情况下,如果风力为轻微( Light ),那么自动着陆( auto )发生19次,非自动着陆( noauto )发生13次。
mosaic() 函数,将 structable()函数生成的表格绘制成统计图,同时提供了卡方检验的p值:
> mosaic(tab1, shade = T)
mosaic统计图
图中的方块可以表示表格中相应位置的数值比例,p值是不显著的,所以特征与响应变量不相关。这说明,风力强度magn不能帮助我们预测是否使用自动着陆。
> mosaic(use ~ error + vis, shuttle)
mosaic统计图
神经网络的数据准备是非常重要的,因为所有协变量和响应变量都必须是数值型。本数据集中所有变量都是分类变量,需要使用caret包快速建立虚拟变量作为输入特征:
> p_load(caret)
> dummies <- dummyVars(use ~ ., shuttle, fullRank = T)
> dummies
## Dummy Variable Object
##
## Formula: use ~ .
## <environment: 0x000002699e075128>
## 7 variables, 7 factors
## Variables and levels will be separated by '.'
## A full rank encoding is used
转换为数据框:
> shuttle.2 <- as.data.frame(predict(dummies, newdata = shuttle))
> names(shuttle.2)
## [1] "stability.xstab" "error.MM" "error.SS" "error.XL"
## [5] "sign.pp" "wind.tail" "magn.Medium" "magn.Out"
## [9] "magn.Strong" "vis.yes"
> str(shuttle.2)
## 'data.frame': 256 obs. of 10 variables:
## $ stability.xstab: num 1 1 1 1 1 1 1 1 1 1 ...
## $ error.MM : num 0 0 0 0 0 0 0 0 0 0 ...
## $ error.SS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ error.XL : num 0 0 0 0 0 0 0 0 0 0 ...
## $ sign.pp : num 1 1 1 1 1 1 0 0 0 0 ...
## $ wind.tail : num 0 0 0 1 1 1 0 0 0 1 ...
## $ magn.Medium : num 0 1 0 0 1 0 0 1 0 0 ...
## $ magn.Out : num 0 0 0 0 0 0 0 0 0 0 ...
## $ magn.Strong : num 0 0 1 0 0 1 0 0 1 0 ...
## $ vis.yes : num 0 0 0 0 0 0 0 0 0 0 ...
现在,我们得到具有10个变量的输入特征空间。对于stability, 0 表示 stab , 1 表示 xstab 。error的基准是 LX ,用3个变量表示其他分类。
可以用 ifelse() 函数建立响应变量:
> shuttle.2$use <- ifelse(shuttle$use == "auto", 1, 0)
> table(shuttle.2$use)
##
## 0 1
## 111 145
拆分为训练集和测试集:
> set.seed(123)
> train.index <- createDataPartition(shuttle.2$use, p = 0.7, list = F)
> str(train.index)
## int [1:180, 1] 1 4 5 6 7 8 9 11 13 14 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "Resample1"
> shuttle.train <- shuttle.2[train.index, ]
> shuttle.test <- shuttle.2[-train.index, ]
>
> dim(shuttle.train)
## [1] 180 11
> dim(shuttle.test)
## [1] 76 11
4、模型建立与模型评价
以前,我们使用y ~指定数据集中除响应变量之外的所有变量作为输入,但neuralnet中不允许这种写法。绕过这种限制的方式是,使用 as.formula() 函数。先建立一个保存变量名的对象,然后用这个对象作为输入,从而将变量名粘贴到公式右侧。
> p_load(neuralnet)
>
> n <- names(shuttle.train)
> form <- as.formula(paste("use ~", paste(n[!n %in% "use"], collapse = "+")))
> print(form)
## use ~ stability.xstab + error.MM + error.SS + error.XL + sign.pp +
## wind.tail + magn.Medium + magn.Out + magn.Strong + vis.yes
## <environment: 0x000002699e075128>
建立模型:
> fit <- neuralnet(form, data = shuttle.train, err.fct = "ce", linear.output = F)
参数说明:
hidden :每层中隐藏神经元的数量,最多可以设置3个隐藏层,默认值为1
act.fct :激活函数,默认为逻辑斯蒂函数,也可以设置为tanh函数
err.fct :计算误差,默认为sse;因为我们处理的是二值结果变量,所以要设置成ce,使用交叉熵
linear.output :逻辑参数,控制是否忽略act.fct,默认值为 TRUE;对于我们的数据来说,需要设置为 FALSE
> fit$result.matrix
## [,1]
## error 0.013651024
## reached.threshold 0.009868817
## steps 670.000000000
## Intercept.to.1layhid1 5.136942014
## stability.xstab.to.1layhid1 -2.485264957
## error.MM.to.1layhid1 1.032588807
## error.SS.to.1layhid1 2.543705586
## error.XL.to.1layhid1 0.030906433
## sign.pp.to.1layhid1 0.840732458
## wind.tail.to.1layhid1 0.721638821
## magn.Medium.to.1layhid1 0.034567106
## magn.Out.to.1layhid1 -2.436662220
## magn.Strong.to.1layhid1 -0.099174792
## vis.yes.to.1layhid1 -7.556133035
## Intercept.to.use -28.580429411
## 1layhid1.to.use 66.014874838
可以看到,误差为0.013651024。steps的值是算法达到阈值所需的训练次数,也就是误差函数的偏导数的绝对值小于阈值(默认为0.1)时的训练次数。权重最高的神经元是error.SS.to.1layhid1,权重值为2.543705586。
查看广义权重(第i个协变量对对数发生比的贡献):
> head(fit$generalized.weights[[1]])
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## 1 -4.701598 1.9534405 4.812155 0.05846846 1.5904887 1.3651886 0.06539368
## 4 -2.355740 0.9787731 2.411135 0.02929567 0.7969158 0.6840290 0.03276556
## 5 -2.277955 0.9464547 2.331521 0.02832835 0.7706022 0.6614428 0.03168367
## 6 -2.593462 1.0775429 2.654447 0.03225195 0.8773340 0.7530556 0.03607199
## 7 -10.097313 4.1952759 10.334750 0.12556887 3.4157882 2.9319260 0.14044172
## 8 -9.798060 4.0709409 10.028460 0.12184740 3.3145548 2.8450328 0.13627946
## [,8] [,9] [,10]
## 1 -4.609652 -0.18761781 -14.294612
## 4 -2.309670 -0.09400607 -7.162328
## 5 -2.233406 -0.09090206 -6.925833
## 6 -2.542743 -0.10349240 -7.885092
## 7 -9.899846 -0.40293446 -30.699599
## 8 -9.606445 -0.39099273 -29.789758
可视化:
> plot(fit)
权重统计图
从这张统计图中,可以知道截距和每个变量的权重。
查看广义权重:
> par(mfrow = c(1, 2))
> # 一直报错:Error in plot.window(...): 'ylim'值不能是无限的
> # gwplot(fit, selected.covariate = "vis.yes")
> # gwplot(fit, selected.covariate = "wind.tail")
> # 换种方式实现
> fit.covariate <- fit$covariate %>% as.data.frame()
> plot(fit.covariate$vis.yes,main = "vis.yes",xlab = "",ylab = "")
> plot(fit.covariate$wind.tail,main = "wind.tail",xlab = "",ylab = "")
广义权重统计图
wind.tail的连接权重在整体上处于较低的位置。vis.yes的广义权重非常不对称,而wind.tail的权重则分布得非常均匀,说明这个变量基本不具备预测能力。
看看模型在测试集上的表现:
> results.train <- compute(fit, shuttle.train[, 1:10])
> pred.train <- results.train$net.result
> print(pred.train)
## [,1]
## 1 1.000000e+00
## 4 1.000000e+00
## 5 1.000000e+00
## 6 1.000000e+00
## ---------------------
## 251 6.119581e-04
## 252 1.484097e-10
## 253 2.765947e-08
## 255 3.022915e-07
## 256 7.566422e-08
> pred.train <- ifelse(pred.train < 0.5, 0, 1)
> table(pred.train, shuttle.train$use)
##
## pred.train 0 1
## 0 73 0
## 1 0 107
神经网络模型的正确率达到了100%!看看它在测试集上的表现:
> result.test <- compute(fit, shuttle.test[, 1:10])
> pred.test <- result.test$net.result
> pred.test <- ifelse(pred.test < 0.5, 0, 1)
> table(pred.test, shuttle.test$use)
##
## pred.test 0 1
## 0 38 2
## 1 0 36
测试集上有两个错误,查看是哪两个:
> which(pred.test == 0 & shuttle.test$use == 1)
## [1] 58 59
测试集中的58、59行预测错误。
5、深度学习示例
5.1 安装H2O
1、如果以前安装过H2O包,先删除
> if ("package:h2o" %in% search()) {
+ detach("package:h2o", unload = TRUE)
+ }
## [1] "A shutdown has been triggered. "
> if ("package:h2o" %in% rownames(installed.packages())) {
+ remove.packages("h2o")
+ }
2、下载并安装h2o包所需的依赖包:
> library(pacman)
> p_load(methods, statmod, stats, graphics, RCurl, jsonlite, tools, utils)
3、安装并加载h2o包
> p_load(h2o)
5.2 将数据上传到H2O平台
> path <- "./data_set/data-master/bank_DL.csv"
连接到H2O平台,并在集群上启动一个实例。
> # nthreads=-1使实例可以使用集群上的所有CPU
> local.h2o <- h2o.init(nthreads = -1)
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 5 hours 57 minutes
## H2O cluster timezone: Asia/Shanghai
## H2O data parsing timezone: UTC
## H2O cluster version: 3.28.0.4
## H2O cluster version age: 15 days
## H2O cluster name: H2O_started_from_R_Admin_wzk082
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.57 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 3.6.2 (2019-12-12)
服务已经启动,此时从浏览器中也可以看到:
h2o
将数据文件上传到H2O平台:
> # h2o.importFolder h2o.importURL h2o.importHDFS
> bank <- h2o.uploadFile(path = path)
> class(bank)
## [1] "H2OFrame"
> # 在H2O平台中,很多R函数的输出和我们以前用过的函数不一样。
> str(bank)
## Class 'H2OFrame' <environment: 0x00000269a00bfc50>
## - attr(*, "op")= chr "Parse"
## - attr(*, "id")= chr "bank_DL_sid_8edb_2"
## - attr(*, "eval")= logi FALSE
## - attr(*, "nrow")= int 4521
## - attr(*, "ncol")= int 64
## - attr(*, "types")=List of 64
## ..$ : chr "real"
## --------------------------------------------------------
## ..$ previous_2 : num 0 0 0 0 0 0 1 0 0 1
## ..$ previous_3 : num 0 0 0 0 0 1 0 0 0 0
## ..$ previous_4 : num 0 1 0 0 0 0 0 0 0 0
## ..$ previous_5 : num 0 0 0 0 0 0 0 0 0 0
## ..$ poutcome_failure : num 0 1 1 0 0 1 0 0 0 1
## ..$ poutcome_other : num 0 0 0 0 0 0 1 0 0 0
## ..$ poutcome_success : num 0 0 0 0 0 0 0 0 0 0
## ..$ poutcome_unknown : num 1 0 0 1 1 0 0 1 1 0
## ..$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1
查看相应变量的分布:
> h2o.table(bank$y)
## y Count
## 1 no 4000
## 2 yes 521
可以看到,有521名客户对银行的营销活动给出了“是”的反应,另外4000名客户的反应则 是“否”。这个响应变量有点不平衡。
5.3 拆分训练集和测试集
> # 建立统一的随机数向量
> rand <- h2o.runif(bank, seed = 123)
>
> train <- bank[rand <= 0.7, ] %>% h2o.assign(key = "train")
> test <- bank[rand > 0.7, ] %>% h2o.assign(key = "test")
> # 查看拆分是否均衡
> h2o.table(train[, 64])
## y Count
## 1 no 2783
## 2 yes 396
> h2o.table(test[, 64])
## y Count
## 1 no 1217
## 2 yes 125
5.4 构建模型
使用随机搜索的方法调整超参数,这种方法比全网格搜索要节省时间。需要检查的超参数有:有舍弃(dropout)和无舍弃的tanh激活函数、3种不同形式的隐藏层(神经元组合)、两种不同的舍弃率,以及两种不同的学习率。
> # 建立随机搜索超参数的列表
> hyper.params <- list(activation = c("Tanh", "TanhWithDropout"),
+ hidden = list(c(20, 20), c(30, 30), c(30, 30, 30)),
+ input_dropout_ratio = c(0, 0.05), rate = c(0.01, 0.25))
> # 建立随机搜索原则的列表,strategy设置 为RandomDiscrete表示随机搜索;如果要进行全网格搜索,就要设置为Cartesian
> search.criteria <- list(
+ strategy = "RandomDiscrete",max_runtime_secs = 420,
+ max_models = 100,seed =123,stopping_rounds = 5,
+ # 结束标志为:前5个模型之间的误差在1%以内
+ stopping_tolerance = 0.01
+ )
> random.search <- h2o.grid(
+ # 深度学习算法
+ algorithm = "deeplearning",
+ grid_id = "random.search",
+ # 训练数据集
+ training_frame = train,
+ # 验证数据集
+ validation_frame = test,
+ # 输入特征
+ x = 1:63,
+ # 响应变量
+ y = 64,
+ epochs = 1,
+ stopping_metric = "misclassification",
+ hyper_params = hyper.params,
+ search_criteria = search.criteria
+ )
检查效果最好的前5个模型的结果:
> grid <- h2o.getGrid("random.search", sort_by = "auc", decreasing = T)
> grid
## H2O Grid Details
## ================
##
## Grid ID: random.search
## Used hyper parameters:
## - activation
## - hidden
## - input_dropout_ratio
## - rate
## Number of models: 24
## Number of failed models: 0
##
## Hyper-Parameter Search Summary: ordered by decreasing auc
## activation hidden input_dropout_ratio rate model_ids
## 1 TanhWithDropout [30, 30] 0.0 0.01 random.search_model_17
## 2 TanhWithDropout [30, 30] 0.05 0.01 random.search_model_16
## 3 TanhWithDropout [30, 30] 0.0 0.25 random.search_model_6
## 4 TanhWithDropout [30, 30, 30] 0.05 0.01 random.search_model_23
## 5 Tanh [30, 30, 30] 0.05 0.01 random.search_model_14
## auc
## 1 0.8635497124075596
## 2 0.8588824979457683
## 3 0.8580049301561217
## 4 0.85473459326212
## 5 0.8506162695152013
##
## ---
## activation hidden input_dropout_ratio rate
## 19 Tanh [20, 20] 0.05 0.25
## 20 TanhWithDropout [20, 20] 0.0 0.25
## 21 Tanh [30, 30, 30] 0.05 0.25
## 22 Tanh [30, 30] 0.05 0.01
## 23 TanhWithDropout [30, 30, 30] 0.0 0.01
## 24 TanhWithDropout [30, 30, 30] 0.05 0.25
## model_ids auc
## 19 random.search_model_18 0.8081840591618734
## 20 random.search_model_8 0.802872637633525
## 21 random.search_model_3 0.798695152013147
## 22 random.search_model_2 0.7974330320460148
## 23 random.search_model_22 0.7543467543138865
## 24 random.search_model_11 0.716936729663106
所以,第17号模型最终胜出,它使用有舍弃的tanh激活函数、2个隐藏层(每个隐藏层中有30个神经元)、0.0的舍弃率和0.01的学习率,其AUC大概是0.864。
通过混淆矩阵查看模型在测试集上的表现:
> best.model <- h2o.getModel(grid@model_ids[[1]])
> h2o.confusionMatrix(best.model, valid = T)
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.123302996428301:
## no yes Error Rate
## no 1133 84 0.069022 =84/1217
## yes 60 65 0.480000 =60/125
## Totals 1193 149 0.107303 =144/1342
尽管错误率不到11%,但yes标签上的错误太多了,它的假阳性率和假阴性率都非常高。这说明数据不平衡的分类可能是一个问题。
5.5 使用交叉验证建立模型
> dlmodel <- h2o.deeplearning(x = 1:63, y = 64, training_frame = train,
+ hidden = c(30, 30),
+ epochs = 3, nfolds = 5, fold_assignment = "Stratified", balance_classes = T,
+ activation = "TanhWithDropout", seed = 123, adaptive_rate = F, input_dropout_ratio = 0,
+ stopping_metric = "misclassification", variable_importances = T)
> dlmodel
## Model Details:
## ==============
##
## H2OBinomialModel: deeplearning
## Model ID: DeepLearning_model_R_1583821038674_163
## Status of Neuron Layers: predicting y, 2-class classification, bernoulli distribution, CrossEntropy loss,
## 2,912 weights/biases, 22.9 KB, 18,957 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms
## 1 1 63 Input 0.00 % NA NA NA NA
## 2 2 30 TanhDropout 50.00 % 0.000000 0.000000 0.004907 0.000000
## 3 3 30 TanhDropout 50.00 % 0.000000 0.000000 0.004907 0.000000
## 4 4 2 Softmax NA 0.000000 0.000000 0.004907 0.000000
## momentum mean_weight weight_rms mean_bias bias_rms
## 1 NA NA NA NA NA
## 2 0.000000 0.025206 0.814655 0.147166 0.735583
## 3 0.000000 -0.005851 0.337015 -0.084604 0.411204
## 4 0.000000 0.053744 0.769235 -0.004516 0.303754
##
##
## H2OBinomialMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on full training frame **
##
## MSE: 0.1826312
## RMSE: 0.4273537
## LogLoss: 0.5439753
## Mean Per-Class Error: 0.1502497
## AUC: 0.9159916
## AUCPR: 0.8807761
## Gini: 0.8319832
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## no yes Error Rate
## no 2108 675 0.242544 =675/2783
## yes 161 2617 0.057955 =161/2778
## Totals 2269 3292 0.150333 =836/5561
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.105880 0.862273 298
## 2 max f2 0.034591 0.918540 352
## 3 max f0point5 0.300881 0.845340 190
## 4 max accuracy 0.125034 0.850207 286
## 5 max precision 0.595273 0.942308 4
## 6 max recall 0.004675 1.000000 393
## 7 max specificity 0.600539 0.999641 0
## 8 max absolute_mcc 0.105880 0.711645 298
## 9 max min_per_class_accuracy 0.245084 0.838733 218
## 10 max mean_per_class_accuracy 0.125034 0.850278 286
## 11 max tns 0.600539 2782.000000 0
## 12 max fns 0.600539 2764.000000 0
## 13 max fps 0.003311 2783.000000 399
## 14 max tps 0.004675 2778.000000 393
## 15 max tnr 0.600539 0.999641 0
## 16 max fnr 0.600539 0.994960 0
## 17 max fpr 0.003311 1.000000 399
## 18 max tpr 0.004675 1.000000 393
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
## H2OBinomialMetrics: deeplearning
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.08705731
## RMSE: 0.2950548
## LogLoss: 0.2875898
## Mean Per-Class Error: 0.2193127
## AUC: 0.872492
## AUCPR: 0.4780927
## Gini: 0.744984
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## no yes Error Rate
## no 2497 286 0.102767 =286/2783
## yes 133 263 0.335859 =133/396
## Totals 2630 549 0.131802 =419/3179
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.424191 0.556614 172
## 2 max f2 0.122184 0.657076 295
## 3 max f0point5 0.424191 0.507330 172
## 4 max accuracy 0.717398 0.884869 63
## 5 max precision 0.957903 0.800000 3
## 6 max recall 0.000684 1.000000 398
## 7 max specificity 0.972282 0.999281 0
## 8 max absolute_mcc 0.424191 0.490448 172
## 9 max min_per_class_accuracy 0.188988 0.804887 266
## 10 max mean_per_class_accuracy 0.122184 0.809987 295
## 11 max tns 0.972282 2781.000000 0
## 12 max fns 0.972282 395.000000 0
## 13 max fps 0.000346 2783.000000 399
## 14 max tps 0.000684 396.000000 398
## 15 max tnr 0.972282 0.999281 0
## 16 max fnr 0.972282 0.997475 0
## 17 max fpr 0.000346 1.000000 399
## 18 max tpr 0.000684 1.000000 398
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy 0.8826488 0.019553455 0.9073783 0.8575949 0.88621795 0.89269054
## auc 0.8777314 0.022875382 0.8861917 0.89000493 0.8754107 0.89764535
## aucpr 0.47863057 0.055181786 0.52034676 0.4284161 0.52908653 0.50504553
## err 0.11735118 0.019553455 0.09262166 0.14240506 0.11378205 0.10730949
## err_count 74.6 12.381437 59.0 90.0 71.0 69.0
## cv_5_valid
## accuracy 0.86936235
## auc 0.83940434
## aucpr 0.41025785
## err 0.13063763
## err_count 84.0
##
## ---
## mean sd cv_1_valid cv_2_valid cv_3_valid
## pr_auc 0.47863057 0.055181786 0.52034676 0.4284161 0.52908653
## precision 0.5355106 0.077891 0.5882353 0.42741936 0.5903614
## r2 0.2003508 0.08083034 0.18793476 0.22176513 0.086192444
## recall 0.6443926 0.075244226 0.5633803 0.7361111 0.5697674
## rmse 0.29464537 0.02022303 0.28359163 0.28028414 0.32952103
## specificity 0.9167673 0.031265505 0.95053005 0.8732143 0.936803
## cv_4_valid cv_5_valid
## pr_auc 0.50504553 0.41025785
## precision 0.5940594 0.4774775
## r2 0.31192234 0.19393937
## recall 0.6818182 0.6708861
## rmse 0.28509894 0.2947311
## specificity 0.9261261 0.8971631
看看在测试集上的表现:
> perf <- h2o.performance(dlmodel, test)
> perf
## H2OBinomialMetrics: deeplearning
##
## MSE: 0.07173268
## RMSE: 0.2678296
## LogLoss: 0.2341893
## Mean Per-Class Error: 0.2004174
## AUC: 0.8735283
## AUCPR: 0.3951697
## Gini: 0.7470567
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## no yes Error Rate
## no 1031 186 0.152835 =186/1217
## yes 31 94 0.248000 =31/125
## Totals 1062 280 0.161699 =217/1342
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.257779 0.464198 175
## 2 max f2 0.183122 0.609168 210
## 3 max f0point5 0.575237 0.496183 25
## 4 max accuracy 0.576611 0.915052 23
## 5 max precision 0.604630 1.000000 0
## 6 max recall 0.003262 1.000000 399
## 7 max specificity 0.604630 1.000000 0
## 8 max absolute_mcc 0.257779 0.428554 175
## 9 max min_per_class_accuracy 0.183122 0.808000 210
## 10 max mean_per_class_accuracy 0.117068 0.816250 242
## 11 max tns 0.604630 1217.000000 0
## 12 max fns 0.604630 124.000000 0
## 13 max fps 0.003262 1217.000000 399
## 14 max tps 0.003262 125.000000 399
## 15 max tnr 0.604630 1.000000 0
## 16 max fnr 0.604630 0.992000 0
## 17 max fpr 0.003262 1.000000 399
## 18 max tpr 0.003262 1.000000 399
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
做个对比:
训练集上的混淆矩阵:
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
no yes Error Rate
no 2497 286 0.102767 =286/2783
yes 133 263 0.335859 =133/396
Totals 2630 549 0.131802 =419/3179
测试集上的混淆矩阵:
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
no yes Error Rate
no 1031 186 0.152835 =186/1217
yes 31 94 0.248000 =31/125
Totals 1062 280 0.161699 =217/1342
整体错误率提高了,假阴性率有所下降,所以需要更多调优工作。
最后,可以计算变量重要性,在表中,我们看到变量是按照重要性顺序排列的,但变量重要性会受到抽样变动的影响。如果你换一个随机数种子,变量重要性的顺序也很可能发生改变。以下是按重要性排列的前5个和最后6个变量:
> dlmodel@model$variable_importances
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 duration 1.000000 1.000000 0.105319
## 2 poutcome_success 0.738116 0.738116 0.077738
## 3 month_oct 0.415810 0.415810 0.043793
## 4 month_mar 0.282554 0.282554 0.029758
## 5 poutcome_unknown 0.263573 0.263573 0.027759
##
## ---
## variable relative_importance scaled_importance percentage
## 58 contact_telephone 0.072925 0.072925 0.007680
## 59 job_unemployed 0.071896 0.071896 0.007572
## 60 campaign_3 0.070856 0.070856 0.007463
## 61 job_housemaid 0.066491 0.066491 0.007003
## 62 campaign_6 0.065220 0.065220 0.006869
## 63 campaign_10 0.065091 0.065091 0.006855