80-预测分析-R语言实现-神经网络和有序逻辑回归
2020-10-05 本文已影响0人
wonphen
> library(pacman)
> p_load(dplyr, readr, caret, DataExplorer)
1、读入数据
> glass <- read_csv("data_set/glass.data", col_names = F)
>
> profile_missing(glass)
## # A tibble: 11 x 3
## feature num_missing pct_missing
## <fct> <int> <dbl>
## 1 X1 0 0
## 2 X2 0 0
## 3 X3 0 0
## 4 X4 0 0
## 5 X5 0 0
## 6 X6 0 0
## 7 X7 0 0
## 8 X8 0 0
## 9 X9 0 0
## 10 X10 0 0
## 11 X11 0 0
2、 数据预处理
> names(glass) <- c("id", "ri", "na", "mg", "ai", "si", "k",
+ "ca", "ba", "fe", "type")
> glass <- select(glass, -c("id"))
>
> str(glass)
## tibble [214 × 10] (S3: tbl_df/tbl/data.frame)
## $ ri : num [1:214] 1.52 1.52 1.52 1.52 1.52 ...
## $ na : num [1:214] 13.6 13.9 13.5 13.2 13.3 ...
## $ mg : num [1:214] 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ ai : num [1:214] 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ si : num [1:214] 71.8 72.7 73 72.6 73.1 ...
## $ k : num [1:214] 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ ca : num [1:214] 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ ba : num [1:214] 0 0 0 0 0 0 0 0 0 0 ...
## $ fe : num [1:214] 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ type: num [1:214] 1 1 1 1 1 1 1 1 1 1 ...
> # 检查数据类别分布
> table(glass$type)
##
## 1 2 3 5 6 7
## 70 76 17 13 9 29
> # 将结果变量转换为因子型
> glass$type <- factor(glass$type)
>
> set.seed(123)
> ind <- createDataPartition(glass$type, p = 0.8, list = F)
> dtrain <- glass[ind, ]
> dtest <- glass[-ind, ]
>
> dim(dtrain)
## [1] 174 10
> dim(dtest)
## [1] 40 10
3、神经网络建模
> set.seed(123)
> fit.nnet <- train(type ~ ., data = dtrain, method = "nnet", trace = F,
+ preProc = c("center", "scale"))
>
> # 训练集准确率
> confusionMatrix(predict(fit.nnet, newdata = dtrain, type = "raw"), dtrain$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 5 6 7
## 1 52 11 10 0 0 0
## 2 4 50 4 1 0 2
## 3 0 0 0 0 0 0
## 5 0 0 0 10 0 1
## 6 0 0 0 0 7 0
## 7 0 0 0 0 1 21
##
## Overall Statistics
##
## Accuracy : 0.8046
## 95% CI : (0.7378, 0.8607)
## No Information Rate : 0.3506
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7281
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.9286 0.8197 0.00000 0.90909 0.87500 0.8750
## Specificity 0.8220 0.9027 1.00000 0.99387 1.00000 0.9933
## Pos Pred Value 0.7123 0.8197 NaN 0.90909 1.00000 0.9545
## Neg Pred Value 0.9604 0.9027 0.91954 0.99387 0.99401 0.9803
## Prevalence 0.3218 0.3506 0.08046 0.06322 0.04598 0.1379
## Detection Rate 0.2989 0.2874 0.00000 0.05747 0.04023 0.1207
## Detection Prevalence 0.4195 0.3506 0.00000 0.06322 0.04023 0.1264
## Balanced Accuracy 0.8753 0.8612 0.50000 0.95148 0.93750 0.9342
模型在训练集上的准确率为80.46%,同时可以看到,模型对1和2两个类别区分得不太好,对3完全没有区分能力,对5、6、7区分得还不错。
> # 测试集准确率
> confusionMatrix(predict(fit.nnet, newdata = dtest, type = "raw"), dtest$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 5 6 7
## 1 11 7 1 0 0 0
## 2 3 7 2 1 0 1
## 3 0 0 0 0 0 0
## 5 0 1 0 0 0 0
## 6 0 0 0 0 1 0
## 7 0 0 0 1 0 4
##
## Overall Statistics
##
## Accuracy : 0.575
## 95% CI : (0.4089, 0.7296)
## No Information Rate : 0.375
## P-Value [Acc > NIR] : 0.008001
##
## Kappa : 0.3796
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.7857 0.4667 0.000 0.0000 1.000 0.8000
## Specificity 0.6923 0.7200 1.000 0.9737 1.000 0.9714
## Pos Pred Value 0.5789 0.5000 NaN 0.0000 1.000 0.8000
## Neg Pred Value 0.8571 0.6923 0.925 0.9487 1.000 0.9714
## Prevalence 0.3500 0.3750 0.075 0.0500 0.025 0.1250
## Detection Rate 0.2750 0.1750 0.000 0.0000 0.025 0.1000
## Detection Prevalence 0.4750 0.3500 0.000 0.0250 0.025 0.1250
## Balanced Accuracy 0.7390 0.5933 0.500 0.4868 1.000 0.8857
在测试集上同样可以看到,模型对1和2上的区分不太好,对3和5上完全错误。测试集上的准确率才57.5%,比训练集低很多,但考虑到样本量较少,测试集性能的方差会比较高,所以不一定是模型过拟合了数据。
4、有序逻辑回归
在多元逻辑回归里,假设输出类别(本例中的type)是不存在自然的顺序的。但是如果输出变量是一个序数,也称为有序因子,比如葡萄酒的等级,这时就需要训练有序逻辑回归模型。
还是使用同样的数据做实验(懒得换数据)。
> fit.polr <- train(type ~ ., data = dtrain, method = "polr")
>
> fit.polr$bestTune
## method
## 4 loglog
模型最终选择了loglog方法,因为它的准确率最高。
> fit.polr$results[, 1:3]
## method Accuracy Kappa
## 1 cauchit NaN NaN
## 2 cloglog NaN NaN
## 3 logistic 0.5018365 0.3101245
## 4 loglog 0.5430999 0.3693555
## 5 probit 0.4939978 0.3006526
> summary(fit.polr$finalModel)
##
## Coefficients:
## Value Std. Error t value
## ri 481.7045 0.01191 40434.3736
## na 3.5906 0.16151 22.2308
## mg 1.9825 0.15275 12.9788
## ai 5.9433 0.38663 15.3720
## si 4.0588 0.04386 92.5436
## k 2.6755 0.18499 14.4627
## ca 2.1278 0.11959 17.7925
## ba 2.7789 0.28710 9.6791
## fe -0.7725 1.34948 -0.5724
##
## Intercepts:
## Value Std. Error t value
## 1|2 1108.3530 0.0068 163512.8967
## 2|3 1110.2082 0.2276 4877.0734
## 3|5 1111.0442 0.3053 3639.1542
## 5|6 1112.1334 0.4042 2751.6627
## 6|7 1113.1950 0.5130 2169.9858
##
## Residual Deviance: 378.2646
## AIC: 406.2646
> confusionMatrix(predict(fit.polr, newdata = dtrain, type = "raw"), dtrain$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 5 6 7
## 1 37 21 12 0 0 0
## 2 19 37 2 4 3 1
## 3 0 3 0 1 0 0
## 5 0 0 0 5 2 4
## 6 0 0 0 1 0 0
## 7 0 0 0 0 3 19
##
## Overall Statistics
##
## Accuracy : 0.5632
## 95% CI : (0.4861, 0.6381)
## No Information Rate : 0.3506
## P-Value [Acc > NIR] : 8.444e-09
##
## Kappa : 0.3883
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.6607 0.6066 0.00000 0.45455 0.000000 0.7917
## Specificity 0.7203 0.7434 0.97500 0.96319 0.993976 0.9800
## Pos Pred Value 0.5286 0.5606 0.00000 0.45455 0.000000 0.8636
## Neg Pred Value 0.8173 0.7778 0.91765 0.96319 0.953757 0.9671
## Prevalence 0.3218 0.3506 0.08046 0.06322 0.045977 0.1379
## Detection Rate 0.2126 0.2126 0.00000 0.02874 0.000000 0.1092
## Detection Prevalence 0.4023 0.3793 0.02299 0.06322 0.005747 0.1264
## Balanced Accuracy 0.6905 0.6750 0.48750 0.70887 0.496988 0.8858
> confusionMatrix(predict(fit.polr, newdata = dtest, type = "raw"), dtest$type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 5 6 7
## 1 10 9 2 0 0 0
## 2 4 6 1 0 0 1
## 3 0 0 0 0 0 0
## 5 0 0 0 1 1 0
## 6 0 0 0 0 0 0
## 7 0 0 0 1 0 4
##
## Overall Statistics
##
## Accuracy : 0.525
## 95% CI : (0.3613, 0.6849)
## No Information Rate : 0.375
## P-Value [Acc > NIR] : 0.03792
##
## Kappa : 0.3072
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.7143 0.4000 0.000 0.5000 0.000 0.8000
## Specificity 0.5769 0.7600 1.000 0.9737 1.000 0.9714
## Pos Pred Value 0.4762 0.5000 NaN 0.5000 NaN 0.8000
## Neg Pred Value 0.7895 0.6786 0.925 0.9737 0.975 0.9714
## Prevalence 0.3500 0.3750 0.075 0.0500 0.025 0.1250
## Detection Rate 0.2500 0.1500 0.000 0.0250 0.000 0.1000
## Detection Prevalence 0.5250 0.3000 0.000 0.0500 0.000 0.1250
## Balanced Accuracy 0.6456 0.5800 0.500 0.7368 0.500 0.8857