R基础知识

80-预测分析-R语言实现-神经网络和有序逻辑回归

2020-10-05  本文已影响0人  wonphen
> library(pacman)
> p_load(dplyr, readr, caret, DataExplorer)

1、读入数据

> glass <- read_csv("data_set/glass.data", col_names = F)
> 
> profile_missing(glass)
## # A tibble: 11 x 3
##    feature num_missing pct_missing
##    <fct>         <int>       <dbl>
##  1 X1                0           0
##  2 X2                0           0
##  3 X3                0           0
##  4 X4                0           0
##  5 X5                0           0
##  6 X6                0           0
##  7 X7                0           0
##  8 X8                0           0
##  9 X9                0           0
## 10 X10               0           0
## 11 X11               0           0

2、 数据预处理

> names(glass) <- c("id", "ri", "na", "mg", "ai", "si", "k",
+                   "ca", "ba", "fe", "type")
> glass <- select(glass, -c("id"))
> 
> str(glass)
## tibble [214 × 10] (S3: tbl_df/tbl/data.frame)
##  $ ri  : num [1:214] 1.52 1.52 1.52 1.52 1.52 ...
##  $ na  : num [1:214] 13.6 13.9 13.5 13.2 13.3 ...
##  $ mg  : num [1:214] 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ ai  : num [1:214] 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ si  : num [1:214] 71.8 72.7 73 72.6 73.1 ...
##  $ k   : num [1:214] 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ ca  : num [1:214] 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ ba  : num [1:214] 0 0 0 0 0 0 0 0 0 0 ...
##  $ fe  : num [1:214] 0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ type: num [1:214] 1 1 1 1 1 1 1 1 1 1 ...
> # 检查数据类别分布
> table(glass$type)
## 
##  1  2  3  5  6  7 
## 70 76 17 13  9 29
> # 将结果变量转换为因子型
> glass$type <- factor(glass$type)
> 
> set.seed(123)
> ind <- createDataPartition(glass$type, p = 0.8, list = F)
> dtrain <- glass[ind, ]
> dtest <- glass[-ind, ]
> 
> dim(dtrain)
## [1] 174  10
> dim(dtest)
## [1] 40 10

3、神经网络建模

> set.seed(123)
> fit.nnet <- train(type ~ ., data = dtrain, method = "nnet", trace = F,
+                   preProc = c("center", "scale"))
> 
> # 训练集准确率
> confusionMatrix(predict(fit.nnet, newdata = dtrain, type = "raw"), dtrain$type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  5  6  7
##          1 52 11 10  0  0  0
##          2  4 50  4  1  0  2
##          3  0  0  0  0  0  0
##          5  0  0  0 10  0  1
##          6  0  0  0  0  7  0
##          7  0  0  0  0  1 21
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8046          
##                  95% CI : (0.7378, 0.8607)
##     No Information Rate : 0.3506          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7281          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity            0.9286   0.8197  0.00000  0.90909  0.87500   0.8750
## Specificity            0.8220   0.9027  1.00000  0.99387  1.00000   0.9933
## Pos Pred Value         0.7123   0.8197      NaN  0.90909  1.00000   0.9545
## Neg Pred Value         0.9604   0.9027  0.91954  0.99387  0.99401   0.9803
## Prevalence             0.3218   0.3506  0.08046  0.06322  0.04598   0.1379
## Detection Rate         0.2989   0.2874  0.00000  0.05747  0.04023   0.1207
## Detection Prevalence   0.4195   0.3506  0.00000  0.06322  0.04023   0.1264
## Balanced Accuracy      0.8753   0.8612  0.50000  0.95148  0.93750   0.9342

模型在训练集上的准确率为80.46%,同时可以看到,模型对1和2两个类别区分得不太好,对3完全没有区分能力,对5、6、7区分得还不错。

> # 测试集准确率
> confusionMatrix(predict(fit.nnet, newdata = dtest, type = "raw"), dtest$type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  5  6  7
##          1 11  7  1  0  0  0
##          2  3  7  2  1  0  1
##          3  0  0  0  0  0  0
##          5  0  1  0  0  0  0
##          6  0  0  0  0  1  0
##          7  0  0  0  1  0  4
## 
## Overall Statistics
##                                           
##                Accuracy : 0.575           
##                  95% CI : (0.4089, 0.7296)
##     No Information Rate : 0.375           
##     P-Value [Acc > NIR] : 0.008001        
##                                           
##                   Kappa : 0.3796          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity            0.7857   0.4667    0.000   0.0000    1.000   0.8000
## Specificity            0.6923   0.7200    1.000   0.9737    1.000   0.9714
## Pos Pred Value         0.5789   0.5000      NaN   0.0000    1.000   0.8000
## Neg Pred Value         0.8571   0.6923    0.925   0.9487    1.000   0.9714
## Prevalence             0.3500   0.3750    0.075   0.0500    0.025   0.1250
## Detection Rate         0.2750   0.1750    0.000   0.0000    0.025   0.1000
## Detection Prevalence   0.4750   0.3500    0.000   0.0250    0.025   0.1250
## Balanced Accuracy      0.7390   0.5933    0.500   0.4868    1.000   0.8857

在测试集上同样可以看到,模型对1和2上的区分不太好,对3和5上完全错误。测试集上的准确率才57.5%,比训练集低很多,但考虑到样本量较少,测试集性能的方差会比较高,所以不一定是模型过拟合了数据。

4、有序逻辑回归

在多元逻辑回归里,假设输出类别(本例中的type)是不存在自然的顺序的。但是如果输出变量是一个序数,也称为有序因子,比如葡萄酒的等级,这时就需要训练有序逻辑回归模型。

还是使用同样的数据做实验(懒得换数据)。

> fit.polr <- train(type ~ ., data = dtrain, method = "polr")
> 
> fit.polr$bestTune
##   method
## 4 loglog

模型最终选择了loglog方法,因为它的准确率最高。

> fit.polr$results[, 1:3]
##     method  Accuracy     Kappa
## 1  cauchit       NaN       NaN
## 2  cloglog       NaN       NaN
## 3 logistic 0.5018365 0.3101245
## 4   loglog 0.5430999 0.3693555
## 5   probit 0.4939978 0.3006526
> summary(fit.polr$finalModel)
## 
## Coefficients:
##       Value Std. Error    t value
## ri 481.7045    0.01191 40434.3736
## na   3.5906    0.16151    22.2308
## mg   1.9825    0.15275    12.9788
## ai   5.9433    0.38663    15.3720
## si   4.0588    0.04386    92.5436
## k    2.6755    0.18499    14.4627
## ca   2.1278    0.11959    17.7925
## ba   2.7789    0.28710     9.6791
## fe  -0.7725    1.34948    -0.5724
## 
## Intercepts:
##     Value       Std. Error  t value    
## 1|2   1108.3530      0.0068 163512.8967
## 2|3   1110.2082      0.2276   4877.0734
## 3|5   1111.0442      0.3053   3639.1542
## 5|6   1112.1334      0.4042   2751.6627
## 6|7   1113.1950      0.5130   2169.9858
## 
## Residual Deviance: 378.2646 
## AIC: 406.2646
> confusionMatrix(predict(fit.polr, newdata = dtrain, type = "raw"), dtrain$type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  5  6  7
##          1 37 21 12  0  0  0
##          2 19 37  2  4  3  1
##          3  0  3  0  1  0  0
##          5  0  0  0  5  2  4
##          6  0  0  0  1  0  0
##          7  0  0  0  0  3 19
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5632          
##                  95% CI : (0.4861, 0.6381)
##     No Information Rate : 0.3506          
##     P-Value [Acc > NIR] : 8.444e-09       
##                                           
##                   Kappa : 0.3883          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity            0.6607   0.6066  0.00000  0.45455 0.000000   0.7917
## Specificity            0.7203   0.7434  0.97500  0.96319 0.993976   0.9800
## Pos Pred Value         0.5286   0.5606  0.00000  0.45455 0.000000   0.8636
## Neg Pred Value         0.8173   0.7778  0.91765  0.96319 0.953757   0.9671
## Prevalence             0.3218   0.3506  0.08046  0.06322 0.045977   0.1379
## Detection Rate         0.2126   0.2126  0.00000  0.02874 0.000000   0.1092
## Detection Prevalence   0.4023   0.3793  0.02299  0.06322 0.005747   0.1264
## Balanced Accuracy      0.6905   0.6750  0.48750  0.70887 0.496988   0.8858
> confusionMatrix(predict(fit.polr, newdata = dtest, type = "raw"), dtest$type)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  5  6  7
##          1 10  9  2  0  0  0
##          2  4  6  1  0  0  1
##          3  0  0  0  0  0  0
##          5  0  0  0  1  1  0
##          6  0  0  0  0  0  0
##          7  0  0  0  1  0  4
## 
## Overall Statistics
##                                           
##                Accuracy : 0.525           
##                  95% CI : (0.3613, 0.6849)
##     No Information Rate : 0.375           
##     P-Value [Acc > NIR] : 0.03792         
##                                           
##                   Kappa : 0.3072          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity            0.7143   0.4000    0.000   0.5000    0.000   0.8000
## Specificity            0.5769   0.7600    1.000   0.9737    1.000   0.9714
## Pos Pred Value         0.4762   0.5000      NaN   0.5000      NaN   0.8000
## Neg Pred Value         0.7895   0.6786    0.925   0.9737    0.975   0.9714
## Prevalence             0.3500   0.3750    0.075   0.0500    0.025   0.1250
## Detection Rate         0.2500   0.1500    0.000   0.0250    0.000   0.1000
## Detection Prevalence   0.5250   0.3000    0.000   0.0500    0.000   0.1250
## Balanced Accuracy      0.6456   0.5800    0.500   0.7368    0.500   0.8857
上一篇 下一篇

猜你喜欢

热点阅读