生物信息学与算法数据科学与R语言数据-R语言-图表-决策-Linux-Python

统计-可能是最丰富的随机森林攻略+代码放送

2018-07-01  本文已影响96人  PriscillaBai

关于随机森林,你要知道:

构建过程:

第一步 构建bootstrap

第二步 建立决策树

第三步 重复这个过程若干次

问题:

代码实战

1. 获取数据。我们选择UCI上的machine learning的数据集

library(ggplot2)
library(cowplot)
library(randomForest)
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
data <- read.csv(url, header=FALSE)
colnames(data) <- c(
  "age",
  "sex",# 0 = female, 1 = male
  "cp", # chest pain
  # 1 = typical angina,
  # 2 = atypical angina,
  # 3 = non-anginal pain,
  # 4 = asymptomatic
  "trestbps", # resting blood pressure (in mm Hg)
  "chol", # serum cholestoral in mg/dl
  "fbs",  # fasting blood sugar greater than 120 mg/dl, 1 = TRUE, 0 = FALSE
  "restecg", # resting electrocardiographic results
  # 1 = normal
  # 2 = having ST-T wave abnormality
  # 3 = showing probable or definite left ventricular hypertrophy
  "thalach", # maximum heart rate achieved
  "exang",   # exercise induced angina, 1 = yes, 0 = no
  "oldpeak", # ST depression induced by exercise relative to rest
  "slope", # the slope of the peak exercise ST segment
  # 1 = upsloping
  # 2 = flat
  # 3 = downsloping
  "ca", # number of major vessels (0-3) colored by fluoroscopy
  "thal", # this is short of thalium heart scan
  # 3 = normal (no cold spots)
  # 6 = fixed defect (cold spots during rest and exercise)
  # 7 = reversible defect (when cold spots only appear during exercise)
  "hd" # (the predicted attribute) - diagnosis of heart disease
  # 0 if less than or equal to 50% diameter narrowing
  # 1 if greater than 50% diameter narrowing
)
head(data)

2. 清洗数据

str(data)
发现,sex, cp, fbs, restecg等明明是factor类型,但这里给的是num。另外有些地方填充的是?,需要改成NA
## First, replace "?"s with NAs.
data[data == "?"] <- NA

## Now add factors for variables that are factors and clean up the factors
## that had missing data...
data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"
data$sex <- as.factor(data$sex)

data$cp <- as.factor(data$cp)
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)

data$ca <- as.integer(data$ca)
data$ca <- as.factor(data$ca)  # ...then convert the integers to factor levels

data$thal <- as.integer(data$thal) # "thal" also had "?"s in it.
data$thal <- as.factor(data$thal)

## This next line replaces 0 and 1 with "Healthy" and "Unhealthy"
data$hd <- ifelse(test=data$hd == 0, yes="Healthy", no="Unhealthy")
data$hd <- as.factor(data$hd)

3. 构建模型

a) 先用临近值填补缺失值

set.seed(43)
data.imputed<-rfImpute(hd~.,data=data,iter=6)

iter: 迭代数 breiman说4-6次就好,过多的迭代数不会让OOB error变小
set.seed:保证抽取的过程是随机的
hd~: 我们想通过所有参数预测hd


红框部分为每次迭代的OOB error


b) 构建随机森林模型

model <- randomForest(hd ~ ., data=data.imputed, proximity=TRUE)

mtry:
如果我们想预测的是连续变量,该值为总的变量值/3
如果想预测的是factor,该值为总变量数的根号
本例子中,hd是factor,mtry的默认值为sqrt(13)=3.6约等于3


number of tree:500 种树个数,默认500个
no. of variables tried at each split: 3 (即mtry)节点个数
OOB误差:17.82% 这个很重要
cofusion matrix的意义
22个unhealthy被分入healthy中
32个healthy被分入unhealthy中


c) 更换mtry和number of trees的数量,使随机森林达到最优
核心思想:使OOB,healthy, unhealthy的error rate达到最低

model$err.rate

横行:种的第i颗树,i=1:500,依次类推
纵行:OOB error rate; healthy error rate; unhealthy error rate

oob.error.data <- data.frame(
  Trees=rep(1:nrow(model$err.rate), times=3),
  Type=rep(c("OOB", "Healthy", "Unhealthy"), each=nrow(model$err.rate)),
  Error=c(model$err.rate[,"OOB"],
          model$err.rate[,"Healthy"],
          model$err.rate[,"Unhealthy"]))

ggplot(data=oob.error.data, aes(x=Trees, y=Error)) +
  geom_line(aes(color=Type))


可以看出:当树种到400以后,三者的误差基本不变了

model <- randomForest(hd ~ ., data=data.imputed, ntree=1000, proximity=TRUE)
model

oob.error.data <- data.frame(
  Trees=rep(1:nrow(model$err.rate), times=3),
  Type=rep(c("OOB", "Healthy", "Unhealthy"), each=nrow(model$err.rate)),
  Error=c(model$err.rate[,"OOB"],
          model$err.rate[,"Healthy"],
          model$err.rate[,"Unhealthy"]))

ggplot(data=oob.error.data, aes(x=Trees, y=Error)) +
  geom_line(aes(color=Type))

500-1000之间,误差基本不变,因此选500颗树就好

oob.values <- vector(length=10)
for(i in 1:10) {
  temp.model <- randomForest(hd ~ ., data=data.imputed, mtry=i, ntree=1000)
  oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}
oob.value

可以看出, mtry在3左右就很好,再低容易引起过拟合

4. 应用多维尺度变换(MDS)查看样本之间的距离,即拟合效果

具体原理参照我之前的帖子

distance.matrix <- dist(1-model$proximity)
mds.stuff <- cmdscale(distance.matrix, eig=TRUE, x.ret=TRUE)
mds.var.per <- round(mds.stuff$eig/sum(mds.stuff$eig)*100, 1)
mds.values <- mds.stuff$points
mds.data <- data.frame(Sample=rownames(mds.values),
                       X=mds.values[,1],
                       Y=mds.values[,2],
                       Status=data.imputed$hd)

ggplot(data=mds.data, aes(x=X, y=Y, label=Sample)) +
  geom_text(aes(color=Status)) +
  theme_bw() +
  xlab(paste("MDS1 - ", mds.var.per[1], "%", sep="")) +
  ylab(paste("MDS2 - ", mds.var.per[2], "%", sep="")) +
  ggtitle("MDS plot using (1 - Random Forest Proximities)")

可见我们的随机森林效果不错, healthy分成一类,unhealthy分成一类

5. 查看每个变量的重要性评分

注意:importance=TRUE必须得打开,否则没法进行重要性评分

model <- randomForest(hd ~ ., data=data.imputed, proximity=TRUE,importance=TRUE)
importance(model,type=1)
importance(model,type=2)
varImpPlot(model)     


上一篇下一篇

猜你喜欢

热点阅读