【Kaggle】泰坦尼克号生存训练
就是这篇文章,知乎称“您的帐号由于存在异常行为暂时被知乎反作弊系统限制使用”,然后任凭我申诉多久,都恢复不了了!!!最可恶的是,在你发布文章的时候一点儿提示都没有,显示发布成功,但就是擅自删除,辛苦写的文章就找不到了!!!
1.引言
本文数据来自Kaggle中知名的数据集Titanic Machine Learning from Disaster,是利用训练集训练模型,来预测测试集中的乘客能否在沉船事件存活。
说明:本文借鉴了 Megan L. Risdal 的文章,在细节处略有改动。
2.数据导入与观察
加载需要用到的包:
> pkgs <- c("dplyr","ggplot2","ggthemes","scales","mice","randomForest")
> install.packages(pkgs,dependencies = TRUE)
> library(dplyr) # 操作数据的包
> library(ggplot2) # 绘图包
> library(ggthemes) # ggplot2的主题修改包
> library(scales) # 可视化的包
> library(VIM) # 查看缺失数据的包
> library(mice) # 插补数据的包
> library(randomForest) # 随机森林
导入并初步观察数据:
> train <- read.csv(file.choose(),stringsAsFactors = FALSE)
> test <- read.csv(file.choose(),stringsAsFactors = FALSE)
> str(train)
略
> str(test)
略
两个数据集除Survived字段不同外,其他字段均相同。为了方便数据清洗,我们合并训练集与测试集。
> full <- bind_rows(train, test)
> str(full)
'data.frame': 1309 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
> summary(full)
PassengerId Survived Pclass Name Sex Age
Min. : 1 Min. :0.0000 Min. :1.000 Length:1309 Length:1309 Min. : 0.17
1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character Class :character 1st Qu.:21.00
Median : 655 Median :0.0000 Median :3.000 Mode :character Mode :character Median :28.00
Mean : 655 Mean :0.3838 Mean :2.295 Mean :29.88
3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:39.00
Max. :1309 Max. :1.0000 Max. :3.000 Max. :80.00
NA's :418 NA's :263
SibSp Parch Ticket Fare Cabin Embarked
Min. :0.0000 Min. :0.000 Length:1309 Min. : 0.000 Length:1309 Length:1309
1st Qu.:0.0000 1st Qu.:0.000 Class :character 1st Qu.: 7.896 Class :character Class :character
Median :0.0000 Median :0.000 Mode :character Median : 14.454 Mode :character Mode :character
Mean :0.4989 Mean :0.385 Mean : 33.295
3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.: 31.275
Max. :8.0000 Max. :9.000 Max. :512.329
NA's :1
合并后的数据中,共有1309个观测,其中训练集891个,测试集418个,生存情况(Survived)中缺失值418个(正常,需要预测的部分),年龄(Age)中缺失值263个,船票费用(Fare)中缺失值1个。
解释一下各个变量对应的含义:
3.数据清洗
3.1 观察姓名变量
我们注意到乘客名字(Name)有一个非常显著的特点:每个名字当中都包含了具体的称谓或者头衔。我们可以将这部分信息提取出来,或许可以作为非常有用的新变量。
> # 乘客姓名提取头衔
> full$Title <- gsub("(.*, )|(\\..*)", "", full$Name)
> # 按照性别划分头衔数量
> table(full$Sex, full$Title)
Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir the Countess
female 0 0 0 1 1 0 1 0 0 260 2 1 0 197 2 0 0 1
male 1 4 1 0 7 1 0 2 61 0 0 0 757 0 0 8 1 0
> # 对于那些出现频率较低的头衔合并为一类
> rare_title <- c("Capt","Col","Don","Dona","Dr", "Jonkheer", "Lady","Major", "Rev", "Sir","the Countess")
> # 数据量不多,我想试试手工调整头衔
> filter(full,full$Title %in% rare_title)%>%
+ select(Name,Sex,Age,SibSp,Parch,Title)%>%
+ arrange(Title)
Name Sex Age SibSp Parch Title
1 Crosby, Capt. Edward Gifford male 70 1 1 Capt
2 Simonius-Blumer, Col. Oberst Alfons male 56 0 0 Col
3 Weir, Col. John male 60 0 0 Col
4 Gracie, Col. Archibald IV male 53 0 0 Col
5 Astor, Col. John Jacob male 47 1 0 Col
6 Uruchurtu, Don. Manuel E male 40 0 0 Don
7 Oliva y Ocana, Dona. Fermina female 39 0 0 Dona
8 Minahan, Dr. William Edward male 44 2 0 Dr
9 Moraweck, Dr. Ernest male 54 0 0 Dr
10 Pain, Dr. Alfred male 23 0 0 Dr
11 Stahelin-Maeglin, Dr. Max male 32 0 0 Dr
12 Frauenthal, Dr. Henry William male 50 2 0 Dr
13 Brewe, Dr. Arthur Jackson male NA 0 0 Dr
14 Leader, Dr. Alice (Farnham) female 49 0 0 Dr
15 Dodge, Dr. Washington male 53 1 1 Dr
16 Reuchlin, Jonkheer. John George male 38 0 0 Jonkheer
17 Duff Gordon, Lady. (Lucille Christiana Sutherland) ("Mrs Morgan") female 48 1 0 Lady
18 Peuchen, Major. Arthur Godfrey male 52 0 0 Major
19 Butt, Major. Archibald Willingham male 45 0 0 Major
20 Byles, Rev. Thomas Roussel Davids male 42 0 0 Rev
21 Bateman, Rev. Robert James male 51 0 0 Rev
22 Carter, Rev. Ernest Courtenay male 54 1 0 Rev
23 Kirkland, Rev. Charles Leonard male 57 0 0 Rev
24 Harper, Rev. John male 28 0 1 Rev
25 Montvila, Rev. Juozas male 27 0 0 Rev
26 Lahtinen, Rev. William male 30 1 1 Rev
27 Peruschitz, Rev. Joseph Maria male 41 0 0 Rev
28 Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") male 49 1 0 Sir
29 Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) female 33 0 0 the Countess
> # 我将Lady调整为Mrs,Sir调整为Mr
> rare_title2 <- c("Capt","Col","Don","Dona","Dr", "Jonkheer","Major", "Rev","the Countess")
> # 对于一些称呼进行重新指定(按含义) 如mlle, ms指小姐, mme 指女士
> full$Title[full$Title %in% c("Mlle","Ms")]<- "Miss"
> full$Title[full$Title %in% c("Mme","Lady")]<- "Mrs"
> full$Title[full$Title =="Sir"]<- "Mr"
> full$Title[full$Title %in% rare_title2] <- "Rare Title"
> # 查看重新调整后的情况
> table(full$Sex, full$Title)
Master Miss Mr Mrs Rare Title
female 0 264 0 199 3
male 61 0 758 0 24
> # 从乘客姓名中,提取姓氏,作为家庭变量
> full$Surname <- sapply(full$Name, function(x) strsplit(x, split = '[,.]')[[1]][1])
> length(unique(full$Surname))
[1] 875
Megan L. Risdal 的文章中,将乘客姓氏也提取出来,提示可以发掘乘客姓氏之间的联系,但没有进行进一步操,我们这里也就不探讨了。
3.2 观察家庭情况
处理完乘客姓名这一变量,我们再考虑衍生一些家庭相关的变量。基于已有变量SubSp和Parch生成家庭人数family size 这一变量。
> # 生成家庭人数变量,包括自己在内
> full$Fsize <- full$SibSp + full$Parch + 1
> # 生成一个家庭变量:格式为姓_家庭人数
> full$Family <- paste(full$Surname, full$Fsize, sep="_")
> # 使用 ggplot2 绘制家庭人数与生存情况之间的关系
> ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +
+ geom_bar(stat="count", position="dodge") +
+ scale_x_continuous(breaks=c(1:11)) +
+ labs(x = "Family Size") +
+ theme_few()
通过图形我们可以明显发现以下特点:
- 个人和家庭人数>4的家庭,存活人数小于死亡人数
- 家庭人数在[2:4]的家庭,存活人数大于死亡人数
因此,我们可以将家庭人数变量进行分段合并,明显的可以分为3段:个人,小家庭,大家庭,由此生成新变量。
> # 离散化
> full$FsizeD[full$Fsize == 1] <- "single"
> full$FsizeD[full$Fsize < 5 & full$Fsize > 1]<- "small"
> full$FsizeD[full$Fsize > 4] <- "large"
> # 通过马赛克图查看家庭规模与生存情况之间的关系
> mosaicplot(table(full$FsizeD,full$Survived), main="Family Size by Survival", shade=TRUE)
显而易见,个人与大家庭不利于在沉船事故中生存,而小家庭当中生存率相对较高。
3.3 尝试生成更多变量
在乘客客舱变量Cabin中,也存在一些有价值的信息,如客舱层数deck。
> # 可以看出这一变量有很多缺失值,有单个客户多个客舱,格式基本为字母+数字
> head(full$Cabin,30)
[1] "" "C85" "" "C123" "" "" "E46"
[8] "" "" "" "G6" "C103" "" ""
[15] "" "" "" "" "" "" ""
[22] "D56" "" "A6" "" "" "" "C23 C25 C27"
[29] "" ""
> # 假设第一个字母即为客舱层数,建立一个层数变量(Deck),取值范围A - z:
> full$Deck<-factor(sapply(full$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
> summary(full$Deck)
A B C D E F G T NA's
22 65 94 46 41 21 5 1 1014
上面只是对变量进行了基本处理,还有很多可以进一步操作的地方,如有些乘客名下包含很多间房 (如28行, "C23 C25 C27"), 但是这一变量有1014 个缺失值,占比太高了。 后面就不再进一步考虑。
4.处理缺失值
我们先找到所有的缺失数据看看情况:
- 数值型数据:age缺失263个,fare缺失1个
- 字符型数据:cabin缺失1014个,embarked缺失2个
- 因子型数据:deck缺失1014个
> sapply(full,function(x) {sum(is.na(x))})
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
0 418 0 0 0 263 0 0 0
Fare Cabin Embarked Title Surname Fsize Family FsizeD Deck
1 0 0 0 0 0 0 0 1014
> sapply(full,function(x) {sum(x == "",na.rm=TRUE)})
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
0 0 0 0 0 0 0 0 0
Fare Cabin Embarked Title Surname Fsize Family FsizeD Deck
0 1014 2 0 0 0 0 0 0
处理缺失值的方法有很多种,考虑到数据集本身较小,样本数也不多,因而不能直接整行或者整列删除缺失值样本。我们考虑对于缺失值较少的,用均值或中位数填补,缺失值较多的通过现有数据和变量进行预估填补。
4.1 登船港口缺失
找到缺失数据在哪一行,观察情况。
> filter(full,full$Embarked=="")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
1 62 1 1 Icard, Miss. Amelie female 38 0 0 113572 80
2 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62 0 0 113572 80
Cabin Embarked Title Surname Fsize Family FsizeD Deck
1 B28 Miss Icard 1 Icard_1 single B
2 B28 Mrs Stone 1 Stone_1 single B
我们可以看到他们支付的票价都是$ 80,舱位等级都是1,我们可以大胆推论有相同舱位等级(passenger class)和票价(Fare)的乘客,也许有着相同的登船港口?
> # 去除缺失值乘客的ID
> embark_fare <- full %>% filter(PassengerId != 62 & PassengerId != 830)
> # 用 ggplot2 绘制embarkment, passenger class, & median fare 三者关系图
> ggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
+ geom_boxplot() +
+ geom_hline(aes(yintercept=80),
+ colour="red", linetype="dashed", lwd=2) +
+ scale_y_continuous(labels=dollar_format()) +
+ theme_few()
很明显,港口C出发的头等舱支付的票价中位数为80,因此我们可以把两个缺失值替换为C。
> full$Embarked[c(62, 830)] <- "C"
4.1 票价缺失
找到缺失数据在哪一行,观察情况。
> filter(full,is.na(full$Fare))
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title Surname
1 1044 NA 3 Storey, Mr. Thomas male 60.5 0 0 3701 NA S Mr Storey
Fsize Family FsizeD Deck
1 1 Storey_1 single <NA>
这是从港口S出发的三等舱乘客,根据上面登船港口缺失值填补的推论,我们可以作图看看:
> ggplot(full[full$Pclass == "3" & full$Embarked == "S", ],
+ aes(x = Fare)) +
+ geom_density(fill = "#99d6ff", alpha=0.4) +
+ geom_vline(aes(xintercept=median(Fare, na.rm=T)),colour="red", linetype="dashed", lwd=1) +
+ scale_x_continuous(labels=dollar_format()) +
+ theme_few()
从得到的图形上看,将缺失值用中位数进行替换是合理的。替换数值为$8.05。
> full$Fare[1044] <- median(full[full$Pclass==3&full$Embarked=="S",]$Fare,na.rm=TRUE)
> full$Fare[1044]
[1] 8.05
4.3 年龄缺失
用户年龄(Age) 中有大量缺失存在,简单用中位数或均值肯定不妥,这里我们用mice包的随机插补,将基于其他变量构建一个预测模型对年龄缺失值进行填补。
> # 使变量因子化
> factor_vars <- c("PassengerId","Pclass","Sex","Embarked",
+ "Title","Surname","Family","FsizeD")
> full[factor_vars] <- lapply(full[factor_vars],function(x) as.factor(x))
> # 设置随机种子
> set.seed(123)
> # 执行多重插补法,剔除一些没什么用的变量
> mice_mod <- mice(full[, !names(full) %in%
+ c("PassengerId","Name","Ticket","Cabin",
+ "Family","Surname","Survived")],
+ method="rf")
iter imp variable
1 1 Age Deck
1 2 Age Deck
1 3 Age Deck
1 4 Age Deck
1 5 Age Deck
2 1 Age Deck
2 2 Age Deck
2 3 Age Deck
2 4 Age Deck
2 5 Age Deck
3 1 Age Deck
3 2 Age Deck
3 3 Age Deck
3 4 Age Deck
3 5 Age Deck
4 1 Age Deck
4 2 Age Deck
4 3 Age Deck
4 4 Age Deck
4 5 Age Deck
5 1 Age Deck
5 2 Age Deck
5 3 Age Deck
5 4 Age Deck
5 5 Age Deck
> # 保存完整输出
> mice_output <- complete(mice_mod)
> # 绘制年龄分布图
> par(mfrow=c(1,2))
> hist(full$Age, freq=F, main="Age: Original Data",
+ col="darkgreen", ylim=c(0,0.04))
> hist(mice_output$Age, freq=F, main="Age: MICE Output",
+ col="lightgreen", ylim=c(0,0.04))
从上面的图来看,数据填补前后,并没有发生明显的偏移,随机插补应该是有效的,那么下面可以用mice模型的结果对原年龄数据进行替换。
> full$Age <- mice_output$Age
> # 检查缺失值是否被完全替换了
> sum(is.na(full$Age))
[1] 0
5.特征工程
现在我们知道每一位乘客的年龄,那么基于“妇女与儿童优先”的原则,我们可以生成一些变量,如儿童(Child)和 母亲(Mother)。
划分标准:
- 儿童 : 年龄Age < 18
- 母亲 : 女性+年龄 > 18+拥有超过1个子女+头衔不是'Miss'
> # 首先我们来看年龄与生存情况之间的关系
> ggplot(full[1:891,], aes(Age, fill = factor(Survived))) +
+ geom_histogram() +
+ # 分性别来看,因为我们知道性别对于生存情况有重要影响
+ facet_grid(.~Sex) +
+ theme_few()
生成child变量, 并且基于此划分儿童child与成人adult。
> full$child[full$Age < 18] <- "Child"
> full$child[full$Age >= 18] <- "Adult"
> # 展示对应人数
> table(full$child, full$Survived)
0 1
Adult 479 270
Child 70 72
下面来生成母亲这个变量。
> #增加mother变量
> full$Mother <- "NotMother"
> full$Mother[full$Sex == "female" & full$Parch > 0 & full$Age > 18 & full$Title != "Miss"] <- "Mother"
> table(full$Mother, full$Survived)
0 1
Mother 15 39
NotMother 534 303
将child和mother变量转化成因子类型
> full$child <- factor(full$child)
> full$Mother <- factor(full$Mother)
至此,所有我们需要的变量都已经生成,并且其中没有缺失值。 为了保险起见,我们进行二次确认。
> sapply(full,function(x){sum(is.na(x))})
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
0 418 0 0 0 0 0 0 0
Fare Cabin Embarked Title Surname Fsize Family FsizeD Deck
0 0 0 0 0 0 0 0 1014
child Mother
0 0
> sapply(full,function(x){sum(x=="",na.rm=T)})
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket
0 0 0 0 0 0 0 0 0
Fare Cabin Embarked Title Surname Fsize Family FsizeD Deck
0 1014 0 0 0 0 0 0 0
child Mother
0 0
6.模型设定与预测
在完成上面的工作之后,我们进入到最后一步:预测泰坦尼克号上乘客的生存状况。 在这里我们使用随机森林分类算法(The RandomForest Classification Algorithm) 。
第一步,拆分训练集与测试集
> train <- full[1:891,]
> test <- full[892:1309,]
第二步, 建立模型
> # 设置随机种子
> set.seed(1234)
> # 建立模型 (注意: 不是所有可用变量全部加入)
> rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp +Parch +
+ Fare + Embarked + Title + FsizeD + child + Mother,
+ data = train)
> # 显示模型误差
> plot(rf_model, ylim=c(0,0.36))
> legend("topright", colnames(rf_model$err.rate), col=1:3, fill=1:3)
黑色那条线表示:整体误差率(the overall error rate)保持在20% 左右
红色和绿色线分别表示:遇难与生还的误差率,红线保持在10%,绿线保持在30%左右。
我们的模型,在预测死亡情况时更准确一些。
第三步, 查看变量重要性
> importance <- importance(rf_model)
> varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,"MeanDecreaseGini"],2))
> rankImportance <- varImportance %>%
+ mutate(Rank = paste0("#",dense_rank(desc(Importance))))
> # 作图
> ggplot(rankImportance, aes(x = reorder(Variables, Importance), y = Importance, fill = Importance)) +
+ geom_bar(stat="identity") +
+ geom_text(aes(x = Variables, y = 0.5, label = Rank), hjust=0, vjust=0.55, size = 4, colour = "red") + labs(x = "Variables") +
+ coord_flip()
我们从图上可以看出,头衔和票价对于生存情况影响最大,其次是性别和年龄,而乘客舱位排第五。 而母亲和孩子对于生存与否的影响最小,真是有点出乎意料。
第四步, 预测
> # 将模型带入测试集
> prediction <- predict(rf_model, test)
> # 保存结果
> solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)
> # 输出结果到CSV文件格式
> write.csv(solution, file = "rf_mod_Solution.csv", row.names = F)
7.总结
本次数据分析是基于Megan L. Risdal 的文章,算是一次深度模仿,而且对于其中的一些操作还有不太理解的部分,还应该多看多学,争取早日独立完成一项数据分析工作。