R语言知识干货R & Python的笔记

R语言数据分析实例一:离职率分析与建模预测

2018-05-17  本文已影响0人  上火了

一、背景说明

本文分析利用IBM离职员工数据进行分析。在对离职率的影响因素进行观察的基础至上,建立模型并预测哪些员工更易离职。

一般而言,数据分析分为三个步骤:数据收集与清洗、探索性分析和建模预测。本文的数据集是IBM用于研究员工预测的模拟数据,数据十分完整,无需清洗。因此,本文主要分为三个部分:

通过对IBM离职员工数据实践,本文希望发掘出影响员工流失的因素,并对利用R语言进行数据分析过程进行复习,深化对数据分析工作意义的理解。


二、数据集说明

IBM离职员工数据集共有35个变量,1470个观测个案。部分需要重点关注的变量如下:


重点变量信息

上述变量可以分为三个部分:


三、探索性数据分析

载入分析包和数据集

library(tidyverse)
library(rpart)
library(rpart.plot)
library(randomForest)
library(gbm)
library(ggplot2)
library(ggthemes)
#提供describe函数
library(Hmisc)
#提供grid.arrange()函数,用于排列图片
library(gridExtra)
#提供roc函数
library(pROC)

Attr_data <- read.csv(file = "D:/MY/数据分析/RWorld/实验室/IBM_Employee_Attrition.csv")

(一)描述性统计信息

str(Attr_data)
describe(Attr_data)
描述性统计信息一 描述性统计信息二 描述性统计信息三

通过描述性统计可以初步观测到:


(二)可视化探索

1.基本身份信息

p_Gender <- ggplot(data = Attr_data, aes(x = Gender)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized_2() +
  labs(title = "性别 VS 离职", x = "性别", y = "比例")

p_Age <- ggplot(data = Attr_data, aes(x = Age)) +
  geom_density(aes(fill = Attrition), alpha = 0.7) +
  theme_solarized_2() +
  labs(title = "年龄 VS 离职", x = "年龄", y = "") +
  scale_x_continuous(breaks = seq(18, 60, 5))

p_Education  <- ggplot(data = Attr_data, aes(x = Education)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized_2() +
  labs(title = "教育程度 VS 离职", x = "教育程度", y = "比例")

p_MaritalStatus <- ggplot(data = Attr_data, aes(x = MaritalStatus)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized_2() +
  labs(title = "婚姻状况 VS 离职", x = "婚姻状况", y = "比例")

p_NumCompaniesWorked <- ggplot(data = Attr_data, aes(x = NumCompaniesWorked)) +
  geom_density(aes(fill = Attrition), alpha = 0.7) +
  theme_solarized_2() +
  labs(title = "工作过的企业数量 VS 离职", x = "工作过的企业数量", y = "")+
  scale_x_continuous(breaks = seq(0, 9, 1)) 
性别与离职率 年龄与离职率 受教育程度与离职率 婚姻状况与离职率 工作企业数与离职率
分析结果:

2.员工公司身份信息

p_TotalWorkingYears <- ggplot(data = Attr_data, aes(x = TotalWorkingYears)) +
  geom_density(aes(fill = Attrition), alpha = 0.7) +
  theme_solarized() +
  labs(title = "总工龄 VS 离职", x = "总工龄", y = "") +
  scale_x_continuous(breaks = seq(0, 40, 5))
  
p_YearsAtCompany <- ggplot(data = Attr_data, aes(x = YearsAtCompany)) +
  geom_density(aes(fill = Attrition), alpha = 0.7) +
  theme_solarized() +
  labs(title = "本公司工龄 VS 离职", x = "本公司工龄", y = "") +
  scale_x_continuous(breaks = seq(0, 40, 5))

p_JobRole <- ggplot(data = Attr_data, aes(x = JobRole)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "职位 VS 离职", x = "职位", y = "比例") +
  theme(axis.text.x = element_text(angle = 90))

p_JobLevel <- ggplot(data = Attr_data, aes(x = JobLevel)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "职级 VS 离职", x = "职级", y = "比例")
总工龄与离职 本公司工龄与离职 职级与离职 职位与离职
分析结果:

3.薪资与福利信息

(1)月薪、工作投入和绩效评分

p_MonthlyIncome <- ggplot(data = Attr_data, aes(x = MonthlyIncome)) +
  geom_density(aes(fill = Attrition), alpha = 0.7) +
  theme_solarized() +
  labs(title = "月薪 VS 离职", x = "月薪", y = "") +
  scale_x_continuous(breaks = seq(0, 20000, 3000)) +
  theme(axis.text.x =  element_text(angle = 15))

p_JobInvolvement <- ggplot(data = Attr_data, aes(x = JobInvolvement)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "工作投入 VS 离职", x = "工作投入", y = "比例")
  
p_PerformanceRating <- ggplot(data = Attr_data, aes(x = PerformanceRating)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "绩效评分 VS 离职", x = "绩效评分", y = "比例") +
  scale_x_continuous(breaks = seq(3, 4, 1))

Attr_data$JobInvolvement1 <- as.character(Attr_data$JobInvolvement)
p_JobInvolvement_MonthlyIncome <- ggplot(data = Attr_data, aes(x = JobInvolvement1, y = MonthlyIncome)) +
  geom_boxplot(aes(fill = Attrition)) +
  theme_solarized_2() +
  labs(title = "工作投入与月薪", x = "工作投入", y = "月薪")
工作投入与离职 绩效评分与离职 月薪与离职 投入产出与离职

分析结果:


(2)福利相关变量

p_StockOptionLevel <- ggplot(data = Attr_data, aes(x = StockOptionLevel)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "股权认购优先级别 VS 离职", x = "股权认购优先级别", y = "比列") 
  
p_PercentSalaryHike <- ggplot(data = Attr_data, aes(x = PercentSalaryHike)) +
  geom_density(aes(fill = Attrition), alpha = 0.7) +
  theme_solarized() +
  labs(title = "涨薪比例 VS 离职", x = "涨薪比例", y = "") +
  scale_x_continuous(breaks = seq(0, 26, 2))

p_TrainingTimesLastYear <- ggplot(data = Attr_data, aes(x = TrainingTimesLastYear)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "上年度培训次数 VS 离职", x = "上年度培训次数", y = "比例") +
  scale_x_continuous(breaks = seq(0, 6, 1)) 

p_YearsSinceLastPromotion <- ggplot(data = Attr_data, aes(x = YearsSinceLastPromotion)) +
  geom_density(aes(fill = Attrition), alpha = 0.7) +
  theme_solarized() +
  labs(title = "距上次升职间隔 VS 离职", x = "距上次升职间隔", y = "") +
  scale_x_continuous(breaks = seq(0, 15, 1)) 
优先股认购权与离职 涨薪比列与离职 培训与离职 升职与离职
分析结果:

4.生活质量相关

(1)主观满意度调查

p_EnvirnomentSatisfaction <- ggplot(data = Attr_data, aes(x = EnvironmentSatisfaction)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "工作环境满意度 VS 离职", x = "工作环境满意度", y = "比列")

p_JobSatisfication <- ggplot(data = Attr_data, aes(x = JobSatisfaction)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "工作满意度 VS 离职", x = "工作满意度", y = "比列")

p_RelationshipSatisfaction <- ggplot(data = Attr_data, aes(x = RelationshipSatisfaction)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "关系满意度 VS 离职", x = "关系满意度", y = "比列")

p_WorkLifeBalance <- ggplot(data = Attr_data, aes(x = WorkLifeBalance)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "工作与生活平衡 VS 离职", x = "工作与生活平衡", y = "比列")
工作环境满意度与离职 工作满意度与离职 工作关系满意度与离职 工作生活平衡与离职
分析结果:

(2)客观工作生活冲突

p_DistanceFromHome <- ggplot(data = Attr_data, aes(x = DistanceFromHome)) +
  geom_density(aes(fill = Attrition), alpha = 0.7) +
  theme_solarized() +
  labs(title = "上班距离 VS 离职", x = "上班距离", y = "") +
  scale_x_continuous(breaks = seq(1, 29, 1))

p_OverTime <- ggplot(data = Attr_data, aes(x = OverTime)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "加班情况 VS 离职", x = "加班情况", y = "比列")

p_BusinessTravel <- ggplot(data = Attr_data, aes(x = BusinessTravel)) +
  geom_bar(aes(fill = Attrition), position = "fill") +
  theme_solarized() +
  labs(title = "出差 VS 离职", x = "出差", y = "比列") +
  theme(axis.text.x = element_text(angle = 90))
加班与离职 上班距离与离职 出差与离职
分析结果:

(三)探索性分析结论

基于对数据的探索性分析,员工离职有多方面因素的影响,主要有:

1.工作与生活的不平衡——加班、离家远和出差等;
2.工作投入如果不能获得相匹配的回报,员工更倾向离职;
3.优先股认购等福利是员工较为关注的回报形式;
4.年龄、任职过的公司数量的因素也会影响员工离职率;


四、训练模型

(一)决策树模型

1.变量整理

删除需要的变量:EmployeeCount, EmployeeNumber, Over18, StandardHours
变量重新编码:JobRole, EducationFiled

Attr_data_predicted <- Attr_data %>%
  select(- EmployeeCount, -EmployeeNumber, -Over18, -StandardHours)
levels(Attr_data$JobRole) <- c("HC", "HR", "Lab", "Man", "MDir", "RsD", "RsSci", "SlEx", "SlRep")
levels(Attr_data$EducationField) <- c("HR", "LS", "MRK", "MED", "NA", "TD")

2.分割数据

set.seed(3221)
n <- nrow(Attr_data_predicted)
rnd <- sample(n, n * 0.7)
train <- Attr_data_predicted[rnd, ]
test <- Attr_data_predicted[- rnd, ]

3.建模预测

dtree <- rpart(Attrition ~ ., data = trian)
preds <- predict(dtree, test, type = "class")
rcov <- roc(as.numeric(test$Attrition), as.numeric(preds))

#观察模型可行性
rcov$auc
prop.table(table(test$Attrition, preds, dnn = c("Actual", "Predicted")),1)

#绘制决策树图
dtreepr <- prune(dtree, cp = 0.01666667)
predspr <- predict(dtreepr, test, type = "class")
rocvpr <- roc(as.numeric(test$Attrition), as.numeric(predspr))

rocvpr$auc
rpart.plot(dtreepr, 
           type = 4,
           tweak = 0.9,
           fallen.leaves = F,
           cex = 0.7)
Area under the curve: 0.6417
      Predicted
Actual         No        Yes
   No  0.93548387 0.06451613
   Yes 0.65217391 0.34782609
Area under the curve: 0.6334
决策树模型

分析结果表明:


(二)随机森林模型

set.seed(2343)
fit_forest <- randomForest(Attrition ~ ., data = train)
rfpreds <- predict(fit_forest, test, type = "class")
# 计算AUC面积
rocrf <- roc(as.numeric(test$Attrition), as.numeric(rfpreds))
rocrf$auc
Area under the curve: 0.5612

随机森林所得的AUC值为0.5612,小于决策树模型。


(三)GBM模型

set.seed(3433)
# 定义10折交叉检验的控制器用于下面所有GBM模型的训练
ctrl <- trainControl(method = "cv",
                     number = 10,
                     summaryFunction = twoClassSummary,
                     classProbs = TRUE) 
fit_gbm <- train(Attrition ~.,
                data = train,
                method = "gbm",
                verbose = FALSE,
                metric = "ROC",
                trControl = ctrl)
gbmpreds <- predict(fit_gbm, test)
rocgbm <- roc(as.numeric(test$Attrition), as.numeric(gbmpreds))
rocgbm$auc
Area under the curve: 0.5915

GBM模型得到的AUC值为0.5915


(四)对GBM模型进行优化

对于对于随机森林和GBM的方法,AUC值小于单一决策树模型的AUC值的情况较少见,这显然说明单一的树拟合得更好或者更稳定的情况。(一般需要得到AUC值大于0.75的模型)

当结果分类变量之间的比列是1:10或者更高的时候,通常需要考虑优化模型。本例中,离职变量的比列是1:5左右,但仍然可能是合理的,因为在决策树中看到的主要问题是预测那些实际离开的人(敏感度)。

# 设置与之前GBM建模控制器一致的种子
ctrl$seeds <- fit_gbm$control$seeds

1.通过加权的方式优化GBM模型

加权旨在降低少数群体中的错误,这里是离职群体。

#设置权重参数,提高离职群体的样本权重
model_weights <- ifelse(train$Attrition == "No", 
                        (1/table(train$Attrition)[1]),
                        (1/table(train$Attrition)[2])) 
weightedfit <- train(Attrition ~ .,
                     data = train,
                     method = "gbm",
                     verbose = FALSE,
                     weights = model_weights,
                     metric = "ROC", 
                     trControl = ctrl)
weightedpreds <- predict(weightedfit, test)
rocweight <- roc(as.numeric(test$Attrition), as.numeric(weightedpreds))
rocweight$auc
Area under the curve: 0.7803

2.向上采样

向上采样(up-sampling)指从多数类中随机删除实例。

ctrl$sampling <- "up" 
set.seed(3433)
upfit <- train(Attrition ~.,
               data = train,
               method = "gbm",
               verbose = FALSE,
               metric = "ROC",
               trControl = ctrl)
uppreds <- predict(upfit, test)
rocup <- roc(as.numeric(test$Attrition), as.numeric(uppreds))
rocup$auc
Area under the curve: 0.7387

3.向下采样

向下采样(down-sampling)指从少数类中复制实例。

ctrl$sampling <- "down"
set.seed(3433)
downfit <- train(Attrition ~.,
                 data = train,
                 method = "gbm",
                 verbose = FALSE,
                 metric = "ROC",
                 trControl = ctrl)
downpreds <- predict(downfit, test)
rocdown <- roc(as.numeric(test$Attrition), as.numeric(downpreds))
rocdown$auc
Area under the curve: 0.7491

分析结果表明:
加权调整的模型表现最好,相比较于单纯的随机森林和GBM模型,AUC值从0.5612上升至0.7803,灵敏度也达到了0.7276。据此,后续将采用加权调整后的模型进行预测。

prop.table(table(test$Attrition, weightedpreds, dnn = c("Actual", "Predicted")),1)
      Predicted
Actual        No       Yes
   No  0.8360215 0.1639785
   Yes 0.2753623 0.7246377

五、模型应用

已经训练出一个表现较好的模型。将其应用于实践时,需要注意以下几个方面:

(一)变量重要性列表

varImp(weightedfit)
gbm variable importance

  only 20 most important variables shown (out of 47)

                                Overall
OverTimeYes                      100.00
MonthlyIncome                     57.94
JobLevel                          56.12
Age                               41.13
NumCompaniesWorked                34.17
JobSatisfaction                   33.12
YearsAtCompany                    25.67
DistanceFromHome                  24.09
EnvironmentSatisfaction           23.28
StockOptionLevel                  22.58
YearsWithCurrManager              22.55
DailyRate                         21.87
JobInvolvement                    17.62
RelationshipSatisfaction          16.20
YearsSinceLastPromotion           16.11
BusinessTravelTravel_Frequently   15.20
WorkLifeBalance                   14.90
PercentSalaryHike                 13.58
MonthlyRate                       13.23
HourlyRate                        12.25

可以观察到影响员工流失的前5个因素是:

因此,在实践中就需要注意:


(二)利用模型预测员工离职可能性

本例中对工作投入高、收入低的员工进行预测。

weightedprobs <- predict(weightedfit, test, type = "prob")
test$Prediction <- weightedprobs$Yes
ggplot(data = test, aes(x = MonthlyIncome, y = Prediction)) +
  geom_point(aes(color = JobInvolvement1), alpha = 0.7) +
  geom_smooth(method = "lm") +
  facet_wrap(~ JobInvolvement1) +
  theme_solarized_2() +
  theme(legend.position = "none") +
  labs(title = "工作投入情况", x = "月收入", y = "离职可能性")
工作投入情况与员工离职可能性
结果表明:
投入高的,随着收入增加,离职曲线反而比较平稳。因此,在可能在IBM中,该类员工第一回报可能并不是金钱利益,需要进一步探索。

利用模型预测哪些部门和工作角色离职率高

ggplot(data = test, aes(x = JobRole, y = Prediction)) +
  geom_boxplot(aes(fill = JobRole, alpha = 0.5)) +
  theme_solarized_2() +
  labs(title = "各部门离职可能性比较", x = "部门", y = "离职可能性")
各部门离职可能性
结果表明:
销售人员离职可能性最高,平均超过50%。

六、余论

本例分析仍有需要足够完善的地方,还可以往更多更有意义的地方探索:

上一篇 下一篇

猜你喜欢

热点阅读