小洁详解《R数据科学》--第十一章 forcats处理因子
2018-11-03 本文已影响20人
小洁忘了怎么分身
1.准备工作
library(tidyverse)
library(forcats)
2.创建因子
#创建字符串向量
x1 <- c("Dec", "Apr", "Jan", "Mar")
#创建levels
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
#创建因子
y1 <- factor(x1, levels = month_levels)
y1
#如果不在levels向量内,默认转换为NA
x2 <- c("Dec", "Apr", "Jam", "Mar")
y2 <- factor(x2, levels = month_levels)
y2
#调整默认转换NA的参数:
y2 <- parse_factor(x2, levels = month_levels)
y2
顺序问题
#默认:字母顺序
factor(x1)
#自定义:用level实现
f1<-factor(x1, levels = month_levels)
#与初始数据保持一致
f2<-factor(x1, levels = unique(x1))
查看levels
levels(f2)
3.示例
数据集:forcats::gss_cat
查看数据框中因子列的levels
#方法一:count
gss_cat %>%
count(race)
#方法二:geom_bar条形图
##条形图只需要映射一列即可,作为横坐标。因为纵坐标是计数。其实geom_bar就是count的可视化啦。
ggplot(gss_cat, aes(race)) +
geom_bar()
##默认丢弃没有数据的levels,强制显示用(drop = FALSE):
ggplot(gss_cat, aes(race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
4.修改因子水平
重新编码:fct_recode
#查看数据框中某因子列的levels
gss_cat %>% count(partyid)
#修改:新=旧,是按照赋值的思路
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
count(partyid)
#这里用mutate进行了覆盖
多个原level赋给同一个新levels
#(看mutate最后三行)
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)
这个合并似乎是没有什么简便方法可以逆转的。
更专用的levels合并函数:fct_collapse,待合并的列名用向量表示
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) %>%
count(partyid)