R数据科学读书会数据-R语言-图表-决策-Linux-Python

小洁详解《R数据科学》--第十一章 forcats处理因子

2018-11-03  本文已影响20人  小洁忘了怎么分身

1.准备工作

library(tidyverse)
library(forcats)

2.创建因子

#创建字符串向量
x1 <- c("Dec", "Apr", "Jan", "Mar")
#创建levels
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
#创建因子
y1 <- factor(x1, levels = month_levels)
y1
#如果不在levels向量内,默认转换为NA
x2 <- c("Dec", "Apr", "Jam", "Mar")
y2 <- factor(x2, levels = month_levels)
y2
#调整默认转换NA的参数:
y2 <- parse_factor(x2, levels = month_levels)
y2

顺序问题

#默认:字母顺序
factor(x1)
#自定义:用level实现
f1<-factor(x1, levels = month_levels)
#与初始数据保持一致
f2<-factor(x1, levels = unique(x1))

查看levels

levels(f2)

3.示例

数据集:forcats::gss_cat
查看数据框中因子列的levels

#方法一:count
gss_cat %>%
count(race)
#方法二:geom_bar条形图
##条形图只需要映射一列即可,作为横坐标。因为纵坐标是计数。其实geom_bar就是count的可视化啦。
ggplot(gss_cat, aes(race)) +
geom_bar()
##默认丢弃没有数据的levels,强制显示用(drop = FALSE):
ggplot(gss_cat, aes(race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)

4.修改因子水平

重新编码:fct_recode

#查看数据框中某因子列的levels
gss_cat %>% count(partyid)
#修改:新=旧,是按照赋值的思路
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
count(partyid)
#这里用mutate进行了覆盖

多个原level赋给同一个新levels

#(看mutate最后三行)
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)

这个合并似乎是没有什么简便方法可以逆转的。
更专用的levels合并函数:fct_collapse,待合并的列名用向量表示

gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) %>%
count(partyid)
上一篇 下一篇

猜你喜欢

热点阅读