58.关于学习因子的一个有用的数据集gss_cat
2021-09-13 本文已影响0人
心惊梦醒
【上一篇:57.关于因子的四要素之创建因子】
【下一篇:59.关于调整因子的属性levels的order(一)】
forcats::gss_cat数据集是General Social Survey中的一个样本,General Social Survel是由芝加哥大学的一个独立调查机构NORC进行的一个长期的美国调查,调查中有成千上万的问题,gss_cat数据集中包含的分类变量很适合用来演示处理因子时会遇到的常见挑战。
> gss_cat
# A tibble: 21,483 x 9
year marital age race rincome partyid relig denom tvhours
<int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
1 2000 Never ma~ 26 White $8000 to~ Ind,near r~ Protest~ Souther~ 12
2 2000 Divorced 48 White $8000 to~ Not str re~ Protest~ Baptist~ NA
3 2000 Widowed 67 White Not appl~ Independent Protest~ No deno~ 2
4 2000 Never ma~ 39 White Not appl~ Ind,near r~ Orthodo~ Not app~ 4
5 2000 Divorced 25 White Not appl~ Not str de~ None Not app~ 1
6 2000 Married 25 White $20000 -~ Strong dem~ Protest~ Souther~ NA
7 2000 Never ma~ 36 White $25000 o~ Not str re~ Christi~ Not app~ 3
8 2000 Divorced 44 White $7000 to~ Ind,near d~ Protest~ Luthera~ NA
9 2000 Married 44 White $25000 o~ Not str de~ Protest~ Other 0
10 2000 Married 47 White $25000 o~ Strong rep~ Protest~ Souther~ 3
# ... with 21,473 more rows
year:调查年份,2000-2014
age:年龄,最大年龄截断到89岁
marital:婚姻状态
race:种族
rincome:reported income
partyid:党派
relig:宗教信仰,例如道教、基督教、佛教、伊斯兰教
denom:教派、派别,例如佛教分为汉传佛教、藏传佛教,道教分茅山派、天师道、全真道等
tvhours:每天看电视的时间
gss_cat数据集中有六列是因子,levels()函数可以查看因子所有的levels,count()函数可以看当前数据集中包含的具体值。
> gss_cat %>% count(race)
# A tibble: 3 x 2
race n
<fct> <int>
1 Other 1959
2 Black 3129
3 White 16395
> levels(gss_cat$race)
[1] "Other" "Black" "White" "Not applicable"
用geom_bar()绘制每个种族的数量:
> library(ggpubr)
# ggplot2默认扔掉没有任何值的level
> p1<-ggplot(gss_cat, aes(race)) + geom_bar()
# scale_x_discrete(drop=FALSE)关闭ggplot2的次默认行为
> p2<-ggplot(gss_cat, aes(race)) + geom_bar() + scale_x_discrete(drop = FALSE)
> ggarrange(p1,p2)
ggplot2默认扔掉没有任何值的level
gss_cat数据集中relig(宗教信仰)和denom(教派)的关系有哪些(如下)?可以发现,新教中有更多的派别(个人理解)。
> gss_cat %>% count(relig,denom) %>% print(n=Inf)
# A tibble: 47 x 3
relig denom n
<fct> <fct> <int>
1 No answer No answer 93
2 Don't know Not applicable 15
3 Inter-nondenominational Not applicable 109
4 Native american Not applicable 23
5 Christian No answer 2
6 Christian Don't know 11
7 Christian No denomination 452
8 Christian Not applicable 224
9 Orthodox-christian Not applicable 95
10 Moslem/islam Not applicable 104
11 Other eastern Not applicable 32
12 Hinduism Not applicable 71
13 Buddhism Not applicable 147
14 Other No denomination 7
15 Other Not applicable 217
16 None Not applicable 3523
17 Jewish Not applicable 388
18 Catholic Not applicable 5124
19 Protestant No answer 22
20 Protestant Don't know 41
21 Protestant No denomination 1224
22 Protestant Other 2534
23 Protestant Episcopal 397
24 Protestant Presbyterian-dk wh 244
25 Protestant Presbyterian, merged 67
26 Protestant Other presbyterian 47
27 Protestant United pres ch in us 110
28 Protestant Presbyterian c in us 104
29 Protestant Lutheran-dk which 267
30 Protestant Evangelical luth 122
31 Protestant Other lutheran 30
32 Protestant Wi evan luth synod 71
33 Protestant Lutheran-mo synod 212
34 Protestant Luth ch in america 71
35 Protestant Am lutheran 146
36 Protestant Methodist-dk which 239
37 Protestant Other methodist 33
38 Protestant United methodist 1067
39 Protestant Afr meth ep zion 32
40 Protestant Afr meth episcopal 77
41 Protestant Baptist-dk which 1457
42 Protestant Other baptists 213
43 Protestant Southern baptist 1536
44 Protestant Nat bapt conv usa 40
45 Protestant Nat bapt conv of am 76
46 Protestant Am bapt ch in usa 130
47 Protestant Am baptist asso 237
> gss_cat %>% count(relig,denom) %>% ggplot(aes(x=relig,y=n,fill=denom))+geom_bar(stat = "identity",position = "dodge")
宗教与派别的可视化