节省tidyverse中的管道%>%
2021-03-16 本文已影响0人
R语言数据分析指南
分享{dplyr}中一些鲜为人知的功能,喜欢可以关注我的公众号R语言数据分析指南
library(tidyverse)
data("penguins", package = "palmerpenguins")
penguins <- na.omit(penguins)
1.rename()里面select()
penguins %>%
select(species, island) %>%
rename(penguin_species = species)
penguins %>%
select(penguin_species = species,
island)
2.rename()里面count()
penguins %>% count(species) %>%
rename(total = n)
penguins %>%
count(species, name = "total")
还可以使用与select()上面的示例来计数并重新命名
penguins %>%
count(species) %>%
rename(total = n,penguin_species = species)
penguins %>%
count(penguin_species = species, name = "total")
请注意,传递给name参数的新名称必须用引号引起来,但选定列的新名称不必用引号引起来
3.mutate()里面count()
这很简单-您只需在mutate()内部执行以下操作即可count()
penguins %>%
mutate(long_beak = bill_length_mm > 50) %>%
count(long_beak)
penguins %>%
count(long_beak = bill_length_mm > 50)
当然,当指定多个变量以下列方式计数时,此方法也适用
penguins %>%
mutate(long_beak = bill_length_mm > 50,
is_adelie = species == "Adelie") %>%
count(is_adelie, long_beak)
penguins %>%
count(long_beak = bill_length_mm > 50,
is_adelie = species == "Adelie")
4. transmute()+select()
penguins %>%
mutate(body_mass_kg = body_mass_g/1000) %>%
select(body_mass_kg)
penguins %>%
transmute(body_mass_kg = body_mass_g/1000)
transmute()过去我很少使用过,因为我认为它只能返回经过修改的列,这将是非常有限的(例如,在上面的示例中,以公斤为单位的企鹅体重有什么好处?)
但是实际上,您只可以命名要包括的列,transmute()就像select()继承未修改的列一样。当然,您可以在执行操作时对其“重命名
penguins %>%
mutate(body_mass_kg = body_mass_g/1000) %>%
select(species, island, body_mass_kg) %>%
rename(penguin_species = species)
penguins %>%
transmute(penguin_species = species,
island,
body_mass_kg = body_mass_g/1000)
5.ungroup()里面summarize()
penguins %>%
group_by(island, species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>%
ungroup()
因为summarize()仅按defaut删除最后一个分组变量,这意味着如果ungroup()不调用island,输出仍按变量分组:
penguins %>%
group_by(island, species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>%
group_vars()
penguins %>%
group_by(island, species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE)) %>%
ungroup() %>%
group_vars()
也可以简单地设置.groups参数内summarize(),为'drop'达到相同的:
penguins %>%
group_by(island, species) %>%
summarize(mean_mass = mean(body_mass_g, na.rm = TRUE), .groups = 'drop')
# A tibble: 5 x 3
island species mean_mass
<fct> <fct> <dbl>
1 Biscoe Adelie 3710.
2 Biscoe Gentoo 5092.
3 Dream Adelie 3701.
4 Dream Chinstrap 3733.
5 Torgersen Adelie 3709.
6. arrange()+其他功能slice()
如果您想获取按列排序的前n行,则可以使用top_n(),它提供了一种更简单的方式 slice()+arrange():
penguins %>%
arrange(desc(body_mass_g)) %>%
slice(1:5)
penguins %>%
top_n(5, wt = body_mass_g)
penguins %>%
top_n(5, wt = body_mass_g)
penguins %>%
slice_max(order_by = body_mass_g, n = 5)
新slice_*()功能最重大的变化是为分组数据帧添加了适当的行为
例如,下面的示例返回每种物种的重量百分比最高的5%的企鹅:
penguins %>%
group_by(species) %>%
slice_max(body_mass_g, prop = .05)
7.按组进行计数和求和 add_count()
add_count() 添加一列,其中包含每个组(或组的组合)的计数
##### Long Form #####
# penguins %>%
# group_by(species) %>%
# mutate(count_by_species = n()) %>%
# ungroup()
penguins %>%
add_count(species, name = "count_by_species") %>%
select(-contains("mm"))
# A tibble: 333 x 6
species island body_mass_g sex year count_by_species
<fct> <fct> <int> <fct> <int> <int>
1 Adelie Torgersen 3750 male 2007 146
2 Adelie Torgersen 3800 female 2007 146
3 Adelie Torgersen 3250 female 2007 146
4 Adelie Torgersen 3450 female 2007 146
5 Adelie Torgersen 3650 male 2007 146
6 Adelie Torgersen 3625 female 2007 146
7 Adelie Torgersen 4675 male 2007 146
8 Adelie Torgersen 3200 female 2007 146
9 Adelie Torgersen 3800 male 2007 146
10 Adelie Torgersen 4400 male 2007 146
# ... with 323 more rows
可以使用wt来按组有效地获取总和(也许有点笨拙,但非常有用)
##### Long Form #####
# penguins %>%
# group_by(species) %>%
# mutate(total_weight_by_species = sum(body_mass_g)) %>%
# ungroup()
penguins %>%
add_count(species, wt = body_mass_g,
name ="total_weight_by_species") %>%
select(-contains("mm"))
# A tibble: 333 x 6
species island body_mass_g sex year total_weight_by_species
<fct> <fct> <int> <fct> <int> <int>
1 Adelie Torgersen 3750 male 2007 541100
2 Adelie Torgersen 3800 female 2007 541100
3 Adelie Torgersen 3250 female 2007 541100
4 Adelie Torgersen 3450 female 2007 541100
5 Adelie Torgersen 3650 male 2007 541100
6 Adelie Torgersen 3625 female 2007 541100
7 Adelie Torgersen 4675 male 2007 541100
8 Adelie Torgersen 3200 female 2007 541100
9 Adelie Torgersen 3800 male 2007 541100
10 Adelie Torgersen 4400 male 2007 541100
# ... with 323 more rows
默认情况下,add_tally()添加行数,您已经可以使用mutate(n = n())进行处理
penguins %>%
add_count(species, wt = body_mass_g,
name = "total_weight_by_species") %>%
add_tally(wt = body_mass_g,
name = "total_weight_of_all_species") %>%
select(1:2, last_col(0):last_col(1))
# A tibble: 333 x 4
species island total_weight_of_all_species total_weight_by_species
<fct> <fct> <int> <int>
1 Adelie Torgersen 1400950 541100
2 Adelie Torgersen 1400950 541100
3 Adelie Torgersen 1400950 541100
4 Adelie Torgersen 1400950 541100
5 Adelie Torgersen 1400950 541100
6 Adelie Torgersen 1400950 541100
7 Adelie Torgersen 1400950 541100
8 Adelie Torgersen 1400950 541100
9 Adelie Torgersen 1400950 541100
10 Adelie Torgersen 1400950 541100
# ... with 323 more rows