R语言日常笔记(2)distinc函数
2019-07-16 本文已影响214人
柳叶刀与小鼠标
> library(dplyr)
> library(tidyverse)
> starwars %>%
+ head()
# A tibble: 6 x 13
name height mass hair_color skin_color eye_color birth_year gender homeworld species films
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <lis>
1 Luke~ 172 77 blond fair blue 19 male Tatooine Human <chr~
2 C-3PO 167 75 NA gold yellow 112 NA Tatooine Droid <chr~
3 R2-D2 96 32 NA white, bl~ red 33 NA Naboo Droid <chr~
4 Dart~ 202 136 none white yellow 41.9 male Tatooine Human <chr~
5 Leia~ 150 49 brown light brown 19 female Alderaan Human <chr~
6 Owen~ 178 120 brown, gr~ light blue 52 male Tatooine Human <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
>
>
> #starwars数据集mass和mass列大于0的观测值(这一步可以用于快速剔除NA值)
> mass <- 0
> height <- 0
> filter(starwars, mass > !!mass, mass > !!height)%>%
+ head()
# A tibble: 6 x 13
name height mass hair_color skin_color eye_color birth_year gender homeworld species films
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <lis>
1 Luke~ 172 77 blond fair blue 19 male Tatooine Human <chr~
2 C-3PO 167 75 NA gold yellow 112 NA Tatooine Droid <chr~
3 R2-D2 96 32 NA white, bl~ red 33 NA Naboo Droid <chr~
4 Dart~ 202 136 none white yellow 41.9 male Tatooine Human <chr~
5 Leia~ 150 49 brown light brown 19 female Alderaan Human <chr~
6 Owen~ 178 120 brown, gr~ light blue 52 male Tatooine Human <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
>
>
>
> #取starwars数据集第五行
> slice(starwars, 5)
# A tibble: 1 x 13
name height mass hair_color skin_color eye_color birth_year gender homeworld species films
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <lis>
1 Leia~ 150 49 brown light brown 19 female Alderaan Human <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #取starwars数据集第五行
> filter(starwars, row_number() == 5)
# A tibble: 1 x 13
name height mass hair_color skin_color eye_color birth_year gender homeworld species films
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <lis>
1 Leia~ 150 49 brown light brown 19 female Alderaan Human <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #取starwars数据集前五行
> slice(starwars, 1:5)
# A tibble: 5 x 13
name height mass hair_color skin_color eye_color birth_year gender homeworld species films
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <lis>
1 Luke~ 172 77 blond fair blue 19 male Tatooine Human <chr~
2 C-3PO 167 75 NA gold yellow 112 NA Tatooine Droid <chr~
3 R2-D2 96 32 NA white, bl~ red 33 NA Naboo Droid <chr~
4 Dart~ 202 136 none white yellow 41.9 male Tatooine Human <chr~
5 Leia~ 150 49 brown light brown 19 female Alderaan Human <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #starwars数据集后六行
> tail(starwars)
# A tibble: 6 x 13
name height mass hair_color skin_color eye_color birth_year gender homeworld species films
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <lis>
1 Finn NA NA black dark dark NA male NA Human <chr~
2 Rey NA NA brown light hazel NA female NA Human <chr~
3 Poe ~ NA NA brown light brown NA male NA Human <chr~
4 BB8 NA NA none none black NA none NA Droid <chr~
5 Capt~ NA NA unknown unknown unknown NA female NA NA <chr~
6 Padm~ 165 45 brown light brown 46 female Naboo Human <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #starwars数据集最后五行
> slice(starwars, n())
# A tibble: 1 x 13
name height mass hair_color skin_color eye_color birth_year gender homeworld species films
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <lis>
1 Padm~ 165 45 brown light brown 46 female Naboo Human <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
所使用的数据集是starwars数据集
A tibble with 87 rows and 13 variables:
name
Name of the character
height
Height (cm)
mass
Weight (kg)
hair_color,skin_color,eye_color
Hair, skin, and eye colors
birth_year
Year born (BBY = Before Battle of Yavin)
gender
male, female, hermaphrodite, or none.
homeworld
Name of homeworld
species
Name of species
films
List of films the character appeared in
vehicles
List of vehicles the character has piloted
starships
List of starships the character has piloted
本文将会接受如何数据框处理的常见需求:如何去掉重复值
仅保留每一种gender中第一个出现的观测值(去掉重复的gender观测值)
-
第一种方法:match函数
> k <- match(unique(starwars$gender), starwars$gender)
> starwars[k,c('name','gender','skin_color', 'height', 'mass')]
# A tibble: 5 x 5
name gender skin_color height mass
<chr> <chr> <chr> <int> <dbl>
1 Luke Skywalker male fair 172 77
2 C-3PO NA gold 167 75
3 Leia Organa female light 150 49
4 Jabba Desilijic Tiure hermaphrodite green-tan, brown 175 1358
5 IG-88 none metal 200 140
match函数查找数据集中每个唯一gender的第一行的位置,然后根据位置提取这些行和所需的列。
-
第二种方法:group_by和ungroup
starwars %>%
+ as_tibble %>%
+ select(name,gender, skin_color, height, mass) %>%
+ group_by(gender) %>%
+ filter(row_number(gender)==1) %>%
+ ungroup
# A tibble: 4 x 5
name gender skin_color height mass
<chr> <chr> <chr> <int> <dbl>
1 Luke Skywalker male fair 172 77
2 Leia Organa female light 150 49
3 Jabba Desilijic Tiure hermaphrodite green-tan, brown 175 1358
4 IG-88 none metal 200 140
as_tibble 首先将数据框转换为tibble,select提取感兴趣或者相关的列, group_by按gender分组数据, filter抓取每个gender的第一行,然后 ungroup取消分组。
-
第三种方法: summarize函数
> starwars %>%
+ as_tibble %>%
+ select(name,gender, skin_color, height, mass) %>%
+ group_by(gender) %>%
+ summarize(name = first(name), skin_color=first(skin_color),
+ height=first( height), mass=first(mass))
# A tibble: 5 x 5
gender name skin_color height mass
<chr> <chr> <chr> <int> <dbl>
1 female Leia Organa light 150 49
2 hermaphrodite Jabba Desilijic Tiure green-tan, brown 175 1358
3 male Luke Skywalker fair 172 77
4 none IG-88 metal 200 140
5 NA C-3PO gold 167 75
>
summarize可以避免取消分组这一步,但是 summarize命令需要使用者指定每个非 group_by变量。
-
第四种方法:distinct
> starwars %>%
+ as_tibble %>%
+ select(name,gender, skin_color, height, mass) %>%
+ group_by(gender) %>%
+ distinct(gender,.keep_all = T)
# A tibble: 5 x 5
# Groups: gender [5]
name gender skin_color height mass
<chr> <chr> <chr> <int> <dbl>
1 Luke Skywalker male fair 172 77
2 C-3PO NA gold 167 75
3 Leia Organa female light 150 49
4 Jabba Desilijic Tiure hermaphrodite green-tan, brown 175 1358
5 IG-88 none metal 200 140
>
> # Remove duplicate rows of the dataframe using skin_color and gender
> starwars %>%
+ as_tibble %>%
+ select(name,gender, skin_color, height, mass) %>%
+ group_by(gender) %>%
+ distinct(skin_color,gender,,.keep_all = T)
# A tibble: 39 x 5
# Groups: gender [5]
name gender skin_color height mass
<chr> <chr> <chr> <int> <dbl>
1 Luke Skywalker male fair 172 77
2 C-3PO NA gold 167 75
3 R2-D2 NA white, blue 96 32
4 Darth Vader male white 202 136
5 Leia Organa female light 150 49
6 Owen Lars male light 178 120
7 R5-D4 NA white, red 97 32
8 Chewbacca male unknown 228 112
9 Greedo male green 173 74
10 Jabba Desilijic Tiure hermaphrodite green-tan, brown 175 1358
# ... with 29 more rows
distinct函数看起来好多了:干净,简短,易于理解。 它不是抓住每个组的第一行,而是必须搜索并排除重复项。.keep_all函数用于保留输出数据框中的所有其他变量。
比较不同方法的速速优劣
library(tidyverse)
d1 <- function()
{
k <- match(unique(starwars$gender), starwars$gender)
starwars[k,c('name','gender','skin_color', 'height', 'mass')]
}
d2 <- function()
{
starwars %>%
as_tibble %>%
select(name,gender, skin_color, height, mass) %>%
group_by(gender) %>%
filter(row_number(gender)==1) %>%
ungroup
}
d3 <- function()
{
starwars %>%
as_tibble %>%
select(name,gender, skin_color, height, mass) %>%
group_by(gender) %>%
summarize(name = first(name), skin_color=first(skin_color),
height=first( height), mass=first(mass))
}
d4 <- function()
{
starwars %>%
as_tibble %>%
select(name,gender, skin_color, height, mass) %>%
group_by(gender) %>%
distinct(gender,.keep_all = T)
}
library(microbenchmark)
set.seed(1234)
microbenchmark(d1(), d2(), d3(), d4(), times=9)
Unit: microseconds
expr min lq mean median uq max neval
d1() 74.668 84.870 105.6366 88.580 131.710 140.522 9
d2() 5478.496 5563.829 5808.0292 5735.888 5974.264 6379.598 9
d3() 4710.960 4761.510 5062.5474 4856.583 4876.989 7026.091 9
d4() 6099.018 6241.395 9503.2321 6422.265 6641.627 32286.160 9
从结果发现d1也就是用match的速度非常快! 而在tidyverse方法中, d3的summarize显然更好。