R语言日常笔记（2）distinc函数

2019-07-16 本文已影响214人柳叶刀与小鼠标

> library(dplyr)
> library(tidyverse)
> starwars %>%
+   head()
# A tibble: 6 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Luke~    172    77 blond      fair       blue            19   male   Tatooine  Human   <chr~
2 C-3PO    167    75 NA         gold       yellow         112   NA     Tatooine  Droid   <chr~
3 R2-D2     96    32 NA         white, bl~ red             33   NA     Naboo     Droid   <chr~
4 Dart~    202   136 none       white      yellow          41.9 male   Tatooine  Human   <chr~
5 Leia~    150    49 brown      light      brown           19   female Alderaan  Human   <chr~
6 Owen~    178   120 brown, gr~ light      blue            52   male   Tatooine  Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> 
> 
> #starwars数据集mass和mass列大于0的观测值（这一步可以用于快速剔除NA值）
> mass <- 0
> height <- 0
>  filter(starwars, mass > !!mass, mass > !!height)%>%
+    head()
# A tibble: 6 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Luke~    172    77 blond      fair       blue            19   male   Tatooine  Human   <chr~
2 C-3PO    167    75 NA         gold       yellow         112   NA     Tatooine  Droid   <chr~
3 R2-D2     96    32 NA         white, bl~ red             33   NA     Naboo     Droid   <chr~
4 Dart~    202   136 none       white      yellow          41.9 male   Tatooine  Human   <chr~
5 Leia~    150    49 brown      light      brown           19   female Alderaan  Human   <chr~
6 Owen~    178   120 brown, gr~ light      blue            52   male   Tatooine  Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
>  
>  
>  
> #取starwars数据集第五行
> slice(starwars, 5)
# A tibble: 1 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Leia~    150    49 brown      light      brown             19 female Alderaan  Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #取starwars数据集第五行
> filter(starwars, row_number() == 5)
# A tibble: 1 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Leia~    150    49 brown      light      brown             19 female Alderaan  Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #取starwars数据集前五行
> slice(starwars, 1:5)
# A tibble: 5 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Luke~    172    77 blond      fair       blue            19   male   Tatooine  Human   <chr~
2 C-3PO    167    75 NA         gold       yellow         112   NA     Tatooine  Droid   <chr~
3 R2-D2     96    32 NA         white, bl~ red             33   NA     Naboo     Droid   <chr~
4 Dart~    202   136 none       white      yellow          41.9 male   Tatooine  Human   <chr~
5 Leia~    150    49 brown      light      brown           19   female Alderaan  Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #starwars数据集后六行
> tail(starwars)
# A tibble: 6 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Finn      NA    NA black      dark       dark              NA male   NA        Human   <chr~
2 Rey       NA    NA brown      light      hazel             NA female NA        Human   <chr~
3 Poe ~     NA    NA brown      light      brown             NA male   NA        Human   <chr~
4 BB8       NA    NA none       none       black             NA none   NA        Droid   <chr~
5 Capt~     NA    NA unknown    unknown    unknown           NA female NA        NA      <chr~
6 Padm~    165    45 brown      light      brown             46 female Naboo     Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>
> #starwars数据集最后五行
> slice(starwars, n())
# A tibble: 1 x 13
  name  height  mass hair_color skin_color eye_color birth_year gender homeworld species films
  <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>     <chr>   <lis>
1 Padm~    165    45 brown      light      brown             46 female Naboo     Human   <chr~
# ... with 2 more variables: vehicles <list>, starships <list>

所使用的数据集是starwars数据集

A tibble with 87 rows and 13 variables:

name
Name of the character

height
Height (cm)

mass
Weight (kg)

hair_color,skin_color,eye_color
Hair, skin, and eye colors

birth_year
Year born (BBY = Before Battle of Yavin)

gender
male, female, hermaphrodite, or none.

homeworld
Name of homeworld

species
Name of species

films
List of films the character appeared in

vehicles
List of vehicles the character has piloted

starships
List of starships the character has piloted

本文将会接受如何数据框处理的常见需求：如何去掉重复值

仅保留每一种gender中第一个出现的观测值（去掉重复的gender观测值）

第一种方法：match函数

> k <- match(unique(starwars$gender), starwars$gender)
> starwars[k,c('name','gender','skin_color', 'height', 'mass')]
# A tibble: 5 x 5
  name                  gender        skin_color       height  mass
  <chr>                 <chr>         <chr>             <int> <dbl>
1 Luke Skywalker        male          fair                172    77
2 C-3PO                 NA            gold                167    75
3 Leia Organa           female        light               150    49
4 Jabba Desilijic Tiure hermaphrodite green-tan, brown    175  1358
5 IG-88                 none          metal               200   140

match函数查找数据集中每个唯一gender的第一行的位置，然后根据位置提取这些行和所需的列。

第二种方法：group_by和ungroup

starwars %>%
+   as_tibble %>%
+   select(name,gender, skin_color, height, mass) %>%
+   group_by(gender) %>%
+   filter(row_number(gender)==1) %>%
+   ungroup
# A tibble: 4 x 5
  name                  gender        skin_color       height  mass
  <chr>                 <chr>         <chr>             <int> <dbl>
1 Luke Skywalker        male          fair                172    77
2 Leia Organa           female        light               150    49
3 Jabba Desilijic Tiure hermaphrodite green-tan, brown    175  1358
4 IG-88                 none          metal               200   140

as_tibble 首先将数据框转换为tibble，select提取感兴趣或者相关的列， group_by按gender分组数据， filter抓取每个gender的第一行，然后 ungroup取消分组。

第三种方法： summarize函数

> starwars %>%
+     as_tibble %>%
+     select(name,gender, skin_color, height, mass) %>%
+     group_by(gender) %>%
+     summarize(name = first(name), skin_color=first(skin_color), 
+               height=first( height), mass=first(mass))
# A tibble: 5 x 5
  gender        name                  skin_color       height  mass
  <chr>         <chr>                 <chr>             <int> <dbl>
1 female        Leia Organa           light               150    49
2 hermaphrodite Jabba Desilijic Tiure green-tan, brown    175  1358
3 male          Luke Skywalker        fair                172    77
4 none          IG-88                 metal               200   140
5 NA            C-3PO                 gold                167    75
>

summarize可以避免取消分组这一步，但是 summarize命令需要使用者指定每个非 group_by变量。

第四种方法：distinct

> starwars %>%
+     as_tibble %>%
+     select(name,gender, skin_color, height, mass) %>%
+     group_by(gender) %>%
+     distinct(gender,.keep_all = T)
# A tibble: 5 x 5
# Groups:   gender [5]
  name                  gender        skin_color       height  mass
  <chr>                 <chr>         <chr>             <int> <dbl>
1 Luke Skywalker        male          fair                172    77
2 C-3PO                 NA            gold                167    75
3 Leia Organa           female        light               150    49
4 Jabba Desilijic Tiure hermaphrodite green-tan, brown    175  1358
5 IG-88                 none          metal               200   140
>
> # Remove duplicate rows of the dataframe using skin_color and gender
> starwars %>%
+     as_tibble %>%
+     select(name,gender, skin_color, height, mass) %>%
+     group_by(gender) %>%
+     distinct(skin_color,gender,,.keep_all = T)
# A tibble: 39 x 5
# Groups:   gender [5]
   name                  gender        skin_color       height  mass
   <chr>                 <chr>         <chr>             <int> <dbl>
 1 Luke Skywalker        male          fair                172    77
 2 C-3PO                 NA            gold                167    75
 3 R2-D2                 NA            white, blue          96    32
 4 Darth Vader           male          white               202   136
 5 Leia Organa           female        light               150    49
 6 Owen Lars             male          light               178   120
 7 R5-D4                 NA            white, red           97    32
 8 Chewbacca             male          unknown             228   112
 9 Greedo                male          green               173    74
10 Jabba Desilijic Tiure hermaphrodite green-tan, brown    175  1358
# ... with 29 more rows

distinct函数看起来好多了：干净，简短，易于理解。它不是抓住每个组的第一行，而是必须搜索并排除重复项。.keep_all函数用于保留输出数据框中的所有其他变量。

比较不同方法的速速优劣


library(tidyverse)

d1 <- function()
{
  k <- match(unique(starwars$gender), starwars$gender)
  starwars[k,c('name','gender','skin_color', 'height', 'mass')]
}


d2 <- function()
{
  
  starwars %>%
    as_tibble %>%
    select(name,gender, skin_color, height, mass) %>%
    group_by(gender) %>%
    filter(row_number(gender)==1) %>%
    ungroup
  
}


d3 <- function()
{
  starwars %>%
    as_tibble %>%
    select(name,gender, skin_color, height, mass) %>%
    group_by(gender) %>%
    summarize(name = first(name), skin_color=first(skin_color), 
              height=first( height), mass=first(mass))
  
}


d4 <- function()
{
  
  
  starwars %>%
    as_tibble %>%
    select(name,gender, skin_color, height, mass) %>%
    group_by(gender) %>%
    distinct(gender,.keep_all = T)
  
}

library(microbenchmark)
set.seed(1234)
microbenchmark(d1(), d2(), d3(), d4(), times=9)
Unit: microseconds
 expr      min       lq      mean   median       uq       max neval
 d1()   74.668   84.870  105.6366   88.580  131.710   140.522     9
 d2() 5478.496 5563.829 5808.0292 5735.888 5974.264  6379.598     9
 d3() 4710.960 4761.510 5062.5474 4856.583 4876.989  7026.091     9
 d4() 6099.018 6241.395 9503.2321 6422.265 6641.627 32286.160     9

从结果发现d1也就是用match的速度非常快！而在tidyverse方法中， d3的summarize显然更好。

R语言日常笔记（2）distinc函数

本文将会接受如何数据框处理的常见需求：如何去掉重复值

第一种方法：match函数

第二种方法：group_by和ungroup

第三种方法： summarize函数

第四种方法：distinct

比较不同方法的速速优劣

猜你喜欢

热点阅读