R语言R for Data Science

[R语言] tidyr包 数据整理《R for data sci

2020-04-18  本文已影响0人  半为花间酒

《R for Data Science》第十二章 Tidy data 啃书知识点积累
参考链接:R for Data Science

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

Tidy data

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.
  1. Put each dataset in a tibble.
  2. Put each variable in a column.
table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
table2
#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

# Spread across two tibbles
table4a  # cases
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

table4b  # population
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

- Exercises

(两种类型table均复原为table1)

table2
#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

t2_cases <- table2 %>% 
  filter(type == "cases") %>%
  rename(cases = count) %>%
  arrange(country, year)

t2_population <- table2 %>% 
  filter(type == "population") %>%
  rename(population = count) %>%
  arrange(country, year)

(t2_cases_per_cap <- tibble(
  country = t2_cases$country,
  year = t2_cases$year,
  cases = t2_cases$cases,
  population = t2_population$population) %>%
    mutate(cases_per_cap = (cases / population) * 10000))
table4a  # cases
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

table4b  # population
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

(table4c <-
  tibble(
    country = table4a$country,
    `1999` = table4a[["1999"]] / table4b[["1999"]] * 10000,
    `2000` = table4a[["2000"]] / table4b[["2000"]] * 10000
  ))
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#>   <chr>        <dbl>  <dbl>
#> 1 Afghanistan  0.373   1.29
#> 2 Brazil       2.19    4.61
#> 3 China        1.67    1.67

# 也可以将table4a和table4b还原成table1再处理
# 先利用比较传统的方法,下文有更好的工具:pivot_longer()

t4a_1 <- table4a %>% 
  select(1:2) %>% 
  rename(cases = `1999`) %>% 
  mutate(year = 1999)

t4a_2 <- table4a %>% 
  select(c(1,3)) %>% 
  rename(cases = `2000`) %>% 
  mutate(year = 2000)

t4a <- rbind(t4a_1,t4a_2) %>% 
  arrange(country,year)

t4b_1 <- table4b %>% 
  select(1:2) %>% 
  rename(population = `1999`) %>% 
  mutate(year = 1999)

t4b_2 <- table4b %>% 
  select(c(1,3)) %>% 
  rename(population = `2000`) %>% 
  mutate(year = 2000)

t4b <- rbind(t4b_1,t4b_2) %>% 
  arrange(country,year)

(t4_cases_per <- tibble(
  country = t4a$country,
  year = t4a$year,
  cases = t4a$cases,
  population = t4b$population) %>%
    mutate(cases_per_cap = (cases / population) * 10000))

Pivoting

- pivot_longer()

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

简而言之就是一行有多个观测值

# 适用类型
table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

Take table4a:
the column names 1999 and 2000 represent values of the year variable, the values in the 1999 and 2000 columns represent values of the cases variable, and each row represents two observations, not one.

解决策略:

  1. The set of columns whose names are values, not variables. In this example, those are the columns 1999 and 2000.
  2. The name of the variable to move the column names to. Here it is year.
  3. The name of the variable to move the column values to. Here it’s cases.
table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Afghanistan 2000    2666
#> 3 Brazil      1999   37737
#> 4 Brazil      2000   80488
#> 5 China       1999  212258
#> 6 China       2000  213766

year and cases do not exist in table4a so we put their names in quotes.

tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")

# 左连接两表
dplyr::left_join(tidy4a, tidy4b)
#> Joining, by = c("country", "year")
#> # A tibble: 6 x 4
#>   country     year   cases population
#>   <chr>       <chr>  <int>      <int>
#> 1 Afghanistan 1999     745   19987071
#> 2 Afghanistan 2000    2666   20595360
#> 3 Brazil      1999   37737  172006362
#> 4 Brazil      2000   80488  174504898
#> 5 China       1999  212258 1272915272
#> 6 China       2000  213766 1280428583
# 附另一个练习
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)

preg %>% 
  pivot_longer(c(male,female),names_to = 'sex',values_to = 'count')

- pivot_wider()

Take table2:
an observation is a country in a year, but each observation is spread across two rows.

简而言之就是一列有多个变量

解决策略:

  1. The column to take variable names from. Here, it’s type.
  2. The column to take values from. Here it’s count.
# 适用类型
table2
#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

table2 %>%
    pivot_wider(names_from = type, values_from = count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
  1. pivot_longer() makes wide tables narrower and longer
  2. pivot_wider() makes long tables shorter and wider

- 和gather spread的对比

table4a
## A tibble: 3 x 3
#    country     `1999` `2000`
# * <chr>        <int>  <int>
# 1 Afghanistan    745   2666
# 2 Brazil       37737  80488
# 3 China       212258 213766

table4a %>% 
  pivot_longer(c(`1999`,`2000`),names_to = 'year',values_to = 'value')
# 等价于
table4a %>% 
  gather(c(`1999`,`2000`),key = 'year',value = 'value')
table2
# # A tibble: 12 x 4
#   country      year type            count
#   <chr>       <int> <chr>           <int>
# 1 Afghanistan  1999 cases             745
# 2 Afghanistan  1999 population   19987071
# 3 Afghanistan  2000 cases            2666
# 4 Afghanistan  2000 population   20595360
# 5 Brazil       1999 cases           37737
# 6 Brazil       1999 population  172006362
# 7 Brazil       2000 cases           80488
# 8 Brazil       2000 population  174504898
# 9 China        1999 cases          212258
# 10 China        1999 population 1272915272
# 11 China        2000 cases          213766
# 12 China        2000 population 1280428583

table2 %>% 
  pivot_wider(names_from = type,values_from = count)
# 等价于
table2 %>% 
  spread(type,count)
# # A tibble: 6 x 4
#   country      year  cases population
#   <chr>       <int>  <int>      <int>
# 1 Afghanistan  1999    745   19987071
# 2 Afghanistan  2000   2666   20595360
# 3 Brazil       1999  37737  172006362
# 4 Brazil       2000  80488  174504898
# 5 China        1999 212258 1272915272
# 6 China        2000 213766 1280428583

- Exercises

问题的关键是列属性发生了丢失

可以再gather或者pivot_longer中指定参数

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(c(`2015`,`2016`), names_to = "year", values_to = "return",
               names_ptype = list(year = double()))

stocks %>%
  spread(key = "year", value = "return") %>%
  gather(c(`2015`,`2016`), key = "year", value = "return", convert = TRUE)
people <- tribble(
  ~name,             ~names,  ~values,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)

# 添加一列以区别各行
people <- people %>% 
  group_by(name,names) %>% 
  mutate(num = row_number())

# 即可区分
people %>% 
  pivot_wider(names_from = names, values_from = values)

Separating and uniting

- Separate

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears.
By default, separate() will split values wherever it sees a non-alphanumeric character

table3
# # A tibble: 6 x 3
#   country      year rate             
# * <chr>       <int> <chr>            
# 1 Afghanistan  1999 745/19987071     
# 2 Afghanistan  2000 2666/20595360    
# 3 Brazil       1999 37737/172006362  
# 4 Brazil       2000 80488/174504898  
# 5 China        1999 212258/1272915272
# 6 China        2000 213766/1280428583

table3 %>% 
  separate(rate, into = c("cases", "population"))
# # A tibble: 6 x 4
#   country      year cases  population
#   <chr>       <int> <chr>  <chr>     
# 1 Afghanistan  1999 745    19987071  
# 2 Afghanistan  2000 2666   20595360  
# 3 Brazil       1999 37737  172006362 
# 4 Brazil       2000 80488  174504898 
# 5 China        1999 212258 1272915272
# 6 China        2000 213766 1280428583

(1) sep 指定分隔符

table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")

若sep后跟的是数字则表示从第几个位置切开,右一是-1

table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)
#> # A tibble: 6 x 4
#>   country     century year  rate             
#>   <chr>       <chr>   <chr> <chr>            
#> 1 Afghanistan 19      99    745/19987071     
#> 2 Afghanistan 20      00    2666/20595360    
#> 3 Brazil      19      99    37737/172006362  
#> 4 Brazil      20      00    80488/174504898  
#> 5 China       19      99    212258/1272915272
#> 6 China       20      00    213766/1280428583

(2) convert 分割后解析成合适的类型
不设定则会保留chr类型

table5 <- table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

(3) extra 少数记录分隔符较多的情况

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"),extra = 'warn') # 默认值是drop
# # A tibble: 3 x 3
#   one   two   three
#   <chr> <chr> <chr>
# 1 a     b     c    
# 2 d     e     f    
# 3 h     i     j

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"),extra = 'drop')
# # A tibble: 3 x 3
#   one   two   three
#   <chr> <chr> <chr>
# 1 a     b     c    
# 2 d     e     f    
# 3 h     i     j

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"),extra = 'merge')
# # A tibble: 3 x 3
#   one   two   three
#   <chr> <chr> <chr>
# 1 a     b     c    
# 2 d     e     f,g  
# 3 h     i     j

(4) fill 少数记录分隔符较少的情况

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = 'warn') # 默认值是right
# # A tibble: 3 x 3
#   one   two   three
#  <chr> <chr> <chr>
# 1 a     b     c    
# 2 d     e     NA   
# 3 f     g     i 

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = 'left')
# # A tibble: 3 x 3
#   one   two   three
#   <chr> <chr> <chr>
# 1 a     b     c    
# 2 NA    d     e    
# 3 f     g     i 

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = 'right')
# # A tibble: 3 x 3
#   one   two   three
#  <chr> <chr> <chr>
# 1 a     b     c    
# 2 d     e     NA   
# 3 f     g     i 

(5) remove 选择是否保留原列,unite中也可以设置

table3 %>% 
  separate(rate, into = c("cases", "population"),remove = F)
# # A tibble: 6 x 5
#   country      year rate              cases  population
#   <chr>       <int> <chr>             <chr>  <chr>     
# 1 Afghanistan  1999 745/19987071      745    19987071  
# 2 Afghanistan  2000 2666/20595360     2666   20595360  
# 3 Brazil       1999 37737/172006362   37737  172006362 
# 4 Brazil       2000 80488/174504898   80488  174504898 
# 5 China        1999 212258/1272915272 212258 1272915272
# 6 China        2000 213766/1280428583 213766 1280428583

table5 %>% 
  unite(new, century, year,remove = F)
#> # A tibble: 6 x 3
#     country     new   century year  rate             
#     <chr>       <chr> <chr>   <chr> <chr>            
# 1 Afghanistan 19_99 19      99    745/19987071     
# 2 Afghanistan 20_00 20      00    2666/20595360    
# 3 Brazil      19_99 19      99    37737/172006362  
# 4 Brazil      20_00 20      00    80488/174504898  
# 5 China       19_99 19      99    212258/1272915272
# 6 China       20_00 20      00    213766/1280428583

- unite

unite() is the inverse of separate(): it combines multiple columns into a single column.

# 默认的连接符是'_'
table5 %>% 
  unite(new, century, year)
#> # A tibble: 6 x 3
#>   country     new   rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 19_99 745/19987071     
#> 2 Afghanistan 20_00 2666/20595360    
#> 3 Brazil      19_99 37737/172006362  
#> 4 Brazil      20_00 80488/174504898  
#> 5 China       19_99 212258/1272915272
#> 6 China       20_00 213766/1280428583

默认的连接符是'_',也可用sep指定

table5 %>% 
  unite(new, century, year, sep = "")
#> # A tibble: 6 x 3
#>   country     new   rate             
#>   <chr>       <chr> <chr>            
#> 1 Afghanistan 1999  745/19987071     
#> 2 Afghanistan 2000  2666/20595360    
#> 3 Brazil      1999  37737/172006362  
#> 4 Brazil      2000  80488/174504898  
#> 5 China       1999  212258/1272915272
#> 6 China       2000  213766/1280428583

- extract

支持使用正则表达式,比separate更灵活

# example with separators
tibble(x = c("X_1", "X_2", "AA_1", "AA_2")) %>%
  extract(x, c("variable", "id"), regex = "([A-Z])_([0-9])")
#> # A tibble: 4 x 2
#>   variable id   
#>   <chr>    <chr>
#> 1 X        1    
#> 2 X        2    
#> 3 A        1    
#> 4 A        2

# example with position
tibble(x = c("X1", "X2", "Y1", "Y2")) %>%
  extract(x, c("variable", "id"), regex = "([A-Z])([0-9])")
#> # A tibble: 4 x 2
#>   variable id   
#>   <chr>    <chr>
#> 1 X        1    
#> 2 X        2    
#> 3 Y        1    
#> 4 Y        2

# example that separate could not parse
# 尤其是这个例子,separate只能按分隔符或者按数字位置切割
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
  extract(x, c("variable", "id"), regex = "([A-Z]+)([0-9]+)")
#> # A tibble: 4 x 2
#>   variable id   
#>   <chr>    <chr>
#> 1 X        1    
#> 2 X        20   
#> 3 AA       11   
#> 4 AA       2

Missing values

A value can be missing in one of two possible ways:

  • Explicitly, i.e. flagged with NA.
  • Implicitly, i.e. simply not present in the data.

下面这句解释非常哲学又令人豁然开朗:

stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(
    cols = c(`2015`, `2016`), 
    names_to = "year", 
    values_to = "return", 
    values_drop_na = TRUE
  )
#> # A tibble: 6 x 3
#>     qtr year  return
#>   <dbl> <chr>  <dbl>
#> 1     1 2015    1.88
#> 2     2 2015    0.59
#> 3     2 2016    0.92
#> 4     3 2015    0.35
#> 5     3 2016    0.17
#> 6     4 2016    2.66
stocks %>% 
  complete(year, qtr)
# # A tibble: 8 x 3
#   year   qtr return
#   <dbl> <dbl>  <dbl>
# 1  2015     1   1.88
# 2  2015     2   0.59
# 3  2015     3   0.35
# 4  2015     4  NA   
# 5  2016     1  NA   
# 6  2016     2   0.92
# 7  2016     3   0.17
# 8  2016     4   2.66

可以用nesting确定组合便于暴露NA

df <- tibble(
  group = c(1:2, 1),
  item_id = c(1:2, 2),
  item_name = c("a", "b", "b"),
  value1 = 1:3,
  value2 = 4:6

df %>% 
  complete(group)
# # A tibble: 3 x 5
#    group item_id item_name value1 value2
#    <dbl>   <dbl> <chr>      <int>  <int>
# 1     1       1 a              1      4
# 2     1       2 b              3      6
# 3     2       2 b              2      5

df %>% 
  complete(group,nesting(item_id, item_name))
# # A tibble: 4 x 5
#    group item_id item_name value1 value2
#    <dbl>   <dbl> <chr>      <int>  <int>
# 1     1       1 a              1      4
# 2     1       2 b              3      6
# 3     2       1 a             NA     NA
# 4     2       2 b              2      5
treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)

treatment %>% 
  fill(person)  # 默认是向下填充,可以用.direction指定方向
#> # A tibble: 4 x 3
#>   person           treatment response
#>   <chr>                <dbl>    <dbl>
#> 1 Derrick Whitmore         1        7
#> 2 Derrick Whitmore         2       10
#> 3 Derrick Whitmore         3        9
#> 4 Katherine Burke          1        4

treatment %>% 
  fill(person,.direction = 'down') # 向下填充

treatment %>% 
  fill(person,.direction = 'up') # 向上填充

complete中也有fill参数,是list类型可以指定不同列的参数填补

df %>% 
  complete(group,nesting(item_id, item_name), fill=list(value1=2, value2=3))
# # A tibble: 4 x 5
#    group item_id item_name value1 value2
#    <dbl>   <dbl> <chr>      <int>  <int>
# 1     1       1 a              1      4
# 2     1       2 b              3      6
# 3     2       1 a              2      3
# 4     2       2 b              2      5

Case Study

用的是tidyr::who数据集

It contains redundant columns, odd variable codes, and many missing values.

  1. The best place to start is almost always to gather together the columns that are not variables.
who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1
#> # A tibble: 76,046 x 6
#>   country     iso2  iso3   year key          cases
#>   <chr>       <chr> <chr> <int> <chr>        <int>
#> 1 Afghanistan AF    AFG    1997 new_sp_m014      0
#> 2 Afghanistan AF    AFG    1997 new_sp_m1524    10
#> 3 Afghanistan AF    AFG    1997 new_sp_m2534     6
#> 4 Afghanistan AF    AFG    1997 new_sp_m3544     3
#> 5 Afghanistan AF    AFG    1997 new_sp_m4554     5
#> 6 Afghanistan AF    AFG    1997 new_sp_m5564     2
#> # … with 7.604e+04 more rows

含义解释如下:

who2 <- who1 %>% 
  mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))

# 这个地方书上写错了,应该拆分的是新形成的列否则who2没有意义
who3 <- who2 %>% 
  separate(names_from, c("new", "type", "sexage"), sep = "_")

who4 <- who3 %>% 
  select(-new, -iso2, -iso3)

who5 <- who4 %>% 
  separate(sexage, c("sex", "age"), sep = 1)

# 完整步骤如下
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

Non-tidy data

There are good reasons to use other structures;
Tidy data is not the only way.

上一篇下一篇

猜你喜欢

热点阅读