[R语言] dpylr

2020-04-02 本文已影响0人半为花间酒

《R for Data Science》第五章 Data transformation 啃书知识点积累

参考书籍

《R for data science》

《R数据科学》

核心包

library(nycflights13)
library(tidyverse)

分析nycflights13::flights的缺失情况

dim(flights %>% 
  filter(!is.na(dep_time)))
# [1] 328521     19
dim(flights %>% 
  filter(!is.na(arr_time)))
# [1] 328063     19
dim(flights %>% 
  filter(!is.na(arr_delay)))
# [1] 327346     19

结论：
成功出发的航班 328521个
有到达延误数据的航班 327346个
成功到达的航班 328063个

分析：
有 458个航班出发但未到达
有 717个航班的到达延误数据丢失

操作：
重新修复717个数据，形成新数据框

# 根据多个条件获取arr_time和sched_arr_time相对于出发当天0点的绝对分钟数
new_flights <- flights %>% 
  mutate(
    arr_time_real = case_when(
      (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100 + dep_delay > 1439) | 
        (arr_time < dep_time) ~ arr_time %/% 100 * 60 + arr_time %% 100 + 1440,
      !((sched_dep_time %/% 100 * 60 + sched_dep_time %% 100 + dep_delay > 1439) | 
          (arr_time < dep_time)) ~ arr_time %/% 100 * 60 + arr_time %% 100),
    arr_time_sched = case_when(
      sched_arr_time < sched_dep_time ~ 
        sched_arr_time %/% 100 * 60 + sched_arr_time %% 100 + 1440,
      sched_arr_time >= sched_dep_time ~
        sched_arr_time %/% 100 * 60 + sched_arr_time %% 100),
    arr_delay_real = arr_time_real - arr_time_sched)

# 用两个代码检验新的数据集
# 检查修复后存在到达延误数据的航班数
new_flights %>% 
  filter(!is.na(arr_delay_real))
# A tibble: 328,063 x 22

# 检验新求出的差值和原本存在的数据是否有差异，如果有差异说明该计算方法有错
new_flights %>% 
  filter(arr_delay_real == arr_delay)
# A tibble: 327,346 x 22

dplyr五个重要方法：

Pick observations by their values (filter()).

Reorder the rows (arrange()).

Pick variables by their names (select()).

Create new variables with functions of existing variables (mutate()).

Collapse many values down to a single summary (summarise()).

1. filter()

涉及无理小数的二进制存储方式，导致不精确

1 / 49 * 49 == 1
#> [1] FALSE

# 可以用near()解决
near(1 / 49 * 49, 1)
#> [1] TRUE

异或 xor(a, b)

异或 xor(a, b) 相异为真

Q: Departed in summer (July, August, and September)

flights %>%
  filter(month %in% c(7:9))
# A tibble: 86,326 x 19

De Morgan’s law

!(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y

filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

缺失值

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values

df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
#> # A tibble: 1 x 1
#>       x
#>   <dbl>
#> 1     3

其他缺失值相关的补充

NA ^ 0
#> 1
NA | TRUE
#> TRUE
FALSE & NA
#> FALSE

if the the number is Inf, then the result of the multiplication will be NaN

NA * 0
#> NA

缺失值NA的默认类型是逻辑型

Q: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

# between()表示一个变量落在两数之间，两边闭合区间

dplyr::filter(flights, dep_time >= 0 & dep_time <= 600)
# 等价于
dplyr::filter(flights, between(dep_time, 0, 600))

2. arrange()

缺失值

Missing values are always sorted at the end

df <- tibble(x = c(5, 2, NA))
arrange(df, x)
#> # A tibble: 3 x 1
#>       x
#>   <dbl>
#> 1     2
#> 2     5
#> 3    NA
arrange(df, desc(x))
#> # A tibble: 3 x 1
#>       x
#>   <dbl>
#> 1     5
#> 2     2
#> 3    NA

Q: How could you use arrange() to sort all missing values to the start?

dplyr::arrange(flights, desc(is.na(x), x))

# 如下
dplyr::arrange(flights, desc(is.na(dep_time), dep_time))

关于索引重置

dplyr::arrange(iris,iris$Sepal.Length) # 结果行索引重置
iris[order(iris$Sepal.Length),] # 结果行索引不重置

3. select()

selec中可用的实用函数

starts_with("abc"): matches names that begin with “abc”.

ends_with("xyz"): matches names that end with “xyz”.

contains("ijk"): matches names that contain “ijk”.

num_range("x", 1:3): matches x1, x2 and x3.

matches("(.)\1"): selects variables that match a regular expression.

rename函数可不删除其他未提及变量

rename(flights, tail_num = tailnum)
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 3.368e+05 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

everything()移动列

select(flights, time_hour, air_time, everything())
#> # A tibble: 336,776 x 19
#>   time_hour           air_time  year month   day dep_time sched_dep_time
#>   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
#> 1 2013-01-01 05:00:00      227  2013     1     1      517            515
#> 2 2013-01-01 05:00:00      227  2013     1     1      533            529
#> 3 2013-01-01 05:00:00      160  2013     1     1      542            540
#> 4 2013-01-01 05:00:00      183  2013     1     1      544            545
#> 5 2013-01-01 06:00:00      116  2013     1     1      554            600
#> 6 2013-01-01 05:00:00      150  2013     1     1      554            558
#> # … with 3.368e+05 more rows, and 12 more variables: dep_delay <dbl>,
#> #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>

很实用，如果要把某一列移动到中间位置：

new_flights <-select(new_flights,year:arr_delay,
                     arr_delay_real,
                     everything())

Q: What does the one_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c(
    "year", "month", "day", "dep_delay", "arr_delay",'no_exist')
dplyr::select(flights, one_of(vars))

# one_of()可以筛选向量中存在的列，如果全部存在则等价于
dplyr::select(flights, vars)

# 但如果有一列不存在则上述写法会报错，one_of()不会，只会出现警告

Q: Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

contains()本身不区分大小写

# 如果需要区分大小写可以在辅助函数内部加上ignore.case = FALSE
dplyr::select(flights, contains("TIME", ignore.case = FALSE))

4. mutate()

Note that you can refer to columns that you’ve just created

mutate(flights_sml,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
#> # A tibble: 336,776 x 10
#>    year month   day dep_delay arr_delay distance air_time  gain hours
#>   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
#> 1  2013     1     1         2        11     1400      227    -9  3.78
#> 2  2013     1     1         4        20     1416      227   -16  3.78
#> 3  2013     1     1         2        33     1089      160   -31  2.67
#> 4  2013     1     1        -1       -18     1576      183    17  3.05
#> 5  2013     1     1        -6       -25      762      116    19  1.93
#> 6  2013     1     1        -4        12      719      150   -16  2.5 
#> # … with 3.368e+05 more rows, and 1 more variable: gain_per_hour <dbl>

If you only want to keep the new variables, use transmute()

transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
#> # A tibble: 336,776 x 3
#>    gain hours gain_per_hour
#>   <dbl> <dbl>         <dbl>
#> 1    -9  3.78         -2.38
#> 2   -16  3.78         -4.23
#> 3   -31  2.67        -11.6 
#> 4    17  3.05          5.57
#> 5    19  1.93          9.83
#> 6   -16  2.5          -6.4 
#> # … with 3.368e+05 more rows

Modular arithmetic

%/% (integer division) and %% (remainder)
where x == y * (x %/% y) + (x %% y)

Offsets 函数

compute running differences (e.g. x - lag(x))

find when values change (x != lag(x))

(x <- 1:10)
#>  [1]  1  2  3  4  5  6  7  8  9 10
lag(x)
#>  [1] NA  1  2  3  4  5  6  7  8  9
lead(x)
#>  [1]  2  3  4  5  6  7  8  9 10 NA

Cumulative and rolling aggregates 累积和窗口函数

If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.

Ranking 排秩

y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
#> [1]  1  2  2 NA  4  5
min_rank(desc(y))
#> [1]  5  3  3 NA  2  1

row_number 通用排名，并列的名次结果按先后顺序不一样，靠前出现的元素排名在前

min_rank 通用排名，并列的名次结果一样，占用下一名次。

dense_rank 中国式排名，并列排名不占用名次

percent_rank 按百分比的排名

cume_dist 累计分布区间的排名

ntile 粗略地把向量按堆排名，n即是堆的数量

row_number(y)
#> [1]  1  2  3 NA  4  5
dense_rank(y)
#> [1]  1  2  2 NA  3  4
percent_rank(y)
#> [1] 0.00 0.25 0.25   NA 0.75 1.00
cume_dist(y)
#> [1] 0.2 0.6 0.6  NA 0.8 1.0

5. summarize() + group_by()

忽略NA值
NA具有传染性，summarize需要忽略NA

flights %>% 
  group_by( year, month, day) %>% 
  summarise(delay = mean(dep_delay, na.rm = TRUE))

Fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation

Summary functions

Measures of spread:sd(x), IQR(x), mad(x)
The root mean squared deviation, or standard deviation sd(x), is the standard measure of spread.
The interquartile range IQR(x) and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.

Measures of rank: min(x), quantile(x, 0.25), max(x)
Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.

y <- c(1, 3, 4, 3, 77, 100)
quantile(y, 0.25)
#> 25% 
#> 3 
quantile(y, 0.5)
#> 50% 
#> 3.5

Measures of position: first(x), nth(x, 2), last(x)
These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist.

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first_dep = first(dep_time), 
    last_dep = last(dep_time)
  )
#> # A tibble: 365 x 5
#> # Groups:   year, month [12]
#>    year month   day first_dep last_dep
#>   <int> <int> <int>     <int>    <int>
#> 1  2013     1     1       517     2356
#> 2  2013     1     2        42     2354
#> 3  2013     1     3        32     2349
#> 4  2013     1     4        25     2358
#> 5  2013     1     5        14     2357
#> 6  2013     1     6        16     2355
#> # … with 359 more rows

nth可以查找，也可以设置默认值

y <- c(1, 3, 4, 3, 77, 100)
nth(y, 3)
#> 4
nth(y, 8)
#> NA
nth(y, 8, default=10)
#> 10

range(x) 一定要和python迭代器区别
range(x) 返回两个极值

> y <- c(33, 65, 22, 34, 44, 77, 14)
> range(y)
#> [1] 14 77

排秩后取极值跟直接取极值的方法比较

not_cancelled %>% 
  group_by(year, month, day) %>% 
  mutate(r = min_rank(desc(dep_time))) %>% 
  filter(r %in% range(r))
#> # A tibble: 770 x 20
#> # Groups:   year, month, day [365]
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1     2356           2359        -3      425            437
#> 3  2013     1     2       42           2359        43      518            442
#> 4  2013     1     2     2354           2359        -5      413            437
#> 5  2013     1     3       32           2359        33      504            442
#> 6  2013     1     3     2349           2359       -10      434            445
#> # … with 764 more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, r <int>


not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first_dep = first(dep_time), 
    last_dep = last(dep_time)
  )
#> # A tibble: 365 x 5
#> # Groups:   year, month [12]
#>    year month   day first_dep last_dep
#>   <int> <int> <int>     <int>    <int>
#> 1  2013     1     1       517     2356
#> 2  2013     1     2        42     2354
#> 3  2013     1     3        32     2349
#> 4  2013     1     4        25     2358
#> 5  2013     1     5        14     2357
#> 6  2013     1     6        16     2355
#> # … with 359 more rows

Counts

To count the number of distinct (unique) values, use n_distinct(x)

# Which destinations have the most carriers?
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(carriers = n_distinct(carrier)) %>% 
  arrange(desc(carriers))
#> # A tibble: 104 x 2
#>   dest  carriers
#>   <chr>    <int>
#> 1 ATL          7
#> 2 BOS          7
#> 3 CLT          7
#> 4 ORD          7
#> 5 TPA          7
#> 6 AUS          6
#> # … with 98 more rows

count()中可以加上权重wt，等价于group_by+summariz(sum)

new_flights %>% 
  filter(!is.na(arr_delay_real)) %>% 
  count(tailnum, wt=distance)

new_flights %>% 
  filter(!is.na(arr_delay_real)) %>% 
  group_by(tailnum) %>% 
  summarise(sum_distance = sum(distance))

# 二者结果一致，但count+权重的形成列名是n

Counts and proportions of logical values: sum(x > 10), mean(y == 0).
When used with numeric functions, TRUE is converted to 1 and FALSE to 0.
sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.

Ungrouping

daily <- group_by(flights, year, month, day)

daily %>% 
  ungroup() %>%             # no longer grouped by date
  summarise(flights = n())  # all flights
#> # A tibble: 1 x 1
#>   flights
#>     <int>
#> 1  336776

Q: What does the sort argument to count() do

# count()中sort参数默认是FALSE，即不排序，如果设置为TRUE则会按计数从大到小排序

new_flights %>% 
  filter(!is.na(dep_time)) %>% 
  count(dest, sort=F)

Q: For each plane, count the number of flights before the first delay of greater than 1 hour.

Tailnum <- new_flights %>% 
  group_by(tailnum) %>% 
  count() %>% 
  select(tailnum)

results <- new_flights %>% 
  filter(!is.na(arr_delay_real)) %>% 
  group_by(tailnum) %>% 
  arrange(year, month, day, hour, minute) %>% 
  mutate(cum_delay = arr_delay_real > 60,
         cumsum_delay = cumsum(cum_delay)) %>% 
  # 各组出现延误超过1h后累加就会变成1
  filter(cumsum_delay == 0) %>%
  count(tailnum) %>% 
  right_join(Tailnum, by='tailnum') 
  
# 将NA转换为0
results[is.na(results$n),2] <-0 ;results

6. Grouped mutates

分组过滤器filter()内排秩

# Find the worst members of each group

flights_sml %>% 
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay)) < 10)
#> # A tibble: 3,306 x 7
#> # Groups:   year, month, day [365]
#>    year month   day dep_delay arr_delay distance air_time
#>   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
#> 1  2013     1     1       853       851      184       41
#> 2  2013     1     1       290       338     1134      213
#> 3  2013     1     1       260       263      266       46
#> 4  2013     1     1       157       174      213       60
#> 5  2013     1     1       216       222      708      121
#> 6  2013     1     1       255       250      589      115
#> # … with 3,300 more rows

7. 其他用法

join() 连表

b <- data.frame(x1=c("A","B","D"),x3=c(T,F,T))
a <- data.frame(x1=c("A","B","C"),x2=c(1,2,3))
dplyr::inner_join(a,b,by="x1") # 内连接，交集
dplyr::full_join(a,b,by="x1") # 全连接，并集
dplyr::left_join(a,b,by="x1")
dplyr::right_join(a,b,by="x1")

> b
#>   x1    x3
#> 1  A  TRUE
#> 2  B FALSE
#> 3  D  TRUE
> a
#>   x1 x2
#> 1  A  1
#> 2  B  2
#> 3  C  3

# 半连接，交集后保留左边的列
dplyr::semi_join(a,b,by="x1") 
#>   x1 x2
#> 1  A  1
#> 2  B  2
#> Warning message:
#> Column `x1` joining factors with different levels, coercing to character vector 

# 补连接，左表扣除半连接
dplyr::anti_join(a,b,by="x1")
#>   x1 x2
#> 1  C  3
#> Warning message:
#> Column `x1` joining factors with different levels, coercing to character vector

去重和抽取

# 类似unique的用法
dplyr::distinct(rbind(iris[1:10,],iris[1:15,])) 

# 取出10-15行
dplyr::slice(iris,10:15) 
# 随机抽取10行
dplyr::sample_n(iris,10) 
# 按比例随机抽取
dplyr::sample_frac(iris,0.2)