【r<-数据分析】使用dplyr（2）：arrange和s

2018-03-23 本文已影响88人王诗翔

使用arrange()排列行

arrange()函数工作原理和filter()相似，但它不是选择行，而是改变行的顺序。它使用一个数据框和一系列有序的列变量（或者更复杂的表达式）作为输入。如果你提供了超过一个列名，其他列对应着进行排序。

arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515      2.00      830
##  2  2013     1     1      533            529      4.00      850
##  3  2013     1     1      542            540      2.00      923
##  4  2013     1     1      544            545     -1.00     1004
##  5  2013     1     1      554            600     -6.00      812
##  6  2013     1     1      554            558     -4.00      740
##  7  2013     1     1      555            600     -5.00      913
##  8  2013     1     1      557            600     -3.00      709
##  9  2013     1     1      557            600     -3.00      838
## 10  2013     1     1      558            600     -2.00      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

使用desc()可以以逆序（降序）的方式排列：

arrange(flights, desc(arr_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900      1301     1242
##  2  2013     6    15     1432           1935      1137     1607
##  3  2013     1    10     1121           1635      1126     1239
##  4  2013     9    20     1139           1845      1014     1457
##  5  2013     7    22      845           1600      1005     1044
##  6  2013     4    10     1100           1900       960     1342
##  7  2013     3    17     2321            810       911      135
##  8  2013     7    22     2257            759       898      121
##  9  2013    12     5      756           1700       896     1058
## 10  2013     5     3     1133           2055       878     1250
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

缺失值会排到最后面：

df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1  2.00
## 2  5.00
## 3 NA
arrange(df, desc(x))
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1  5.00
## 2  2.00
## 3 NA

练习

你怎么将所有的缺失值都排到最前面？（提示：使用is.na()）
给flights排序找到延时最多的航班；找到其中离开最早的。
给flights排序找到最快的航班。
哪一个航班时间最长？哪一个最短？

使用select()选择列

一般我们分析的原始数据集有非常多的变量（列），第一个我们要解决的问题就是缩小范围找到我们需要的数据（变量）。select()允许我们快速通过变量名对数据集取子集。

# 根据名字选择列
select(flights, year, month, day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

# 选择year到day之间（包含本身）的所有列
select(flights, year:day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

# 选择那么除year到day的所有列
select(flights, -(year:day))
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>
##  1      517            515      2.00      830            819     11.0 
##  2      533            529      4.00      850            830     20.0 
##  3      542            540      2.00      923            850     33.0 
##  4      544            545     -1.00     1004           1022    -18.0 
##  5      554            600     -6.00      812            837    -25.0 
##  6      554            558     -4.00      740            728     12.0 
##  7      555            600     -5.00      913            854     19.0 
##  8      557            600     -3.00      709            723    -14.0 
##  9      557            600     -3.00      838            846    - 8.00
## 10      558            600     -2.00      753            745      8.00
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

有很多帮助函数可以使用在select()函数中：

starts_with("abc")匹配以“abc”开头的名字。
ends_with("xyz")匹配以“xyz”结尾的名字。
contains("ijk")匹配包含“ijk”的名字。
matches("(.)\\1")选择符合正则表达式的变量。这里是任意包含有重复字符的变量。
num_range("x", 1:3)匹配x1，x2，x3。

运行?select查看更多详情。

select()也可以用来重命名变量，但很少使用到，因为它会将所有未显示指定的变量删除掉。我们可以使用它的变体函数rename()来给变量重新命名：

rename(flights, tail_num = tailnum)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515      2.00      830
##  2  2013     1     1      533            529      4.00      850
##  3  2013     1     1      542            540      2.00      923
##  4  2013     1     1      544            545     -1.00     1004
##  5  2013     1     1      554            600     -6.00      812
##  6  2013     1     1      554            558     -4.00      740
##  7  2013     1     1      555            600     -5.00      913
##  8  2013     1     1      557            600     -3.00      709
##  9  2013     1     1      557            600     -3.00      838
## 10  2013     1     1      558            600     -2.00      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

select()的另外一个操作是与everything()帮助函数联合使用。当你有一大堆变量你想要移动到数据框开始（最左侧）时非常有用。

select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00    227    2013     1     1      517            515
##  2 2013-01-01 05:00:00    227    2013     1     1      533            529
##  3 2013-01-01 05:00:00    160    2013     1     1      542            540
##  4 2013-01-01 05:00:00    183    2013     1     1      544            545
##  5 2013-01-01 06:00:00    116    2013     1     1      554            600
##  6 2013-01-01 05:00:00    150    2013     1     1      554            558
##  7 2013-01-01 06:00:00    158    2013     1     1      555            600
##  8 2013-01-01 06:00:00     53.0  2013     1     1      557            600
##  9 2013-01-01 06:00:00    140    2013     1     1      557            600
## 10 2013-01-01 06:00:00    138    2013     1     1      558            600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

练习

尽量用更多的方式从flights中选择dep_time,dep_delay，arr_time和arr_delay。
如果你多次包含同一变量名在select()函数里会发生什么呢？
one_of()函数是用来做什么的？为什么它与下面这个向量结合使用会非常有用？

    var <- c(
    "year", "month", "day", "dep_delay", "arr_delay"
    )

下面代码的运行结果会让你吃惊吗？这个select的帮助函数默认是怎样处理这种情况的呢？你怎样改变默认的情况？

select(flights, contains("TIME"))

练习有关的见解和答案我只发布在博客上（有兴趣学习还是要自己动手和思考下）

【r<-数据分析】使用dplyr（2）：arrange和s

使用arrange()排列行

练习

使用select()选择列

练习

猜你喜欢

热点阅读