[16] 《R数据科学》select选择列

2020-11-03 本文已影响0人灰常不错

从数据集中选择变量一直是一个挑战，而select()函数可以快速生成一个有用的变量子集。

按名称选择列

library(dplyr)
library(nycflights13)
select(flights,year,month,day)
# A tibble: 336,776 x 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ... with 336,766 more rows

选择"year"和"day"之间的所有列，包括"year"和"day"

select(flights,year:day)
# A tibble: 336,776 x 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ... with 336,766 more rows

选择不在"year"和"day"之间的所有列，不包括"year"和"day"

select(flights,-(year:day))
# A tibble: 336,776 x 16
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int>
 1      517            515         2      830            819        11 UA        1545
 2      533            529         4      850            830        20 UA        1714
 3      542            540         2      923            850        33 AA        1141
 4      544            545        -1     1004           1022       -18 B6         725
 5      554            600        -6      812            837       -25 DL         461
 6      554            558        -4      740            728        12 UA        1696
 7      555            600        -5      913            854        19 B6         507
 8      557            600        -3      709            723       -14 EV        5708
 9      557            600        -3      838            846        -8 B6          79
10      558            600        -2      753            745         8 AA         301
# ... with 336,766 more rows, and 8 more variables: tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>

还可以在select()函数中使用一些辅助函数

starts_with("abc"):匹配以"abc"开头的名称。
ends_with("xyz"):匹配以"xyz"结尾的名称。
contains("ijk"):匹配包含"ijk"的名称。
matches("(.)\\1"):选择匹配正则表达式的那些变量。
num_range("x",1:3):匹配x1,x2和x3。

更改变量名

select()可以重命名变量，但是我们很少使用它，因为这样会丢掉所有未明确提及的变量，我们一般采用select()的变体函数rename()来重命名变量，以保留所有未明确提及的变量。
比如：

rename(flights,tail_num = tailnum)

更换列的位置

如果我们想要筛选出需要的变量列而又保留其它列，可以加入everything()函数。
筛选出time_hour，air_time列，并保留其他列（把time_hour，air_time提到最前方，并保留其它列）：

select(flights,time_hour,air_time,everything())
# A tibble: 336,776 x 19
   time_hour           air_time  year month   day dep_time sched_dep_time dep_delay
   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>     <dbl>
 1 2013-01-01 05:00:00      227  2013     1     1      517            515         2
 2 2013-01-01 05:00:00      227  2013     1     1      533            529         4
 3 2013-01-01 05:00:00      160  2013     1     1      542            540         2
 4 2013-01-01 05:00:00      183  2013     1     1      544            545        -1
 5 2013-01-01 06:00:00      116  2013     1     1      554            600        -6
 6 2013-01-01 05:00:00      150  2013     1     1      554            558        -4
 7 2013-01-01 06:00:00      158  2013     1     1      555            600        -5
 8 2013-01-01 06:00:00       53  2013     1     1      557            600        -3
 9 2013-01-01 06:00:00      140  2013     1     1      557            600        -3
10 2013-01-01 06:00:00      138  2013     1     1      558            600        -2
# ... with 336,766 more rows, and 11 more variables: arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>