R for Data Science（笔记） ---数据变换（s

2021-07-03 本文已影响0人生信小鹏

R for Data Science

紧接之前写的select的基础操作

4. 拓展2

其实select的使用，和其他函数结合搭配使用会发挥强大的作用

结合函数 last_col()

选取倒数第几列，默认是最后一列是0

先看一看iris这个数据集的样子

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

总共有5列

iris %>% select(last_col())
#> # A tibble: 150 x 1
#>   Species
#>   <fct>  
#> 1 setosa 
#> 2 setosa 
#> 3 setosa 
#> 4 setosa 
#> # ... with 146 more rows

可以看到，没有任何参数情况下，其选择的是最后一列。

> iris %>% select(3:last_col(1)) %>% head()
  Petal.Length Petal.Width
1          1.4         0.2
2          1.4         0.2
3          1.3         0.2
4          1.5         0.2
5          1.4         0.2
6          1.7         0.4

select(3:last_col(1))这个参数就是选取第三列到倒数第二列的数据，同样可以推广到选取倒数第二列到倒数第4列，方法同上。

结合函数 everything() 函数

我一般习惯结合使用everything函数进行一个列的重新排列。

例如，我想把一个数据框的第3，6，8，列放在最前面，这样方便我查看，其余的顺序不变。
以原来flights数据为例，可以这样书写

select(flights, time_hour, air_time, everything())
#> # A tibble: 336,776 x 19
#>   time_hour           air_time  year month   day dep_time sched_dep_time
#>   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
#> 1 2013-01-01 05:00:00      227  2013     1     1      517            515
#> 2 2013-01-01 05:00:00      227  2013     1     1      533            529
#> 3 2013-01-01 05:00:00      160  2013     1     1      542            540
#> 4 2013-01-01 05:00:00      183  2013     1     1      544            545
#> 5 2013-01-01 06:00:00      116  2013     1     1      554            600
#> 6 2013-01-01 05:00:00      150  2013     1     1      554            558
#> # … with 336,770 more rows, and 12 more variables: dep_delay <dbl>,
#> #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>

与starts_with()函数结合

iris %>% select(starts_with("Sepal"))
#> # A tibble: 150 x 2
#>   Sepal.Length Sepal.Width
#>          <dbl>       <dbl>
#> 1          5.1         3.5
#> 2          4.9         3  
#> 3          4.7         3.2
#> 4          4.6         3.1
#> # ... with 146 more rows

与ends_with()函数结合

iris %>% select(ends_with("Width"))
#> # A tibble: 150 x 2
#>   Sepal.Width Petal.Width
#>         <dbl>       <dbl>
#> 1         3.5         0.2
#> 2         3           0.2
#> 3         3.2         0.2
#> 4         3.1         0.2
#> # ... with 146 more rows

使用这两个函数的一点tips：这两个函数当中的内容必须是字符串形式，也就是要加上引号，如果不加上引号，不能执行。这里我放上我自己的数据的一个例子。

数据是行名是TCGA的编码，总共15位，就是这个样子。

目的是只挑选出14，15位是小于11的数据。

#正确的输入方式
RCC_test <- expr_RCC %>% select(ends_with(c("01","05")))

#错误的输入方式
RCC_test <- expr_RCC %>% select(ends_with(c(01,05)))

当然，最开始的时候，我是用的是基础R

RCC_cancer <- expr_RCC[,str_sub(colnames(expr_RCC),14,15) < 11]

两者好像都还算简洁。

与contains()函数结合

这个有点类似通配符或者正则表达式

iris %>% select(contains("al"))
#> # A tibble: 150 x 4
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#>          <dbl>       <dbl>        <dbl>       <dbl>
#> 1          5.1         3.5          1.4         0.2
#> 2          4.9         3            1.4         0.2
#> 3          4.7         3.2          1.3         0.2
#> 4          4.6         3.1          1.5         0.2
#> # ... with 146 more rows

与 matches()函数结合

这个就是用到了正则表达式的

iris %>% select(matches("[pt]al")) 
#> # A tibble: 150 x 4
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#>          <dbl>       <dbl>        <dbl>       <dbl>
#> 1          5.1         3.5          1.4         0.2
#> 2          4.9         3            1.4         0.2
#> 3          4.7         3.2          1.3         0.2
#> 4          4.6         3.1          1.5         0.2
#> # ... with 146 more rows

结合where()函数

where()里面可以是函数，这个就厉害了，可以给与判断语句

iris %>% select(where(is.factor))
#> # A tibble: 150 x 1
#>   Species
#>   <fct>  
#> 1 setosa 
#> 2 setosa 
#> 3 setosa 
#> 4 setosa 
#> # ... with 146 more rows

结合which()函数

既然能用where(), which()也是可以的

这个是我自己找到的方法，同样利用刚才自己的数据

RCC_test <- expr_RCC %>% select(which(str_sub(colnames(expr_RCC),14,15) < 11))

结果同样，nice