R | R入门の想法

2021-10-31 本文已影响0人 shwzhao

之前一直不知道该如何学R，一直用awk等命令进行数据处理。

最开始看的《R语言实战》，了解了工作目录，数据类型，数据结构等；

后来看《R数据科学》（网上可搜中文版pdf，英文网页版）；

练习题，搜到了张敬信老师的玩转数据处理120题之P1-P20（R语言tidyverse版本），有更新版，《R语言编程—基于tidyverse》新书信息汇总；

教程？网上搜tidyverse、readr、tibble、dplyr、stringr、tidyr、purrr、forcats......应有尽有！

数据处理，学好tidyverse的几个核心包的核心函数，应该就差不多了吧 ~ ~ ~

补充：老板让学的 quick-R

补充：tidyverse的几个核心包的主要函数

> library(tidyverse)
-- Attaching packages ----------------------- tidyverse 1.3.1 --
√ ggplot2 3.3.5     √ purrr   0.3.4
√ tibble  3.1.5     √ dplyr   1.0.7
√ tidyr   1.1.4     √ stringr 1.4.0
√ readr   2.0.2     √ forcats 0.5.1
-- Conflicts -------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

1. tibble

data.frame、data.table、tibble

library(tibble)
# library(tidyverse)

tibble(): 创建tibble，不改变输入的类型、变量的名称，也不创建行名称
tribble()：transposed tibble的缩写，创建tibble的另一种方法
as_tibble(): 将数据框转换为tibble
df$x、df[["x"]]、df %>% .$x、df %>% .[["x"]]: 按名称提取x列元素组成向量
df[[1]]: 按位置提取第1列元素组成的向量

df <- tibble(
  `>` = 1:5,
  `100` = 1,
  a = c("one", "two", "three", "four", "five"),
  `:)` = `>` * `100` + str_length(a),
)
df

# # A tibble: 5 x 4
#     `>` `100` a      `:)`
#   <int> <dbl> <chr> <dbl>
# 1     1     1 one       4
# 2     2     1 two       5
# 3     3     1 three     8
# 4     4     1 four      8
# 5     5     1 five      9

> tribble(
  ~`<`, ~`100`,
  "a", 1,
  "b", 1,
)

# # A tibble: 2 x 2
#   `<`   `100`
#   <chr> <dbl>
# 1 a         1
# 2 b         1

> df$`100` # 特殊字符，加反引号
[1] 1 1 1 1 1
> df[[">"]] # 虽然是特殊字符，但仍加双引号
[1] 1 2 3 4 5
> mpg %>% colnames() # tibble不换行显示，所以会显示不全，要输出tibble行名，可以直接使用基础函数 colnames()
 [1] "manufacturer" "model"        "displ"        "year"         "cyl"
 [6] "trans"        "drv"          "cty"          "hwy"          "fl"
[11] "class"

rownames_to_column(): 行名转列，个人感觉非常有用，不喜欢导入数据时第一列被自动转换成列名
rowid_to_column(): 行号转列
column_to_rownames(): 列转列名
has_rownames(): 判断是否有列名
remove_rownames(): 移除列名

> mtcars %>% has_rownames()
[1] TRUE
> mtcars %>% remove_rownames() %>% has_rownames()
[1] FALSE
> mtcars %>% rownames_to_column("car") %>% head()
                car  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

http://blog.fens.me/r-tibble/

2. readr

R基础包 read.csv()、read.tsv()、read.delim()
readr包，更快，生成tibble，不使用行名称......

library(readr)
# library(tidyverse)

read_csv(): 读取逗号分隔文件
skip: 传递数值，等于2时，跳过前2行
comment: 传递字符，等于#时，丢弃所有以#开头的行
col_names: 1. 传递逻辑，是否将第一行作为列标题，False时，将各列一次标注为X1至Xn；2. 传递字符向量，作为列名称
na: 可以设定哪个值或哪些值作为缺失值
read_tsv(): 读取制表符分隔文件
read_delim(): 读取任意分隔文件
read_fwf(): 读取固定宽度文件
readsl: 读取Excel文件（.xls和.xlsx均可）
write_csv()、write_tsv(): 写出文件
write_excel_csv(): 将CSV文件导出为Excel文件

3. dplyr

3.1 汇总操作

count(): 类似于Shell中输出某几列后sort | uniq -c
group_by(): 可以按某一或几列进行分组

> mpg %>% group_by(manufacturer,model) %>% count()
# A tibble: 38 × 3
# Groups:   manufacturer, model [38]
   manufacturer model                  n
   <chr>        <chr>              <int>
 1 audi         a4                     7
 2 audi         a4 quattro             8
 3 audi         a6 quattro             3
 4 chevrolet    c1500 suburban 2wd     5
 5 chevrolet    corvette               5
 6 chevrolet    k1500 tahoe 4wd        4
 7 chevrolet    malibu                 5
 8 dodge        caravan 2wd           11
 9 dodge        dakota pickup 4wd      9
10 dodge        durango 4wd            7
# … with 28 more rows

summarise(): 默认每行一个分组进行统计描述
group_by() %>% summarise(): 常搭配使用

> mtcars %>% select(cyl, vs) %>% add_count(cyl, sort=TRUE)  # `sort`: 等于`TRUE`时，按次数多少从大到小排序；mtcars %>% add_count(cyl) 用于添加一列次数
# A tibble: 32 x 3
     cyl    vs     n
   <dbl> <dbl> <int>
 1     8     0    14
 2     8     0    14
 3     8     0    14
 4     8     0    14
 5     8     0    14
 6     8     0    14
 7     8     0    14
 8     8     0    14
 9     8     0    14
10     8     0    14
# ... with 22 more rows
> mtcars %>% group_by(cyl) %>% summarise(count=n(),mean_disp=mean(disp))
# A tibble: 3 x 3
    cyl count mean_disp
  <dbl> <int>     <dbl>
1     4    11      105.
2     6     7      183.
3     8    14      353.

base包的几个累积计算函数，dplyr扩充了几个。

cummax(): 累积最大值
cummin(): 累积最小值
cumsum(): 累积求和
cumprod(): 累积求积
dplyr::cummean(): 累积求平均值

> tibble(a=1:5, b=6:10) %>%
  pivot_longer(everything(), names_to="Type",values_to="Num") %>%
  arrange(Type) %>%
  group_by(Type) %>%
  mutate(Cumsum=cumsum(Num))
# A tibble: 10 × 3
# Groups:   Type [2]
   Type    Num Cumsum
   <chr> <int>  <int>
 1 a         1      1
 2 a         2      3
 3 a         3      6
 4 a         4     10
 5 a         5     15
 6 b         6      6
 7 b         7     13
 8 b         8     21
 9 b         9     30
10 b        10     40

3.2 行操作

filter(): 筛选行
关系：<、<= 、>、>=、==、!=、is.na()、!is.na()
逻辑：&、|、!、xor()
此外也要注意：between()、%in%、near()......

> a=c("audi","dodge","honda")
> mpg %>% filter(manufacturer %in% a) %>% group_by(manufacturer) %>% count()
# A tibble: 3 × 2
# Groups:   manufacturer [3]
  manufacturer     n
  <chr>        <int>
1 audi            18
2 dodge           37
3 honda            9

distinct(): 去除重复行

> mpg %>% select(manufacturer) %>% distinct()
# A tibble: 15 × 1
   manufacturer
   <chr>
 1 audi
 2 chevrolet
 3 dodge
 4 ford
 5 honda
 6 hyundai
 7 jeep
 8 land rover
 9 lincoln
10 mercury
11 nissan
12 pontiac
13 subaru
14 toyota
15 volkswagen

slice(): 按位置选择行

> mpg %>% select(manufacturer) %>% distinct() %>% slice(c(1,1,3,5)) %>% .[[1]]
[1] "audi"  "audi"  "dodge" "honda"

sample_n: 随机抽取n行数据

> mpg %>% select(manufacturer) %>% distinct() %>% sample_n(3,replace = F) %>% .[[1]]
[1] "audi"   "toyota" "dodge"

slice_sample(): 随机抽取一行数据

> mpg %>% slice_sample()
# A tibble: 1 × 11
  manufacturer model      displ  year   cyl trans  drv     cty   hwy fl    class
  <chr>        <chr>      <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
1 ford         f150 pick…   4.2  1999     6 auto(… 4        14    17 r     pick…

arrange(): 对行排序
add_row()

3.3 列操作

pull: 提取列，返回向量，不能提取多列

> mpg %>% pull(year) %>% unique() # 基础函数unique()，用于向量去重
[1] 1999 2008

select(): 筛选列
starts_with()、ends_with()、contains(): 字面意义
matches(): 匹配正则表达式
everything(): 所有的列。把特定列放到前面后，剩下的列用everything()统一放置
:: 切片取列
-: 删除某列
注意使用!，&，|......

> mpg %>% select((!starts_with(c("m","d","y")) | matches("m.*er")))
# A tibble: 234 × 7
     cyl trans        cty   hwy fl    class   manufacturer
   <int> <chr>      <int> <int> <chr> <chr>   <chr>
 1     4 auto(l5)      18    29 p     compact audi
 2     4 manual(m5)    21    29 p     compact audi
 3     4 manual(m6)    20    31 p     compact audi
 4     4 auto(av)      21    30 p     compact audi
 5     6 auto(l5)      16    26 p     compact audi
 6     6 manual(m5)    18    26 p     compact audi
 7     6 auto(av)      18    27 p     compact audi
 8     4 manual(m5)    18    26 p     compact audi
 9     4 auto(l5)      16    25 p     compact audi
10     4 manual(m6)    20    28 p     compact audi
# … with 224 more rows

relocate(): 将列移动到新的位置
across(): 以相同的方式汇总或改变多个列
c_across()
mutate(): 新增列
配合函数case_when()、coalesce()、if_else()、na_if()使用

> tibble(a=1:5, b=6:10) %>% mutate(If2=if_else(b/a==2,"T","F"))
# A tibble: 5 × 3
      a     b If2
  <int> <int> <chr>
1     1     6 F
2     2     7 F
3     3     8 F
4     4     9 F
5     5    10 T

transmute(): 只保留新增的列

> mpg %>% transmute(mm=str_c(manufacturer, "__", model))
# A tibble: 234 × 1
   mm
   <chr>
 1 audi__a4
 2 audi__a4
 3 audi__a4
 4 audi__a4
 5 audi__a4
 6 audi__a4
 7 audi__a4
 8 audi__a4 quattro
 9 audi__a4 quattro
10 audi__a4 quattro
# … with 224 more rows

rename, rename_with(): 修改列名

3.4 多个数据框操作

left_join()
此外，right_join()、inner_join()、full_join()

by = c()

bind_cols()
bind_rows()

intersect(): 多个数据框共有的行
setdiff(): 多个数据框非共有的行
union(): 合并数据框后去重

4. tidyr

自我认为pivot_wider()、pivot_longer()是tidyr最重要的两个函数，其他的函数可以配合使用dplyr和stringr达到相同的效果。

pivot_wider(): 长变宽
pivot_longer(): 宽变长

> mpg %>% group_by(model, year) %>% count() %>% 
  pivot_wider(id_cols=model, names_from=year, values_from=n, names_prefix = "year_")
# A tibble: 38 × 3
# Groups:   model [38]
   model              year_1999 year_2008
   <chr>                  <int>     <int>
 1 4runner 4wd                4         2
 2 a4                         4         3
 3 a4 quattro                 4         4
 4 a6 quattro                 1         2
 5 altima                     2         4
 6 c1500 suburban 2wd         1         4
 7 camry                      4         3
 8 camry solara               4         3
 9 caravan 2wd                6         5
10 civic                      5         4
# … with 28 more rows
> mpg %>% group_by(model, year) %>% count() %>% 
  pivot_wider(id_cols=model, names_from=year, values_from=n, names_prefix = "year_") %>% 
  pivot_longer(-model, names_to="year",values_to="count")
# A tibble: 76 × 3
# Groups:   model [38]
   model       year      count
   <chr>       <chr>     <int>
 1 4runner 4wd year_1999     4
 2 4runner 4wd year_2008     2
 3 a4          year_1999     4
 4 a4          year_2008     3
 5 a4 quattro  year_1999     4
 6 a4 quattro  year_2008     4
 7 a6 quattro  year_1999     1
 8 a6 quattro  year_2008     2
 9 altima      year_1999     2
10 altima      year_2008     4
# … with 66 more rows

expand(): 提取列，并给出所有可能的组合情况
complete()

> mtcars %>% expand(cyl)
# A tibble: 3 × 1
    cyl
  <dbl>
1     4
2     6
3     8
> mtcars %>% expand(cyl, vs)
# A tibble: 6 × 2
    cyl    vs
  <dbl> <dbl>
1     4     0
2     4     1
3     6     0
4     6     1
5     8     0
6     8     1

unite(): 合并列
separate(): 拆分列
separate_rows(): 把列拆成行

drop_na(): 删除指定列含有缺失值的行
fill(): 缺失值填充
replace_na(): 指定列缺失值替换

5. stringr

5.1 检测匹配

str_detect()
str_starts()
str_which()
str_locate()
str_count()

> mpg %>%
  mutate(newcol=if_else(str_detect(trans, "auto"), "A", "M")) %>%
  group_by(newcol) %>%
  count()
# A tibble: 2 × 2
# Groups:   newcol [2]
  newcol     n
  <chr>  <int>
1 A        157
2 M         77

5.2 字符提取

str_sub(): 提取字符的指定区段
str_subset(): 提取匹配的字符
str_extract()
str_extract_all()
str_match()

5.3 字符长度

str_length(): 返回字符串长度
str_pad(): 填充字符到一定长度
str_trunc(): 截断字符到一定长度
str_trim(): 去除字符开头和末尾的空白
str_squish(): 减少字符中重复的空格

5.4 字符替换

str_replace(): 替换每个字符串的首个匹配的字符
str_replace_all(): 替换每个字符串所有匹配的字符
str_to_lower(): 所有字母小写
str_to_upper(): 所有字母大写
str_to_title(): 每个字符的第一个字母大写，剩下小写
str_to_sentence(): 第一个字符的第一个字母大写，剩下小写

5.5 组合切割

str_c()

> mpg %>%
  mutate(a=str_c(manufacturer, model, sep="-")) %>%
  select(a,everything())
# A tibble: 234 × 12
   a        manufacturer model  displ  year   cyl trans  drv     cty   hwy fl
   <chr>    <chr>        <chr>  <dbl> <int> <int> <chr>  <chr> <int> <int> <chr>
 1 audi-a4  audi         a4       1.8  1999     4 auto(… f        18    29 p
 2 audi-a4  audi         a4       1.8  1999     4 manua… f        21    29 p
 3 audi-a4  audi         a4       2    2008     4 manua… f        20    31 p
 4 audi-a4  audi         a4       2    2008     4 auto(… f        21    30 p
 5 audi-a4  audi         a4       2.8  1999     6 auto(… f        16    26 p
 6 audi-a4  audi         a4       2.8  1999     6 manua… f        18    26 p
 7 audi-a4  audi         a4       3.1  2008     6 auto(… f        18    27 p
 8 audi-a4… audi         a4 qu…   1.8  1999     4 manua… 4        18    26 p
 9 audi-a4… audi         a4 qu…   1.8  1999     4 auto(… 4        16    25 p
10 audi-a4… audi         a4 qu…   2    2008     4 manua… 4        20    28 p
# … with 224 more rows, and 1 more variable: class <chr>

感觉下面几个函数对于数据框的操作没有太大的帮助

str_flatten()
str_dup()
str_split_fixed()
str_split()
str_split_n()
str_glue()
str_glue_data()

5.6 字符排序

str_order()
str_sort()

6. forcats

6.1 `base`包中关于因子的函数

factor()
levels()
as_factor()

> qq <- c("d", "g", "a", "e", "f", "a", "e")
> qq <- factor(qq)

6.2 查看

fct_count(): 查看各个level的数量

6.3 重新排序

fct_relevel(): 手动
fct_infreq(): 根据level出现的频率
fct_inorder(): 根据level出现的顺序
fct_rev(): 颠倒
fct_shift(): 平移
fct_shuffle(): 随机排序
fct_reorder(): 根据其他变量进行排序，这个应该很有用
fct_reorder2():

6.4 改变levels值

fct_recode(): 手动
fct_anon():
fct_collapse():
fct_lump: 根据levels的频率进行操作，对于筛选数据很有用
fct_other():

6.6 添加或删除levels

fct_drop()
fct_expand()
fct_explicit_na()

7. purrr

刚看了张金龙老师的R语言视频课（20220302），他说他目前为止还没用过这个功能。
我平时也就是简单地整理表格数据，purrr更适用于list，我好像也用不到......，等以后接触到再学吧。

R | R入门の想法

1. tibble

2. readr

3. dplyr

3.1 汇总操作

3.2 行操作

3.3 列操作

3.4 多个数据框操作

4. tidyr

5. stringr

5.1 检测匹配

5.2 字符提取

5.3 字符长度

5.4 字符替换

5.5 组合切割

5.6 字符排序

6. forcats

6.1 `base`包中关于因子的函数

6.2 查看

6.3 重新排序

6.4 改变levels值

6.6 添加或删除levels

7. purrr

猜你喜欢

热点阅读

R | R入门の想法

1. tibble

2. readr

3. dplyr

3.1 汇总操作

3.2 行操作

3.3 列操作

3.4 多个数据框操作

4. tidyr

5. stringr

5.1 检测匹配

5.2 字符提取

5.3 字符长度

5.4 字符替换

5.5 组合切割

5.6 字符排序

6. forcats

6.1 base包中关于因子的函数

6.2 查看

6.3 重新排序

6.4 改变levels值

6.6 添加或删除levels

7. purrr

猜你喜欢

热点阅读

6.1 `base`包中关于因子的函数