R语言学习指南(3) tidyverse的基础使用

2020-12-14 本文已影响0人 R语言数据分析指南

tidyverse是为数据科学设计的R软件包，它包含(ggplot2、dplyr、tidyr、stringr、magrittr、tibble)等一系列热门软件包，学好tidyverse的使用可也让你站上另一个高度，从而高效的处理数据，因此本文档不仅仅做一些案例介绍，而是希望以较为正确的学习方法来介绍R语言，使大家少走弯路，快速入门掌握R语言。

1. 安装`tidyverse`

install.packages("tidyverse")
library(tidyverse)

> library(tidyverse)
─ Attaching packages ─────────── tidyverse 1.3.0 ─
✓ ggplot2 3.3.2     ✓ purrr   0.3.4
✓ tibble  3.0.4     ✓ dplyr   1.0.2
✓ tidyr   1.1.2     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.0
─ Conflicts ──────────── tidyverse_conflicts() ─
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

2. iris数据集

我们将使用iris(鸢尾花)数据集，因此花一点时间来熟悉一下它，加载ggplot2软件包时，可以使用此内置数据集。加载tidyverse软件包将自动加载ggplot2。

View(iris)  #可以像excel一样查看数据

iris.png

attributes(iris) #查看数据属性

> attributes(iris)
$names
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
[5] "Species"     

$class
[1] "data.frame"

$row.names
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
 [17]  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32
 [33]  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48
 [49]  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64
 [65]  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80
 [81]  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96
 [97]  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112
[113] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
[129] 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[145] 145 146 147 148 149 150

可以看到数据有5列，150行，数据类型为数据框；分别表示Sepal.Length(花萼长度)，Sepal.Width(花萼宽度）、Petal.Length(花瓣长度)，Petal.Width(花瓣宽度）、Species(花的类型)，其中花有3种类型(setosa、versicolor、virginica）

上面介绍了iris数据集，接着我们开始一些基础的数据操作

3.使用dplyr对数据进行操作

3.1 select(按名称选取列）

select(iris,Sepal.Length,Petal.Length,Species)
#为了查看方便也可以只查看前6行
head(select(iris,Sepal.Length,Petal.Length,Species))

将筛选出来的结果通过赋值操作符<-给一个变量，如下所示

p <- select(iris,Sepal.Length,Petal.Length,Species)

接着用此数据进行一个最基础的可视化：
关于ggplot2的原理可参考:https://www.jianshu.com/p/4da5a941e8b5

ggplot(p,aes(Sepal.Length,Petal.Length))+
  geom_point(aes(color=Species),size=2)

plot1.png

select选择2列之间的所有列

select(iris,Sepal.Length:Petal.Length))

select选择不在2列之间的所有列

select(iris,-(Sepal.Length:Petal.Length))

select改变列的顺序

#select()与everythin()函数结合使用可以改变列的顺序

select(iris,Species,Petal.Width,Sepal.Width,
       Sepal.Length,Petal.Length,everything())

3.2 filter(按值筛选行）

filter(iris,Sepal.Length >=5,Petal.Length >=2)

p1 <- filter(iris,Sepal.Length >=5,Petal.Length >=2)
ggplot(p1,aes(Sepal.Length,Petal.Length))+
  geom_point(aes(color=Species),size=2)

R中的比较运算符：>、>=、<、<=、!=(不等于)、==(等于）
R中的逻辑运算符：&表示"与”，|表示“或”，!表示“非”

plot2.png

3.3 arrange(改变行顺序）

#根据Petal.Width列的数据进行排序，默认为升序
arrange(iris,Petal.Width)

#desc()可以按列进行降序排序：
arrange(iris,desc(Petal.Width)))

3.5 rename(更改列名称)

#新名称在前，原始名称在后
rename(iris,length=Sepal.Length)

rename(iris,replace=c("Sepal.Length"="length"))

3.6 mutate(添加新列）

mutate(iris,group ="A",Length=10)

3.7 summarize(进行分组摘要)

summarize它可以将数据框折叠成一行

summarize(iris,mean(Sepal.Length),
          sd(Sepal.Length))

3.8 group_by()

group_by可以将分析单位从整个数据集更改为单个分组

iris %>% group_by(Species) %>% 
  summarize(m = mean(Sepal.Length,na.rm=T))
# na.rm=T 表示移除缺失数据

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
  Species        m
  <fct>      <dbl>
1 setosa      5.01
2 versicolor  5.94
3 virginica   6.59

3.9 %>%(管道)

利用管道可以简化代码，提高代码阅读流畅性：

p1 <- filter(iris,Sepal.Length >=5,Petal.Length >=2)
p2 <- group_by(p1,Species)
p3 <- filter(p2,Species=="virginica")
ggplot(p3,aes(Sepal.Length,Petal.Length))+
  geom_point(aes(color=Species),size=2)

iris %>% filter(Sepal.Length >=5,Petal.Length >=2) %>%
  group_by(Species) %>% filter(Species=="virginica") %>%
  ggplot(aes(Sepal.Length,Petal.Length))+
  geom_point(aes(color=Species),size=2)

这2段代码结果相同，可以明显看到使用了%>%减少了中间变量，提高了代码的可阅读性

iris %>% filter(.,Sepal.Length >=5,Petal.Length >=2)

管道的原理就是将%>%左边的变量传递到右边的.处，通常在正式书写时可省略.

3.10 count() 计算每组值的次数

iris %>% count(Species)

tidyverse中还有很多的有用的函数，但是上面所述的均为在数据处理中使用频率最高的函数，到此为止我们已经介绍了画图原理，及一系列数据处理函数，下面就可以通过一系列可视化案例来不断加深学习

R语言学习指南(3) tidyverse的基础使用

1. 安装`tidyverse`

2. iris数据集

3.使用dplyr对数据进行操作

3.1 select(按名称选取列）

select选择2列之间的所有列

select选择不在2列之间的所有列

select改变列的顺序

3.2 filter(按值筛选行）

3.3 arrange(改变行顺序）

3.5 rename(更改列名称)

3.6 mutate(添加新列）

3.7 summarize(进行分组摘要)

3.8 group_by()

3.9 %>%(管道)

3.10 count() 计算每组值的次数

猜你喜欢

热点阅读

R语言学习指南(3) tidyverse的基础使用

1. 安装tidyverse

2. iris数据集

3.使用dplyr对数据进行操作

3.1 select(按名称选取列）

select选择2列之间的所有列

select选择不在2列之间的所有列

select改变列的顺序

3.2 filter(按值筛选行）

3.3 arrange(改变行顺序）

3.5 rename(更改列名称)

3.6 mutate(添加新列）

3.7 summarize(进行分组摘要)

3.8 group_by()

3.9 %>%(管道)

3.10 count() 计算每组值的次数

猜你喜欢

热点阅读

1. 安装`tidyverse`