R tidyverse 学习

2019-08-06 本文已影响0人生信小白2018

Tidyverse是一系列优秀R包的合集，其中最常用的7个package包括ggplot2/tibble/tidyr/readr/purrr/dplyr/stringr/forcat。
每个包的功能如下：

readr：用于数据读取
tibble：用于形成强化数据框
tidyr：用于长宽表格转换，数据整洁，数据清理
dplyr：用于数据操纵，数据整理
stringr：用于处理字符串数据
forcats：用于处理因子数据
ggplot2：用于数据可视化

对于大多部分数据分析任务，通常有一些固定的操作，操作对应的命令和对应的R包也是相对固定的，基本可以用下图概括。

image.png

安装-载入-大概了解tidyverse

install.packages("tidyverse")
library(tidyverse)

tidyverse_conflicts()     # tidyverse与其他包的冲突
tidyverse_deps()          # 列出所有tidyverse的依赖包
tidyverse_logo()          #获取tidyverse的logo
tidyverse_packages()      # 列出所有tidyverse包
tidyverse_update()        # 更新tidyverse包

载入数据，了解数据

library(datasets)
install.packages("gapminder")
library(gapminder)
attach(iris)

head(iris)
str(iris)
glimpse(iris)
typeof(iris)
dim(iris)

readr包

readr包中主要的函数有：
read_csv，
read_delim，
read_table，
write_delim，
write_csv，
write_excel_csv，
write_delim函数，
其中read_table中分隔符是指定为固定空格的，不能修改分隔符，函数read_delim可以指定分隔符

管道符：%>%

意思是将%>%左边的对象传递给右边的函数;可以大量减少内存中的对象，节省内存;
f（x）变成这样：x ％>％ f和这样的东西：h（g（f（x）））变成这样：x％>％f％>％g％>％h

x %>% f   等效与   f(x) 
x %>% f(y)   等效与   f(x, y) 
x %>% f %>% g %>% h   等价于   h(g(f(x)))

参数占位符

x %>% f(y,. )   等价于   f(y, x)
x %>% f(y, z =. )   就相当于   f(y, z = x)

正在使用属性的占位符
它直接在右边的表达式中多次使用占位符。但是，当占位符仅出现在嵌套表达式magrittr中时，仍将应用第一个参数规则。原因是在大多数情况下，这种结果更清晰。

x %>% f(y = nrow(.), z = ncol(.))     就相当于     f(x, y = nrow(x), z = ncol(x))

行为可以通过在大括号中封闭右手来实现 overruled:

x %>% {f(y = nrow(.), z = ncol(.))}     就相当于     f(y = nrow(x), z = ncol(x))

带变量的管道
许多函数接受数据参数，比如 lm 和 aggregate，这在一个处理数据的管道中非常有用。还有一些函数没有数据参数，对于公开数据中的变量很有用。这是用 %$% 运算符完成的：

library(tidyverse)
library(magrittr)
iris %>%
  subset(Sepal.Length> mean(Sepal.Length)) %$%
  cor(Sepal.Length, Sepal.Width)

data.frame(z= rnorm(100)) %$% ts.plot(z)

复合分配管道操作
还有一个管道运算符，可以用作 shorthand 符号，在左手边是"被覆盖"：

iris$Sepal.Length<- 
  iris$Sepal.Length %>%
  sqrt()

要避免在赋值运算符后面立即重复左边的操作，请使用 %<>% 运算符：

iris$Sepal.Length %<>% sqrt

这里运算符与 %>% 完全一样，只是管道分配结果而不是返回结果。它必须是长链中的第一个管道操作符。

除了%>%这个好用的符号外，magrittr还提供了其他三个比较好用的符号,%$%，%<>%和%T>%。

%>%        forward-pipe operator.
%T>%     tee operator.
%<>%     compound assignment pipe-operator. （大神不建议这样做，要听话）
%$%        exposition pipe-operator.

tidyr, reshape2的替代者，功能更纯粹

tidyr会将数据变的整洁
整洁数据有三个原则：

1 变量构成列
2 观察组成行
3 值放在单元里面

整齐的数据特性：每一列都是一个变量；每一行都是一个观测值
tidyr 四大常用函数

gather() 使“宽”数据变成长数据
spread() 使“长”数据变成宽数据
separate() 将单个列拆分为多个列
unite() 将多个列组合成一个列

我们使用gather（）来挖掘最初分散在三列中的数据，并将它们分为两列：键和值
Gather占用多列并折叠成键值对，根据需要复制所有其他列，当你注意到你的列不是变量时，你可以使用gather（）。 “这就是tidyverse定义gather的方式。

kv_gathered <- key_value %>% 
  gather(key, # this will be the new column for the 3 key columns
         value, # this will contain the 9 distinct values
         key1:key3, # this is the range of columns we want gathered
         na.rm = TRUE # handles missing
  )
kv_gathered

gather 函数主要四个参数
data :数据集
key ：列明
value ：原来值的新的列名
...: 需要聚集的变量，删除前面加-

gather(data, key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
     第一个参数data放的是原数据，数据类型要是一个数据框；
     下面传一个键值对，名字是自己起的，这两个值是做新转换成的二维表的表头，即两个变量名；
     第四个是选中要转置的列，这个参数不写的话就默认全部转置；
     后面还可以加可选参数na.rm，如果na.rm = TRUE，那么将会在新表中去除原表中的缺失值(NA)。

gather()举例
先构造一个数据框stu：

stu<-data.frame(grade=c("A","B","C","D","E"), female=c(5, 4, 1, 2, 3), male=c(1, 2, 3, 4, 5))  #成绩-性别的人数分布
gather(stu,gender,count, -grade)
第一个参数是原数据stu，二、三两个参数是键值对（性别，人数），第四个表示减去（除去grade列，就只转置剩下两列）

separate负责分割数据，把一个变量中就包含两个变量的数据分来（上例gather中是属性名也是一个变量，一个属性名一个变量）
separate(data, col, into, sep (= 正则表达式), remove =TRUE,convert = FALSE, extra = "warn", fill = "warn", ...)
第一个参数放要分离的数据框；
第二个参数放要分离的列；
第三个参数是分割成的变量的列（肯定是多个），用向量表示；
第四个参数是分隔符，用正则表达式表示，或者写数字，表示从第几位分开（文档里是这样写的：

stu2<-data.frame(grade=c("A","B","C","D","E"), 
                 female_1=c(5, 4, 1, 2, 3), male_1=c(1, 2, 3, 4, 5),
                 female_2=c(4, 5, 1, 2, 3), male_2=c(0, 2, 3, 4, 6))
stu2
stu2_new<-gather(stu2,gender_class,count,-grade)
stu2_new

现在我们要做的就是把gender_class这一列分开：

separate(stu2_new,gender_class,c("gender","class"))

spread用来扩展表，把某一列的值（键值对）分开拆成多列。

spread(data, key, value, fill = NA, convert = FALSE, drop =TRUE, sep = NULL)

key是原来要拆的那一列的名字（变量名），value是拆出来的那些列的值应该填什么（填原表的哪一列）

name<-rep(c("Sally","Jeff","Roger","Karen","Brain"),c(2,2,2,2,2))
test<-rep(c("midterm","final"),5)
class1<-c("A","C",NA,NA,NA,NA,NA,NA,"B","B")
class2<-c(NA,NA,"D","E","C","A",NA,NA,NA,NA)
class3<-c("B","C",NA,NA,NA,NA,"C","C",NA,NA)
class4<-c(NA,NA,"A","C",NA,NA,"A","A",NA,NA)
class5<-c(NA,NA,NA,NA,"B","A",NA,NA,"A","C")
stu3<-data.frame(name,test,class1,class2,class3,class4,class5)
stu3
gather(stu3,class,grade, class1:class5, na.rm=TRUE)

用spread函数将test列分来成midterm和final两列，这两列的值是选的两门课的成绩。
再重复一遍，第二个参数是要拆分的那一列的列名，第三个参数是扩展出的列的值应该来自原表的哪一列的列名。

stu3_new<-gather(stu3, class, grade, class1:class5, na.rm = TRUE)
spread(stu3_new,test,grade)

最后补充一条，现在class列显得有些冗余，直接用数字似乎更简洁，使用readr包中的parse_number()提出数字（还用到了dplyr的mutate函数）

library(readr)
library(dplyr)
mutate(spread(stu3_new,test,grade),class=parse_number(class))

unite--多列合并为一列

unite(data, col, …, sep = “_”, remove = TRUE)
data：为数据框
col：被组合的新列名称
…：指定哪些列需要被组合
sep：组合列之间的连接符，默认为下划线
remove：是否删除被组合的列

先虚构一数据框

set.seed(1)
date <- as.Date('2016-11-01') + 0:14
hour <- sample(1:24, 15)
min <- sample(1:60, 15)
second <- sample(1:60, 15)
event <- sample(letters, 15)
data <- data.table(date, hour, min, second, event)

把date，hour，min和second列合并为新列datetime
R中的日期时间格式为"Year-Month-Day-Hour:Min:Second"

dataNew <- data %>%unite(datehour, date, hour, sep = ' ') %>%unite(datetime, datehour, min, second, sep = ':')
dataNew

dplyr

主要功能：

1、选择数据表的行: filter
2、排序arrange
3、改变数据表的列: mutate, transmute
    mutate 会保留改变前和改变后的列
    transmute 则只会保留改变后的列, 而扔掉改变前的列
选择数据表的列: select, rename
4、select 只会选择你指定的列
5、rename 则会改变列名, 并选择其他所有的列
6、通过 group_by 和 summarize 函数可以把数据进行分组进行分析

过滤 filter()函数可以用来取数据子集。提取符合特定逻辑条件的行。

例如，iris％>％filter（Sepal.Length> 6）

iris %>% filter(Species == "virginica") # 指定满足的行
iris %>% filter(Species == 'virginica',Sepal.Length > 6)  #多个条件用，分割

选择 Sepal.Length > 6.7,且Species == "versicolor"或者 Species == "virginica"的行

iris %>% filter(
  Sepal.Length > 6.7, 
  Species %in% c("versicolor", "virginica" )
)

排序 arrange()函数用来对观察值排序，默认是升序。

iris %>% arrange(Sepal.Length)  #升序
iris %>% arrange(desc(Sepal.Length))  #降序
arrange(my_data, -Sepal.Length)  #根据Sepal.Length值排序（降序）

新增变量 mutate()可以更新或者新增数据框一列。

iris %>% mutate(Sepal.Length = Sepal.Length*10) # 将该列数值变成以mm为单位
iris %>% mutate(SLMn = Sepal.Length * 10) # 创建新的一列

select 选择指定列

iris %>% select(1:3) #选择第一列到第三列
iris %>% select(1,3)#选择第一列和第三列
 
iris %>% select(Sepal.Length, Petal.Length)
iris %>% select(Sepal.Length:Petal.Length)
iris %>% select(starts_with("Petal"))  # Select column whose name starts with "Petal"
iris %>% select(ends_with("Width"))  # Select column whose name ends with "Width"
iris %>% select(contains("etal"))  # Select columns whose names contains "etal"
iris %>% select(matches(".t."))  # Select columns whose name maches a regular expression
iris %>% select(one_of(c("Sepal.Length", "Petal.Length")))  # selects variables provided in a character vector.

iris %>% select_if(is.numeric)  #选择列属性为数字的列

#删除列(根据列的属性）
iris %>% select(-Sepal.Length, -Petal.Length)  #Removing Sepal.Length and Petal.Length columns
iris %>% select(-(Sepal.Length:Petal.Length))  #Removing all columns from Sepal.Length to Petal.Length
iris %>% select(-starts_with("Petal"))  #Removing all columns whose name starts with “Petal”:

#根据列的位置删除列
iris %>% select(-1)  #删除第1列
iris %>% select(-(1:3))   #删除第1到3列
iris %>% select(-1, -3)   #删除第1列与第3列

rename（）重命名列

iris %>% 
  rename(
    sepal_length = Sepal.Length,
    sepal_width = Sepal.Width
  )
#将列Sepal.Length重命名为sepal_length，将Sepal.Width重命名为sepal_width：
# Rename column where names is "Sepal.Length"
names(iris)[names(iris) == "Sepal.Length"] <- "sepal_length"
names(iris)[names(iris) == "Sepal.Width"] <- "sepal_width"
iris #使用函数名称（）或colnames（）获取列名称
 #根据列位置重命名
names(iris)[1] <- "sepal_length"
names(iris)[2] <- "sepal_width"

整合函数流：

iris %>%
filter(Species == "Virginica") %>%
mutate(SLMm = Sepal.Length) %>%
arrange(desc(SLMm))

summarize()函数可以让我们将很多变量汇总为单个的数据点。

iris %>% summarize(medianSL = median(Sepal.Length))

iris %>% 
  filter(Species == "virginica") %>%
  summarize(medianSL=median(Sepal.Length))
#还可以一次性汇总多个变量;用，分割
iris %>% 
  filter(Species == "virginica") %>% 
  summarize(medianSL = median(Sepal.Length),
            maxSL = max(Sepal.Length))

group_by()可以让我们安装指定的组别进行汇总数据，而不是针对整个数据框

iris %>% 
  group_by(Species) %>% 
  summarize(medianSL = median(Sepal.Length),
            maxSL = max(Sepal.Length))

iris %>% 
  filter(Sepal.Length>6) %>% 
  group_by(Species) %>% 
  summarize(medianPL = median(Petal.Length), 
            maxPL = max(Petal.Length))

ggplot2

#散点图
#散点图可以帮助我们理解两个变量的数据关系，使用geom_point()可以绘制散点图：
iris_small<- iris %>% 
  filter(Sepal.Length > 5)
ggplot(iris_small, aes(x=Petal.Length, y= Petal.Width)) + 
  geom_point()  

#颜色
ggplot(iris_small, aes(x = Petal.Length,
                       y = Petal.Width,
                       color = Species)) + 
  geom_point()

#大小
ggplot(iris_small, aes(x = Petal.Length,
                       y = Petal.Width,
                       color = Species,
                       size = Sepal.Length)) + 
  geom_point()

#分面
ggplot(iris_small, aes(x = Petal.Length,
                       y = Petal.Width)) + 
  geom_point() + 
  facet_wrap(~Species)


#线图
by_year <- gapminder %>% 
  group_by(year) %>% 
  summarize(medianGdpPerCap = median(gdpPercap))

ggplot(by_year, aes(x = year,
                    y = medianGdpPerCap)) +
  geom_line() + 
  expand_limits(y=0)

#条形图
by_species <- iris %>%  
  filter(Sepal.Length > 6) %>% 
  group_by(Species) %>% 
  summarize(medianPL=median(Petal.Length))

ggplot(by_species, aes(x = Species, y=medianPL)) + 
  geom_col()

#直方图
ggplot(iris_small, aes(x = Petal.Length)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#箱线图
ggplot(iris_small, aes(x=Species, y=Sepal.Length)) + 
  geom_boxplot()

其他数据导入和建模类的包

DBI，用于联接数据库
haven，用于读入SPSS、SAS、Stata 数据
httr，用于联接网页API
jsonlite，用于读入JSON 数据
readxl，用于读入Excel 文档
rvest，用于网络爬虫
xml2，用于读入xml 数据
modelr，用于使用管道函数建模
broom，用于统计模型结果的整洁