学习小组Day6-R包dplyr学习

2020-04-01 本文已影响0人 Pingouin

R包学习

Overview

安装和加载

从google搜包在哪里

install.packages("packagename") #从CRAN下载
BiocManager:install("packagename") #从Biocductor下载

加载：两种方法均可

library()
require()

快捷键

command+← #调到本行最左

command+shift+← #选择左侧

command+enter #运行选择行

command+shift+enter #运行所有代码

control+L #清除console内容

command+shift+c #comment/uncomment lines

rm(list = ls()) #清楚环境变量

dplyr五个基本函数

本练习利用内置的iris数据集

test <- iris[c(1:2,51:52,101:102),] # 选择1-2，51-52，101-102行

1.mutate()新增列

mutate(test, new= Sepal.Length * Sepal.Width) # 新增一个名为new的列

2.select()选择

按列号筛选

select(test,1) # 选择test的第一列
select(test,c(1,5)) # 选择第一列和第五列
select(test,Sepal.Length)# 选择列名为Petal.Length的列
var <- select(test, "Petal.Length") #选择列名为Petal.Length的列‘

按列名筛选

var <- c("Petal.Length","Species")# 定义一个变量，包含两个列，分别为Petal.Length和Species
select(test,one_of(var)) # character vector

My understanding of one_of() is that it just lets you select variables using a character vector of their names instead of putting their names into the select() call, but then you get all of the variables whose names are in the vector, not just one of them.

3.filter()筛选行

filter(test, Species == "setosa" & Sepal.Length>5)
filter(test, Species %in% c("setosa","versicolor"))# species列中包含setosa或者versicolor的行

4.arrange()排序

arrange(test, Sepal.Length) # 默认从小到大
arrange(test,desc(Sepal.Length)) # 倒序

5.summarise()汇总

summarize(test, mean(Sepal.Length),  sd(Sepal.Length))# 计算Sepal.Length的平均值和标准差
group_by(test, Species)#按照species分组
summarise(group_by(test, Species),mean(Sepal.Length), sd(Sepal.Length)) # 分组计算每个组的均值和标准差

6.管道

让code更加简洁，省去每步都需要写test的名字

require(tidyverse)
test %>% 
  group_by(Species) %>% 
  summarize(mean(Sepal.Length), sd(Sepal.Length))
#注意三行一起

7.count统计某列unique值

count(test,Species)

P.S 可以先打出来列名自动补全都再加双引号，避免打错

dplyr处理关系数据

注意不要引入factor，以两个数据框为例

test1 <- data.frame(x = c('b','e','f','x'), 
                    z = c("A","B","C",'D'),
                    stringsAsFactors = F)
test2 <- data.frame(x=c('a','b','c','d','e','f'),
                    y=c(1,2,3,4,5,6),
                    stringsAsFactors = F)

> test1
  x z
1 b A
2 e B
3 f C
4 x D
> test2
  x y
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 f 6

1.inner_join()

inner_join(test1,test2,by = "x") #取交集

# output
  x z y
1 b A 2
2 e B 5
3 f C 6

2.left_join()

left_join(test1,test2,by="x") #以test1为基准，test1中没有交集的，对应的test2的行用NA补齐
left_join(test2,test1,by="x")#以test2为基准，test2中没有交集的，对应的test1的行用NA补齐

#output1
  x z  y
1 b A  2
2 e B  5
3 f C  6
4 x D NA
#output2
  x y    z
1 a 1 <NA>
2 b 2    A
3 c 3 <NA>
4 d 4 <NA>
5 e 5    B
6 f 6    C

3.full_join()

full_join(test1,test2,by="x")

#output
  x    z  y
1 b    A  2
2 e    B  5
3 f    C  6
4 x    D NA
5 a <NA>  1
6 c <NA>  3
7 d <NA>  4

4.semi_join()

返回能够与y表匹配的x表的所有列

semi_join(test1,test2,by="x")
semi_join(x = test1, y = test2, by = 'x')

#output
  x z
1 b A
2 e B
3 f C

5.anti_join()

返回无法与y表匹配的x表的所有列

anti_join(x = test2, y = test1, by = 'x')

#output
  x y
1 a 1
2 c 3
3 d 4

6.bind

bind_rows(test1,test2) #合并行，注意要列数相同
rbind(test1,test2)

cbind(test2,test3) #合并列，注意行数相同
bind_cols(test2,test3)

Reference

生信星球课程Day6