46.关于Mutating joins类函数和merge函数
【上一篇:45.关于关系型数据的开篇】
【下一篇:47.关于Filtering joins类函数】
上篇说了两种关系型数据中的需求,处理关系型数据有三大类函数:Mutating joins、Filtering joins、Set Operations。
Mutating joins类函数包括inner_join()、left_join()、right_join()、full_join()。inner_join()函数术语inner join,另外三个属于outer join。另外base R中的merge()函数也能通过设置不同参数分别实现这四种功能。Mutating joins类函数的返回结果是为original table增加新的列,即variables。
Filtering joins类函数包括semi_join()、anti_join(),返回的结果是过滤后的observations。
Set Operations类函数(集合操作类函数)包括intersect()、union()、setdiff(),用来比较两个数据框,返回两个数据框行的交集、并集和差集。
Mutating joins类的四个函数的Usage是一样的:
Mutating joins类函数
x和y:数据框
by:向量,表示根据两个表格中的哪些列match
copy:貌似不懂的样子!
suffix:包含两个值的向量,用于合并后区分相同的列名,下面再详细讲
keep:逻辑值,是否来自x和y的join keys都输出,默认为FALSE
inner_join()的返回结果是x和y中都有的行。
left_join()的返回结果是x中所有的行。
right_join()的返回结果是y中所有的行。
full_join()的返回结果是x或y中的所有行。
outer join类中match后缺失的值直接用NA填充。
返回的结果用Veen图表示大概是这个样子的:
by默认是NULL,表示根据x和y中的公共列进行match,这种情况下R同时会打印出公共列都是什么,便于检查;可以直接指定一个变量名,也可以指定一个向量,这两种情况的前提是x和y中都有这些变量(相同变量名)。如果x和y中你用来match的列名不同,可以设置by=c("a"="b"),即x$a=y$b列,多个不同列可以设置by=c("a1"="b1","a2"="b2","a3"="b3"),即x$a1=y$b1、x$a2=y$b2、x$a3=y$b3。
test1<-tribble(
~name,~sex,~score,
"ZhangS","male",90,
"LiS","male",80,
"WangJ","female",85,
"MuQ","female",95,
"NieP","male",100,
"ZhaoY","female",93,
"JiaW","female",98,
)
test2<-tribble(
~name,~Name,~address,~sex,
"ZhangS","ZhangS","Beijing","male",
"LiS","LiS","Beijing","male",
"WangJ","WangJ","Tanjin","female",
"MuQ","MuQ","Taijin","female",
"NieP","NieP","Henan","male",
"ZhaoY","ZhaoY","Henan","female",
"Qiany","Qiany","Xinjiang","male",
)
# by=NULL,自动打印实际使用的by值,即by = c("name", "sex")
> inner_join(test1,test2)
Joining, by = c("name", "sex")
# A tibble: 6 x 5
name sex score Name address
<chr> <chr> <dbl> <chr> <chr>
1 ZhangS male 90 ZhangS Beijing
2 LiS male 80 LiS Beijing
3 WangJ female 85 WangJ Tanjin
4 MuQ female 95 MuQ Taijin
5 NieP male 100 NieP Henan
6 ZhaoY female 93 ZhaoY Henan
# 直接指定两个数据框中的都有的列名
> left_join(test1,test2,by="name")
# A tibble: 7 x 6
name sex.x score Name address sex.y
<chr> <chr> <dbl> <chr> <chr> <chr>
1 ZhangS male 90 ZhangS Beijing male
2 LiS male 80 LiS Beijing male
3 WangJ female 85 WangJ Tanjin female
4 MuQ female 95 MuQ Taijin female
5 NieP male 100 NieP Henan male
6 ZhaoY female 93 ZhaoY Henan female
7 JiaW female 98 NA NA NA
# 用test1中的name和test2中的Name进行match
> right_join(test1,test2,by=c("name"="Name"))
# A tibble: 7 x 6
name sex.x score name.y address sex.y
<chr> <chr> <dbl> <chr> <chr> <chr>
1 ZhangS male 90 ZhangS Beijing male
2 LiS male 80 LiS Beijing male
3 WangJ female 85 WangJ Tanjin female
4 MuQ female 95 MuQ Taijin female
5 NieP male 100 NieP Henan male
6 ZhaoY female 93 ZhaoY Henan female
7 Qiany NA NA Qiany Xinjiang male
# 改变相同列名的后缀
> full_join(test1,test2,by="name",suffix=c(".test1",".test2"))
# A tibble: 8 x 6
name sex.test1 score Name address sex.test2
<chr> <chr> <dbl> <chr> <chr> <chr>
1 ZhangS male 90 ZhangS Beijing male
2 LiS male 80 LiS Beijing male
3 WangJ female 85 WangJ Tanjin female
4 MuQ female 95 MuQ Taijin female
5 NieP male 100 NieP Henan male
6 ZhaoY female 93 ZhaoY Henan female
7 JiaW female 98 NA NA NA
8 Qiany NA NA Qiany Xinjiang male
上面示例考虑的情况都是没有重复值的情况,实际中重复值出现很常见,那这四个函数又是怎么处理重复值的呢?
对于重复值,四个函数返回的结果中会保留所有重复的行。
基于上面的举例进行说明:
1) 如果test1中有两个ZhangS,test2不变,则:
test3<-tribble(
~name,~sex,~score,
"ZhangS","male",90,
"ZhangS","male",98,
"LiS","male",80,
"WangJ","female",85,
"MuQ","female",95,
"NieP","male",100,
"ZhaoY","female",93,
"JiaW","female",98,
)
> inner_join(test3,test2)
Joining, by = c("name", "sex")
# A tibble: 7 x 5
name sex score Name address
<chr> <chr> <dbl> <chr> <chr>
1 ZhangS male 90 ZhangS Beijing
2 ZhangS male 98 ZhangS Beijing
3 LiS male 80 LiS Beijing
4 WangJ female 85 WangJ Tanjin
5 MuQ female 95 MuQ Taijin
6 NieP male 100 NieP Henan
7 ZhaoY female 93 ZhaoY Henan
> left_join(test3,test2)
Joining, by = c("name", "sex")
# A tibble: 8 x 5
name sex score Name address
<chr> <chr> <dbl> <chr> <chr>
1 ZhangS male 90 ZhangS Beijing
2 ZhangS male 98 ZhangS Beijing
3 LiS male 80 LiS Beijing
4 WangJ female 85 WangJ Tanjin
5 MuQ female 95 MuQ Taijin
6 NieP male 100 NieP Henan
7 ZhaoY female 93 ZhaoY Henan
8 JiaW female 98 NA NA
> right_join(test3,test2)
Joining, by = c("name", "sex")
# A tibble: 8 x 5
name sex score Name address
<chr> <chr> <dbl> <chr> <chr>
1 ZhangS male 90 ZhangS Beijing
2 ZhangS male 98 ZhangS Beijing
3 LiS male 80 LiS Beijing
4 WangJ female 85 WangJ Tanjin
5 MuQ female 95 MuQ Taijin
6 NieP male 100 NieP Henan
7 ZhaoY female 93 ZhaoY Henan
8 Qiany male NA Qiany Xinjiang
> full_join(test3,test2)
Joining, by = c("name", "sex")
# A tibble: 9 x 5
name sex score Name address
<chr> <chr> <dbl> <chr> <chr>
1 ZhangS male 90 ZhangS Beijing
2 ZhangS male 98 ZhangS Beijing
3 LiS male 80 LiS Beijing
4 WangJ female 85 WangJ Tanjin
5 MuQ female 95 MuQ Taijin
6 NieP male 100 NieP Henan
7 ZhaoY female 93 ZhaoY Henan
8 JiaW female 98 NA NA
9 Qiany male NA Qiany Xinjiang
2) 如果test1不变,test2中有两个LiS,则:
test4<-tribble(
~name,~Name,~address,~sex,
"ZhangS","ZhangS","Beijing","male",
"LiS","LiS","Beijing","male",
"LiS","LiS","Beijing_new","male",
"WangJ","WangJ","Tanjin","female",
"MuQ","MuQ","Taijin","female",
"NieP","NieP","Henan","male",
"ZhaoY","ZhaoY","Henan","female",
"Qiany","Qiany","Xinjiang","male",
)
> inner_join(test1,test4)
Joining, by = c("name", "sex")
# A tibble: 7 x 5
name sex score Name address
<chr> <chr> <dbl> <chr> <chr>
1 ZhangS male 90 ZhangS Beijing
2 LiS male 80 LiS Beijing
3 LiS male 80 LiS Beijing_new
4 WangJ female 85 WangJ Tanjin
5 MuQ female 95 MuQ Taijin
6 NieP male 100 NieP Henan
7 ZhaoY female 93 ZhaoY Henan
其他的可以自己尝试
3) 如果test1和test2中均有重复值,则:
> inner_join(test3,test4)
Joining, by = c("name", "sex")
# A tibble: 8 x 5
name sex score Name address
<chr> <chr> <dbl> <chr> <chr>
1 ZhangS male 90 ZhangS Beijing
2 ZhangS male 98 ZhangS Beijing
3 LiS male 80 LiS Beijing
4 LiS male 80 LiS Beijing_new
5 WangJ female 85 WangJ Tanjin
6 MuQ female 95 MuQ Taijin
7 NieP male 100 NieP Henan
8 ZhaoY female 93 ZhaoY Henan
可以看出来test1中的两行ZhangS和test2中的两行LiS分别组合了,生成了2x2=4行
关于输出:这四个函数返回的都是一个与x类型相同的对象,尽可能地保留x中行和列的顺序。输出结果中包含x和y中的所有列,当然by指定的列是合并的,除by指定的列之外其他名字相同的列通过加后缀加以区分。
dplyr包中链接函数和merge函数的比较:
dplyr和merge函数的比较
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
incomparables = NULL, ...)
关于by的使用:
如果指定by,则不用指定by.x和by.y了