46.关于Mutating joins类函数和merge函数

2021-08-29  本文已影响0人  心惊梦醒

【上一篇:45.关于关系型数据的开篇】
【下一篇:47.关于Filtering joins类函数】

    上篇说了两种关系型数据中的需求,处理关系型数据有三大类函数:Mutating joins、Filtering joins、Set Operations。
    Mutating joins类函数包括inner_join()、left_join()、right_join()、full_join()。inner_join()函数术语inner join,另外三个属于outer join。另外base R中的merge()函数也能通过设置不同参数分别实现这四种功能。Mutating joins类函数的返回结果是为original table增加新的列,即variables。
    Filtering joins类函数包括semi_join()、anti_join(),返回的结果是过滤后的observations。
    Set Operations类函数(集合操作类函数)包括intersect()、union()、setdiff(),用来比较两个数据框,返回两个数据框行的交集、并集和差集。


    Mutating joins类的四个函数的Usage是一样的:


Mutating joins类函数
x和y:数据框
by:向量,表示根据两个表格中的哪些列match
copy:貌似不懂的样子!
suffix:包含两个值的向量,用于合并后区分相同的列名,下面再详细讲
keep:逻辑值,是否来自x和y的join keys都输出,默认为FALSE

    inner_join()的返回结果是x和y中都有的行。
    left_join()的返回结果是x中所有的行。
    right_join()的返回结果是y中所有的行。
    full_join()的返回结果是x或y中的所有行。
    outer join类中match后缺失的值直接用NA填充。
    返回的结果用Veen图表示大概是这个样子的:

四个函数返回结果
    by默认是NULL,表示根据x和y中的公共列进行match,这种情况下R同时会打印出公共列都是什么,便于检查;可以直接指定一个变量名,也可以指定一个向量,这两种情况的前提是x和y中都有这些变量(相同变量名)。如果x和y中你用来match的列名不同,可以设置by=c("a"="b"),即x$a=y$b列,多个不同列可以设置by=c("a1"="b1","a2"="b2","a3"="b3"),即x$a1=y$b1、x$a2=y$b2、x$a3=y$b3。
test1<-tribble(
~name,~sex,~score,
"ZhangS","male",90,
"LiS","male",80,
"WangJ","female",85,
"MuQ","female",95,
"NieP","male",100,
"ZhaoY","female",93,
"JiaW","female",98,
)

test2<-tribble(
~name,~Name,~address,~sex,
"ZhangS","ZhangS","Beijing","male",
"LiS","LiS","Beijing","male",
"WangJ","WangJ","Tanjin","female",
"MuQ","MuQ","Taijin","female",
"NieP","NieP","Henan","male",
"ZhaoY","ZhaoY","Henan","female",
"Qiany","Qiany","Xinjiang","male",
)

# by=NULL,自动打印实际使用的by值,即by = c("name", "sex")
> inner_join(test1,test2)
Joining, by = c("name", "sex")
# A tibble: 6 x 5
name   sex    score Name   address
<chr>  <chr>  <dbl> <chr>  <chr>  
1 ZhangS male      90 ZhangS Beijing
2 LiS    male      80 LiS    Beijing
3 WangJ  female    85 WangJ  Tanjin 
4 MuQ    female    95 MuQ    Taijin 
5 NieP   male     100 NieP   Henan  
6 ZhaoY  female    93 ZhaoY  Henan  

# 直接指定两个数据框中的都有的列名
> left_join(test1,test2,by="name")
# A tibble: 7 x 6
name   sex.x  score Name   address sex.y 
<chr>  <chr>  <dbl> <chr>  <chr>   <chr> 
1 ZhangS male      90 ZhangS Beijing male  
2 LiS    male      80 LiS    Beijing male  
3 WangJ  female    85 WangJ  Tanjin  female
4 MuQ    female    95 MuQ    Taijin  female
5 NieP   male     100 NieP   Henan   male  
6 ZhaoY  female    93 ZhaoY  Henan   female
7 JiaW   female    98 NA     NA      NA    

# 用test1中的name和test2中的Name进行match
> right_join(test1,test2,by=c("name"="Name"))
# A tibble: 7 x 6
name   sex.x  score name.y address  sex.y 
<chr>  <chr>  <dbl> <chr>  <chr>    <chr> 
  1 ZhangS male      90 ZhangS Beijing  male  
2 LiS    male      80 LiS    Beijing  male  
3 WangJ  female    85 WangJ  Tanjin   female
4 MuQ    female    95 MuQ    Taijin   female
5 NieP   male     100 NieP   Henan    male  
6 ZhaoY  female    93 ZhaoY  Henan    female
7 Qiany  NA        NA Qiany  Xinjiang male 

# 改变相同列名的后缀
> full_join(test1,test2,by="name",suffix=c(".test1",".test2"))
# A tibble: 8 x 6
name   sex.test1 score Name   address  sex.test2
<chr>  <chr>     <dbl> <chr>  <chr>    <chr>    
  1 ZhangS male         90 ZhangS Beijing  male     
2 LiS    male         80 LiS    Beijing  male     
3 WangJ  female       85 WangJ  Tanjin   female   
4 MuQ    female       95 MuQ    Taijin   female   
5 NieP   male        100 NieP   Henan    male     
6 ZhaoY  female       93 ZhaoY  Henan    female   
7 JiaW   female       98 NA     NA       NA       
8 Qiany  NA           NA Qiany  Xinjiang male

    上面示例考虑的情况都是没有重复值的情况,实际中重复值出现很常见,那这四个函数又是怎么处理重复值的呢?
    对于重复值,四个函数返回的结果中会保留所有重复的行
    基于上面的举例进行说明:
    1) 如果test1中有两个ZhangS,test2不变,则:

test3<-tribble(
  ~name,~sex,~score,
  "ZhangS","male",90,
  "ZhangS","male",98,
  "LiS","male",80,
  "WangJ","female",85,
  "MuQ","female",95,
  "NieP","male",100,
  "ZhaoY","female",93,
  "JiaW","female",98,
)
> inner_join(test3,test2)
Joining, by = c("name", "sex")
# A tibble: 7 x 5
name   sex    score Name   address
<chr>  <chr>  <dbl> <chr>  <chr>  
  1 ZhangS male      90 ZhangS Beijing
2 ZhangS male      98 ZhangS Beijing
3 LiS    male      80 LiS    Beijing
4 WangJ  female    85 WangJ  Tanjin 
5 MuQ    female    95 MuQ    Taijin 
6 NieP   male     100 NieP   Henan  
7 ZhaoY  female    93 ZhaoY  Henan
> left_join(test3,test2)
Joining, by = c("name", "sex")
# A tibble: 8 x 5
name   sex    score Name   address
<chr>  <chr>  <dbl> <chr>  <chr>  
  1 ZhangS male      90 ZhangS Beijing
2 ZhangS male      98 ZhangS Beijing
3 LiS    male      80 LiS    Beijing
4 WangJ  female    85 WangJ  Tanjin 
5 MuQ    female    95 MuQ    Taijin 
6 NieP   male     100 NieP   Henan  
7 ZhaoY  female    93 ZhaoY  Henan  
8 JiaW   female    98 NA     NA
> right_join(test3,test2)
Joining, by = c("name", "sex")
# A tibble: 8 x 5
name   sex    score Name   address 
<chr>  <chr>  <dbl> <chr>  <chr>   
  1 ZhangS male      90 ZhangS Beijing 
2 ZhangS male      98 ZhangS Beijing 
3 LiS    male      80 LiS    Beijing 
4 WangJ  female    85 WangJ  Tanjin  
5 MuQ    female    95 MuQ    Taijin  
6 NieP   male     100 NieP   Henan   
7 ZhaoY  female    93 ZhaoY  Henan   
8 Qiany  male      NA Qiany  Xinjiang
> full_join(test3,test2)
Joining, by = c("name", "sex")
# A tibble: 9 x 5
name   sex    score Name   address 
<chr>  <chr>  <dbl> <chr>  <chr>   
  1 ZhangS male      90 ZhangS Beijing 
2 ZhangS male      98 ZhangS Beijing 
3 LiS    male      80 LiS    Beijing 
4 WangJ  female    85 WangJ  Tanjin  
5 MuQ    female    95 MuQ    Taijin  
6 NieP   male     100 NieP   Henan   
7 ZhaoY  female    93 ZhaoY  Henan   
8 JiaW   female    98 NA     NA      
9 Qiany  male      NA Qiany  Xinjiang

    2) 如果test1不变,test2中有两个LiS,则:

test4<-tribble(
  ~name,~Name,~address,~sex,
  "ZhangS","ZhangS","Beijing","male",
  "LiS","LiS","Beijing","male",
  "LiS","LiS","Beijing_new","male",
  "WangJ","WangJ","Tanjin","female",
  "MuQ","MuQ","Taijin","female",
  "NieP","NieP","Henan","male",
  "ZhaoY","ZhaoY","Henan","female",
  "Qiany","Qiany","Xinjiang","male",
)
> inner_join(test1,test4)
Joining, by = c("name", "sex")
# A tibble: 7 x 5
  name   sex    score Name   address    
  <chr>  <chr>  <dbl> <chr>  <chr>      
1 ZhangS male      90 ZhangS Beijing    
2 LiS    male      80 LiS    Beijing    
3 LiS    male      80 LiS    Beijing_new
4 WangJ  female    85 WangJ  Tanjin     
5 MuQ    female    95 MuQ    Taijin     
6 NieP   male     100 NieP   Henan      
7 ZhaoY  female    93 ZhaoY  Henan 
其他的可以自己尝试

    3) 如果test1和test2中均有重复值,则:

> inner_join(test3,test4)
Joining, by = c("name", "sex")
# A tibble: 8 x 5
  name   sex    score Name   address    
  <chr>  <chr>  <dbl> <chr>  <chr>      
1 ZhangS male      90 ZhangS Beijing    
2 ZhangS male      98 ZhangS Beijing    
3 LiS    male      80 LiS    Beijing    
4 LiS    male      80 LiS    Beijing_new
5 WangJ  female    85 WangJ  Tanjin     
6 MuQ    female    95 MuQ    Taijin     
7 NieP   male     100 NieP   Henan      
8 ZhaoY  female    93 ZhaoY  Henan

可以看出来test1中的两行ZhangS和test2中的两行LiS分别组合了,生成了2x2=4行

    关于输出:这四个函数返回的都是一个与x类型相同的对象,尽可能地保留x中行和列的顺序。输出结果中包含x和y中的所有列,当然by指定的列是合并的,除by指定的列之外其他名字相同的列通过加后缀加以区分。


    dplyr包中链接函数和merge函数的比较:


dplyr和merge函数的比较
merge(x, y, by = intersect(names(x), names(y)),
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
      incomparables = NULL, ...)

关于by的使用:
如果指定by,则不用指定by.x和by.y了

【上一篇:45.关于关系型数据的开篇】
【下一篇:47.关于Filtering joins类函数】

上一篇下一篇

猜你喜欢

热点阅读