[R语言] Join 连接《R for data science

2020-04-22 本文已影响0人半为花间酒

《R for Data Science》第十三章 Relational data 啃书知识点积累
参考链接：

R for Data Science

R语言中dplyr包join函数之目前我看到过的最形象的教程

Introduction

本章节的应用前提需要了解nycflights13各数据集的内容

library(nycflights13)

?flights # dplyr用的很熟悉就不查了
?airlines
?airports
?planes
?weather

画模式图

涉及了datamodelr和DiagrammeR

library(datamodelr,DiagrammeR)

dm <- dm_from_data_frames(list(
  #  指定数据集来源
  Master = Lahman::Master,
  Managers = Lahman::Managers,
  AwardsManagers = Lahman::AwardsManagers
)) %>%
  # 设置主键
  dm_set_key("Master", "playerID") %>%
  dm_set_key("Managers", c("yearID", "teamID", "inseason")) %>%
  dm_set_key("AwardsManagers", c("playerID", "awardID", "yearID")) %>%
  # 设置外键
  dm_add_references(
    Managers$playerID == Master$playerID,
    AwardsManagers$playerID == Master$playerID
  )

# 绘图
# rankdir可以指定'BT', 'RL'等
# columnArrows默认为F，只是数据集间粗略箭头
dm_create_graph(dm, rankdir = "LR", columnArrows = TRUE) %>%
  dm_render_graph()

There are three families of verbs designed to work with relational data:

Mutating joins, which add new variables to one data frame from matching observations in another.

Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.

Set operations, which treat observations as if they were set elements.

Keys

There are two types of keys:

A primary key uniquely identifies an observation in its own table.

A foreign key uniquely identifies an observation in another table.

确认是不是主键：分组后各组仅一个观测

planes %>% 
  count(tailnum) %>% 
  filter(n > 1) %>% 
  nrow()
# [1] 0

代理键的概念：

If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key.

flights %>% 
  mutate(id = row_number()) %>% 
  select(id,everything())

Mutating joins

flights %>% 
  select(year:day, hour,tailnum, carrier) %>% 
  left_join(airlines, by = "carrier")


# 也可以利用mutate和match完成left_join相同操作，但麻烦很多
flights %>% 
  select(year:day, hour,tailnum, carrier) %>% 
  mutate(name = airlines$name[match(carrier, airlines$carrier)])

- inner_join

The most important property of an inner join is that unmatched rows are not included in the result.

- Outer joins

A left join keeps all observations in x.
A right join keeps all observations in y.
A full join keeps all observations in x and y.

The left join should be your default join

base::merge()有类似join的功能

- Defining the key columns

The default, by = NULL, uses all variables that appear in both tables, the so called natural join
A character vector,by = "x". This is like a natural join, but uses only some of the common variables.

flights %>% 
  left_join(planes, by = "tailnum")
#> # A tibble: 336,776 x 16
#>   year.x month   day  hour origin dest  tailnum carrier year.y type 
#>    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
#> 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe…
#> 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe…
#> 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe…
#> 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe…
#> 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe…
#> 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe…
#> # … with 3.368e+05 more rows, and 6 more variables: manufacturer <chr>,
#> #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>

A named character vector: by = c("a" = "b"). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.

flights %>% 
  left_join(airports, c("origin" = "faa"))
#> # A tibble: 336,776 x 15
#>    year month   day  hour origin dest  tailnum carrier name    lat   lon   alt
#>   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
#> 1  2013     1     1     5 EWR    IAH   N14228  UA      Newa…  40.7 -74.2    18
#> 2  2013     1     1     5 LGA    IAH   N24211  UA      La G…  40.8 -73.9    22
#> 3  2013     1     1     5 JFK    MIA   N619AA  AA      John…  40.6 -73.8    13
#> 4  2013     1     1     5 JFK    BQN   N804JB  B6      John…  40.6 -73.8    13
#> 5  2013     1     1     6 LGA    ATL   N668DN  DL      La G…  40.8 -73.9    22
#> 6  2013     1     1     5 EWR    ORD   N39463  UA      Newa…  40.7 -74.2    18
#> # … with 3.368e+05 more rows, and 3 more variables: tz <dbl>, dst <chr>,
#> #   tzone <chr>

- Q: 画出航班的大致起落

flights %>%
  inner_join(select(airports, origin = faa, ori_lat = lat, ori_lon = lon),
             by = "origin") %>% 
  inner_join(select(airports, dest = faa, des_lat = lat, des_lon = lon),
             by = "dest") %>% 
  # 取子集，类似filter(row_number() <= 100)
  slice(1:100) %>%
  ggplot(aes(
    x = ori_lon, xend = des_lon,
    y = ori_lat, yend = des_lat
  )) +
  # 美国地图
  borders("state") +
  # 调整地图到合适比例
  coord_quickmap() +
  geom_segment(arrow = arrow(length = unit(0.1, "cm"))) +
  labs(y = "Latitude", x = "Longitude")

Filtering joins

semi_join(x, y) keeps all observations in x that have a match in y.
Semi-joins are useful for matching filtered summary tables back to the original rows.
anti_join(x, y) drops all observations in x that have a match in y.、
Anti-joins are useful for diagnosing join mismatches.

x <- tribble(
  ~key, ~val_x,
  # --/---
    1, "x1",
    2, "x2",
    3, "x3"
)

y <- tribble(
  ~key, ~val_y,
  # --/---
    1, "y1",
    2, "y2",
    4, "y3"
)

semi_join(x,y,by='key')
# # A tibble: 2 x 2
#     key val_x
#   <dbl> <chr>
# 1     1 x1   
# 2     2 x2 

semi_join(y,x,by='key')
# # A tibble: 2 x 2
#     key val_x
#   <dbl> <chr>
# 1     1 y1   
# 2     2 y2

anti_join(x,y,by='key')
# # A tibble: 1 x 2
#     key val_x
#   <dbl> <chr>
# 1     3 x3 

anti_join(y,x,by='key')
# # A tibble: 1 x 2
#     key val_x
#   <dbl> <chr>
# 1     4 y3 


# 另一个例子，获取飞往受欢迎目的地的所有航班
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  slice(1:10)

flights %>% 
  filter(dest %in% top_dest$dest)

# 用半连接
flights %>% 
  semi_join(top_dest)

Set operations

intersect(x, y): return only observations in both x and y.

union(x, y): return unique observations in x and y.

setdiff(x, y): return observations in x, but not in y.

These expect the x and y inputs to have the same variables

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)

intersect(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1

union(df1, df2)
#> # A tibble: 3 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     1
#> 3     1     2

setdiff(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     2     1

setdiff(df2, df1)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2