R for Data Science

[R语言] Join 连接《R for data science

2020-04-22  本文已影响0人  半为花间酒

《R for Data Science》第十三章 Relational data 啃书知识点积累
参考链接:

  1. R for Data Science
  2. R语言中dplyr包join函数之目前我看到过的最形象的教程

Introduction

本章节的应用前提需要了解nycflights13各数据集的内容

library(nycflights13)

?flights # dplyr用的很熟悉就不查了
?airlines
?airports
?planes
?weather

画模式图

涉及了datamodelrDiagrammeR

library(datamodelr,DiagrammeR)

dm <- dm_from_data_frames(list(
  #  指定数据集来源
  Master = Lahman::Master,
  Managers = Lahman::Managers,
  AwardsManagers = Lahman::AwardsManagers
)) %>%
  # 设置主键
  dm_set_key("Master", "playerID") %>%
  dm_set_key("Managers", c("yearID", "teamID", "inseason")) %>%
  dm_set_key("AwardsManagers", c("playerID", "awardID", "yearID")) %>%
  # 设置外键
  dm_add_references(
    Managers$playerID == Master$playerID,
    AwardsManagers$playerID == Master$playerID
  )

# 绘图
# rankdir可以指定'BT', 'RL'等
# columnArrows默认为F,只是数据集间粗略箭头
dm_create_graph(dm, rankdir = "LR", columnArrows = TRUE) %>%
  dm_render_graph()

There are three families of verbs designed to work with relational data:

  • Mutating joins, which add new variables to one data frame from matching observations in another.
  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.
  • Set operations, which treat observations as if they were set elements.

Keys

There are two types of keys:

  • A primary key uniquely identifies an observation in its own table.
  • A foreign key uniquely identifies an observation in another table.
planes %>% 
  count(tailnum) %>% 
  filter(n > 1) %>% 
  nrow()
# [1] 0

If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key.

flights %>% 
  mutate(id = row_number()) %>% 
  select(id,everything())

Mutating joins

flights %>% 
  select(year:day, hour,tailnum, carrier) %>% 
  left_join(airlines, by = "carrier")


# 也可以利用mutate和match完成left_join相同操作,但麻烦很多
flights %>% 
  select(year:day, hour,tailnum, carrier) %>% 
  mutate(name = airlines$name[match(carrier, airlines$carrier)])

- inner_join

The most important property of an inner join is that unmatched rows are not included in the result.

- Outer joins

  1. A left join keeps all observations in x.
  2. A right join keeps all observations in y.
  3. A full join keeps all observations in x and y.

The left join should be your default join

- Defining the key columns

  1. The default, by = NULL, uses all variables that appear in both tables, the so called natural join
  2. A character vector,by = "x". This is like a natural join, but uses only some of the common variables.
flights %>% 
  left_join(planes, by = "tailnum")
#> # A tibble: 336,776 x 16
#>   year.x month   day  hour origin dest  tailnum carrier year.y type 
#>    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
#> 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixe…
#> 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixe…
#> 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixe…
#> 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixe…
#> 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixe…
#> 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixe…
#> # … with 3.368e+05 more rows, and 6 more variables: manufacturer <chr>,
#> #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
  1. A named character vector: by = c("a" = "b"). This will match variable a in table x to variable b in table y. The variables from x will be used in the output.
flights %>% 
  left_join(airports, c("origin" = "faa"))
#> # A tibble: 336,776 x 15
#>    year month   day  hour origin dest  tailnum carrier name    lat   lon   alt
#>   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
#> 1  2013     1     1     5 EWR    IAH   N14228  UA      Newa…  40.7 -74.2    18
#> 2  2013     1     1     5 LGA    IAH   N24211  UA      La G…  40.8 -73.9    22
#> 3  2013     1     1     5 JFK    MIA   N619AA  AA      John…  40.6 -73.8    13
#> 4  2013     1     1     5 JFK    BQN   N804JB  B6      John…  40.6 -73.8    13
#> 5  2013     1     1     6 LGA    ATL   N668DN  DL      La G…  40.8 -73.9    22
#> 6  2013     1     1     5 EWR    ORD   N39463  UA      Newa…  40.7 -74.2    18
#> # … with 3.368e+05 more rows, and 3 more variables: tz <dbl>, dst <chr>,
#> #   tzone <chr>

- Q: 画出航班的大致起落

flights %>%
  inner_join(select(airports, origin = faa, ori_lat = lat, ori_lon = lon),
             by = "origin") %>% 
  inner_join(select(airports, dest = faa, des_lat = lat, des_lon = lon),
             by = "dest") %>% 
  # 取子集,类似filter(row_number() <= 100)
  slice(1:100) %>%
  ggplot(aes(
    x = ori_lon, xend = des_lon,
    y = ori_lat, yend = des_lat
  )) +
  # 美国地图
  borders("state") +
  # 调整地图到合适比例
  coord_quickmap() +
  geom_segment(arrow = arrow(length = unit(0.1, "cm"))) +
  labs(y = "Latitude", x = "Longitude")

Filtering joins

  1. semi_join(x, y) keeps all observations in x that have a match in y.
    Semi-joins are useful for matching filtered summary tables back to the original rows.
  2. anti_join(x, y) drops all observations in x that have a match in y.、
    Anti-joins are useful for diagnosing join mismatches.
x <- tribble(
  ~key, ~val_x,
  # --/---
    1, "x1",
    2, "x2",
    3, "x3"
)

y <- tribble(
  ~key, ~val_y,
  # --/---
    1, "y1",
    2, "y2",
    4, "y3"
)

semi_join(x,y,by='key')
# # A tibble: 2 x 2
#     key val_x
#   <dbl> <chr>
# 1     1 x1   
# 2     2 x2 

semi_join(y,x,by='key')
# # A tibble: 2 x 2
#     key val_x
#   <dbl> <chr>
# 1     1 y1   
# 2     2 y2

anti_join(x,y,by='key')
# # A tibble: 1 x 2
#     key val_x
#   <dbl> <chr>
# 1     3 x3 

anti_join(y,x,by='key')
# # A tibble: 1 x 2
#     key val_x
#   <dbl> <chr>
# 1     4 y3 


# 另一个例子,获取飞往受欢迎目的地的所有航班
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  slice(1:10)

flights %>% 
  filter(dest %in% top_dest$dest)

# 用半连接
flights %>% 
  semi_join(top_dest)

Set operations

  1. intersect(x, y): return only observations in both x and y.
  2. union(x, y): return unique observations in x and y.
  3. setdiff(x, y): return observations in x, but not in y.

These expect the x and y inputs to have the same variables

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)

intersect(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1

union(df1, df2)
#> # A tibble: 3 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     1
#> 3     1     2

setdiff(df1, df2)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     2     1

setdiff(df2, df1)
#> # A tibble: 1 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     2
上一篇下一篇

猜你喜欢

热点阅读