[R语言] Join 连接《R for data science
2020-04-22 本文已影响0人
半为花间酒
《R for Data Science》第十三章 Relational data 啃书知识点积累
参考链接:
Introduction
本章节的应用前提需要了解nycflights13
各数据集的内容
library(nycflights13)
?flights # dplyr用的很熟悉就不查了
?airlines
?airports
?planes
?weather
画模式图
涉及了datamodelr
和DiagrammeR
library(datamodelr,DiagrammeR)
dm <- dm_from_data_frames(list(
# 指定数据集来源
Master = Lahman::Master,
Managers = Lahman::Managers,
AwardsManagers = Lahman::AwardsManagers
)) %>%
# 设置主键
dm_set_key("Master", "playerID") %>%
dm_set_key("Managers", c("yearID", "teamID", "inseason")) %>%
dm_set_key("AwardsManagers", c("playerID", "awardID", "yearID")) %>%
# 设置外键
dm_add_references(
Managers$playerID == Master$playerID,
AwardsManagers$playerID == Master$playerID
)
# 绘图
# rankdir可以指定'BT', 'RL'等
# columnArrows默认为F,只是数据集间粗略箭头
dm_create_graph(dm, rankdir = "LR", columnArrows = TRUE) %>%
dm_render_graph()
There are three families of verbs designed to work with relational data:
- Mutating joins, which add new variables to one data frame from matching observations in another.
- Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.
- Set operations, which treat observations as if they were set elements.
Keys
There are two types of keys:
- A primary key uniquely identifies an observation in its own table.
- A foreign key uniquely identifies an observation in another table.
- 确认是不是主键:分组后各组仅一个观测
planes %>%
count(tailnum) %>%
filter(n > 1) %>%
nrow()
# [1] 0
- 代理键的概念:
If a table lacks a primary key, it’s sometimes useful to add one with
mutate()
androw_number()
. That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a surrogate key.
flights %>%
mutate(id = row_number()) %>%
select(id,everything())
Mutating joins
flights %>%
select(year:day, hour,tailnum, carrier) %>%
left_join(airlines, by = "carrier")
# 也可以利用mutate和match完成left_join相同操作,但麻烦很多
flights %>%
select(year:day, hour,tailnum, carrier) %>%
mutate(name = airlines$name[match(carrier, airlines$carrier)])
- inner_join
The most important property of an inner join is that unmatched rows are not included in the result.
- Outer joins
- A left join keeps all observations in x.
- A right join keeps all observations in y.
- A full join keeps all observations in x and y.
The left join should be your default join
-
base::merge()
有类似join的功能
- Defining the key columns
- The default,
by = NULL
, uses all variables that appear in both tables, the so called natural join - A character vector,
by = "x"
. This is like a natural join, but uses only some of the common variables.
flights %>%
left_join(planes, by = "tailnum")
#> # A tibble: 336,776 x 16
#> year.x month day hour origin dest tailnum carrier year.y type
#> <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixe…
#> 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixe…
#> 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixe…
#> 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixe…
#> 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixe…
#> 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixe…
#> # … with 3.368e+05 more rows, and 6 more variables: manufacturer <chr>,
#> # model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
- A named character vector:
by = c("a" = "b")
. This will match variable a in table x to variable b in table y. The variables from x will be used in the output.
flights %>%
left_join(airports, c("origin" = "faa"))
#> # A tibble: 336,776 x 15
#> year month day hour origin dest tailnum carrier name lat lon alt
#> <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 2013 1 1 5 EWR IAH N14228 UA Newa… 40.7 -74.2 18
#> 2 2013 1 1 5 LGA IAH N24211 UA La G… 40.8 -73.9 22
#> 3 2013 1 1 5 JFK MIA N619AA AA John… 40.6 -73.8 13
#> 4 2013 1 1 5 JFK BQN N804JB B6 John… 40.6 -73.8 13
#> 5 2013 1 1 6 LGA ATL N668DN DL La G… 40.8 -73.9 22
#> 6 2013 1 1 5 EWR ORD N39463 UA Newa… 40.7 -74.2 18
#> # … with 3.368e+05 more rows, and 3 more variables: tz <dbl>, dst <chr>,
#> # tzone <chr>
- Q: 画出航班的大致起落
flights %>%
inner_join(select(airports, origin = faa, ori_lat = lat, ori_lon = lon),
by = "origin") %>%
inner_join(select(airports, dest = faa, des_lat = lat, des_lon = lon),
by = "dest") %>%
# 取子集,类似filter(row_number() <= 100)
slice(1:100) %>%
ggplot(aes(
x = ori_lon, xend = des_lon,
y = ori_lat, yend = des_lat
)) +
# 美国地图
borders("state") +
# 调整地图到合适比例
coord_quickmap() +
geom_segment(arrow = arrow(length = unit(0.1, "cm"))) +
labs(y = "Latitude", x = "Longitude")
Filtering joins
-
semi_join(x, y)
keeps all observations in x that have a match in y.
Semi-joins are useful for matching filtered summary tables back to the original rows. -
anti_join(x, y)
drops all observations in x that have a match in y.、
Anti-joins are useful for diagnosing join mismatches.
x <- tribble(
~key, ~val_x,
# --/---
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
# --/---
1, "y1",
2, "y2",
4, "y3"
)
semi_join(x,y,by='key')
# # A tibble: 2 x 2
# key val_x
# <dbl> <chr>
# 1 1 x1
# 2 2 x2
semi_join(y,x,by='key')
# # A tibble: 2 x 2
# key val_x
# <dbl> <chr>
# 1 1 y1
# 2 2 y2
anti_join(x,y,by='key')
# # A tibble: 1 x 2
# key val_x
# <dbl> <chr>
# 1 3 x3
anti_join(y,x,by='key')
# # A tibble: 1 x 2
# key val_x
# <dbl> <chr>
# 1 4 y3
# 另一个例子,获取飞往受欢迎目的地的所有航班
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
slice(1:10)
flights %>%
filter(dest %in% top_dest$dest)
# 用半连接
flights %>%
semi_join(top_dest)
Set operations
intersect(x, y)
: return only observations in both x and y.union(x, y)
: return unique observations in x and y.setdiff(x, y)
: return observations in x, but not in y.
These expect the x and y inputs to have the same variables
df1 <- tribble(
~x, ~y,
1, 1,
2, 1
)
df2 <- tribble(
~x, ~y,
1, 1,
1, 2
)
intersect(df1, df2)
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
union(df1, df2)
#> # A tibble: 3 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
#> 2 2 1
#> 3 1 2
setdiff(df1, df2)
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 2 1
setdiff(df2, df1)
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 2