[14] 《R数据科学》dplyr练习
2020-11-02 本文已影响0人
灰常不错
(1)找出满足以下条件的所有航班。
a.到达时间延误2小时或更多的航班
library(tidyverse)
library(nycflights13)
flights
filter(flights,arr_delay>=120)
# A tibble: 10,200 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 811 630 101 1047 830 137
2 2013 1 1 848 1835 853 1001 1950 851
3 2013 1 1 957 733 144 1056 853 123
4 2013 1 1 1114 900 134 1447 1222 145
5 2013 1 1 1505 1310 115 1638 1431 127
6 2013 1 1 1525 1340 105 1831 1626 125
7 2013 1 1 1549 1445 64 1912 1656 136
8 2013 1 1 1558 1359 119 1718 1515 123
9 2013 1 1 1732 1630 62 2028 1825 123
10 2013 1 1 1803 1620 103 2008 1750 138
# ... with 10,190 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
b.飞往休斯顿(IAH 或 HOU机场)的航班,2种方式:
filter(flights,dest=='IAH'|dest=='HOU')
filter(flights,dest %in% c('IAH','HOU'))
# A tibble: 9,313 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 517 515 2 830 819 11
2 2013 1 1 533 529 4 850 830 20
3 2013 1 1 623 627 -4 933 932 1
4 2013 1 1 728 732 -4 1041 1038 3
5 2013 1 1 739 739 0 1104 1038 26
6 2013 1 1 908 908 0 1228 1219 9
7 2013 1 1 1028 1026 2 1350 1339 11
8 2013 1 1 1044 1045 -1 1352 1351 1
9 2013 1 1 1114 900 134 1447 1222 145
10 2013 1 1 1205 1200 5 1503 1505 -2
# ... with 9,303 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
c.由 United, American, Delta(航空公司)运营的航班
filter(flights,carrier %in% c('DL','UA','AA'))
filter(flights,carrier=='DL'|carrier=='UA'|carrier=='AA')
# A tibble: 139,504 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 517 515 2 830 819 11
2 2013 1 1 533 529 4 850 830 20
3 2013 1 1 542 540 2 923 850 33
4 2013 1 1 554 600 -6 812 837 -25
5 2013 1 1 554 558 -4 740 728 12
6 2013 1 1 558 600 -2 753 745 8
7 2013 1 1 558 600 -2 924 917 7
8 2013 1 1 558 600 -2 923 937 -14
9 2013 1 1 559 600 -1 941 910 31
10 2013 1 1 559 600 -1 854 902 -8
# ... with 139,494 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
d.7-9月出发的航班
filter(flights,month %in% c(7:9))
# A tibble: 86,326 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 7 1 1 2029 212 236 2359 157
2 2013 7 1 2 2359 3 344 344 0
3 2013 7 1 29 2245 104 151 1 110
4 2013 7 1 43 2130 193 322 14 188
5 2013 7 1 44 2150 174 300 100 120
6 2013 7 1 46 2051 235 304 2358 186
7 2013 7 1 48 2001 287 308 2305 243
8 2013 7 1 58 2155 183 335 43 172
9 2013 7 1 100 2146 194 327 30 177
10 2013 7 1 100 2245 135 337 135 122
# ... with 86,316 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
e.到达延误两小时及以上,但出发没有延误
filter(flights,dep_delay<=0 & arr_delay>120)
# A tibble: 29 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 27 1419 1420 -1 1754 1550 124
2 2013 10 7 1350 1350 0 1736 1526 130
3 2013 10 7 1357 1359 -2 1858 1654 124
4 2013 10 16 657 700 -3 1258 1056 122
5 2013 11 1 658 700 -2 1329 1015 194
6 2013 3 18 1844 1847 -3 39 2219 140
7 2013 4 17 1635 1640 -5 2049 1845 124
8 2013 4 18 558 600 -2 1149 850 179
9 2013 4 18 655 700 -5 1213 950 143
10 2013 5 22 1827 1830 -3 2217 2010 127
# ... with 19 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
f.起飞延误至少1h,但飞行过程弥补回30min的航班
filter(flights,dep_delay>=60&dep_delay-arr_delay>=30)
# A tibble: 2,074 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 1716 1545 91 2140 2039 61
2 2013 1 1 2205 1720 285 46 2040 246
3 2013 1 1 2326 2130 116 131 18 73
4 2013 1 3 1503 1221 162 1803 1555 128
5 2013 1 3 1821 1530 171 2131 1910 141
6 2013 1 3 1839 1700 99 2056 1950 66
7 2013 1 3 1850 1745 65 2148 2120 28
8 2013 1 3 1923 1815 68 2036 1958 38
9 2013 1 3 1941 1759 102 2246 2139 67
10 2013 1 3 1950 1845 65 2228 2227 1
# ... with 2,064 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
g. 出发时间在午夜和早上 6 点之间(包括 0 点和 6 点)的航班。
filter(flights, dep_delay <= 600 | dep_delay == 2400)
# A tibble: 328,481 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 517 515 2 830 819 11
2 2013 1 1 533 529 4 850 830 20
3 2013 1 1 542 540 2 923 850 33
4 2013 1 1 544 545 -1 1004 1022 -18
5 2013 1 1 554 600 -6 812 837 -25
6 2013 1 1 554 558 -4 740 728 12
7 2013 1 1 555 600 -5 913 854 19
8 2013 1 1 557 600 -3 709 723 -14
9 2013 1 1 557 600 -3 838 846 -8
10 2013 1 1 558 600 -2 753 745 8
# ... with 328,471 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
(2) dplyr 中对筛选有帮助的另一个函数是between()
。它的作用是什么?你能使用这个函数来简化解决前面问题的代码吗?
between(1:12, 7, 9)
#[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
#提取7至9月的航班信息
filter(flights, between(month, 7, 9))
# A tibble: 86,326 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 7 1 1 2029 212 236 2359 157
2 2013 7 1 2 2359 3 344 344 0
3 2013 7 1 29 2245 104 151 1 110
4 2013 7 1 43 2130 193 322 14 188
5 2013 7 1 44 2150 174 300 100 120
6 2013 7 1 46 2051 235 304 2358 186
7 2013 7 1 48 2001 287 308 2305 243
8 2013 7 1 58 2155 183 335 43 172
9 2013 7 1 100 2146 194 327 30 177
10 2013 7 1 100 2245 135 337 135 122
# ... with 86,316 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
(3) dep_time 有缺失值的航班有多少?其他变量的缺失值情况如何?这样的行表示什么情况?
nrow(filter(flights,is.na(dep_time)))
#[1] 8255
#NA表示航班可能会取消
(4) 为什么 NA ^ 0 的值不是 NA ?为什么 NA | TRUE 的值不是 NA ?为什么 FALSE & NA 的值不是 NA ?你能找出一般规律吗?(NA * 0 则是精妙的反例!)
我认为NA只要指定了与其它值的运算就会显示NA,无论这个运算是加减乘除还是其他,但如果是自身的^0,将显示为1,并且or True都是TRUE,and FALSE 都是FALSE。