R语言

[14] 《R数据科学》dplyr练习

2020-11-02  本文已影响0人  灰常不错
(1)找出满足以下条件的所有航班。

a.到达时间延误2小时或更多的航班

library(tidyverse)
library(nycflights13)
flights
filter(flights,arr_delay>=120)
# A tibble: 10,200 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      811            630       101     1047            830       137
 2  2013     1     1      848           1835       853     1001           1950       851
 3  2013     1     1      957            733       144     1056            853       123
 4  2013     1     1     1114            900       134     1447           1222       145
 5  2013     1     1     1505           1310       115     1638           1431       127
 6  2013     1     1     1525           1340       105     1831           1626       125
 7  2013     1     1     1549           1445        64     1912           1656       136
 8  2013     1     1     1558           1359       119     1718           1515       123
 9  2013     1     1     1732           1630        62     2028           1825       123
10  2013     1     1     1803           1620       103     2008           1750       138
# ... with 10,190 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

b.飞往休斯顿(IAH 或 HOU机场)的航班,2种方式:

filter(flights,dest=='IAH'|dest=='HOU')
filter(flights,dest %in% c('IAH','HOU'))
# A tibble: 9,313 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2      830            819        11
 2  2013     1     1      533            529         4      850            830        20
 3  2013     1     1      623            627        -4      933            932         1
 4  2013     1     1      728            732        -4     1041           1038         3
 5  2013     1     1      739            739         0     1104           1038        26
 6  2013     1     1      908            908         0     1228           1219         9
 7  2013     1     1     1028           1026         2     1350           1339        11
 8  2013     1     1     1044           1045        -1     1352           1351         1
 9  2013     1     1     1114            900       134     1447           1222       145
10  2013     1     1     1205           1200         5     1503           1505        -2
# ... with 9,303 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

c.由 United, American, Delta(航空公司)运营的航班

filter(flights,carrier %in% c('DL','UA','AA'))
filter(flights,carrier=='DL'|carrier=='UA'|carrier=='AA')
# A tibble: 139,504 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2      830            819        11
 2  2013     1     1      533            529         4      850            830        20
 3  2013     1     1      542            540         2      923            850        33
 4  2013     1     1      554            600        -6      812            837       -25
 5  2013     1     1      554            558        -4      740            728        12
 6  2013     1     1      558            600        -2      753            745         8
 7  2013     1     1      558            600        -2      924            917         7
 8  2013     1     1      558            600        -2      923            937       -14
 9  2013     1     1      559            600        -1      941            910        31
10  2013     1     1      559            600        -1      854            902        -8
# ... with 139,494 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

d.7-9月出发的航班

filter(flights,month %in% c(7:9))
# A tibble: 86,326 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     7     1        1           2029       212      236           2359       157
 2  2013     7     1        2           2359         3      344            344         0
 3  2013     7     1       29           2245       104      151              1       110
 4  2013     7     1       43           2130       193      322             14       188
 5  2013     7     1       44           2150       174      300            100       120
 6  2013     7     1       46           2051       235      304           2358       186
 7  2013     7     1       48           2001       287      308           2305       243
 8  2013     7     1       58           2155       183      335             43       172
 9  2013     7     1      100           2146       194      327             30       177
10  2013     7     1      100           2245       135      337            135       122
# ... with 86,316 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

e.到达延误两小时及以上,但出发没有延误

filter(flights,dep_delay<=0 & arr_delay>120)
# A tibble: 29 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1    27     1419           1420        -1     1754           1550       124
 2  2013    10     7     1350           1350         0     1736           1526       130
 3  2013    10     7     1357           1359        -2     1858           1654       124
 4  2013    10    16      657            700        -3     1258           1056       122
 5  2013    11     1      658            700        -2     1329           1015       194
 6  2013     3    18     1844           1847        -3       39           2219       140
 7  2013     4    17     1635           1640        -5     2049           1845       124
 8  2013     4    18      558            600        -2     1149            850       179
 9  2013     4    18      655            700        -5     1213            950       143
10  2013     5    22     1827           1830        -3     2217           2010       127
# ... with 19 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

f.起飞延误至少1h,但飞行过程弥补回30min的航班

filter(flights,dep_delay>=60&dep_delay-arr_delay>=30)
# A tibble: 2,074 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1     1716           1545        91     2140           2039        61
 2  2013     1     1     2205           1720       285       46           2040       246
 3  2013     1     1     2326           2130       116      131             18        73
 4  2013     1     3     1503           1221       162     1803           1555       128
 5  2013     1     3     1821           1530       171     2131           1910       141
 6  2013     1     3     1839           1700        99     2056           1950        66
 7  2013     1     3     1850           1745        65     2148           2120        28
 8  2013     1     3     1923           1815        68     2036           1958        38
 9  2013     1     3     1941           1759       102     2246           2139        67
10  2013     1     3     1950           1845        65     2228           2227         1
# ... with 2,064 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>

g. 出发时间在午夜和早上 6 点之间(包括 0 点和 6 点)的航班。

filter(flights, dep_delay <= 600 | dep_delay == 2400)
# A tibble: 328,481 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     1     1      517            515         2      830            819        11
 2  2013     1     1      533            529         4      850            830        20
 3  2013     1     1      542            540         2      923            850        33
 4  2013     1     1      544            545        -1     1004           1022       -18
 5  2013     1     1      554            600        -6      812            837       -25
 6  2013     1     1      554            558        -4      740            728        12
 7  2013     1     1      555            600        -5      913            854        19
 8  2013     1     1      557            600        -3      709            723       -14
 9  2013     1     1      557            600        -3      838            846        -8
10  2013     1     1      558            600        -2      753            745         8
# ... with 328,471 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>
(2) dplyr 中对筛选有帮助的另一个函数是between()。它的作用是什么?你能使用这个函数来简化解决前面问题的代码吗?
between(1:12, 7, 9)
#[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
#提取7至9月的航班信息
 filter(flights, between(month, 7, 9))
# A tibble: 86,326 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>
 1  2013     7     1        1           2029       212      236           2359       157
 2  2013     7     1        2           2359         3      344            344         0
 3  2013     7     1       29           2245       104      151              1       110
 4  2013     7     1       43           2130       193      322             14       188
 5  2013     7     1       44           2150       174      300            100       120
 6  2013     7     1       46           2051       235      304           2358       186
 7  2013     7     1       48           2001       287      308           2305       243
 8  2013     7     1       58           2155       183      335             43       172
 9  2013     7     1      100           2146       194      327             30       177
10  2013     7     1      100           2245       135      337            135       122
# ... with 86,316 more rows, and 10 more variables: carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>
(3) dep_time 有缺失值的航班有多少?其他变量的缺失值情况如何?这样的行表示什么情况?
nrow(filter(flights,is.na(dep_time)))
#[1] 8255
#NA表示航班可能会取消
(4) 为什么 NA ^ 0 的值不是 NA ?为什么 NA | TRUE 的值不是 NA ?为什么 FALSE & NA 的值不是 NA ?你能找出一般规律吗?(NA * 0 则是精妙的反例!)

我认为NA只要指定了与其它值的运算就会显示NA,无论这个运算是加减乘除还是其他,但如果是自身的^0,将显示为1,并且or True都是TRUE,and FALSE 都是FALSE。

上一篇 下一篇

猜你喜欢

热点阅读