[15] 《R数据科学》使用arrange()排列行
2020-11-02 本文已影响0人
灰常不错
arrange()函数的工作方式与filter()函数十分相似,但前者不是选择行,而是改变行的顺序。它接受一个数据框和一组作为排序依据的列名作为参数。
文章摘要
- 依次按行排序
- 使用
desc()
按行降序 - 缺失值排序规则
依次按行排序
如果列名不止一个,那么就使用后面的列在前面排序的基础上进行排序:
arrange(flights,year,month,day)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 517 515 2 830 819 11
2 2013 1 1 533 529 4 850 830 20
3 2013 1 1 542 540 2 923 850 33
4 2013 1 1 544 545 -1 1004 1022 -18
5 2013 1 1 554 600 -6 812 837 -25
6 2013 1 1 554 558 -4 740 728 12
7 2013 1 1 555 600 -5 913 854 19
8 2013 1 1 557 600 -3 709 723 -14
9 2013 1 1 557 600 -3 838 846 -8
10 2013 1 1 558 600 -2 753 745 8
# ... with 336,766 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
使用desc()按行降序
arrange(flights,desc(arr_delay))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 9 641 900 1301 1242 1530 1272
2 2013 6 15 1432 1935 1137 1607 2120 1127
3 2013 1 10 1121 1635 1126 1239 1810 1109
4 2013 9 20 1139 1845 1014 1457 2210 1007
5 2013 7 22 845 1600 1005 1044 1815 989
6 2013 4 10 1100 1900 960 1342 2211 931
7 2013 3 17 2321 810 911 135 1020 915
8 2013 7 22 2257 759 898 121 1026 895
9 2013 12 5 756 1700 896 1058 2020 878
10 2013 5 3 1133 2055 878 1250 2215 875
# ... with 336,766 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
缺失值排序规则
缺失值总排在最后:
df <- tibble(x=c(5,2,NA))
arrange(df,x)
# A tibble: 3 x 1
x
<dbl>
1 2
2 5
3 NA
arrange(df,desc(x))
# A tibble: 3 x 1
x
<dbl>
1 5
2 2
3 NA
练习
(1)如何使用arrange()
将缺失值排在最前面?
arrange(flights, desc(is.na(dep_time)))
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 1 NA 1630 NA NA 1815 NA
2 2013 1 1 NA 1935 NA NA 2240 NA
3 2013 1 1 NA 1500 NA NA 1825 NA
4 2013 1 1 NA 600 NA NA 901 NA
5 2013 1 2 NA 1540 NA NA 1747 NA
6 2013 1 2 NA 1620 NA NA 1746 NA
7 2013 1 2 NA 1355 NA NA 1459 NA
8 2013 1 2 NA 1420 NA NA 1644 NA
9 2013 1 2 NA 1321 NA NA 1536 NA
10 2013 1 2 NA 1545 NA NA 1910 NA
# ... with 336,766 more rows, and 10 more variables: carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
(2)对flights排序以找出延误时间最长的航班。找出出发时间最早的航班。
head(arrange(flights, desc(dep_delay)), 1)
# A tibble: 1 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 9 641 900 1301 1242 1530 1272
head(arrange(flights, dep_delay), 1)
# A tibble: 1 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 12 7 2040 2123 -43 40 2352 48
(3)对flight排序以找出速度最快的航班。
head(arrange(flights, desc(distance / air_time)), 1)
# A tibble: 1 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 5 25 1709 1700 9 1923 1937 -14
(4)哪个航班的飞行时间最长?哪个最短?
head(arrange(flights, desc(air_time)), 1)
# A tibble: 1 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 3 17 1337 1335 2 1937 1836 61
head(arrange(flights, air_time), 1)
# A tibble: 1 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
1 2013 1 16 1355 1315 40 1442 1411 31