spark 中order by，sort by，distribu

2019-10-14 本文已影响0人邵红晓

order by
order by会对输入做全局排序，因此只有一个Reducer(多个Reducer无法保证全局有序)，然而只有一个Reducer，会导致当输入规模较大时，消耗较长的计算时间
SELECT pdate from xxx.jpush_wemedia_native_hbase order by pdate desc

1      2020-06-10 16:52:55
2   2020-06-10 16:52:42
3   2020-06-10 16:52:26
4   2020-06-10 16:49:59
5   2020-06-10 16:49:57
6   2020-06-10 16:48:16
7   2020-06-10 16:46:46
8   2020-06-10 16:46:16
9   2020-06-10 16:39:35
10  2020-06-10 16:37:12
11  2020-06-10 16:36:05
12  2020-06-10 16:35:57
13  2020-06-10 16:35:08
14  2020-06-10 16:31:32
15  2020-06-10 16:30:15
16  2020-06-10 16:28:18
17  2020-06-10 16:26:50
18  2020-06-10 16:26:06
19  2020-06-10 16:24:24
20  2020-06-10 16:19:34
21  2020-06-10 16:17:33
22  2020-06-10 16:09:22
23  2020-06-10 16:08:09
24  2020-06-10 16:06:28

sort by
sort by不是全局排序，其在数据进入reducer前完成排序，因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只会保证每个reducer的输出有序，并不保证全局有序
SELECT pdate from xxx.jpush_wemedia_native_hbase sort by pdate desc

1   2020-06-10 14:17:56
2   2020-06-08 14:40:45
3   2020-05-27 16:26:11
4   2020-05-27 16:28:18
5   2020-06-10 16:05:15
6   2020-05-28 18:05:15
7   2020-06-10 16:46:16
8   2020-06-10 16:02:15
9   2020-06-08 14:39:13
10  2020-05-28 18:15:58
11  2020-06-10 16:26:06
12  2020-05-27 16:29:29
13  2020-06-10 16:00:25
14  2020-06-10 14:18:12
15  2020-06-10 14:04:08
16  2020-06-10 13:36:33
17  2020-06-08 14:38:56
18  2020-06-10 16:52:42
19  2020-06-08 14:40:25
20  2020-06-10 16:52:26
21  2020-06-10 16:37:12
22  2020-06-10 14:02:25
23  2020-06-10 14:16:32

distribute by
distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by后面列，对应reduce的个数进行分发，默认是采用hash算法。sort by为每个reduce产生一个排序文件。在有些情况下，你需要控制某个特定行应该到哪个reducer，这通常是为了进行后续的聚集操作。distribute by刚好可以做这件事。因此，distribute by经常和sort by配合使用。
1.Map输出的文件大小不均。
2.Reduce输出文件大小不均。
3.小文件过多。
4.文件超大。

INSERT overwrite table store_sales partition 
       ( 
              ss_sold_date_sk 
       ) 
SELECT ss_sold_time_sk,
       ss_net_paid, 
       ss_net_paid_inc_tax, 
       ss_net_profit, 
       ss_sold_date_sk 
FROM   tpcds_1t_ext.et_store_sales 
where ss_sold_date_sk is null
distribute by distribute by cast(rand() * 5 as int);

场景1
动态分区插入数据，有Shuffle的情况下，spark.sql.shuffle.partitions(默认值200)这个参数值，产生200个小文件一个每个map task，采用distribute by 控制小文件的产生
场景2
统计用户行为序列

cluster by
cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是倒叙排序，不能指定排序规则为ASC或者DESC。

spark 中order by，sort by，distribu

猜你喜欢

热点阅读