HQL之数据排序

2019-10-28 本文已影响0人长较瘦

HQL支持以下四种关键字来进行数据排序，

ORDER BY [ASC|DESC]
SORT BY [ASC|DESC]
DISTRIBUTE BY
CLUSTER BY
以下依次为这些关键字的说明和解释，

ORDER BY [ASC|DESC]:
和SQL中SELECT语句的ORDER BY子句相似。需要注意的是，该排序是针对每一个reducer的输出汇总后整体的排序，只使用一个reducer来做全局整体排序，通常会消耗较多时间来完成这个排序任务。
ASC这个开关项是缺省开关项，如果没有指定ASC|DESC，ASC就会被缺省使用。自Hive 2.1.0版本起，HQL支持可以指定NULL值位于排序中的起始/结束。NULL FIRST, NULL值位于起始；NULL LAST, NULL值位于结束位置。缺省情况下，NULL值被视为最小，在做ASC排序时，位于起始位置，在做DESC排序时，位于结束位置。

> SELECT name FROM employee ORDER BY name DESC; -- By columns      
+----------+      
|   name   |      
+----------+      
| Will     |      
| Shelley  |      
| Michael  |      
| Lucy     |      
+----------+      
4 rows selected (24.057 seconds)            
> SELECT name       
> FROM employee  -- Order by expression      
> ORDER BY CASE WHEN name = 'Will' THEN 0 ELSE 1 END DESC;      
+----------+      
|   name   |      
+----------+      
| Lucy     |      
| Shelley  |      
| Michael  |      
| Will     |      
+----------+      
4 rows selected (25.057 seconds)            
> SELECT * FROM emp_simple ORDER BY work_place NULL LAST;      
+---------+------------+      
| name    | work_place |      
+---------+------------+      
| Lucy    | Montreal   |      
| Michael | Toronto    |      
| Will    | NULL       | -- NULL stays at the last      
+---------+------------+      
3 rows selected (0.263 seconds)

推荐LIMIT子句和ORDER BY一起使用。当设置hive.mapred.mode为strict时，ORDER BY子句一定要和LIMIT子句一起使用，否则执行相关HQL语句时会返回异常。另外，Hive 1.x版本，该设置缺省为 nonstrict，但自从Hive 2.x后，该设置缺省为 strict。

SORT BY [ASC|DESC]: 用来指定提供给Reducer输入记录的排序方式。由于是给Reducer的输入数据排序，所以这个排序操作在Reducer开始处理数据前就已经完成了。该排序不是全局排序，只是在每个Reducer内局部排。当时有且仅有一个Reducer时，和ORDER BY的效果一致。另外SORT BY通常需要和DISTRIBUTE BY一起使用，否则无实际效果。

> SET mapred.reduce.tasks = 2; -- Sort by with more than 1 reducer      
No rows affected (0.001 seconds)            
> SELECT name FROM employee SORT BY name DESC;      
+---------+      
|   name  |      
+---------+      
| Shelley | -- Once result is collected to client, it is       
| Michael |   order-less      
| Lucy    |      
| Will    |      
+---------+      
4 rows selected (54.386 seconds)            
> SET mapred.reduce.tasks = 1; -- Sort by one reducer      
No rows affected (0.002 seconds)            
> SELECT name FROM employee SORT BY name DESC;      
+----------+      
|   name   |      
+----------+      
| Will     | -- Same result to ORDER BY      
| Shelley  |      
| Michael  |      
| Lucy     |      
+----------+      
4 rows selected (46.03 seconds)

DISTRIBUTE BY:
和GROUP BY子句相似，Mapper用来确定数据输出至哪个Reducer。
和GROUP BY子句不一样的地方在于，DISTRIBUTE BY子句无法用于数据聚合函数，如COUNT(*)，而只是指定数据输出至哪个Reducer。
DISTRIBUTE BY子句通常用于按照某些列来重新存放数据。另外该子句后的列，必须包含在SELECT子句的列列表中，否在会报错。

-- Error when not specify distributed column employee_id in select      
> SELECT name FROM employee_hr DISTRIBUTE BY employee_id;       
Error: Error while compiling statement: FAILED: SemanticException
[Error 10004]: Line 1:44 Invalid table alias or column reference 'employee_id': (possible column names are: name)             
> SELECT name, employee_id FROM employee_hr DISTRIBUTE BY employee_id;       
+----------+--------------+      
|   name   | employee_id  |      
+----------+--------------+      
| Lucy     | 103          |      
| Steven   | 102          |      
| Will     | 101          |      
| Michael  | 100          |      
+----------+--------------+      
4 rows selected (38.92 seconds)            
-- Used with SORT BY to order name started on the same day      
> SELECT name, start_date      
> FROM employee_hr      
> DISTRIBUTE BY start_date SORT BY name;      
+----------+--------------+      
|   name   |  start_date  |      
+----------+--------------+      
| Lucy     | 2010-01-03   |      
| Michael  | 2014-01-29   |      
| Steven   | 2012-11-03   |     
| Will     | 2013-10-02   |      
+----------+--------------+      
4 rows selected (38.01 seconds)

CLUSTER BY: 等用于DISTRIBUTE BY子句和SORT BY子句同时作用于同一组数据列。另外，该子句不可以指定ASC或DESC选项。

> SELECT name, employee_id FROM employee_hr CLUSTER BY name;      
+----------+--------------+      
|   name   | employee_id  |      
+----------+--------------+      
| Lucy     | 103          |      
| Michael  | 100          |      
| Steven   | 102          |      
| Will     | 101          |      
+----------+--------------+      
4 rows selected (39.791 seconds)

最后，以上子句的区别，大家可以通过以下图表来体会，

HQL 排序子句的区别

HQL之数据排序

猜你喜欢

热点阅读