HQL之数据排序

2019-10-28  本文已影响0人  长较瘦

HQL支持以下四种关键字来进行数据排序,

  1. ORDER BY [ASC|DESC]:
    和SQL中SELECT语句的ORDER BY子句相似。需要注意的是,该排序是针对每一个reducer的输出汇总后整体的排序,只使用一个reducer来做全局整体排序,通常会消耗较多时间来完成这个排序任务。
    ASC这个开关项是缺省开关项,如果没有指定ASC|DESC,ASC就会被缺省使用。自Hive 2.1.0版本起,HQL支持可以指定NULL值位于排序中的起始/结束。NULL FIRST, NULL值位于起始;NULL LAST, NULL值位于结束位置。缺省情况下,NULL值被视为最小,在做ASC排序时,位于起始位置,在做DESC排序时,位于结束位置。
> SELECT name FROM employee ORDER BY name DESC; -- By columns      
+----------+      
|   name   |      
+----------+      
| Will     |      
| Shelley  |      
| Michael  |      
| Lucy     |      
+----------+      
4 rows selected (24.057 seconds)            
> SELECT name       
> FROM employee  -- Order by expression      
> ORDER BY CASE WHEN name = 'Will' THEN 0 ELSE 1 END DESC;      
+----------+      
|   name   |      
+----------+      
| Lucy     |      
| Shelley  |      
| Michael  |      
| Will     |      
+----------+      
4 rows selected (25.057 seconds)            
> SELECT * FROM emp_simple ORDER BY work_place NULL LAST;      
+---------+------------+      
| name    | work_place |      
+---------+------------+      
| Lucy    | Montreal   |      
| Michael | Toronto    |      
| Will    | NULL       | -- NULL stays at the last      
+---------+------------+      
3 rows selected (0.263 seconds)

推荐LIMIT子句和ORDER BY一起使用。当设置hive.mapred.mode为strict时,ORDER BY子句一定要和LIMIT子句一起使用,否则执行相关HQL语句时会返回异常。另外,Hive 1.x版本,该设置缺省为 nonstrict,但自从Hive 2.x后,该设置缺省为 strict。

  1. SORT BY [ASC|DESC]: 用来指定提供给Reducer输入记录的排序方式。由于是给Reducer的输入数据排序,所以这个排序操作在Reducer开始处理数据前就已经完成了。该排序不是全局排序,只是在每个Reducer内局部排。当时有且仅有一个Reducer时,和ORDER BY的效果一致。另外SORT BY通常需要和DISTRIBUTE BY一起使用,否则无实际效果。
> SET mapred.reduce.tasks = 2; -- Sort by with more than 1 reducer      
No rows affected (0.001 seconds)            
> SELECT name FROM employee SORT BY name DESC;      
+---------+      
|   name  |      
+---------+      
| Shelley | -- Once result is collected to client, it is       
| Michael |   order-less      
| Lucy    |      
| Will    |      
+---------+      
4 rows selected (54.386 seconds)            
> SET mapred.reduce.tasks = 1; -- Sort by one reducer      
No rows affected (0.002 seconds)            
> SELECT name FROM employee SORT BY name DESC;      
+----------+      
|   name   |      
+----------+      
| Will     | -- Same result to ORDER BY      
| Shelley  |      
| Michael  |      
| Lucy     |      
+----------+      
4 rows selected (46.03 seconds)
  1. DISTRIBUTE BY:
    和GROUP BY子句相似,Mapper用来确定数据输出至哪个Reducer。
    和GROUP BY子句不一样的地方在于,DISTRIBUTE BY子句无法用于数据聚合函数,如COUNT(*),而只是指定数据输出至哪个Reducer。
    DISTRIBUTE BY子句通常用于按照某些列来重新存放数据。另外该子句后的列,必须包含在SELECT子句的列列表中,否在会报错。
-- Error when not specify distributed column employee_id in select      
> SELECT name FROM employee_hr DISTRIBUTE BY employee_id;       
Error: Error while compiling statement: FAILED: SemanticException
[Error 10004]: Line 1:44 Invalid table alias or column reference 'employee_id': (possible column names are: name)             
> SELECT name, employee_id FROM employee_hr DISTRIBUTE BY employee_id;       
+----------+--------------+      
|   name   | employee_id  |      
+----------+--------------+      
| Lucy     | 103          |      
| Steven   | 102          |      
| Will     | 101          |      
| Michael  | 100          |      
+----------+--------------+      
4 rows selected (38.92 seconds)            
-- Used with SORT BY to order name started on the same day      
> SELECT name, start_date      
> FROM employee_hr      
> DISTRIBUTE BY start_date SORT BY name;      
+----------+--------------+      
|   name   |  start_date  |      
+----------+--------------+      
| Lucy     | 2010-01-03   |      
| Michael  | 2014-01-29   |      
| Steven   | 2012-11-03   |     
| Will     | 2013-10-02   |      
+----------+--------------+      
4 rows selected (38.01 seconds)
  1. CLUSTER BY: 等用于DISTRIBUTE BY子句和SORT BY子句同时作用于同一组数据列。另外,该子句不可以指定ASC或DESC选项。
> SELECT name, employee_id FROM employee_hr CLUSTER BY name;      
+----------+--------------+      
|   name   | employee_id  |      
+----------+--------------+      
| Lucy     | 103          |      
| Michael  | 100          |      
| Steven   | 102          |      
| Will     | 101          |      
+----------+--------------+      
4 rows selected (39.791 seconds)

最后,以上子句的区别,大家可以通过以下图表来体会,


HQL 排序子句的区别
上一篇下一篇

猜你喜欢

热点阅读