Hive调优之工具篇

2019-10-29  本文已影响0人  长较瘦

HQL提供EXPLAIN和ANALYZE语句,用于检查和确定查询性能。另外Hive日志包含有足够详细的信息用于性能调查和问题确认。

  1. EXPLAIN 不需要执行查询即可返回一个查询的执行计划。当我们担心查询有性能问题是,可以用该语句来分析查询语句。该语句语法为,

EXPLAIN FORMATTED|EXTENDED|DEPENDENCY|AUTHORIZATION] hql_query
其中:

> EXPLAIN SELECT gender_age.gender, count(*) 
> FROM employee_partitioned WHERE year=2018 
> GROUP BY gender_age.gender LIMIT 2;
+----------------------------------------------------------------------+
| Explain                                                              |
+----------------------------------------------------------------------+
| STAGE DEPENDENCIES:                                                  |
| Stage-1 is a root stage                                              |
| Stage-0 depends on stages: Stage-1                                   |
|                                                                      |
| STAGE PLANS:                                                         |
| Stage: Stage-1                                                       |
| Map Reduce                                                           |
| Map Operator Tree:                                                   |
| TableScan                                                            |
| alias: employee_partitioned                                          |
| Pruned Column Paths: gender_age.gender                               |
| Statistics:                                                          |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE  |
| Select Operator                                                      |
| expressions: gender_age.gender (type: string)                        |
| outputColumnNames: _col0                                             |
| Statistics:                                                          |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE  |
| Group By Operator                                                    |
| aggregations: count()                                                |
| keys: _col0 (type: string)                                           |
| mode: hash                                                           |
| outputColumnNames: _col0, _col1                                      |
| Statistics:                                                          |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE  |
| Reduce Output Operator                                               |
| key expressions: _col0 (type: string)                                |
| sort order: +                                                        |
| Map-reduce partition columns: _col0 (type: string)                   |
| Statistics:                                                          |
| Num rows: 4 Data size: 223 Basic stats: COMPLETE Column stats: NONE  |
| TopN Hash Memory Usage: 0.1                                          |
| value expressions: _col1 (type: bigint)                              |
| Reduce Operator Tree:                                                |
| Group By Operator                                                    |
| aggregations: count(VALUE._col0)                                     |
| keys: KEY._col0 (type: string)                                       |
| mode: mergepartial                                                   |
| outputColumnNames: _col0, _col1                                      |
| Statistics:                                                          |
| Num rows: 2 Data size: 111 Basic stats: COMPLETE Column stats: NONE  |
| Limit                                                                |
| Number of rows: 2                                                    |
| Statistics:                                                          |
| Num rows: 2 Data size: 110 Basic stats: COMPLETE Column stats: NONE  |
| File Output Operator                                                 |
| compressed: false                                                    |
| Statistics:                                                          |
| Num rows: 2 Data size: 110 Basic stats: COMPLETE Column stats: NONE  |
| table:                                                               |
| input format:                                                        |
| org.apache.hadoop.mapred.SequenceFileInputFormat                     |
| output format:                                                       |
| org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat            |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe            |
|                                                                      |
| Stage: Stage-0                                                       |
| Fetch Operator                                                       |
| limit: 2                                                             |
| Processor Tree:                                                      |
| ListSink                                                             |
+----------------------------------------------------------------------+
53 rows selected (0.232 seconds)
  1. ANALYZE Hive统计数据是描述更多详细信息如行数、文件数和数据库中各对象源数据大小等数据集合。统计数据是数据的元数据(Metadata),收集和存放于Metastore数据库中。Hive的统计数据支持表、分区和列等级别。这些统计数据是Hive 基于成本优化器(Cost-Based Optimizer (CBO))的输入方,帮助基于成本优化器以消耗系统资源完成查询所需最低成本来设计执行计划。Hive统计数据既可以部分自动收集(自Hive V3.2.0之后),也可以手工通过执行ANALYZE来收集生成。
手工收集表统计数据

但NOSCAN指定时,该操作不会扫描文件,只收集文件数的大小

> ANALYZE TABLE employee COMPUTE STATISTICS;
No rows affected (27.979 seconds)

> ANALYZE TABLE employee COMPUTE STATISTICS NOSCAN;
No rows affected (25.979 seconds)
手工收集分区统计数据
-- Applies for specific partition
> ANALYZE TABLE employee_partitioned 
> PARTITION(year=2018, month=12) COMPUTE STATISTICS;
No rows affected (45.054 seconds)
      
-- Applies for all partitions
> ANALYZE TABLE employee_partitioned 
> PARTITION(year, month) COMPUTE STATISTICS;
No rows affected (45.054 seconds)
手工收集列统计数据
> ANALYZE TABLE employee_id COMPUTE STATISTICS FOR COLUMNS       
employee_id;       
No rows affected (41.074 seconds)

自动收集统计数据可以通过

SET hive.stats.autogather=true

通过INSERT OVERWRITE/INTO语句来加载表或者分区数据的统计数据会自动收集至Metastore,而LOAD语句不会触发自动收集统计数据机制。
查看统计数据可以使用DESCRIBE EXTENDED/FORMATTED命令,以下为相关示例,

-- Check statistics in a table
> DESCRIBE EXTENDED employee_partitioned PARTITION(year=2018, month=12);
-- Check statistics in a partition
> DESCRIBE EXTENDED employee;
...
parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}).
-- Check statistics in a column
> DESCRIBE FORMATTED employee.name;
+--------+---------+---+---+---------+--------------+
|col_name|data_type|min|max|num_nulls|distinct_count| ...
+--------+---------+---+---+---------+--------------+
| name   | string  |   |   | 0       | 5            | ...
+--------+---------+---+---+---------+--------------+
+-----------+-----------+
|avg_col_len|max_col_len| ...
+-----------+-----------+
| 5.6       | 7         | ...
+-----------+-----------+
3 rows selected (0.116 seconds)
  1. Logs 日志提供用于查看一个查询或者Job任务的详细信息。通过查看日志,我们可以发现运行问题和错误,这些问题往往导致很糟糕的性能。Hive提供两类日志,包括系统日志和Job任务日志。

hive.root.logger=WARN,DRFA ## set logger level
hive.log.dir=/tmp/${user.name} ## set log file path
hive.log.file=hive.log ## set log file name

以上设置适用于所有用户,我们也可以通过在Hive命令行中指定这些设置,这样只在当前用户会话中生效,如:

$hive --hiveconf hive.root.logger=DEBUG,console

yarn logs -applicationId <application_id>

上一篇 下一篇

猜你喜欢

热点阅读