hive 总结三（压缩）

2019-07-16 本文已影响0人利伊奥克儿

本文参考：黑泽君相关博客
本文是我总结日常工作中遇到的坑，结合黑泽君相关博客，选取、补充了部分内容。
开启 map 输出阶段压缩可以减少 job 中 map 和 Reduce task 间数据传输量。

查看配置命令如下，对应的设置只要加上相关值即可,如下

是否开启hive中间传输数据压缩功能？
hive> set hive.exec.compress.intermediate;
hive.exec.compress.intermediate=false

开启hive中间传输数据压缩功能
hive>  set hive.exec.compress.intermediate=true;

是否开启mapreduce中map输出压缩功能
hive> set mapreduce.map.output.compress;
mapreduce.map.output.compress=true

是否设置mapreduce中map输出数据的压缩方式
hive> set mapreduce.map.output.compress;
mapreduce.map.output.compress=true

当 Hive 将输出写入到表中时，输出内容同样可以进行压缩。属性 hive.exec.compress.output 控制着这个功能。
用户可能需要保持默认设置文件中的默认值 false，这样默认的输出就是非压缩的纯文本文件了。用户可以通过在查询语句或执行脚本中设置这个值为 true，来开启输出结果压缩功能。

查看配置命令如下，对应的设置只要加上相关值即可,如下

是否开启hive最终输出数据压缩功能
hive> set hive.exec.compress.output;
hive.exec.compress.output=false

是否开启mapreduce最终输出数据压缩
hive> set hive.exec.compress.output;
hive.exec.compress.output=false

是否设置mapreduce最终数据输出压缩方式
hive> set mapreduce.output.fileoutputformat.compress.codec;
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec

设置为SnappyCodec 压缩
hive> set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec ;

是否设置mapreduce最终数据输出压缩为块压缩（默认是行压缩RECORD）
hive> set mapreduce.output.fileoutputformat.compress.type;
mapreduce.output.fileoutputformat.compress.type=BLOCK

文件存储格式

Hive支持的存储数的格式主要有：TEXTFILE 、SEQUENCEFILE、ORC、PARQUET。

TextFile格式

默认格式，数据不做压缩，磁盘开销大，数据解析开销大。
可结合Gzip、Bzip2使用。
但使用Gzip这种方式，hive不会对数据进行切分，从而无法对数据进行并行操作。

Orc格式

Orc (Optimized Row Columnar)是Hive 0.11版里引入的新的存储格式。

Index Data
Index Data 包括每一列的最小值和最大值，以及每一列中的行位置。(也可以包含bit field or bloom filter ) 行索引项提供了偏移量，使您能够在解压缩块中查找正确的压缩块和字节。
注意，ORC索引仅用于选择stripes 和 row groups，而不用于回答查询。
拥有相对频繁的行索引项可以在stripe 内跳过行，以便快速读取，尽管stripe 很大。默认情况下，可以跳过每10,000行。

Row Data
存的是具体的数据，先取部分行，然后对这些行按列进行存储。对每个列进行了编码，分成多个Stream来存储。

Stripe Footer
存的是各个Stream的类型，长度等信息。

File Footer
每个文件有一个File Footer，这里面存的是每个Stripe的行数，每个Column的数据类型信息等；

PostScript
每个文件的尾部是一个PostScript，这里面记录了整个文件的压缩类型以及FileFooter的长度信息等。

在读取文件时，会seek到文件尾部读PostScript，从里面解析到File Footer长度，再读FileFooter，从里面解析到各个Stripe信息，再读各个Stripe，即从后往前读。

Orc格式结构.png

示例：
数据量：19017003
hive> CREATE TABLE tmp.orcTest stored AS ORC 
    > TBLPROPERTIES
    > ('orc.compress'='SNAPPY',
    > 'orc.create.index'='true',
    > 'orc.bloom.filter.fpp'='0.05',
    > 'orc.stripe.size'='10485760',
    > 'orc.row.index.stride'='10000') 
    > AS 
    > SELECT 
    > *
    > FROM tmp.textFileTest
Stage-Stage-1: Map: 16  Reduce: 62   Cumulative CPU: 1350.32 sec   HDFS Read: 4140450145 HDFS Write: 250057437 SUCCESS

向表中加载数据（不能使用load方式加载数据，需要insert into方式或者上述方式，即一定要通过MapReduce任务加载数据）

orc:不使用索引
hive> set hive.optimize.index.filter=false;
hive> select * from tmp.orcTest where userid ='02012138';
Query ID = hdfs_20190716114646_cb74ab6e-3d7f-42a8-b1e8-703abd89c420
Total jobs = 1
Launching Job 1 out of 1
.
.
.
Stage-Stage-1: Map: 5   Cumulative CPU: 32.88 sec   HDFS Read: 250016639 HDFS Write: 175 SUCCESS
Total MapReduce CPU Time Spent: 32 seconds 880 msec
OK
02012138     2019-06-16 00:21:03.000000000   10000                           0       20190616
Time taken: 13.935 seconds, Fetched: 1 row(s)
很明显，扫描了所有记录。再使用索引查询：

orc:使用索引
hive> set hive.optimize.index.filter=true;
hive> select * from tmp.orcTest where userid ='02012138';
Query ID = hdfs_20190716114747_dd02a4a3-7a25-424b-84d8-a314e918464f
Total jobs = 1
Launching Job 1 out of 1
.
.
.
Stage-Stage-1: Map: 5   Cumulative CPU: 24.77 sec   HDFS Read: 58144301 HDFS Write: 175 SUCCESS
Total MapReduce CPU Time Spent: 24 seconds 770 msec
OK
02012138     2019-06-16 00:21:03.000000000   10000
Time taken: 12.916 seconds, Fetched: 1 row(s)

可以看到，只扫描了部分记录，即根据Row Group Index中的min/max跳过了WHERE条件中不包含的stripes，索引有效果。

TeXtFile
hive> select * from tmp.tb_tmp_tms_stb_errorcode where userid ='02012138';
Query ID = hdfs_20190716114747_ac4b272c-4317-4b47-9726-91f297fc1af8
Total jobs = 1
Launching Job 1 out of 1
.
.
.
Stage-Stage-1: Map: 16   Cumulative CPU: 112.89 sec   HDFS Read: 4140148821 HDFS Write: 175 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 52 seconds 890 msec
OK
02012138     2019-06-16 00:21:03.000000000   10000
Time taken: 17.072 seconds, Fetched: 1 row(s)

数据未经过压缩，读取全集数据

注：我看一些文档说，Orc列存储，如果用 =,> 之类 筛选列不会触发 mapReduce,经过测试，还是触发了，而且查询时间整体上比TextFile 短。  
关于mapReduce的后续再找找答案，或者有知道的道友分享下。

ORC格式会将其转换成如下的树状结构.png

Parquet格式

这个东东我用的少，了解不多，在这里就仅仅标记一下

Parquet文件是以二进制方式存储的，所以是不可以直接读取的，文件中包括该文件的数据和元数据，因此Parquet格式文件是自解析的。

通常情况下，在存储Parquet数据的时候会按照Block大小设置行组的大小，由于一般情况下每一个Mapper任务处理数据的最小单位是一个Block，这样可以把每一个行组由一个Mapper任务处理，增大任务执行并行度。Parquet文件的格式如下图所示。

Parquet文件格式.png

各种存储文件的查询速度总结：经过验证，查询速度相近， orc、parquet有时候会比TextFile稍微快一丢丢。

小结：在实际的项目开发当中，hive表的数据存储格式一般选择：orc或parquet。压缩方式一般选择snappy或lzo。(使用压缩后可大量节省空间)

存储和压缩结合

查看hadoop支持的压缩方式

[hdfs@tmpe2e02 tmp_lillcol]$ hadoop checknative
19/07/16 14:21:20 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
19/07/16 14:21:20 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop:  true /opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop/lib/native/libhadoop.so.1.0.0
zlib:    true /lib64/libz.so.1
snappy:  true /opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop/lib/native/libsnappy.so.1
lz4:     true revision:10301
bzip2:   true /lib64/libbz2.so.1
openssl: true /usr/local/ssl/lib/libcrypto.so

HiveQL语句指定ORC文件格式配置

配置	默认值	备注
orc.compress	ZLIB	高级压缩(one of NONE, ZLIB, SNAPPY)
orc.compress.size	262,144	每个压缩块中的字节数
orc.stripe.size	67,108,864	每条stripe中的字节数
orc.row.index.stride	10,000	索引条目之间的行数(必须是>= 1000)
orc.create.index	true	是否创建行索引
orc.bloom.filter.columns	""	逗号分隔的列名列表，应该为其创建bloom过滤器
orc.bloom.filter.fpp	0.05	bloom过滤器的误报概率(必须是>0.0和<1.0)

案例：
CREATE TABLE tmp.orcTest stored AS ORC 
TBLPROPERTIES
('orc.compress'='SNAPPY',
'orc.create.index'='true',
'orc.bloom.filter.fpp'='0.05',
'orc.stripe.size'='10485760',
'orc.row.index.stride'='10000') 
AS 
SELECT 
*
FROM tmp.orcTextFile
DISTRIBUTE BY userid sort BY userid;

orc存储文件默认采用ZLIB压缩。ZLIB压缩率比snappy的高，但是ZLIB解压缩速率很低。

Fetch抓取

Fetch抓取是指，Hive中对某些情况的查询可以不必使用MapReduce计算。
例如：SELECT * FROM orcTest; 在这种情况下，Hive可以简单地读取orcTest对应的存储目录下的文件，然后输出查询结果到控制台。

配置
hive-default.xml.template 中 hive.fetch.task.conversion

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      可选值 [none, minimal, more].
      一些select查询可以转换为单个获取任务，从而最小化延迟。
      目前，查询应该是单一来源的，没有任何子查询，也不应该有任何聚合或区别(引起RS)、横向视图和连接。
      0. none : 禁用 hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more  : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
</property>

测试：
hive> set hive.fetch.task.conversion; //查询当前配置
hive.fetch.task.conversion=minimal

hive> set hive.fetch.task.conversion=none;
hive> select userid from tmp.orcTest limit 1;
Query ID = hdfs_20190716151010_29239e97-a06c-47aa-9958-f1964ab8eb17
Total jobs = 1
Launching Job 1 out of 1
.
.
.
Stage-Stage-1: Map: 5   Cumulative CPU: 15.09 sec   HDFS Read: 4155591 HDFS Write: 661 SUCCESS
Total MapReduce CPU Time Spent: 15 seconds 90 msec
OK
02012138
Time taken: 13.908 seconds, Fetched: 10 row(s)

hive> set hive.fetch.task.conversion=more;
hive> select userid from tmp.orcTest limit 10;
OK
02012138
Time taken: 0.025 seconds, Fetched: 10 row(s)

把hive.fetch.task.conversion设置成none，然后执行查询语句，都会执行MapReduce程序。

本地模式

有时Hive的输入数据量是非常小的。
在这种情况下，为查询触发执行任务消耗的时间可能会比实际job的执行时间要多的多。
对于大多数这种情况，Hive可以通过本地模式在单台机器上处理所有的任务。
对于小数据集，执行时间可以明显被缩短。

// 开启本地mr模式
hive> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=false

// 设置local mr的最大输入数据量，当输入数据量小于这个值时采用local mr的方式，默认为134217728，即128M
hive> set hive.exec.mode.local.auto.inputbytes.max;
hive.exec.mode.local.auto.inputbytes.max=134217728

// 设置local mr的最大输入文件个数，当输入文件个数小于这个值时采用local mr的方式，默认为4
hive> set hive.exec.mode.local.auto.input.files.max;
hive.exec.mode.local.auto.input.files.max=4

上面是查看参数值的方法，对应的设置直接改成设置值即可。

创建测表
hive> CREATE TABLE tmp.orclocal stored AS ORC 
    > TBLPROPERTIES
    > ('orc.compress'='SNAPPY',
    > 'orc.create.index'='true',
    > 'orc.bloom.filter.fpp'='0.05',
    > 'orc.stripe.size'='10485760',
    > 'orc.row.index.stride'='10000') 
    > AS 
    > SELECT 
    > *
    > FROM tmp.orcTest 
    > DISTRIBUTE BY userid sort BY userid limit 1000;

查看文件数量：1个 符合配置
hive> !hadoop fs -du -h hdfs://ns1/user/hive/warehouse/tmp.db/orclocal;
22.5 K  45.1 K  hdfs://ns1/user/hive/warehouse/tmp.db/orclocal/000000_0

关闭本地模式，并执行查询语句
hive> set hive.exec.mode.local.auto=false;
hive> select count(*) from tmp.orclocal;
OK
1000
Time taken: 18.98 seconds, Fetched: 1 row(s)

开启本地模式，并执行查询语句
hive> set hive.exec.mode.local.auto=true;
hive> select count(*) from tmp.orclocal;
OK
1000
Time taken: 1.576 seconds, Fetched: 1 row(s)

时间对比有点不真实了....，但是测试几次，在小文件的情况下开启本地模式确实优势很大