Hive的数据倾斜优化
1.Skew Join
When working with data that has a highly uneven distribution, the data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation. The following setting informs Hive to optimize properly if data skew happens:
当处理的数据分布高度不平衡时,数据倾斜将以下面方式发生,少数的计算节点必须处理大部分的计算.如果数据发生倾斜,下面的设置方式能够使Hive适当的优化.
jdbc:hive2://> SET hive.optimize.skewjoin=true;
--If there is data skew in join, set it to true. Default is false.
--如果有数据早join时倾斜,将它设置为true.默认它是false.
jdbc:hive2://> SET hive.skewjoin.key=100000;
--This is the default value. If the number of key is bigger than this, the new keys will send to the other unused reducers.
--这是默认值.如果key的数量是超过这个,新的key将要被发送到已被使用的reducers.
Note
Skew data could happen on the GROUP BY data too. To optimize it, we need to do the
following settings to enable skew data optimization in the GROUP BY result:
注意
数据倾斜也在发生在GROUP BY数据.为了优化它,我们需要做下面的设置,使倾斜的数能够在GROUP BY结果中优化.
SET hive.groupby.skewindata=true;
Once configured, Hive will first trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew.
一旦配置,Hive将第一次触发一个额外的MapReduce job ,它的map输出将要随机的分布到reducer中来避免数据倾斜.
For more information about Hive join optimization, please refer to the Apache Hive wiki
available at
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization and
https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization.