hive之with cube、with rollup、group
2020-01-04 本文已影响0人
愤怒的谜团
一:多维分组聚合背景
企业针对有些指标,希望能够支撑上钻和下钻来进行查看,这样即可以满足精细化分析,又可以满足指标概览,当然现在的一些可视化工具,已经支持了上钻和下钻了,只要导入最细粒度的数据即可。
二:grouping sets介绍
直接上例子,测试数据源如下:
select *
from dw_tmp.pvtable;
测试数据源.png
三个字段分别是系统类型,版本号,以及浏览页面类型,等会通过分组聚合,来查看不同的os,version,达到多少pv(pv就是去重的页面数)。
测试一:
select os
,version
,count(DISTINCT page_id) as pv
,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
GROUPING SETS (os,version)
ORDER BY GROUPING__ID
GROUPING__ID其实就是对分组进行一个编号
测试结果.png
测试二:
select os
,version
,count(DISTINCT page_id) as pv
,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
GROUPING SETS (os,version,(os,version))
ORDER BY GROUPING__ID
测试结果.png
结论:
select os
,version
,count(DISTINCT page_id) as pv
,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
GROUPING SETS (os,version,(os,version))
ORDER BY GROUPING__ID
等同于
select os,version,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY os,version
union all
select os,null,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY os
union all
select null,version,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY version
三:with cube介绍
测试一:
select os
,version
,count(DISTINCT page_id) as pv
,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
WITH CUBE
ORDER BY GROUPING__ID
测试结果.png
从结论当中可以看出,相对于os,version,做了一个笛卡尔积。
select os
,version
,count(DISTINCT page_id) as pv
,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
WITH CUBE
ORDER BY GROUPING__ID
等同于
select os,version,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY os,version
union all
select os,null,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY os
union all
select null,version,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY version
union all
select null,null,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
四:with rollup
测试一:
select os
,version
,count(DISTINCT page_id) as pv
,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
with rollup
ORDER BY GROUPING__ID
测试结果.png
测试二:
select version
,os
,count(DISTINCT page_id) as pv
,GROUPING__ID
from dw_tmp.pvtable
GROUP BY version,os
with rollup
ORDER BY GROUPING__ID
测试结果.png
结论:从两次测试结果来看,with rollup跟with cube最大的区别是,with rollup不是笛卡尔积,而且遵循某种类似于左连接的思想,即不允许左侧为null,右侧非null的情况出现。