hive大数据 爬虫Python AI Sql大数据

hive之with cube、with rollup、group

2020-01-04  本文已影响0人  愤怒的谜团

一:多维分组聚合背景

企业针对有些指标,希望能够支撑上钻和下钻来进行查看,这样即可以满足精细化分析,又可以满足指标概览,当然现在的一些可视化工具,已经支持了上钻和下钻了,只要导入最细粒度的数据即可。

二:grouping sets介绍

直接上例子,测试数据源如下:

select *
from dw_tmp.pvtable;
测试数据源.png

三个字段分别是系统类型,版本号,以及浏览页面类型,等会通过分组聚合,来查看不同的os,version,达到多少pv(pv就是去重的页面数)。

测试一:

select os
       ,version
       ,count(DISTINCT page_id) as pv
       ,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
GROUPING SETS (os,version)
ORDER BY GROUPING__ID 

GROUPING__ID其实就是对分组进行一个编号


测试结果.png

测试二:

select os
       ,version
       ,count(DISTINCT page_id) as pv
       ,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
GROUPING SETS (os,version,(os,version))
ORDER BY GROUPING__ID
测试结果.png

结论

select os
       ,version
       ,count(DISTINCT page_id) as pv
       ,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
GROUPING SETS (os,version,(os,version))
ORDER BY GROUPING__ID

等同于
select os,version,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY os,version
union all
select os,null,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY os
union all
select null,version,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY version

三:with cube介绍

测试一:

select os
       ,version
       ,count(DISTINCT page_id) as pv
       ,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
WITH CUBE 
ORDER BY GROUPING__ID
测试结果.png

从结论当中可以看出,相对于os,version,做了一个笛卡尔积。

select os
       ,version
       ,count(DISTINCT page_id) as pv
       ,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
WITH CUBE 
ORDER BY GROUPING__ID
等同于
select os,version,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY os,version
union all
select os,null,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY os
union all
select null,version,count(DISTINCT page_id) as pv
from dw_tmp.pvtable
GROUP BY version
union all
select null,null,count(DISTINCT page_id) as pv
from dw_tmp.pvtable

四:with rollup

测试一:

select os
       ,version
       ,count(DISTINCT page_id) as pv
       ,GROUPING__ID
from dw_tmp.pvtable
GROUP BY os,version
with rollup
ORDER BY GROUPING__ID
测试结果.png

测试二:

select version
       ,os
       ,count(DISTINCT page_id) as pv
       ,GROUPING__ID
from dw_tmp.pvtable
GROUP BY version,os
with rollup
ORDER BY GROUPING__ID
测试结果.png

结论:从两次测试结果来看,with rollup跟with cube最大的区别是,with rollup不是笛卡尔积,而且遵循某种类似于左连接的思想,即不允许左侧为null,右侧非null的情况出现。

上一篇 下一篇

猜你喜欢

热点阅读