Hive的三种去重方式对比

2019-12-06  本文已影响0人  Mr_WangZz

数据量级:从 415816 到 221436

第1种 distinct

SELECT  COUNT(DISTINCT uid)
FROM    tableA
WHERE   dt = '20191205'
AND     event = 'start'
;
--29.379s

第2种 group by

SELECT  COUNT(*)
FROM    (
            SELECT  uid
                    ,collect_set(model)[0]
                    ,collect_set(os)[0]
            FROM    tableA
            WHERE   dt = '20191205'
            AND     event = 'start'
            GROUP BY uid
        ) a
;
--25.239s

第3种 row_number()

SELECT  COUNT(*)
FROM    (
            SELECT  uid
                    ,row_number() OVER(PARTITION BY uid ORDER BY uid) rn
            FROM    tableA
            WHERE   dt = '20191205'
            AND     event = 'start'
        ) c
WHERE   rn = 1
;
-- 25.162s
-- 25.154s

加不加 DESC 除了结果顺序不同,对查询效率的影响

SELECT  COUNT(*)
FROM    (
            SELECT  uid
                    ,row_number() OVER(PARTITION BY uid ORDER BY uid DESC) rn
            FROM    tableA
            WHERE   dt = '20191205'
            AND     event = 'start'
        ) c
WHERE   rn = 1
;
-- 24.436s
-- 25.435s
-- 24.114s

总结:

参考:

  1. 当数据量达到亿级时,group by和distinct区别以及性能比较
  2. Hive 三种去重方式
上一篇下一篇

猜你喜欢

热点阅读