Hive的三种去重方式对比
2019-12-06 本文已影响0人
Mr_WangZz
数据量级:从 415816 到 221436
第1种 distinct
SELECT COUNT(DISTINCT uid)
FROM tableA
WHERE dt = '20191205'
AND event = 'start'
;
--29.379s
第2种 group by
SELECT COUNT(*)
FROM (
SELECT uid
,collect_set(model)[0]
,collect_set(os)[0]
FROM tableA
WHERE dt = '20191205'
AND event = 'start'
GROUP BY uid
) a
;
--25.239s
第3种 row_number()
SELECT COUNT(*)
FROM (
SELECT uid
,row_number() OVER(PARTITION BY uid ORDER BY uid) rn
FROM tableA
WHERE dt = '20191205'
AND event = 'start'
) c
WHERE rn = 1
;
-- 25.162s
-- 25.154s
加不加 DESC 除了结果顺序不同,对查询效率的影响
SELECT COUNT(*)
FROM (
SELECT uid
,row_number() OVER(PARTITION BY uid ORDER BY uid DESC) rn
FROM tableA
WHERE dt = '20191205'
AND event = 'start'
) c
WHERE rn = 1
;
-- 24.436s
-- 25.435s
-- 24.114s
总结:
- distinct使用起来虽然简单,但是效率不及 group by 和 row_number()。
- 当数据量非常大时,尤其达到百万级及以上,应优先使用后两种去重方式。
参考: