HIVE去重问题
2017-07-19 本文已影响0人
小甜瓜Melon
代码1:查询重复记录
SELECT * FROM tb a WHERE (a.account) IN
(SELECT account FROM tb GROUP BY account HAVING COUNT(*) > 1) LIMIT 2;
代码2:查询记录数大于2的记录数目
SELECT COUNT(mobile) AS duplicatetotal,mobile FROM db.tb
GROUP BY mobile HAVING COUNT(*)>1 ORDER BY duplicatetotal DESC;
代码3:去重操作
DROP TABLE IF EXISTS db.tb_new;
CREATE TABLE IF NOT EXISTS db.tb_new AS
SELECT * FROM
(SELECT *, row_number() OVER (PARTITION BY mobile ORDER BY created_at DESC) AS num FROM
(SELECT * FROM db.tb_old WHERE mobile REGEXP "^[1][34578][0-9]{9}$") tmp1 ) t
WHERE t.num=1;
说明:mobile为除重依据, created_at 为排序依据; DESC留下最近的一条记录。
代码4:查询电话数目为3的记录
SELECT account, COUNT(account) AS duplicatetotal FROM db.tb
GROUP BY account HAVING COUNT(*)=3 LIMIT 10;
代码5:
hiveContext.sql("SELECT COUNT(*) AS count_sum, user_profile_id FROM stage_data.temp_table GROUP BY user_profile_id HAVING COUNT(*) > 1").show()
代码6:
hiveContext.sql("SELECT COUNT(*) FROM stage_data.temp_table WHERE LENGTH(TRIM(user_profile_id)) = 0 ").show()
去除user_profile_id两边的空格之后,计算长度。若长度为0,则表明该字段为空。
后记:
- left join之后,发现左表的记录数多出来了,一般是由于左表partition之后,NULL行也被当做单独一行,导致左表记录数增加。查询该记录用
SELECT * FROM temp_table WHERE user_profile_id IS NULL;
去除该记录的方法是用原来的表筛选用 IS NOT NULL;
- join表之前一般不要进行去重操作,join完之后再进行去重操作,效率会提高。
- 关联表之前需要检查左右表关联字段是否有重复字段,若存在重复字段则可能产生一对多问题。
完。