大数据开发

关于HQL使用count()报错的坑

2020-04-08  本文已影响0人  晨冉1688

报错信息:AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT user_id)
deviating function: count(DISTINCT CASE WHEN status = 1 THEN user_id
ELSE NULL END) Consider using NDV() instead of COUNT(DISTINCT)
if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT
query option to perform this rewrite automatically.

报错代码如下

select 
    grad_id,
    count(distinct user_id) as total_users,
    count(distinct case when status = 1 then user_id else null end) as distinct_total_users
from 
    student_center_info
group by 
    grad_id;

运行报错信息如下:
AnalysisException: all DISTINCT aggregate functions need to have the
same set of parameters as count(DISTINCT user_id)
deviating function: count(DISTINCT CASE WHEN status = 1 THEN user_id
ELSE NULL END) Consider using NDV() instead of COUNT(DISTINCT)
if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT
query option to perform this rewrite automatically.

出现这个个报错的原因是因为count()按某个列统计时,只能使用一次,如上面的代码,使用两次count()对user_id统计时,就会报错,解决办法如下:

select 
    a.grad_id,
    a.total_user,
    b.total_users
from
    (select 
        grad_id,
        count(distinct user_id) as total_users
    from 
        student_center_info
    group by 
        grad_id) as a
 left join
     (select 
            grad_id,
            count(distinct case when status = 1 then user_id else null end) as total_users
        from 
            student_center_info
    group by 
        grad_id) as b
 on 
     a.grad_id = b.grad_id 

如果用impala的话,可以用ndv()函数来代替,但是ndv()这个去重不是准确的,如果对数据要求精度比较高的,不建议使用ndv()。而且hive对它的支持也不是很友好。

上一篇 下一篇

猜你喜欢

热点阅读