关于HQL使用count()报错的坑

2020-04-08 本文已影响0人晨冉1688

报错信息：AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT user_id)
deviating function: count(DISTINCT CASE WHEN status = 1 THEN user_id
ELSE NULL END) Consider using NDV() instead of COUNT(DISTINCT)
if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT
query option to perform this rewrite automatically.

报错代码如下

select 
    grad_id,
    count(distinct user_id) as total_users,
    count(distinct case when status = 1 then user_id else null end) as distinct_total_users
from 
    student_center_info
group by 
    grad_id;

运行报错信息如下：
AnalysisException: all DISTINCT aggregate functions need to have the
same set of parameters as count(DISTINCT user_id)
deviating function: count(DISTINCT CASE WHEN status = 1 THEN user_id
ELSE NULL END) Consider using NDV() instead of COUNT(DISTINCT)
if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT
query option to perform this rewrite automatically.

出现这个个报错的原因是因为count（）按某个列统计时，只能使用一次，如上面的代码，使用两次count（）对user_id统计时，就会报错，解决办法如下：

select 
    a.grad_id,
    a.total_user,
    b.total_users
from
    (select 
        grad_id,
        count(distinct user_id) as total_users
    from 
        student_center_info
    group by 
        grad_id) as a
 left join
     (select 
            grad_id,
            count(distinct case when status = 1 then user_id else null end) as total_users
        from 
            student_center_info
    group by 
        grad_id) as b
 on 
     a.grad_id = b.grad_id

如果用impala的话，可以用ndv()函数来代替，但是ndv()这个去重不是准确的，如果对数据要求精度比较高的，不建议使用ndv()。而且hive对它的支持也不是很友好。

关于HQL使用count()报错的坑

猜你喜欢

热点阅读