Hive小技巧

2020-08-12 本文已影响0人 xxlee

转载请在文章起始处注明出处，谢谢。

1、取字符串某个位置的值

比如，我要取‘abcd’的b，有两种取法：

select split('abcd','')[1]
select substr('abcd',2,1)

2、我要取a表的用户，b表的用户，还有a、b重复的用户

这里可以用sum() over(partition by) 来取，下面举个例子：
假如a表和b表结构一样，都有dt,pin,type。

取数代码示意如下：

set hive.merge.mapfiles = true;
set hive.merge.mapredfiles = true;
set hive.merge.size.per.task = 512000000;
set hive.merge.smallfiles.avgsize = 512000000;
set mapred.max.split.size = 512000000;
set mapred.min.split.size.per.node = 512000000;
set mapred.min.split.size.per.rack = 512000000;
set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set hive.hadoop.supports.splittable.combineinputformat = true;
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nostrict;
set hive.exec.max.dynamic.partitions.pernode = 1000;
set hive.exec.parallel = true;
set hive.exec.parallel.thread.number = 16;
set hive.new.job.grouping.set.cardinality = 5000;
select
    t2.dt,
    count(distinct case
        when t2.t = 1 then t2.pin
        end) a的人数,
    count(distinct case
        when t2.t = 2 then t2.pin
        end) b的人数,
    count(distinct case
        when t2.t >= 3 then t2.pin
        end) ab重复人数
from 
    (
        select
            t1.dt,
            t1.pin,
            t1.type,
            sum(t1.type) over(partition by t1.pin) t
        from 
            (
                表a a 
                union all
                表b b 
            )
            t1 
        group by 
            t1.dt,
            t1.pin,
            t1.type
    )
    t2
group by 
    t2.dt

注意：t2子查询里的t1.type是一定要写的，不然会报错：

FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.

Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 4:16 Invalid column reference 'type'

Hive小技巧

转载请在文章起始处注明出处，谢谢。

1、取字符串某个位置的值

2、我要取a表的用户，b表的用户，还有a、b重复的用户

猜你喜欢

热点阅读