Spark Skew 的一种姿势

2020-10-31  本文已影响0人  zhangliyun

fact_table_optimized = select concat(key,'_', floor(rand(123456)*19)) as salted_key, val from (select * from ${model_tmp_db}.ip_snapshot_filter where dt='')t

dimension_optimized= select dim_key,val, explode(array(0,1,23,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key from (select cust_id, from ${model_tmp_db}.ip_increment_filter

where dt=xxx) t

fact_table 生成随机后缀(floor(rand(123456)*19)))key_随机后缀

dimension  table  单独生成随机后缀。扩大N 倍

将Join key 变成加盐后的Key 这样skew 的key 也不skew 了.

Problem

select t.asset_value, t.cnt from (select asset_value,count(asset_value) as cnt from bsl2019prod_tmp.ip_snapshot_filter where dt='2020-10-28' group by asset_value having cnt>1000)t order by t.cnt desc limit 500;

最大的t.cnt 是最小的30倍· skew 的一塌糊涂

t.asset_value t.cnt

xxxxx 31522

xxxx  1209

Code

fact_table_optimized = select concat(t.asset_value,'_', floor(rand(123456)*19)) as salted_key, t.cust_id,t.last_date from (select * from bsl2019prod_tmp.ip_snapshot_filter where dt='2020-10-28')t;

dimension_optimized= select t.cust_id,t.asset_value as key,t.last_date, explode(array(0,1,23,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key from (select cust_id, asset_value,last_date  from bsl2019prod_tmp.ip_increment_filter  where dt='2020-10-28') t;

joinret =select

a.cust_id as cust_id_origin,

b.cust_id as cust_id_target,

split(a.salted_key,'_')[0] as asset_value,

a.last_date AS last_date_origin,

                b.last_date AS last_date_target,

                a.last_date AS link_crtd_ts  from fact_table_optimized a, dimension_optimized b where a.salted_key=concat(b.key,'_',b.salted_key);

insert joinret into table xxx.joinret location '/tmp/joinret';

Let's see whether skew or not , 一点也不skew , :) x  N

y
上一篇下一篇

猜你喜欢

热点阅读