clickhouse

ClickHouse技术分享第二弹(英文讲义)

2020-06-10  本文已影响0人  LittleMagic

前言

以下是今天为公司小伙伴们做的ClickHouse技术分享的讲义。由于PPT太难做了,索性直接用Markdown来写,搭配Chrome上的Markdown Preview Plus插件来渲染,效果非常好。

以下全文奉上,浓缩的都是精华,包含之前写过的两篇文章《物化视图简介与ClickHouse中的应用示例》《ClickHouse Better Practices》中的全部内容,另外也包含一些新内容,如:

  • ClickHouse聚合函数的combinator后缀
  • 分布式join/in的读放大和GLOBAL关键字
  • ClickHouse SQL缺乏开窗分析函数的解决方案
    • 示例:排名榜和同比、环比计算
    • 以及array join的用法
  • LowCardinality数据类型
  • MergeTree索引结构

数组和高阶函数没时间说了,之后再提。

开始~


Advanced Usage & Better Practice of ClickHouse

Part I - Materialized View

Intro

Engines

Creation

【此处图片涉及业务数据,故删掉】

CREATE MATERIALIZED VIEW IF NOT EXISTS dw.merchandise_point_pvuv_agg
ON CLUSTER sht_ck_cluster_1
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{shard}/dw/merchandise_point_pvuv_agg','{replica}')
PARTITION BY ts_date
ORDER BY (ts_date,site_id,point_index,merchandise_id)
SETTINGS index_granularity = 8192
[POPULATE] AS SELECT
  ts_date,
  site_id,
  site_name,
  point_index,
  merchandise_id,
  merchandise_abbr,
  sumState(1) AS pv,
  uniqState(user_id) AS uv,
  maxState(ts_date_time) AS last_time,
  minState(ts_date_time) AS first_time
FROM ods.analytics_access_log
WHERE event_type = 'shtOpenGoodsDetail'
AND active_id = 0
AND site_id >= 0 AND merchandise_id >= 0 AND point_index >= 0
GROUP BY ts_date,site_id,site_name,point_index,merchandise_id,merchandise_abbr;

Under the Hood

Distributed MV

CREATE TABLE IF NOT EXISTS dw.merchandise_point_pvuv_agg_all
ON CLUSTER sht_ck_cluster_1
AS dw.merchandise_point_pvuv_agg
ENGINE = Distributed(sht_ck_cluster_1,dw,merchandise_point_pvuv_agg,rand());

Query

SELECT
  merchandise_id,
  merchandise_abbr,
  sumMerge(pv) AS pv,
  uniqMerge(uv) AS uv,
  maxMerge(last_time) AS last_time,
  minMerge(first_time) AS first_time,
  arrayStringConcat(groupUniqArray(site_name),'|') AS site_names
FROM dw.merchandise_point_pvuv_agg_all
WHERE ts_date = today()
AND site_id IN (10030,10031,10036,10037,10038)
AND point_index = 2
GROUP BY merchandise_id,merchandise_abbr
ORDER BY pv DESC LIMIT 10;

【此处图片涉及业务数据,故删掉】

Part II - Aggregate Function Combinators

-State

【此处图片涉及业务数据,故删掉】

-Merge

-If

SELECT
  sumIf(quantity, merchandise_abbr LIKE '%苹果%') AS apple_quantity,
  countIf(toStartOfHour(ts_date_time) = '2020-06-09 20:00:00') AS eight_oclock_sub_order_num,
  maxIf(quantity * price, coupon_money > 0) AS couponed_max_gmv
FROM ods.ms_order_done
WHERE ts_date = toDate('2020-06-09');
┌─apple_quantity─┬─eight_oclock_sub_order_num─┬─couponed_max_gmv─┐
│           1365 │                      19979 │           318000 │
└────────────────┴────────────────────────────┴──────────────────┘

-Array

SELECT avgArray([33, 44, 99, 110, 220]);
┌─avgArray([33, 44, 99, 110, 220])─┐
│                            101.2 │
└──────────────────────────────────┘

-ForEach

SELECT sumForEach(arr)
FROM (
  SELECT 1 AS id, [3, 6, 12] AS arr
  UNION ALL
  SELECT 2 AS id, [7, 14, 7, 5] AS arr
);
┌─sumForEach(arr)─┐
│ [10,20,19,5]    │
└─────────────────┘

Part III - Using JOIN Correctly

Only consider 2-table equi-joins

Use IN When Possible

SELECT sec_category_name,count()
FROM ods.analytics_access_log
WHERE ts_date = today() - 1
AND site_name like '长沙%'
AND merchandise_id IN (
  SELECT merchandise_id
  FROM ods.ms_order_done
  WHERE price > 10000
)
GROUP BY sec_category_name;

Put Small Table at Right

No Predicate Pushdown

SELECT l.col1,r.col2 FROM left_table l
INNER JOIN right_table r ON l.key = r.key
WHERE l.col3 > 123 AND r.col4 = '...';
SELECT l.col1,r.col2 FROM (
  SELECT col1,key FROM left_table
  WHERE col3 > 123
) l INNER JOIN (
  SELECT col2,key FROM right_table
  WHERE col4 = '...'
) r ON l.key = r.key;

Distributed JOIN/IN with GLOBAL

SELECT
  t1.merchandise_id,t1.merchandise_abbr,t1.pv,t1.uv,
  t2.total_quantity,t2.total_gmv
FROM (
  SELECT
    merchandise_id,merchandise_abbr,
    sumMerge(pv) AS pv,
    uniqMerge(uv) AS uv
  FROM dw.merchandise_point_pvuv_agg_all  -- Distributed
  WHERE ts_date = today()
  AND site_id IN (10030,10031,10036,10037,10038)
  AND point_index = 1
  GROUP BY merchandise_id,merchandise_abbr
) t1
GLOBAL LEFT JOIN (  -- GLOBAL
  SELECT
    merchandise_id,
    sumMerge(total_quantity) AS total_quantity,
    sumMerge(total_gmv) AS total_gmv
  FROM dw.merchandise_gmv_agg_all  -- Distributed
  WHERE ts_date = today()
  AND site_id IN (10030,10031,10036,10037,10038)
  GROUP BY merchandise_id
) t2
ON t1.merchandise_id = t2.merchandise_id;

ARRAY JOIN

Part IV - Alternative to Windowed Analytical Functions

Drawback

row_number() OVER (PARTITION BY col1 ORDER BY col2)
rank() OVER (PARTITION BY col1 ORDER BY col2)
dense_rank() OVER (PARTITION BY col1 ORDER BY col2)
lag(col,num) OVER (PARTITION BY col1 ORDER BY col2)
lead(col,num) OVER (PARTITION BY col1 ORDER BY col2)

arrayEnumerate*()

SELECT arrayEnumerate([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerate([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,2,3,4,5,6,7,8]                                │
└──────────────────────────────────────────────────┘
SELECT arrayEnumerateDense([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerateDense([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,2,3,4,2,1,2,5]                                     │
└───────────────────────────────────────────────────────┘
SELECT arrayEnumerateUniq([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerateUniq([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,1,1,1,2,2,3,1]                                    │
└──────────────────────────────────────────────────────┘

Ranking List

SELECT main_site_id,merchandise_id,gmv,row_number,dense_rank
FROM (
  SELECT main_site_id,
    groupArray(merchandise_id) AS merchandise_arr,
    groupArray(gmv) AS gmv_arr,
    arrayEnumerate(gmv_arr) AS gmv_row_number_arr,
    arrayEnumerateDense(gmv_arr) AS gmv_dense_rank_arr
  FROM (
    SELECT main_site_id,
      merchandise_id,
      sum(price * quantity) AS gmv
    FROM ods.ms_order_done
    WHERE ts_date = toDate('2020-06-01')
    GROUP BY main_site_id,merchandise_id
    ORDER BY gmv DESC
  )
  GROUP BY main_site_id
) ARRAY JOIN
  merchandise_arr AS merchandise_id,
  gmv_arr AS gmv,
  gmv_row_number_arr AS row_number,
  gmv_dense_rank_arr AS dense_rank
ORDER BY main_site_id ASC,row_number ASC;
┌─main_site_id─┬─merchandise_id─┬────gmv─┬─row_number─┬─dense_rank─┐
│          162 │         379263 │ 136740 │          1 │          1 │
│          162 │         360845 │  63600 │          2 │          2 │
│          162 │         400103 │  54110 │          3 │          3 │
│          162 │         404763 │  52440 │          4 │          4 │
│          162 │          93214 │  46230 │          5 │          5 │
│          162 │         304336 │  45770 │          6 │          6 │
│          162 │         392607 │  45540 │          7 │          7 │
│          162 │         182121 │  45088 │          8 │          8 │
│          162 │         383729 │  44550 │          9 │          9 │
│          162 │         404698 │  43750 │         10 │         10 │
│          162 │         102725 │  33284 │         11 │         11 │
│          162 │         404161 │  29700 │         12 │         12 │
│          162 │         391821 │  28160 │         13 │         13 │
│          162 │         339499 │  26069 │         14 │         14 │
│          162 │         404548 │  25600 │         15 │         15 │
│          162 │         167303 │  25520 │         16 │         16 │
│          162 │         209754 │  23940 │         17 │         17 │
│          162 │         317795 │  22950 │         18 │         18 │
│          162 │         404158 │  21780 │         19 │         19 │
│          162 │         326096 │  21540 │         20 │         20 │
│          162 │         404493 │  20950 │         21 │         21 │
│          162 │         389508 │  20790 │         22 │         22 │
│          162 │         301524 │  19900 │         23 │         23 │
│          162 │         404506 │  19900 │         24 │         23 │
│          162 │         404160 │  18130 │         25 │         24 │
........................

neighbor()

neighbor(column,offset[,default_value])
-- offset > 0 = lead
-- offset < 0 = lag
-- default_value is used when the offset is out of bound

Baseline (YoY/MoM)

WITH toDate('2019-01-01') AS start_date
SELECT
  toStartOfMonth(start_date + number * 32) AS dt,
  rand(number) AS val,
  neighbor(val,-12) AS prev_year_val,
  neighbor(val,-1) AS prev_month_val,
  if (prev_year_val = 0,-32768,round((val - prev_year_val) / prev_year_val, 4) * 100) AS yoy_percent,
  if (prev_month_val = 0,-32768,round((val - prev_month_val) / prev_month_val, 4) * 100) AS mom_percent
FROM numbers(18);
┌─────────dt─┬────────val─┬─prev_year_val─┬─prev_month_val─┬─yoy_percent─┬─────────mom_percent─┐
│ 2019-01-01 │  344308231 │             0 │              0 │      -32768 │              -32768 │
│ 2019-02-01 │ 2125630486 │             0 │      344308231 │      -32768 │              517.36 │
│ 2019-03-01 │  799858939 │             0 │     2125630486 │      -32768 │ -62.370000000000005 │
│ 2019-04-01 │ 1899653667 │             0 │      799858939 │      -32768 │               137.5 │
│ 2019-05-01 │ 3073278541 │             0 │     1899653667 │      -32768 │               61.78 │
│ 2019-06-01 │  882031881 │             0 │     3073278541 │      -32768 │               -71.3 │
│ 2019-07-01 │ 3888311917 │             0 │      882031881 │      -32768 │              340.84 │
│ 2019-08-01 │ 3791703268 │             0 │     3888311917 │      -32768 │               -2.48 │
│ 2019-09-01 │ 3472517572 │             0 │     3791703268 │      -32768 │               -8.42 │
│ 2019-10-01 │ 1010491656 │             0 │     3472517572 │      -32768 │  -70.89999999999999 │
│ 2019-11-01 │ 2841992923 │             0 │     1010491656 │      -32768 │              181.25 │
│ 2019-12-01 │ 1783039500 │             0 │     2841992923 │      -32768 │              -37.26 │
│ 2020-01-01 │ 2724427263 │     344308231 │     1783039500 │      691.28 │  52.800000000000004 │
│ 2020-02-01 │ 2472851287 │    2125630486 │     2724427263 │       16.33 │  -9.229999999999999 │
│ 2020-03-01 │ 1699617807 │     799858939 │     2472851287 │      112.49 │ -31.269999999999996 │
│ 2020-04-01 │  873033696 │    1899653667 │     1699617807 │      -54.04 │              -48.63 │
│ 2020-05-01 │ 3524933462 │    3073278541 │      873033696 │        14.7 │              303.76 │
│ 2020-06-01 │   85437434 │     882031881 │     3524933462 │      -90.31 │              -97.58 │
└────────────┴────────────┴───────────────┴────────────────┴─────────────┴─────────────────────┘

Part V - More on Data Types

Date/DateTime

Nullable

merchandise_id Nullable(Int64)
merchandise_id Int64 DEFAULT -1

LowCardinality

-- event_type in access logs is quite suitable
event_type LowCardinality(String)

Arrays & Higher-order Functions [TBD]

Part VI - MergeTree Indices & Table Settings

Index Structure

Index Settings

Table TTL

PARTITION BY ...
ORDER BY (...)
TTL ts_date + INTERVAL 6 MONTH
ALTER TABLE ods.analytics_access_log ON CLUSTER sht_ck_cluster_1
MODIFY TTL ts_date + INTERVAL 6 MONTH;
SETTINGS merge_with_ttl_timeout = 86400  -- 1 day

ZooKeeper

autopurge.purgeInterval = 1
autopurge.snapRetainCount = 5
SETTINGS use_minimalistic_part_header_in_zookeeper = 1

Review CREATE TABLE statement

CREATE TABLE IF NOT EXISTS ods.analytics_access_log
ON CLUSTER sht_ck_cluster_1 (
  ts_date Date,
  ts_date_time DateTime,
  user_id Int64,
  event_type String,
  column_type String,
  groupon_id Int64,
  site_id Int64,
  site_name String,
  -- ...
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/ods/analytics_access_log','{replica}')
PARTITION BY ts_date
ORDER BY (ts_date,toStartOfHour(ts_date_time),main_site_id,site_id,event_type,column_type)
TTL ts_date + INTERVAL 6 MONTH
SETTINGS index_granularity = 8192,
use_minimalistic_part_header_in_zookeeper = 1,
merge_with_ttl_timeout = 86400;

The End

民那晚安晚安。

上一篇下一篇

猜你喜欢

热点阅读