ClickHouse LowCardinality数据类型

2020-04-29  本文已影响0人  白奕新

一、结论及原理

1、what's this?

(1)LowCardinality(Type),是一种数据类型上的字典编码封装。LowCardinality(String),又成StringWithDictionary,是最常见的LowCardinality类型。
(2)使用LowCardinality编码的字段,在底层数据存储上做了修改。

image.png
(3)适用场景:适用于原始string字段冗长且去重后的计数值<1000w(It works the best when source strings are lengthy and the number of distinct values is not that big. There is no hard limit in ClickHouse, the results are usually good if number of distinct values is below 10M)。
(4)优势:降低磁盘存储。对于单个字段的group by有单独特殊的设计,所以会比较快,多个字段以上就不一定了。(https://github.com/ClickHouse/ClickHouse/issues/4796

2、how to use?

create table datasets.base( 
FlightDate Date,
FlightNum LowCardinality(String),
OriginAirportID LowCardinality(String)) ENGINE = MergeTree() PARTITION by FlightDate
order by (FlightNum,OriginAirportID);

二、Performance Test

0、结论

1、source data

使用官方数据给的美国航班表数据作为数据源。

2、test for writing

通过INSERT INTO SELECT的方式写入新表,评判写入性能。

insert into TABLE select FlightDate,FlightNum,OriginAirportID,DestAirportID,Distance,ArrDelay from datasets.ontime where Year > 2008;
(1)基础表
create table datasets.base( FlightDate Date,
FlightNum String,
OriginAirportID String,
DestAirportID String,
Distance Int32,
ArrDelay Int32) ENGINE = MergeTree() PARTITION by FlightDate
order by (FlightNum,OriginAirportID);

cost time for writing

序号 耗时 速度
1 48.069 sec 1.28 million rows/s., 41.75 MB/s
2 48.398 sec 1.27 million rows/s., 41.46 MB/s.
3 52.804 sec 1.16 million rows/s., 38.00 MB/s.

avg cost time : 49.757 sec

(2)LowCardinality表
create table datasets.lowCardinality( FlightDate Date,
FlightNum LowCardinality(String),
OriginAirportID LowCardinality(String),
DestAirportID LowCardinality(String),
Distance Int32,
ArrDelay Int32) ENGINE = MergeTree() PARTITION by FlightDate
order by (FlightNum,OriginAirportID);

cost time for writing

序号 耗时 速度
1 74.402 sec 826.37 thousand rows/s., 26.97 MB/s.
2 67.489 sec 911.01 thousand rows/s., 29.73 MB/s.
3 72.909 sec 843.29 thousand rows/s., 27.52 MB/s.

avg cost time : 71.6 sec

3、test for querying

SELECT SUM(Distance) as all_distince, SUM(ArrDelay), FlightNum
FROM TABLE
WHERE (OriginAirportID LIKE '15%'
        OR OriginAirportID LIKE '13%')
    AND DestAirportID LIKE '13%'
GROUP BY FlightNum, OriginAirportID
ORDER BY all_distince DESC
LIMIT 10;
(1)基础表
序号 耗时 速度
1 1.368 sec 44.17 million rows/s., 469.02 MB/s.
2 1.131 sec 53.43 million rows/s., 567.38 MB/s.
3 1.208 sec. 50.00 million rows/s., 530.94 MB/s.

avg cost time:1.236 sec

(2)LowCardinality表
序号 耗时 速度
1 0.822 sec 73.53 million rows/s., 600.51 MB/s.
2 0.754 sec 911.01 thousand rows/s., 29.73 MB/s.
3 0.767 sec. 78.72 million rows/s., 642.94 MB/s.

avg cost time:0.781 sec

3、test for disk usage

SELECT column, ANY(type), SUM(column_data_compressed_bytes) AS compressed
    , SUM(column_data_uncompressed_bytes) AS uncompressed
FROM system.parts_columns
WHERE table IN ('lowCardinality', 'base')
    AND active
    AND (column LIKE '%AirportID'
        OR column = 'FlightNum')
GROUP BY column, table
ORDER BY column ASC

结果:


image.png

三、reference

(1)https://raw.githubusercontent.com/ClickHouse/clickhouse-presentations/master/meetup19/string_optimization.pdf
(2)https://www.altinity.com/blog/2019/3/27/low-cardinality

后记:
最近准备在线上慢慢使用上,有什么效果到时再来同步更新

上一篇 下一篇

猜你喜欢

热点阅读