ClickHouse

2022-04-18 本文已影响0人想成为大师的学徒小纪

一、简介

ClickHouse最初是为 YandexMetrica 世界第二大Web分析平台而开发的。多年来一直作为该系统的核心组件被该系统持续使用着。目前为止，该系统在ClickHouse中有超过13万亿条记录，并且每天超过200多亿个事件被处理。它允许直接从原始数据中动态查询并生成报告。

ClickHouse是一个用于联机分析(OLAP)的列式数据库管理系统(DBMS)，即数据以列的形式进行存储。ClickHouse不单单是一个数据库，它是一个数据库管理系统。因为它允许在运行时创建表和数据库、加载数据和运行查询，而无需重新配置或重启服务。

常见的列式数据库有： Vertica、 Paraccel (Actian Matrix，Amazon Redshift)、 Sybase IQ、 Exasol、 Infobright、 InfiniDB、 MonetDB (VectorWise， Actian Vector)、 LucidDB、 SAP HANA、 Google Dremel、 Google PowerDrill、 Druid、 kdb+。

二、ClickHouse的特性

优点

许多的列式数据库(如 SAP HANA, Google PowerDrill)只能在内存中工作，这种方式会造成比实际更多的设备预算。ClickHouse被设计用于工作在传统磁盘上的系统，它提供每GB更低的存储成本，但如果可以使用SSD和内存，它也会合理的利用这些资源。

ClickHouse还支持数据压缩、多核心并行处理、多服务器分布式处理等。在ClickHouse中，数据可以保存在不同的shard上，每一个shard都由一组用于容错的replica组成，查询可以并行地在所有shard上进行处理。ClickHouse支持一种基于SQL的声明式查询语言，它在许多情况下与ANSI SQL标准相同。支持的查询GROUP BY, ORDER BY, FROM, JOIN, IN以及非相关子查询，相关(依赖性)子查询和窗口函数暂不受支持。

ClickHouse支持在表中定义主键，按照主键对数据进行排序，这将帮助ClickHouse在几十毫秒以内完成对数据特定值或范围的查找。为了使查询能够快速在主键中进行范围查找，数据总是以增量的方式有序的存储在MergeTree中。因此，数据可以持续不断地高效的写入到表中，并且写入的过程中不会存在任何加锁的行为。

ClickHouse提供各种各样在允许牺牲数据精度的情况下对查询进行加速的方法：

用于近似计算的各类聚合函数，如：distinct values, medians, quantiles
基于数据的部分样本进行近似查询。这时，仅会从磁盘检索少部分比例的数据。
不使用全部的聚合条件，通过随机选择有限个数据聚合条件进行聚合。这在数据聚合条件满足某些分布条件下，在提供相当准确的聚合结果的同时降低了计算资源的使用。

ClickHouse支持自定义JOIN多个表，它更倾向于散列连接算法，如果有多个大表，则使用合并-连接算法

ClickHouse使用异步的多主复制技术。当数据被写入任何一个可用副本后，系统会在后台将数据分发给其他副本，以保证系统在不同副本上保持相同的数据。在大多数情况下ClickHouse能在故障后自动恢复，在一些少数的复杂情况下需要手动恢复。

ClickHouse使用SQL查询实现用户帐户管理，并允许角色的访问控制，类似于ANSI SQL标准和流行的关系数据库管理系统。

缺点

没有完整的事务支持。

缺少高频率，低延迟的修改或删除已存在数据的能力。仅能用于批量删除或修改数据。

稀疏索引使得ClickHouse不适合通过其键检索单行的点查询。

三、安装部署

RPM安装

$ grep -q sse4_2 /proc/cpuinfo && echo "SSE 4.2 supported" || echo "SSE 4.2 not supported"
$ yum install -y yum-utils
$ yum-config-manager --add-repo https://packages.clickhouse.com/rpm/clickhouse.repo
$ yum install -y clickhouse-server clickhouse-client

二进制包安装

$ cd /usr/local/src
$ wget -c -t 2 https://github.com/ClickHouse/ClickHouse/releases/download/v20.8.7.15-lts/clickhouse-common-static-dbg-20.8.7.15.tgz
$ wget -c -t 2 https://github.com/ClickHouse/ClickHouse/releases/download/v20.8.7.15-lts/clickhouse-common-static-20.8.7.15.tgz
$ wget -c -t 2 https://github.com/ClickHouse/ClickHouse/releases/download/v20.8.7.15-lts/clickhouse-server-20.8.7.15.tgz
$ wget -c -t 2 https://github.com/ClickHouse/ClickHouse/releases/download/v20.8.7.15-lts/clickhouse-client-20.8.7.15.tgz
$ tar zxf clickhouse-common-static-dbg-20.8.7.15.tgz
$ tar zxf clickhouse-common-static-20.8.7.15.tgz
$ tar zxf clickhouse-server-20.8.7.15.tgz
$ tar zxf clickhouse-client-20.8.7.15.tgz
$ clickhouse-common-static-20.8.7.15/install/doinst.sh
$ clickhouse-common-static-dbg-20.8.7.15/install/doinst.sh
$ clickhouse-server-20.8.7.15/install/doinst.sh
$ clickhouse-client-20.8.7.15/install/doinst.sh

配置文件参考

config.xml，users.xml
数据库全局参数修改点参考

name	value	description	type
max_compress_block_size	262144	The maximum size of blocks of uncompressed data before compressing for writing to a table.	UInt64
max_insert_block_size	262144	The maximum block size for insertion, if we control the creation of blocks for insertion.	UInt64
max_threads	15	The maximum number of threads to execute the request. By default, it is determined automatically.	MaxThreads
use_uncompressed_cache	1	Whether to use the cache of uncompressed blocks.	Bool
distributed_directory_monitor_sleep_time_ms	1000	Sleep time for StorageDistributed DirectoryMonitors, in case of any errors delay grows exponentially.	Milliseconds
distributed_directory_monitor_max_sleep_time_ms	256000	Maximum sleep time for StorageDistributed DirectoryMonitors, it limits exponential growth too.	Milliseconds
distributed_directory_monitor_batch_inserts	1	Should StorageDistributed DirectoryMonitors try to batch individual inserts into bigger ones.	Bool
load_balancing	random	Which replicas (among healthy replicas) to preferably send a query to (on the first attempt) for distributed processing.	LoadBalancing
log_queries	1	Log requests and write the log to the system table.	Bool
send_progress_in_http_headers	1	Send progress notifications using X-ClickHouse-Progress headers. Some clients do not support high amount of HTTP headers (Python requests in particular), so it is disabled by default.	Bool
http_headers_progress_interval_ms	60000	Do not send HTTP headers X-ClickHouse-Progress more frequently than at each specified interval.	UInt64
max_bytes_before_external_group_by	20000000000		UInt64
max_execution_time	300		Seconds
max_expanded_ast_elements	50000	Maximum size of query syntax tree in number of nodes after expansion of aliases and the asterisk.	UInt64
readonly	0	0 - everything is allowed. 1 - only read requests. 2 - only read requests, as well as changing settings, except for the 'readonly' setting.	UInt64
max_memory_usage	40000000000	Maximum memory usage for processing of single query. Zero means unlimited.	UInt64
max_memory_usage_for_user	48103633715	Maximum memory usage for processing all concurrently running queries for the user. Zero means unlimited.	UInt64
memory_profiler_step	4194304	Whenever query memory usage becomes larger than every next step in number of bytes the memory profiler will collect the allocating stack trace. Zero means disabled memory profiler. Values lower than a few megabytes will slow down query processing.	UInt64
log_query_threads	1	Log query threads into system.query_thread_log table. This setting have effect only when 'log_queries' is true.	Bool
allow_ddl	1	If it is set to true, then a user is allowed to executed DDL queries.	Bool
mysql_datatypes_support_level	decimal,datetime64	Which MySQL types should be converted to corresponding ClickHouse types (rather than being represented as String). Can be empty or any combination of 'decimal' or 'datetime64'. When empty MySQL's DECIMAL and DATETIME/TIMESTAMP with non-zero precision are seen as String on ClickHouse's side.	MySQLDataTypesSupport
enforce_on_cluster_default_for_ddl	1	Whether ON CLUSTER CLAUSE is auto enforced for DDLs.	Bool
only_allow_replicated_tbls_on_ha_cluster	1	Only allow Replicated*MergeTree on HA clusters	Bool
prefer_remote_call_kill_query_on_cluster	1	Prefer using remote call to kill query on cluster XXX, not use DDL Worker.	Bool
allow_experimental_data_skipping_indices	1	Obsolete setting, does nothing. Will be removed after 2020-05-31	Bool

启动

<!== clickhouse启动至少需要4G的内存 ==>

$ systemctl start clickhouse-server
$ systemctl enable clickhouse-server
$ systemctl status clickhouse-server

四、MergeTree表引擎

ClickHouse的底层引擎，分为数据库引擎和表引擎两种，需要重点关注表引擎。

库引擎

库引擎目前支持5种，分别是：Ordinary、Dictionary、Memory、Lazy、MySQL，Ordinary 是默认库引擎，在此类型库引擎下，可以使用任意类型的表引擎。5种库引擎说明：

Ordinary引擎：默认引擎，如果不指定数据库引擎创建的就是Ordinary数据库

Dictionary引擎：此数据库会自动为所有数据字典创建表

Memory引擎：所有数据只会保存在内存中，服务重启数据消失，该数据库引擎只能够创建Memory引擎表

MySQL引擎：改引擎会自动拉取远端MySQL中的数据，并在该库下创建 MySQL表引擎的数据表

Lazy延时引擎：在距最近一次访问间隔expiration_time_in_seconds时间段内，将表保存在内存中，仅适用于Log引擎表
表引擎

ClickHouse的表引擎提供了四个系列（Log、MergeTree、Integration、Special），Log系列用来做小表数据分析，MergeTree系列用来做大数据量分析，而Integration系列则多用于外表数据集成。Log、Special、Integration系列的表引擎相对来说，应用场景有限，功能简单，应用特殊用途，MergeTree系列表引擎又和两种特殊表引擎（Replicated，Distributed）正交形成多种具备不同功能的MergeTree表引擎。

MergeTree系列是官方主推的存储引擎，支持几乎所有ClickHouse核心功能，该系列中，常用的表引擎有：MergeTree、ReplacingMergeTree、CollapsingMergeTree、VersionedCollapsingMergeTree、SummingMergeTree、AggregatingMergeTree等。

MergeTree

MergeTree表引擎主要用于海量数据分析，支持数据分区、存储有序、主键索引、稀疏索引、数据TTL等。MergeTree支持所有ClickHouse SQL语法，但是有些功能与 MySQL并不一致，比如在MergeTree中主键并不用于去重。
ReplacingMergeTree

为了解决MergeTree相同主键无法去重的问题，ClickHouse提供了ReplacingMergeTree引擎，用来做去重。ReplacingMergeTree确保数据最终被去重，但是无法保证查询过程中主键不重复。因为相同主键的数据可能被shard到不同的节点，但是 compaction只能在一个节点中进行，而且optimize查询的时机也不确定。因此，ReplacingMergeTree适合在后台清除重复的数据，以节省空间，但它不能保证没有重复的数据。
CollapsingMergeTree

CollapsingMergeTree引擎要求在建表语句中指定一个标记列Sign（插入的时候指定为1，删除的时候指定为-1），后台Compaction时会将主键相同、Sign相反的行进行折叠，也即删除。来消除ReplacingMergeTree的限制。该引擎可以大大减少存储量，并因此提高SELECT查询的效率。
VersionedCollapsingMergeTree

VersionedCollapsingMergeTree 的用途与 CollapsingMergeTree 相同，但使用不同的折叠算法，允许使用多个线程以任意顺序插入数据。VersionedCollapsingMergeTree表引擎在建表语句中新增了一列Version，用于在乱序情况下记录状态行与取消行的对应关系。即使它们以错误的顺序插入也允许。相反，CollapsingMergeTree 只允许严格连续的插入。
SummingMergeTree

ClickHouse通过SummingMergeTree来支持对主键列进行预先聚合。在后台Compaction时，会将主键相同的多行进行sum求和，然后使用一行数据取而代之，从而大幅度降低存储空间占用，提升聚合计算性能。
AggregatingMergeTree

AggregatingMergeTree也是预先聚合引擎的一种，用于提升聚合计算的性能。与SummingMergeTree的区别在于：SummingMergeTree对非主键列进行sum聚合，而 AggregatingMergeTree则可以指定各种聚合函数。

MergeTree建表语法：

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
    INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
) ENGINE = MergeTree()
ORDER BY expr
[PARTITION BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...]
[SETTINGS name=value, ...]

子句解释：

ENGINE - 引擎名和参数。

ORDER BY -排序键。如果没有使用 PRIMARY KEY 显式指定的主键，ClickHouse 会使用排序键作为主键。

PARTITION BY -分区键。在大多数情况下，不需要分区键，而在大多数其他情况下，不需要比几个月更细的分区键。分区不会加速查询（与ORDER BY表达式相反）。永远不应该使用太细化的分区。不要按客户端标识符或名称对数据进行分区（而是将客户端标识符或名称作为 ORDER BY表达式中的第一列），一般按月分区。

PRIMARY KEY -主键。默认情况下主键跟排序键（由 ORDER BY 子句指定）相同。因此，大部分情况下不需要再专门指定一个 PRIMARY KEY 子句。

SAMPLE BY - 用于抽样的表达式。

TTL - 指定行存储的持续时间并定义数据片段在硬盘和卷上的移动逻辑的规则列表。

SETTINGS — 控制 MergeTree 行为的额外参数。

五、命令行客户端参数

--host, -h -– 服务端的host名称, 默认是localhost。您可以选择使用host名称或者IPv4或IPv6地址。

--port – 连接的端口，默认值：9000。注意HTTP接口以及TCP原生接口使用的是不同端口。

--user, -u – 用户名。默认值：default。

--password – 密码。默认值：空字符串。

--query, -q – 使用非交互模式查询。

--database, -d – 默认当前操作的数据库. 默认值：服务端默认的配置（默认是default）。

--multiline, -m – 如果指定，允许多行语句查询（Enter仅代表换行，不代表查询语句完结）。

--multiquery, -n – 如果指定, 允许处理用;号分隔的多个查询，只在非交互模式下生效。

--format, -f – 使用指定的默认格式输出结果。

--vertical, -E – 如果指定，默认情况下使用垂直格式输出结果。这与–format=Vertical相同。在这种格式中，每个值都在单独的行上打印，这种方式对显示宽表很有帮助。

--time, -t – 如果指定，非交互模式下会打印查询执行的时间到stderr中。

--stacktrace – 如果指定，如果出现异常，会打印堆栈跟踪信息。

--config-file – 配置文件的名称。

--secure – 如果指定，将通过安全连接连接到服务器。

--history_file — 存放命令历史的文件的路径。

--param_<name> — 查询参数配置查询参数.

六、系统表说明

system.asynchronous_metric_log

包含每分钟记录一次的 system.asynchronous_metrics历史值。默认开启。
system.asynchronous_metric

包含在后台定期计算的指标。例如，在使用的RAM量。
system.clusters

包含有关配置文件中可用的集群及其中的服务器的信息。
system.columns

此系统表包含所有表中列的信息。

你可以使用这个表来获得类似于 DESCRIBE TABLE 查询的信息，但是可以同时获得多个表的信息。

临时表中的列只在创建它们的会话中的 system.columns 中才可见，并且它们的 database 字段显示为空。
system.contributors

此系统表包含有关贡献者的信息。排列顺序是在查询执行时随机生成的。
system.clash_log

包含有关致命错误堆栈跟踪的信息.该表默认不存在于数据库中, 仅在发生致命错误时才创建.
system.current_roles

包含当前用户的激活角色. SET ROLE 修改该表的内容.
system.data_skipping_indices

包含有关所有表中现有数据跳过索引的信息.
system.data_type_families

包含有关受支持的数据类型的信息.
system.databases

包含当前用户可用的数据库的相关信息。
system.detached_parts

包含关于 MergeTree 表的分离分区的信息。reason 列详细说明了该分区被分离的原因。

对于用户分离的分区，原因是空的。你可以通过 ALTER TABLE ATTACH PARTITION|PART 命令添加这些分区。

关于其他列的描述，参见 system.parts。

如果分区名称无效，一些列的值可能是NULL。可以通过ALTER TABLE DROP DETACHED PART来删除这些分区。
system.disks

包含在服务器配置中定义的磁盘信息.
system.distributed_ddl_queue

包含有关在集群上执行的分布式ddl查询(集群环境)的信息.
system.distribution_queue

包含关于队列中要发送到分片的本地文件的信息. 这些本地文件包含通过以异步模式将新数据插入到Distributed表中而创建的新部分.
system.enabled_roles

包含当前所有活动角色, 包括当前用户的当前角色和当前角色的已授予角色.
system.errors

包含错误代码和它们被触发的次数.
system.functions

包含有关常规函数和聚合函数的信息。
system.grants

授予ClickHouse用户帐户的权限.
system.merge_tree_settings

包含 MergeTree 表的设置 (Setting) 信息。
system.metrics

此系统表包含可以即时计算或具有当前值的指标。例如，同时处理的查询数量或当前的复制延迟。这个表始终是最新的。
system.mutations

该表包含关于MergeTree表的mutation及其进度信息。每条mutation命令都用一行来表示。
system.parts

此系统表包含 MergeTree 表分区的相关信息。每一行描述一个数据分区。
system.parts_columns

包含关于MergeTree表的部分和列的信息. 每一行描述一个数据部分.
system.query_log

包含已执行查询的相关信息，例如：开始时间、处理持续时间、错误消息。可以通过设置 log_queries=0来禁用query_log. 不建议关闭此日志，因为此表中的信息对于解决问题很重要。ClickHouse不会自动从表中删除数据。
system.query_views_log

包含有关运行查询时执行的从属视图的信息，例如视图类型或执行时间.
system.quota_usage

当前用户的配额使用情况: 使用了多少, 还剩多少.
system.quotas_usage

所有用户配额使用情况.
system.replicated_fetches

包含当前正在运行的后台提取的信息.
system.replication_queue

包含用于 ReplicatedMergeTree 系列表的复制队列中存储在ZooKeeper中的任务的信息.
system.role_grants

包含用户和角色的角色授予. 向该表添加项, 请使用GRANT role TO user.
system.roles

包含有关已配置的角色信息.
system.settings

包含当前用户会话设置的相关信息。
system.settings_profiles

包含 Setting 配置文件中指定的属性.
system.storage_policies

包含有关服务器配置中定义的存储策略和卷信息。
system.tables

包含服务器知道的每个表的元数据。分离的表不在 system.tables 显示。临时表只在创建它们的会话中的 system.tables 中才可见。它们的数据库字段显示为空，并且 is_temporary 标志显示为开启。
system.time_zones

包含 ClickHouse 服务器支持的时区列表. 此时区列表可能因 ClickHouse 的版本而异
system.users

包含服务器上配置的用户账号的列表.
system.part_log

该 system.part_log 表只有当创建 part_log 指定了服务器设置。此表包含与以下情况发生的事件有关的信息数据部分在 MergeTree 家庭表，例如添加或合并数据。
system.query_thread_log

包含有关执行查询的线程的信息，例如，线程名称、线程开始时间、查询处理的持续时间。
system.replicas

包含驻留在本地服务器上的复制表的信息和状态。此表可用于监视。该表对于每个已复制的*表都包含一行。
system.events

包含有关系统中发生的事件数的信息。例如，在表中，您可以找到多少 SELECT 自ClickHouse服务器启动以来已处理查询。
system.table_engines

包含服务器支持的表引擎的描述及其功能支持信息。
system.merges

包含有关MergeTree系列中表当前正在进行的合并和部件突变的信息。
system.processes

包含有关进程所有信息。

七、数据备份恢复

<!== 此处使用第三方工具clickhouse-backup，只支持MergeTree表引擎 ==>

1、安装clickhouse-backup

$ yum -y install https://github.com/AlexAkulov/clickhouse-backup/releases/download/1.3.0/clickhouse-backup-1.3.0-1.x86_64.rpm

2、配置文件详解

$ cat > /etc/clickhouse-backup/config.yml <<'EOF'
general:
  remote_storage: none          #如果 `none` 则 `upload` 和 `download` 命令将失败
  max_file_size: 1073741824      #默认1G，upload_by_part为true时无用，用于按档案分割数据分片文件
  disable_progress_bar: true      #在上传和下载过程中显示进度条，仅当 `upload_concurrency` 和 `download_concurrency` 等于 1 时才有意义
  backups_to_keep_local: 0       #最新的本地备份应该保留多少，0表示所有创建的备份都将保留在本地磁盘上
  backups_to_keep_remote: 0       #最新备份应该保留在远程存储上，0 表示所有上传的备份都将保留在远程存储上。
  log_level: info   # 日志等级
  allow_empty_backups: false       #允许空备份
  download_concurrency: 1      #启动多少并行下载数据，最大255，与远程存储类型无关
  upload_concurrency: 1        #启动多少并行上传数据，最大255，与远程存储类型无关
  restore_schema_on_cluster: ""    #使用“ON CLUSTER”子句作为分布式 DDL 执行所有与模式相关的 SQL 查询，查看“system.clusters”表以获取正确的集群名称
  upload_by_part: true
  download_by_part: true
clickhouse:
  username: default     # 用户名
  password: ""          # 密码
  host: localhost       # 主机
  port: 9000            # 端口
  disk_mapping: {}      # 如果您在恢复的服务器上的system.disks与系统不同，请使用
  skip_tables:          # 跳过表
  - system.*
  - INFORMATION_SCHEMA.*
  - information_schema.*
  timeout: 5m                # 连接超时时间
  freeze_by_part: false
  secure: false            # 使用SSL加密进行连接
  skip_verify: false        # 跳过验证
  sync_replicated_tables: false      # 同步复制表
  log_sql_queries: false            # 在clickhouse-server内的`system.query_log`表上启用日志 clickhouse-backup SQL 查询
  config_dir: /etc/clickhouse-server/        # 配置文件目录
  restart_command: systemctl restart clickhouse-server
  ignore_not_exists_error_during_freeze: true       # 允许在备份创建期间经常 CREATE / DROP 表和数据库时避免备份失败，clickhouse-backup 将在执行 `ALTER TABLE ... FREEZE` 期间忽略 `code: 60` 和 `code: 81` 错误
  debug: false
s3:
  access_key: ""
  secret_key: ""
  bucket: ""
  endpoint: ""
  region: us-east-1
  acl: private
  assume_role_arn: ""
  force_path_style: false
  path: ""
  disable_ssl: false
  compression_level: 1
  compression_format: tar
  sse: ""
  disable_cert_verification: false
  storage_class: STANDARD
  concurrency: 1
  part_size: 0
  debug: false
gcs:
  credentials_file: ""
  credentials_json: ""
  bucket: ""
  path: ""
  compression_level: 1
  compression_format: tar
  debug: false
  endpoint: ""
cos:
  url: ""
  timeout: 2m
  secret_id: ""
  secret_key: ""
  path: ""
  compression_format: tar
  compression_level: 1
  debug: false
api:
  listen: localhost:7171
  enable_metrics: true
  enable_pprof: false
  username: ""
  password: ""
  secure: false
  certificate_file: ""
  private_key_file: ""
  create_integration_tables: false
  allow_parallel: false
ftp:
  address: ""
  timeout: 2m
  username: ""
  password: ""
  tls: false
  path: ""
  compression_format: tar
  compression_level: 1
  concurrency: 1
  debug: false
sftp:
  address: ""
  port: 22
  username: ""
  password: ""
  key: ""
  path: ""
  compression_format: tar
  compression_level: 1
  concurrency: 1
  debug: false
azblob:
  endpoint_suffix: core.windows.net
  account_name: ""
  account_key: ""
  sas: ""
  use_managed_identity: false
  container: ""
  path: ""
  compression_level: 1
  compression_format: tar
  sse_key: ""
  buffer_size: 0
  buffer_count: 3
EOF

3、命令参数详解

USAGE:
  clickhouse-backup <command> [-t, --tables=<db>.<table>] <backup_name>
DESCRIPTION:
  需以root或clickhouse用户运行
COMMAND:
  tables                打印可以备份的表
  create                创建一个备份
  create_remote        创建备份并上传到远程存储
  upload                上传备份到远程存储
  list                   打印备份信息
  download                从远程存储下载备份
  restore                从备份中恢复数据
  restore_remote        从远程存储下载备份并恢复
  delete                删除指定备份
  default-config        打印默认配置
  print-config           打印当前配置
  clean                从 `system.disks` 可用的所有 `path` 文件夹中删除 'shadow' 文件夹中的数据
  server                运行API服务模式
GLOBAL OPTIONS:
  --config FILE, -c FILE    指定运行配置文件

4、备份

$ clickhouse-backup tables
$ clickhouse-backup create full_backup_$(date +%F)
$ clickhouse-backup list

5、恢复

<!== 迁移其他主机，需将备份文件上传到该主机再进行恢复==>

$ clickhouse-backup restore --rm <backup_name>

八、集群安装

1、安装zookeeper

<!== 所有主机执行 ==>

$ tar zxf jdk8.tar.gz -C /usr/local
$ echo 'export JAVA_HOME="/usr/local/jdk8"' >>/etc/profile
$ echo 'export PATH="$JAVA_HOME/bin:$PATH"' >> /etc/profile
$ source /etc/profile
$ java -version
$ cd /usr/local/src && wget https://downloads.apache.org/zookeeper/zookeeper-3.5.9/apache-zookeeper-3.5.9-bin.tar.gz
$ tar zxf apache-zookeeper-3.5.9-bin.tar.gz -C /usr/local/
$ cd .. && mv apache-zookeeper-3.5.9-bin/ zookeeper
$ echo 'export ZOOKEEPER_HOME="/usr/local/zookeeper"' >>/etc/profile
$ echo 'export PATN="$ZOOKEEPER_HOME/bin:$PATH"' >>/etc/profile
$ source /etc/profile
$ cd zookeeper/conf && cp -p zoo_sample.cfg zoo.cfg
$ mkdir -p /data/zookeeper
$ mkdir /var/log/zookeeper
$ cat > zoo.cfg<<'EOF'
dataDir=/data/zookeeper
dataLogDir=/var/log/zookeeper/
tickTime=2000
initLimit=5
syncLimit=2
autopurge.snapRetainCount=3
autopurge.purgeInterval=0
maxClientCnxns=1024
#standaloneEnabled=true
#admin.enableServer=true
server.1=10.81.32.24:2888:3888
server.2=10.81.0.101:2888:3888
clientPort=2181
EOF

<!== host1执行 ==>

$ echo 1 > /data/zookeeper/myid

<!== host2执行 ==>

$ echo 2 > /data/zookeeper/myid

<!== 所有主机执行 ==>

$ cat > /etc/systemd/system/zookeeper.service <<'EOF'
[Unit]
Description=zookeeper.service
After=network.target

[Service]
User=zookeeper
Type=forking
Environment=ZOO_LOG_DIR=/var/log/zookeeper
Environment=JAVA_HOME=/usr/local/jdk8
ExecStart=/usr/local/zookeeper/bin/zkServer.sh start
ExecStop=/usr/local/zookeeper/bin/zkServer.sh stop
ExecReload=/usr/local/zookeeper/bin/zkServer.sh restart
Restart=on-failure
StartLimitInterval=60
StartLimitBurst=3000

[Install]
WantedBy=multi-user.target
EOF
$ groupadd zookeeper
$ useradd -g zookeeper -M -s /sbin/nologin zookeeper
$ chown -R zookeeper. /data/zookeeper/
$ chown -R zookeeper. /var/log/zookeeper/
$ systemctl daemon-reload
$ systemctl start zookeeper
$ systemctl enable zookeeper

2、安装clickhouse

<!== 所有主机执行 ==>

$ cd /usr/local/src
$ wget -c -t 2 https://github.com/ClickHouse/ClickHouse/releases/download/v20.8.7.15-lts/clickhouse-common-static-dbg-20.8.7.15.tgz
$ wget -c -t 2 https://github.com/ClickHouse/ClickHouse/releases/download/v20.8.7.15-lts/clickhouse-common-static-20.8.7.15.tgz
$ wget -c -t 2 https://github.com/ClickHouse/ClickHouse/releases/download/v20.8.7.15-lts/clickhouse-server-20.8.7.15.tgz
$ wget -c -t 2 https://github.com/ClickHouse/ClickHouse/releases/download/v20.8.7.15-lts/clickhouse-client-20.8.7.15.tgz
$ tar zxf clickhouse-common-static-dbg-20.8.7.15.tgz
$ tar zxf clickhouse-common-static-20.8.7.15.tgz
$ tar zxf clickhouse-server-20.8.7.15.tgz
$ tar zxf clickhouse-client-20.8.7.15.tgz
$ clickhouse-common-static-20.8.7.15/install/doinst.sh
$ clickhouse-common-static-dbg-20.8.7.15/install/doinst.sh
$ clickhouse-server-20.8.7.15/install/doinst.sh
$ clickhouse-client-20.8.7.15/install/doinst.sh

3、设置集群配置

设置config.xml

<!== 所有主机执行 ==>

按照上述单机部署配置文件修改，<u>注意修改监听地址</u>

设置子配置文件加载集群配置项

<!== 所有主机执行，macros项对应主机需修改 ==>

$ vim /etc/metrika.xml
<yandex>
<!-- 该标签与config.xml的<remote_servers incl="clickhouse_remote_servers" >保持一致 -->    
<clickhouse_remote_servers>
    <!-- 集群名称，可以修改 -->
    <cluster_2shards_1replicas>
        <!-- 配置2个分片，每个分片对应一台机器-->
        <shard>
            <internal_replication>true</internal_replication>
            <replica>
                <host>10.81.0.101</host>
                <port>9000</port>
                <user>cluster_user</user>
                <password>admin123456</password>
            </replica>
        </shard>
        <shard>
            <internal_replication>true</internal_replication>
            <replica>
                <host>10.81.32.44</host>
                <port>9000</port>
                <user>cluster_user</user>
                <password>admin123456</password>
            </replica>
        </shard>
    </cluster_2shards_1replicas>
</clickhouse_remote_servers>
<!-- 该标签与config.xml的<zookeeper incl="zookeeper-servers" optional="true" >保持一致 --> 
<zookeeper-servers>
    <node>
        <host>10.81.32.44</host>
        <port>2181</port>
    </node>
    <node>
        <host>10.81.0.101</host>
        <port>2181</port>
    </node>
</zookeeper-servers>
<!-- 分片和副本标识，shard标签配置分片编号，<replica>配置分片副本主机名，需要修改对应主机上的配置-->
<macros>
    <shard>01</shard>
    <replica>10.81.0.101</replica>  
</macros>    
</yandex>

设置用户配置文件，添加集群用户

<!== 所有主机执行 ==>

$ vim /etc/clickhouse-server/users.xml
<?xml version="1.0"?>
<yandex>
    <profiles>
        <default>
            <max_memory_usage>10000000000</max_memory_usage>
            <max_threads>8</max_threads>
            <receive_timeout>600</receive_timeout>
            <send_timeout>600</send_timeout>            
            <use_uncompressed_cache>0</use_uncompressed_cache>
            <load_balancing>random</load_balancing>
        </default>
        <readonly>
            <max_memory_usage>10000000000</max_memory_usage>
            <max_threads>8</max_threads>
            <receive_timeout>600</receive_timeout>
            <send_timeout>600</send_timeout>
            <use_uncompressed_cache>0</use_uncompressed_cache>
            <load_balancing>random</load_balancing>
            <allow_ddl>0</allow_ddl>
            <readonly>1</readonly>
        </readonly>
    </profiles>

    <!-- Users and ACL. -->
    <users>
        <default>
            <password_sha256_hex>ac0e7d037817094e9e0b4441f9bae3209d67b02fa484917065f71b16109a1a78</password_sha256_hex>
            <networks incl="networks" replace="replace">
                <ip>::/0</ip>
            </networks>
            <profile>readonly</profile>
            <quota>default</quota>
        </default>
        <root>
            <password_sha256_hex>ac0e7d037817094e9e0b4441f9bae3209d67b02fa484917065f71b16109a1a78</password_sha256_hex>
            <networks>
                <ip>10.81.0.0/16</ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
            <access_management>1</access_management>
        </root>
        <cluster_user>
            <password_sha256_hex>ac0e7d037817094e9e0b4441f9bae3209d67b02fa484917065f71b16109a1a78</password_sha256_hex>
            <networks>
                <ip>10.81.0.0/16</ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
        </cluster_user>
    </users>

    <!-- Quotas. -->
    <quotas>
        <default>
            <interval>
                <duration>3600</duration>
                <queries>0</queries>
                <errors>0</errors>
                <result_rows>0</result_rows>
                <read_rows>0</read_rows>
                <execution_time>0</execution_time>
            </interval>
        </default>
    </quotas>
</yandex>

4、启动集群

$ systemctl start clickhouse-server
$ systemctl enable clickhouse-server
$ systemctl status clickhouse-server

查看集群信息

clickhouse

5、验证集群可用性

创建数据库和表

create database test on cluster 'cluster_2shards_1replicas';
#创建分布式表
CREATE TABLE test.cluster ON CLUSTER 'cluster_2shards_1replicas' (id Int32,name String)ENGINE = Distributed(cluster_2shards_1replicas, test, local,id);
#创建本地表
CREATE TABLE test.local (id Int32,name String)ENGINE = MergeTree() ORDER BY id PARTITION BY id PRIMARY KEY id;