ClickHouse高级

1. 执行计划

在 ClickHouse 20.6 版本之前要查看 SQL 语句的执行计划需要设置日志级别为 TRACE 才可以看到,并且只能真正执行 SQL,在执行日志里面查看。在 20.6 版本引入了原生的执行计划语法,并在 20.6.3.28 版本成为正式功能。

EXPLAIN [AST | SYNTAX | PLAN | PIPELINE] [setting = value, ...] SELECT ... [FORMAT ...]
  • PLAN:用于查看执行计划,默认值
  • header:打印计划中各个步骤的 head 说明,默认关闭,默认值 0
  • description:打印计划中各个步骤的描述,默认开启,默认值 1
  • actions:打印计划中各个步骤的详细信息,默认关闭,默认值 0
  • AST:用于查看语法树
  • SYNTAX:用于优化语法
  • PIPELINE:用于查看 PIPELINE 计划
  • header:打印计划中各个步骤的 head 说明,默认关闭
  • graph:用 DOT 图像语言描述管道图,默认关闭,查看相关的图形需要 graphviz 查看
  • actions:如果开启了 graph,紧凑打印,默认开启

注:PLAN 和 PIPELINE 可以进行额外的显示设置,如上参数所示。

$ clickhouse-client -h 主机名 --send_logs_level=trace <<< "sql" > /dev/null

其中, send_logs_level参数指定日志等级为 trace, <<<将 SQL 语句重定向至 clickhouse-client 进行查询, > /dev/null将查询结果重定向到空设备吞掉,以便观察日志。

注意:通过将 ClickHouse 的服务日志,设置到 DEBUG 或者 TRACE 级别,才可以变相实现 EXPLAIN 查询的作用。

2. 建表优化

虽然 ClickHouse 底层将 DateTime 存储为时间戳 Long 类型,但不建议存储 Long 类型,因为 DateTime 不需要经过函数转换处理,执行效率高、可读性好。

-- create_time 不为 DateTime 类型时,需要经过 toDate 函数转换
CREATE TABLE t_type2
(
    id           UInt32,
    sku_id       String,
    total_amount Decimal(16, 2),
    create_time  Int32
) ENGINE = ReplacingMergeTree(create_time)
      PARTITION BY toYYYYMMDD(toDate(create_time))
      PRIMARY KEY (id)
      ORDER BY (id, sku_id);

官方已经指出 Nullable 类型几乎总还是会拖累性能,因为存储 Nullable 列时需要创建一个额外的文件来存储 NULL 标记,并且 Nullable 列无法被索引。因此除非极特殊情况,应直接使用字段默认值表示空,或者自行指定一个在业务中无意义的值(例如用 -1 表示没有商品 ID)。

-- 创建带空值的表
CREATE TABLE t_null
(
    x Int8,
    y Nullable(Int8)
) ENGINE = TinyLog;

-- 插入数据
INSERT INTO t_null
VALUES (1, NULL),
       (2, 3);

-- 查询
SELECT x + y
FROM t_null;

分区粒度根据业务特点决定,不宜过粗或过细。一般选择按天分区,也可以指定为 Tuple(),以单表一亿数据为例,分区大小控制在 10-30 个最佳。

必须指定索引列,ClickHouse 中的索引列即排序列,通过 order by 指定,一般在查询条件中经常被用来充当筛选条件的属性被纳入进来;可以是单一维度,也可以是组合维度的索引;通常需要满足高级列在前、查询频率大的在前原则;还有基数特别大的不适合做索引列,如用户表的 userid 字段;通常筛选后的数据满足在百万以内为最佳。

index_granularity 是用来控制索引粒度的,默认是 8192,如非必须不建议调整。

如果表中不是必须保留全量历史数据,建议指定 TTL(生存时间值),可以免去手动清理过期历史数据的麻烦,TTL 也可以通过 alter table语句随时修改。

尽量不要执行单条或小批量删除和插入操作,这样会产生小分区文件,给后台 Merge 任务带来巨大压力。

不要一次写入太多分区,或数据写入太快,数据写入太快会导致 Merge 速度跟不上而报错,一般建议每秒钟发起 2-3 次写入操作,每次操作写入 2w~5w 条数据(依服务器性能而定)。

常见错误:

配置项主要在 config.xml 或 users.xml 中,基本上都在 users.xml 里。

配置 描述 background_pool_size 后台线程池的大小,merge 线程就是在该线程池中执行,该线程池不仅仅是给 merge 线程用的,默认值 16,建议改成 CPU 个数的 2 倍(线程数) background_schedule_pool_size 执行后台任务的线程数,默认 128,建议改成CPU 个数的 2 倍(线程数) background_distributed_schedule_pool_size 分布式发送执行后台任务的线程数,默认 16,建议改成CPU 个数的 2 倍(线程数) max_concurrent_queries 最大并发处理的请求数(包含 select、insert 等),默认值 100,建议150~300(不够再加) max_threads 单个查询所能使用的最大 CPU 个数,默认是 CPU 核数

配置 描述 max_memory_usage 此参数在 users.xml 中,表示单词 Query 占用内存最大值,该值可以设置的比较大,这样可以提升集群查询的上限;保留部分给 OS,例如:128GB 内存的机器设置为 100GB max_bytes_before_external_group_by 一般按照 max_memory_usage 的一半设置内存,当 group 使用内存超过阈值后会刷新到磁盘进行 max_bytes_before_external_sort 当 order by 已超过 max_bytes_before_external_sort 内存就进行溢写磁盘(基于磁盘排序),如果不设置该值,那么当内存不够是直接抛错,设置了该值 order by 可以正常完成,但是速度相对存内存来说较慢(实测非常慢) max_table_size_to_drop 此参数在 config.xml 中,应用于需要删除表或分区的情况,默认 50GB,意思是如果删除 50GB 以上的分区表会失败,建议改为 0(任何分区表都可删除)

ClickHouse 不支持设置多数据目录,为了提升数据 IO 性能,可以挂载虚拟卷组,一个卷组绑定多块物理磁盘提升读写性能,多数据查询场景 SSD 回避普通机械硬盘快 2-3 倍。

3. 语法优化

1. 准备测试数据,下载官方数据集
$ curl -O https://datasets.clickhouse.com/hits/partitions/hits_v1.tar
$ curl -O https://datasets.clickhouse.com/visits/partitions/visits_v1.tar

2. 解压到 ClickHouse 数据目录
$ tar xvf hits_v1.tar -C /var/lib/clickhouse
$ tar xvf visits_v1.tar -C /var/lib/clickhouse

3. 修改数据目录所属用户
$ sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/data/datasets
$ sudo chown -R clickhouse:clickhouse /var/lib/clickhouse/metadata/datasets

4. 重启 ClickHouse
$ sudo systemctl restart clickhouse-server

5. 执行查询,hits_v1 表 130+ 字段 880w+ 数据,visits_v1 表 180+ 字段 160w+ 数据
$ clickhouse-client --query "SELECT count(*) FROM datasets.hits_v1"
$ clickhouse-client --query "SELECT count(*) FROM datasets.visits_v1"

ClickHouse 的 SQL 优化规则是基于 RBO(Rule Based Optimization),下面为部分优化规则。

在调用 count 函数时,如果没有指定具体字段且没有 where 条件,则会直接使用 system.tables 的 total_rows,例如:

-- Optimized trivial count 是对 count 的优化
:) explain plan select count(*) from hits_v1;

EXPLAIN
SELECT count(*)
FROM hits_v1

Query id: 2c1bacf8-187c-430a-95b5-3a3ebdb747af

┌─explain──────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY))          │
│   MergingAggregated                                  │
│     ReadFromPreparedSource (Optimized trivial count) │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:8db9ff00-a709-4575-be01-61081fcefbbd<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:06e0e1f9-1e4c-4407-9683-7577c45b9ae8</font>*</details>

3 rows in set. Elapsed: 0.004 sec.

-- 如果 count 具体的列字段,则不会优化
:) explain plan select count(UserID) from hits_v1;

EXPLAIN
SELECT count(UserID)
FROM hits_v1

Query id: f6f060aa-09de-4fa2-a0e9-2909f0800e42

┌─explain───────────────────────────────────────────────────────────────────────┐
│ Expression ((Projection + Before ORDER BY))                                   │
│   Aggregating                                                                 │
│     Expression (Before GROUP BY)                                              │
│       SettingQuotaAndLimits (Set limits and quota after reading from storage) │
│         ReadFromMergeTree                                                     │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9aa0d5a5-49d5-4331-8c70-a9701fd7ab7e<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c6272498-6b2a-4adb-9a09-2c458cb1191a</font>*</details>

5 rows in set. Elapsed: 0.007 sec.

下面语句子查询中有两个重复的 id 字段,会被去重:

:) EXPLAIN SYNTAX
:-] SELECT a.UserID, b.VisitID, a.URL, b.UserID
:-] FROM datasets.hits_v1 AS a
:-]          LEFT JOIN (SELECT UserID, UserID, VisitID FROM datasets.visits_v1) AS b USING (UserID)
:-] LIMIT 3;

EXPLAIN SYNTAX
SELECT
    a.UserID,
    b.VisitID,
    a.URL,
    b.UserID
FROM datasets.hits_v1 AS a
LEFT JOIN
(
    SELECT
        UserID,
        UserID,
        VisitID
    FROM datasets.visits_v1
) AS b USING (UserID)
LIMIT 3

Query id: 119f7b30-39b0-4cc8-a6f4-81f302e35867

┌─explain─────────────────────┐
│ SELECT                      │
│     UserID,                 │
│     VisitID,                │
│     URL,                    │
│     b.UserID                │
│ FROM datasets.hits_v1 AS a  │
│ ALL LEFT JOIN               │
│ (                           │
│     SELECT                  │
│         UserID,             │
│         VisitID             │
│     FROM datasets.visits_v1 │
│ ) AS b USING (UserID)       │
│ LIMIT 3                     │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:eebad5f0-bbc0-4a40-af60-66c06e269195<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:77b09d0a-68ba-4043-97a0-6ebdd692c88a</font>*</details>

14 rows in set. Elapsed: 0.008 sec.

当 group by 有 having 子句,但是没有 with cube、with rollup 或者 with totals 修饰的时候,having 过滤会下推到 where 提前过滤。例如:

-- having 例子
:) EXPLAIN SYNTAX
:-] SELECT UserID
:-] FROM datasets.hits_v1
:-] GROUP BY UserID
:-] HAVING UserID = '8585742290196126178';

EXPLAIN SYNTAX
SELECT UserID
FROM datasets.hits_v1
GROUP BY UserID
HAVING UserID = '8585742290196126178'

Query id: 761000a4-0043-4f7b-8329-89b2bfbbe055

┌─explain──────────────────────────────┐
│ SELECT UserID                        │
│ FROM datasets.hits_v1                │
│ WHERE UserID = '8585742290196126178' │
│ GROUP BY UserID                      │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9d9570d1-99e6-4be2-bf8f-03ffb66ce932<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7877d035-a77f-4805-9c20-1f58f8b7728a</font>*</details>

4 rows in set. Elapsed: 0.003 sec.

-- 子查询也支持谓词下推
:) EXPLAIN SYNTAX
:-] SELECT *
:-] FROM (SELECT UserID FROM datasets.visits_v1)
:-] WHERE UserID = '8585742290196126178';

EXPLAIN SYNTAX
SELECT *
FROM
(
    SELECT UserID
    FROM datasets.visits_v1
)
WHERE UserID = '8585742290196126178'

Query id: a70da9af-c8c1-4787-a536-e1794fcef33d

┌─explain──────────────────────────────────┐
│ SELECT UserID                            │
│ FROM                                     │
│ (                                        │
│     SELECT UserID                        │
│     FROM datasets.visits_v1              │
│     WHERE UserID = '8585742290196126178' │
│ )                                        │
│ WHERE UserID = '8585742290196126178'     │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:b276e5ba-9841-4271-b8d5-0a128cc89fb9<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:59fd61d0-334a-4aba-a62f-d17578aec9fc</font>*</details>

8 rows in set. Elapsed: 0.006 sec.

-- 复杂例子
:) EXPLAIN SYNTAX
:-] SELECT *
:-] FROM (SELECT *
:-]       FROM (SELECT UserID FROM datasets.visits_v1)
:-]       UNION ALL
:-]       SELECT *
:-]       FROM (SELECT UserID FROM datasets.visits_v1))
:-] WHERE UserID = '8585742290196126178';

EXPLAIN SYNTAX
SELECT *
FROM
(
    SELECT *
    FROM
    (
        SELECT UserID
        FROM datasets.visits_v1
    )
    UNION ALL
    SELECT *
    FROM
    (
        SELECT UserID
        FROM datasets.visits_v1
    )
)
WHERE UserID = '8585742290196126178'

Query id: 7b7ddf53-51b5-40d9-b079-ac65989479f2

┌─explain──────────────────────────────────────┐
│ SELECT UserID                                │
│ FROM                                         │
│ (                                            │
│     SELECT UserID                            │
│     FROM                                     │
│     (                                        │
│         SELECT UserID                        │
│         FROM datasets.visits_v1              │
│         WHERE UserID = '8585742290196126178' │
│     )                                        │
│     WHERE UserID = '8585742290196126178'     │
│     UNION ALL                                │
│     SELECT UserID                            │
│     FROM                                     │
│     (                                        │
│         SELECT UserID                        │
│         FROM datasets.visits_v1              │
│         WHERE UserID = '8585742290196126178' │
│     )                                        │
│     WHERE UserID = '8585742290196126178'     │
│ )                                            │
│ WHERE UserID = '8585742290196126178'         │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:474c2bdb-352c-42f1-bd2f-daa7f80d02e9<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:d5216822-205e-4dc3-a5e1-f40c7fdf56f1</font>*</details>

22 rows in set. Elapsed: 0.011 sec.

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:fb890f2d-d9bc-43bb-9a27-b52b5483dcd6

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a60a33f2-603e-477e-923d-6ca97c1678d6

:) EXPLAIN SYNTAX
:-] SELECT sum(UserID * 2)
:-] FROM datasets.visits_v1;

EXPLAIN SYNTAX
SELECT sum(UserID * 2)
FROM datasets.visits_v1

Query id: 205e2c27-8f38-41c3-8fe0-566ca9e0168d

┌─explain─────────────────┐
│ SELECT sum(UserID) * 2  │
│ FROM datasets.visits_v1 │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ba919434-4776-4cb3-b30c-b8c4ba958a8f<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:21771aad-392c-4ef4-ab85-75cfd803eb08</font>*</details>

2 rows in set. Elapsed: 0.004 sec.

如果对聚合键,即 group by key 使用 min、max、any 聚合函数,则将函数去除,例如:

:) EXPLAIN SYNTAX
:-] SELECT sum(UserID * 2), max(VisitID), max(UserID)
:-] FROM datasets.visits_v1
:-] GROUP BY UserID;

EXPLAIN SYNTAX
SELECT
    sum(UserID * 2),
    max(VisitID),
    max(UserID)
FROM datasets.visits_v1
GROUP BY UserID

Query id: f8931130-1bde-4086-a947-4921b76a6ec4

┌─explain─────────────────┐
│ SELECT                  │
│     sum(UserID) * 2,    │
│     max(VisitID),       │
│     UserID              │
│ FROM datasets.visits_v1 │
│ GROUP BY UserID         │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:00210168-abdd-493c-9a5d-e673ad8de6e9<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f2f46806-95b7-4755-bf3a-90631f70ee91</font>*</details>

6 rows in set. Elapsed: 0.003 sec.

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9a63fecc-d69e-4cb4-9ff8-30f84018a459

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:828759d0-702a-4b12-a831-05a7658fa509

:) EXPLAIN SYNTAX
:-] SELECT UserID, VisitID
:-] FROM datasets.visits_v1
:-] ORDER BY UserID ASC, UserID ASC, VisitID ASC, VisitID ASC;

EXPLAIN SYNTAX
SELECT
    UserID,
    VisitID
FROM datasets.visits_v1
ORDER BY
    UserID ASC,
    UserID ASC,
    VisitID ASC,
    VisitID ASC

Query id: 82b72116-610f-410e-a64a-3607ff784d0b

┌─explain─────────────────┐
│ SELECT                  │
│     UserID,             │
│     VisitID             │
│ FROM datasets.visits_v1 │
│ ORDER BY                │
│     UserID ASC,         │
│     VisitID ASC         │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:64a1988f-f0ca-4c8c-ad9a-9b6bf5c877b9<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0c12119a-fac9-4b4c-bf19-22e8a84ed0e3</font>*</details>

7 rows in set. Elapsed: 0.003 sec.

注:重复的 group by key 不会被删除。

重复声明的 limit by key 会被去重,例如:

:) EXPLAIN SYNTAX
:-] SELECT UserID, VisitID
:-] FROM datasets.visits_v1
:-] LIMIT 3 BY VisitID, VisitID
:-] LIMIT 10;

EXPLAIN SYNTAX
SELECT
    UserID,
    VisitID
FROM datasets.visits_v1
LIMIT 3 BY
    VisitID,
    VisitID
LIMIT 10

Query id: 23883608-f407-4d24-bff1-b1009567d262

┌─explain─────────────────┐
│ SELECT                  │
│     UserID,             │
│     VisitID             │
│ FROM datasets.visits_v1 │
│ LIMIT 3 BY VisitID      │
│ LIMIT 10                │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:83f15e01-19b0-400e-ba6f-91a34338962e<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c2a17c48-bea0-42c5-ac4e-f90143f2a1b7</font>*</details>

6 rows in set. Elapsed: 0.003 sec.

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0556afe4-964e-4e7e-8285-00fe1248bcdc

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e4c31a94-b5c7-4095-bf09-3e35632cddc3

:) EXPLAIN SYNTAX
:-] SELECT a.UserID, a.UserID, b.VisitID, a.URL, b.UserID
:-] FROM datasets.hits_v1 AS a
:-]          LEFT JOIN datasets.visits_v1 AS b USING (UserID, UserID);

EXPLAIN SYNTAX
SELECT
    a.UserID,
    a.UserID,
    b.VisitID,
    a.URL,
    b.UserID
FROM datasets.hits_v1 AS a
LEFT JOIN datasets.visits_v1 AS b USING (UserID, UserID)

Query id: 1f1519f0-fa32-4520-ad6b-c762b310879e

┌─explain──────────────────────────────────────────────┐
│ SELECT                                               │
│     UserID,                                          │
│     UserID,                                          │
│     VisitID,                                         │
│     URL,                                             │
│     b.UserID                                         │
│ FROM datasets.hits_v1 AS a                           │
│ ALL LEFT JOIN datasets.visits_v1 AS b USING (UserID) │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:22b06dc7-8805-40c6-af38-35c2560c4546<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4272158a-47c8-4a4f-8afa-dd06b853ad47</font>*</details>

8 rows in set. Elapsed: 0.008 sec.

如果子查询只返回一行数据,在被引用的时候用标量替换,例如下面语句中的 total_disk_usage:

:) EXPLAIN SYNTAX
:-] WITH (SELECT sum(bytes) FROM system.parts WHERE active) AS total_disk_usage
:-] SELECT (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage, table
:-] FROM system.parts
:-] GROUP BY table
:-] ORDER BY table_disk_usage DESC
:-] LIMIT 10;

EXPLAIN SYNTAX
WITH (
        SELECT sum(bytes)
        FROM system.parts
        WHERE active
    ) AS total_disk_usage
SELECT
    (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
    table
FROM system.parts
GROUP BY table
ORDER BY table_disk_usage DESC
LIMIT 10

Query id: b1f006ec-8aaf-4c68-8843-41cbe7e9cc5e

┌─explain─────────────────────────────────────────────────────────────────────────┐
│ WITH identity(_CAST(0, 'Nullable(UInt64)')) AS total_disk_usage                 │
│ SELECT                                                                          │
│     (sum(bytes_on_disk AS bytes) / total_disk_usage) * 100 AS table_disk_usage, │
│     table                                                                       │
│ FROM system.parts                                                               │
│ GROUP BY table                                                                  │
│ ORDER BY table_disk_usage DESC                                                  │
│ LIMIT 10                                                                        │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:7c85a3d3-d397-4ff5-b706-1d132b79d263<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:44c1e503-dccd-4f5e-9c3d-ed2b011aebc6</font>*</details>

8 rows in set. Elapsed: 0.004 sec.

如果开启了 optimize_if_chain_to_multiif参数,三元运算符会被替换成 multiIf 函数,例如:

:) EXPLAIN SYNTAX
:-] SELECT if(number = 1, 'hello', if(number = 2, 'world', 'atguigu'))
:-] FROM numbers(10) SETTINGS optimize_if_chain_to_multiif = 1;

EXPLAIN SYNTAX
SELECT if(number = 1, 'hello', if(number = 2, 'world', 'atguigu'))
FROM numbers(10)
SETTINGS optimize_if_chain_to_multiif = 1

Query id: 068adf60-9b81-4af8-9683-a880769ac49d

┌─explain─────────────────────────────────────────────────────────────┐
│ SELECT multiIf(number = 1, 'hello', number = 2, 'world', 'atguigu') │
│ FROM numbers(10)                                                    │
│ SETTINGS optimize_if_chain_to_multiif = 1                           │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:56a95eee-70d1-4dd3-abd7-401d00c33843<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:c86ae7ae-751c-4220-aa57-a3ecddf454a5</font>*</details>

3 rows in set. Elapsed: 0.002 sec.

4. 查询优化

PREWHERE 和 WHERE 语句的作用相同,用来过滤数据,不同之处在于 PREWHERE 只支持 *MergeTree 族系列引擎的表,首先会读取指定的列数据,来判断数据过滤,等待数据过滤之后再读取 SELECT 声明的列字段来补全其余属性。

当查询列明显多于筛选列时使用 PREWHERE 可十倍提升查询性能,PREWHERE 会自动优化执行过滤阶段的数据读取方式,降低 IO 操作。

在某些场合下,PREWHERE 语句比 WHERE 语句处理的数据量更少且性能更高。

-- 使用 where,默认自动优化 prewhere,需要关闭自动转 prewhere
SELECT WatchID,
       JavaEnable,
       Title,
       GoodEvent,
       EventTime,
       EventDate,
       CounterID,
       ClientIP,
       ClientIP6,
       RegionID,
       UserID,
       CounterClass,
       OS,
       UserAgent,
       URL,
       Referer,
       URLDomain,
       RefererDomain,
       Refresh,
       IsRobot,
       RefererCategories,
       URLCategories,
       URLRegions,
       RefererRegions,
       ResolutionWidth,
       ResolutionHeight,
       ResolutionDepth,
       FlashMajor,
       FlashMinor,
       FlashMinor2
FROM datasets.hits_v1
WHERE UserID = '3198390223272470366' SETTINGS optimize_move_to_prewhere = 0;

-- 使用 prewhere
SELECT WatchID,
       JavaEnable,
       Title,
       GoodEvent,
       EventTime,
       EventDate,
       CounterID,
       ClientIP,
       ClientIP6,
       RegionID,
       UserID,
       CounterClass,
       OS,
       UserAgent,
       URL,
       Referer,
       URLDomain,
       RefererDomain,
       Refresh,
       IsRobot,
       RefererCategories,
       URLCategories,
       URLRegions,
       RefererRegions,
       ResolutionWidth,
       ResolutionHeight,
       ResolutionDepth,
       FlashMajor,
       FlashMinor,
       FlashMinor2
FROM datasets.hits_v1
PREWHERE UserID = '3198390223272470366';

默认情况下,一般不会关闭 WHERE 自动优化 PREWHERE,但是某些场景即使开启自动优化也不会自动自动转换成 PREWHERE,需要手动指定 PREWHERE:

  • 使用常量表达式;
  • 使用默认值为 alias 类型的字段;
  • 包含了 arrayJOIN、globalIn、globalNotIn 或者 indexHInt 的查询;
  • select 查询的列字段和 where 的谓词相同;
  • 使用了主键字段;

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:f21b4d4a-77f4-45d2-816d-f4c5ddb94a92

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4d8f2990-eda9-439c-9d36-901110815736

-- SAMPLE 采样,先采样再统计
SELECT Title, count(*) AS PageViews
FROM datasets.hits_v1 SAMPLE 0.1
WHERE CounterID = 57
GROUP BY Title
ORDER BY PageViews DESC
LIMIT 1000;

注:采样修饰符只有在 MergeTree engine 表中才有效,且在创建表时需要指定采样策略。

数据量太大时避免使用 select * 操作,查询的性能会与查询的字段大小和数量成反比,查询的字段越少,消耗的 IO 资源越少,性能就会越高。

-- 反例
SELECT * FROM datasets.hits_v1;

-- 正例
SELECT WatchID,
       JavaEnable,
       Title,
       GoodEvent,
       EventTime,
       EventDate,
       CounterID,
       ClientIP,
       ClientIP6,
       RegionID,
       UserID,
       CounterClass,
       OS,
       UserAgent,
       URL,
       Referer,
       URLDomain,
       RefererDomain,
       Refresh,
       IsRobot,
       RefererCategories,
       URLCategories,
       URLRegions,
       RefererRegions,
       ResolutionWidth,
       ResolutionHeight,
       ResolutionDepth,
       FlashMajor,
       FlashMinor,
       FlashMinor2
FROM datasets.hits_v1;

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6cf7e8e9-8626-4f0b-a287-61b5eccdb868

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1b535d09-062a-4275-9da9-4f1267a4f68b

SELECT WatchID,
       JavaEnable,
       Title,
       GoodEvent,
       EventTime,
       EventDate,
       CounterID,
       ClientIP,
       ClientIP6,
       RegionID,
       UserID
FROM datasets.hits_v1
WHERE EventDate = '2014-03-23';

千万以上数据集进行 order by 查询时需要搭配 where 条件和 limit 语句一起使用。

-- 正例
SELECT UserID, Age
FROM datasets.hits_v1
WHERE CounterID = 57
ORDER BY Age DESC
LIMIT 1000;

-- 反例
SELECT UserID, Age
FROM datasets.hits_v1
ORDER BY Age DESC;

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:589a4db4-5d39-412e-a52b-800929913d41

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:871a6fcb-ebbe-4013-81b9-b8d5f1a1709f

-- 反例
SELECT Income, Age, Income / Age AS IncRate
FROM datasets.hits_v1;

-- 正例
SELECT Income, Age
FROM datasets.hits_v1;

性能可提升 10 倍以上,uniqCombined 底层采用类似 HyperLogLog 算法实现,为近似去重,可直接使用这种去重方式提升查询性能。Count(distinct) 会使用 uniqExact 精确去重。

不建议在千万级不同数据上执行 distinct 去重查询,改为近似去重 uniqCombined。

-- 反例
SELECT count(DISTINCT rand())
FROM datasets.hits_v1;

-- 正例
SELECT uniqCombined(rand())
FROM datasets.hits_v1;

参考第 6 节。

-- 创建小表
CREATE TABLE IF NOT EXISTS datasets.visits_v2
    ENGINE = CollapsingMergeTree(Sign) PARTITION BY toYYYYMM(StartDate) ORDER BY (CounterID, StartDate, intHash32(UserID), VisitID) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192 AS
SELECT *
FROM datasets.visits_v1
LIMIT 10000;

-- 创建空表用来保存 join 结果,避免控制台打印过慢
CREATE TABLE IF NOT EXISTS datasets.hits_v2
    ENGINE = MergeTree() PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID) SETTINGS index_granularity = 8192 AS
SELECT *
FROM datasets.hits_v1
WHERE 1 = 0;

当多表联查时,查询的数据仅从其中一张表出时,可考虑用 IN 操作而不是 JOIN。

-- 正例
INSERT INTO datasets.hits_v2
SELECT a.*
FROM datasets.hits_v1 a
WHERE a.CounterID IN (SELECT CounterID FROM datasets.visits_v1);

-- 反例
INSERT INTO datasets.hits_v2
SELECT a.*
FROM datasets.hits_v1 a
         LEFT JOIN datasets.visits_v1 b ON a.CounterID = b.CounterID;

多表 JOIN 时要满足小表在右的原则,右表关联时被加载到内存中与左表进行比较,ClickHouse 中无论是 LEFT JOIN、RIGHT JOIN 还是 INNER JOIN 永远都是拿着右表中的每一条记录到左表中查找该记录是否存在,所以右表必须是小表。

ClickHouse 在 JOIN 查询时不会主动发起谓词下推的操作,需要每个子查询提前完成过滤操作,需要注意的是,是否执行谓词下推,对性能影响差别很大(新版本已经不存在此问题,但是需要注意谓词位置的不同依然有性能的差异)。

-- having 左表
:) EXPLAIN SYNTAX
:-] SELECT a.UserID, b.VisitID
:-] FROM datasets.hits_v1 a
:-]          LEFT JOIN datasets.visits_v2 b ON a.CounterID = b.CounterID
:-] HAVING a.EventDate = '2014-03-27';

EXPLAIN SYNTAX
SELECT
    a.UserID,
    b.VisitID
FROM datasets.hits_v1 AS a
LEFT JOIN datasets.visits_v2 AS b ON a.CounterID = b.CounterID
HAVING a.EventDate = '2014-03-27'

Query id: decc4c92-43cd-40fa-b741-6430e0f8a566

┌─explain──────────────────────────────────────────────────────────┐
│ SELECT                                                           │
│     UserID,                                                      │
│     VisitID                                                      │
│ FROM datasets.hits_v1 AS a                                       │
│ ALL LEFT JOIN datasets.visits_v2 AS b ON CounterID = b.CounterID │
│ PREWHERE EventDate = '2014-03-27'                                │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:0f63770d-c12b-4649-ac01-c2a58900d32a<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:293508da-3d78-4c15-a2f4-8c5d803e38b1</font>*</details>

6 rows in set. Elapsed: 0.015 sec.

-- having 右表
:) EXPLAIN SYNTAX
:-] SELECT a.UserID, b.VisitID
:-] FROM datasets.hits_v1 a
:-]          LEFT JOIN datasets.visits_v2 b ON a.CounterID = b.CounterID
:-] HAVING b.StartDate = '2014-03-27';

EXPLAIN SYNTAX
SELECT
    a.UserID,
    b.VisitID
FROM datasets.hits_v1 AS a
LEFT JOIN datasets.visits_v2 AS b ON a.CounterID = b.CounterID
HAVING b.StartDate = '2014-03-27'

Query id: 830fbbe2-b6ef-4036-bc49-6bc13af5b32d

┌─explain──────────────────────────────────────────────────────────┐
│ SELECT                                                           │
│     UserID,                                                      │
│     VisitID                                                      │
│ FROM datasets.hits_v1 AS a                                       │
│ ALL LEFT JOIN datasets.visits_v2 AS b ON CounterID = b.CounterID │
│ WHERE StartDate = '2014-03-27'                                   │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4b49fc29-7919-4cc3-80e6-23d6dd8bd9fc<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:6af51210-dd23-4eed-bf84-e6fd0c57b91e</font>*</details>

6 rows in set. Elapsed: 0.005 sec.

-- 子查询 JOIN WHERE 不会自动优化
-- 正例
:) EXPLAIN SYNTAX
:-] SELECT a.UserID, b.VisitID
:-] FROM (SELECT UserID, CounterID FROM datasets.hits_v1 WHERE EventDate = '2014-03-27') AS a
:-]          LEFT JOIN datasets.visits_v2 AS b ON a.CounterID = b.CounterID;

EXPLAIN SYNTAX
SELECT
    a.UserID,
    b.VisitID
FROM
(
    SELECT
        UserID,
        CounterID
    FROM datasets.hits_v1
    WHERE EventDate = '2014-03-27'
) AS a
LEFT JOIN datasets.visits_v2 AS b ON a.CounterID = b.CounterID

Query id: a8642ba3-a680-42aa-832b-8f8c30d4b808

┌─explain──────────────────────────────────────────────────────────┐
│ SELECT                                                           │
│     UserID,                                                      │
│     VisitID                                                      │
│ FROM                                                             │
│ (                                                                │
│     SELECT                                                       │
│         UserID,                                                  │
│         CounterID                                                │
│     FROM datasets.hits_v1                                        │
│     PREWHERE EventDate = '2014-03-27'                            │
│ ) AS a                                                           │
│ ALL LEFT JOIN datasets.visits_v2 AS b ON CounterID = b.CounterID │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:236854e7-20fb-4658-9e41-fcb14f231f41<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:69789dec-7317-4deb-be89-599b58db6a6a</font>*</details>

12 rows in set. Elapsed: 0.013 sec.

-- 反例
:) EXPLAIN SYNTAX
:-] SELECT a.UserID, b.VisitID
:-] FROM datasets.hits_v1 AS a
:-]          LEFT JOIN datasets.visits_v2 AS b ON a.CounterID = b.CounterID
:-] WHERE a.EventDate = '2014-03-27';

EXPLAIN SYNTAX
SELECT
    a.UserID,
    b.VisitID
FROM datasets.hits_v1 AS a
LEFT JOIN datasets.visits_v2 AS b ON a.CounterID = b.CounterID
WHERE a.EventDate = '2014-03-27'

Query id: 17a68a1e-19eb-443d-bbf4-66f9442e00d5

┌─explain──────────────────────────────────────────────────────────┐
│ SELECT                                                           │
│     UserID,                                                      │
│     VisitID                                                      │
│ FROM datasets.hits_v1 AS a                                       │
│ ALL LEFT JOIN datasets.visits_v2 AS b ON CounterID = b.CounterID │
│ PREWHERE EventDate = '2014-03-27'                                │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:a1d14a07-d4f3-4276-be79-b102db38422b<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:ba95ab74-4733-4dc5-9c04-70f4990ba09f</font>*</details>

6 rows in set. Elapsed: 0.015 sec.

两张分布式表上的 IN 和 JOIN 之前必须加上GLOBAL关键字,右表只会在接收查询请求的那个节点查询一次,并将其分发到其它节点上。如果不加 GLOBAL 关键字的话,每个节点都会单独发起一次对右表的查询,而右表又是分布式表,就导致右表一共会被查询 N2 次(N 是该分布式表的分片数量),这就是查询放大,会带来很大的开销。

将一些需要关联分析的业务创建成字典表进行 JOIN 操作,前提是字典表不宜太大,因为字典表会常驻内存。

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e9c54df8-3f3d-46c5-9506-235965ec7a0a

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:9c640334-eae3-4ee4-815b-eba7754650f9

5. 数据一致性(重点)

通过查询 ClickHouse 手册发现,即使对数据一致性支持最好的 MergeTree,也只是保证 最终一致性

ReplacingMergeTree 该引擎和 MergeTree 的不同之处在于它会删除排序键值相同的重复项。

数据的去重只会在数据合并期间进行,合并会在后台不定期执行,因此无法预先做出计划,有一些数据可能仍未被处理。尽管可以用 OPTIMIZE语句发起计划外的合并,但不要依靠它,因为 OPTIMIZE语句会引起对数据的大量读写。

因此, ReplacingMergeTree适用于在后台清除重复的数据以节省空间,但是它不保证没有重复的数据出现。

在使用 ReplacingMergeTreeSummingMergeTree这类表引擎的时候,会出现短暂数据不一致的情况。

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:e22e43b1-8535-460c-b984-4cf2a5a3ce01

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:51212947-eb3b-41cd-ba29-f5a156d29666

-- 准备测试数据,创建表
CREATE TABLE IF NOT EXISTS datasets.test_a
(
    user_id     UInt64 COMMENT '数据去重更新标识',
    score       String,
    deleted     UInt8    DEFAULT 0 COMMENT '自定义标记位',
    create_time DateTime DEFAULT toDateTime(0) COMMENT '版本号字段'
) ENGINE = ReplacingMergeTree(create_time)
      ORDER BY user_id;

-- 写入 1000w 测试数据
INSERT INTO datasets.test_a (user_id, score)
WITH (SELECT ['A', 'B', 'C', 'D', 'E', 'F', 'G']) AS dict
SELECT number AS user_id, dict[number % 7 + 1] AS score
FROM numbers(10000000);

-- 修改前 50w 数据,修改 score 字段与 create_time 字段
INSERT INTO datasets.test_a (user_id, score, create_time)
WITH (SELECT ['AA', 'BB', 'CC', 'DD', 'EE', 'FF', 'GG']) AS dict
SELECT number AS user_id, dict[number % 7 + 1] AS score, now() AS create_time
FROM numbers(500000);

-- 查询总数(1050w)
SELECT count()
FROM datasets.test_a;

在写入数据后,立刻执行 OPTIMIZE强制触发新写入分区的合并动作。

OPTIMIZE TABLE datasets.test_a FINAL;

-- 语法
OPTIMIZE TABLE [db.]name [ON CLUSTER cluster] [PARTITION partition | PARTITION ID 'partition_id'] [FINAL] [DEDUPLICATE [BY expression]]

在查询语句后增加 FINAL 修饰符,这样在查询的过程中将会执行 Merge 的特殊逻辑(例如数据去重,预聚合等)。

但是该方法在早期版本基本没有人使用,因为在增加 FINAL 之后,我们的查询将会变成一个单线程的执行过程,查询速度非常慢。

在 20.5.2.7 版本中,FINAL 查询支持多线程执行,并且可以通过max_final_threads 参数控制单个查询的线程数。但是目前读取部分的动作依然是串行的。

FINAL 查询最终的性能和很多因素相关,列字段的大小、分区的数量等等都会影响到最终的查询时间,所以还要结合实际场景取舍。

-- 普通语句执行计划
:) EXPLAIN PIPELINE
:-] SELECT *
:-] FROM datasets.visits_v1
:-] WHERE StartDate = '2014-03-17'
:-] LIMIT 100;

EXPLAIN PIPELINE
SELECT *
FROM datasets.visits_v1
WHERE StartDate = '2014-03-17'
LIMIT 100

Query id: 730be1e8-f678-4e30-9de2-3dc3d5c7f228

Connecting to database datasets at localhost:9000 as user default.

Connected to ClickHouse server version 21.9.2 revision 54449.

┌─explain─────────────────────────┐
│ (Expression)                    │
│ ExpressionTransform × 4         │
│   (SettingQuotaAndLimits)       │
│     (Limit)                     │
│     Limit 4 → 4                 │
│       (ReadFromMergeTree)       │
│       MergeTreeThread × 4 0 → 1 │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:1df39dca-9648-4222-8bbc-75a88decbd1a<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3b1d7083-621e-4454-b960-6c8c827993c0</font>*</details>

7 rows in set. Elapsed: 0.011 sec.

-- FINAL 查询执行计划
:) EXPLAIN PIPELINE
:-] SELECT *
:-] FROM datasets.visits_v1 FINAL
:-] WHERE StartDate = '2014-03-17'
:-] LIMIT 100;

EXPLAIN PIPELINE
SELECT *
FROM datasets.visits_v1
FINAL
WHERE StartDate = '2014-03-17'
LIMIT 100

Query id: a9bd1e44-23bd-45a3-ade7-4c1534c85f38

┌─explain──────────────────────────────────┐
│ (Expression)                             │
│ ExpressionTransform × 4                  │
│   (Limit)                                │
│   Limit 4 → 4                            │
│     (Filter)                             │
│     FilterTransform × 4                  │
│       (SettingQuotaAndLimits)            │
│         (ReadFromMergeTree)              │
│         ExpressionTransform × 4          │
│           CollapsingSortedTransform × 4  │
│             Copy 1 → 4                   │
│               AddingSelector             │
│                 ExpressionTransform      │
│                   MergeTreeInOrder 0 → 1 │
[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:be26201b-d3c1-4c9e-b2c2-ca019aa8bccf<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:4f18b85f-a5bb-4b6a-9d75-3052cc3f6682</font>*</details>

14 rows in set. Elapsed: 0.022 sec.

从 CollapsingSortedTransform 这一步开始已经是多线程执行,但是读取部分还是串行执行。

6. 物化视图

ClickHouse 的物化视图是一种查询结果的持久化,它确实是给我们带来了查询效率的提升。用户查起来跟表没有区别,它就是一张表,它也像是一张时刻在预计算的表,创建的过程它是用了一个特殊引擎,加上后来 AS SELECT,就是 CREATE 一个 TABLE AS SELECT 的写法。

“查询结果集”的范围很宽泛,可以是基础表中的部分数据的一份简单拷贝,也可以是多表 JOIN 之后产生的结果或其子集,或者原始数据的聚合指标等等。所以,物化视图不会随着基础表的变化而变化,所以它也称为快照(snapshot)。

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:923c3141-324f-42b8-8e66-b677a52ff8c7

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:af8236fd-8b0a-49e6-8dde-2365b0b172d9

优点:查询速度快

缺点:本质是一个流式数据的使用场景,是累加式的技术,所以要用历史数据做去重、去核等分析,在物化视图里面是不好用的。在某些场景的使用也是有限的,如果一张表加了好多物化视图,在写这张表的时候,就会消耗很多机器的资源,比如数据带宽占满、存储增加等。

CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db.]table_name [ON CLUSTER] [TO[db.]name] [ENGINE = engine] [POPULATE] AS SELECT ...

创建物化视图的限制:

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:56a9721f-edd9-4e1f-a9bf-1fc5d3bd28e0

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:3c05d133-ffd4-4589-9c80-423b0535d32c

-- 建表
CREATE TABLE IF NOT EXISTS datasets.hits_test
(
    EventDate Date,
    CounterID UInt32,
    UserID    UInt64,
    URL       String,
    Income    UInt8
) ENGINE = MergeTree()
      PARTITION BY toYYYYMM(EventDate)
      ORDER BY (CounterID, EventDate, intHash32(UserID))
      SAMPLE BY intHash32(UserID)
      SETTINGS index_granularity = 8192;

-- 导入数据
INSERT INTO datasets.hits_test
SELECT EventDate, CounterID, UserID, URL, Income
FROM datasets.hits_v1
LIMIT 10000;
CREATE MATERIALIZED VIEW datasets.hits_mv
            ENGINE = SummingMergeTree()
                PARTITION BY toYYYYMM(EventDate)
                ORDER BY (EventDate, intHash32(UserID))
AS
SELECT UserID, EventDate, count(URL) AS ClickCount, sum(Income) AS IncomeSum
FROM datasets.hits_test
WHERE EventDate >= '2014-03-20'
GROUP BY UserID, EventDate;
-- 导入增量数据
INSERT INTO datasets.hits_test
SELECT EventDate, CounterID, UserID, URL, Income
FROM datasets.hits_v1
WHERE EventDate >= '2014-03-23'
LIMIT 10;

-- 查询物化视图
SELECT *
FROM datasets.hits_mv;
-- 导入历史数据
INSERT INTO datasets.hits_mv
SELECT UserID, EventDate, count(URL) AS ClickCount, sum(Income) AS IncomeSum
FROM datasets.hits_test
WHERE EventDate = '2014-03-20'
GROUP BY UserID, EventDate;

-- 查询物化视图
SELECT *
FROM datasets.hits_mv;

7. MaterializeMySQL 引擎

MySQL 的用户群体很大,为了能够增强数据的实时性,很多解决方案会利用 Binlog 将数据写入到 ClickHouse。为了能够监听 Binlog 事件,我们需要用到类似 Canal 这样的第三方中间件,这无疑增加了系统的复杂度。

ClickHouse 20.8.2.3 版本新增加了 MaterializeMySQL 的 Database 引擎,该 Database 能映射到 MySQL 中的某个 Database,并自动在 ClickHouse 中创建对应的 ReplacingMergeTree。ClickHouse 服务作为 MySQL 副本,读取 Binlog并执行 DDL 和 DML 请求,实现了基于 MySQL Binlog 机制的业务数据库实时同步功能。

其中,_version 字段用作 ReplacingMergeTree 的版本参数,每当监听到 INSERT、UPDATE 和 DELETE 事件时,在 Database 内全局自增。而 _sign 字段则用于标记是否被删除,取值 1 或者 -1。

目前 MaterializeMySQL 支持如下几种 Binlog 事件:

  • MYSQL_WRITE_ROWS_EVENT:_sign = 1,_version++;
  • MYSQL_DELETE_ROWS_EVENT:_sign = -1,_version++;
  • MYSQL_UPDATE_ROWS_EVENT:新数据 _sign = 1;
  • MYSQL_QUERY_EVENT:支持 CREATE TABLE、DROP TABLE、RENAME TABLE 等;
CREATE TABLE IF NOT EXISTS teskck.t_organization
(
    id         INT(11) NOT NULL AUTO_INCREMENT,
    code       INT     NOT NULL,
    name       TEXT     DEFAULT NULL,
    updatetime DATETIME DEFAULT NULL,
    PRIMARY KEY (id),
    UNIQUE KEY (code)
) ENGINE = InnoDB
  DEFAULT CHARSET = utf8mb4;

INSERT INTO teskck.t_organization (code, name, updatetime)
    VALUE (1000, 'Realinsight', NOW());
INSERT INTO teskck.t_organization (code, name, updatetime)
    VALUE (1001, 'Realindex', NOW());
INSERT INTO teskck.t_organization (code, name, updatetime)
    VALUE (1002, 'EDT', NOW());

CREATE TABLE IF NOT EXISTS teskck.t_user
(
    id   INT(11) NOT NULL AUTO_INCREMENT,
    code INT,
    PRIMARY KEY (id)
) ENGINE = InnoDB
  DEFAULT CHARSET = utf8mb4;

INSERT INTO teskck.t_user (code)
VALUES (1);
查看 MySQL 相关配置
SHOW SETTINGS ILIKE '%mysql%';

SET allow_experimental_database_materialized_mysql = 1;

8. 常见问题

问题:使用分布式 DDL 执行命令 create table on cluster xxx某个节点上没有创建表,但是 client 返回正常,查看日志报错如下:

<error> xxx.xxx: Retrying createReplica(), becauset some other replicas were creeated at the same time.

</error>

解决方法:重启该不执行的节点。

问题:由于某个数据节点副本异常,导致两数据副本表不一致,某个数据副本缺少表,需要将两个副本调整一致。

解决方法:在缺少表的数据节点上创建缺少的表,创建为本地表,表结构可以在其它数据副本通过 show create table xxx获取。表结构创建后,ClickHouse 会自动从其它副本同步该表数据,验证数据量是否一致即可。

问题:某个数据副本异常无法启动,需要重新搭建副本。

解决方法:

问题:某个数据副本表在 zookeeper 上丢失数据或者不存在,但是 metadata 元数据存在,导致启动异常报错:

Can't get data for node /clickhouse/tables/01-02/xxx/xxx/replicas/xxx/metadata: node doesn't exist (No node): Cannot attach table xxx.

解决方法:

问题:重建表过程中,先使用 drop table xxx on cluster xxx删除 ClickHouse 的数据,但是 zookeeper 里面针对某个 ClickHouse 节点的 table meta 信息未被删除(低概率事件),因为 zookeeper 中仍存在该表的 meta 信息,导致再次创建该表 create table xxx on cluster xxx无法创建,报错如下:

Replica /clickhouse/tables/01-03/xxx/xxx/replicas/xxx already exists.

解决方法:从其它副本节点拷贝该表的 metadata 到该节点,重启节点。

问题:模拟某个节点意外宕机,在大量 INSERT 数据的情况下,关闭某个节点。

现象:数据写入不受影响、数据查询不受影响、建表 DDL 执行到异常节点会卡住,报错:

Code: 159. DB::Exception: Received from localhost:9000. DB::Exception: Watching task /clickhouse/task_gueue/ddl/query-0000565925 is executing longer than distributed_ddl_task_timeout (=180) seconds. There are 1 unfinished hosts (0 of them are currently active), they are going to execute the query in background.

解决方法:重启异常节点,期间其它副本写入数据会自动同步,其它副本建表 DDL 也会自动同步。

Original: https://www.cnblogs.com/xiaoQQya/p/16313680.html
Author: Hit不死的小强
Title: ClickHouse高级

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/562940/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球