ClickHouse 物化檢視學習總結

授客發表於2024-12-09

物化檢視

物化檢視源表--基礎資料來源

建立源表,因為我們的目標涉及報告聚合資料而不是單條記錄,所以我們可以解析它,將資訊傳遞給物化檢視,並丟棄實際傳入的資料。這符合我們的目標並節省了儲存空間,因此我們將使用Null表引擎。

CREATE DATABASE IF NOT EXISTS analytics;

CREATE TABLE analytics.hourly_data
(
    `domain_name` String,
    `event_time` DateTime,
    `count_views` UInt64
)
ENGINE = Null;

注意:可以在Null表上建立物化檢視。因此,寫入表的資料最終會影響檢視,但原始原始資料仍將被丟棄

月度彙總表和物化檢視

對於第一個物化檢視,需要建立 Target 表(本例子中為analytics.monthly_aggregated_data),例中將按月份和域名儲存檢視的總和。

CREATE TABLE analytics.monthly_aggregated_data
(
    `domain_name` String,
    `month` Date,
    `sumCountViews` AggregateFunction(sum, UInt64)
)
ENGINE = AggregatingMergeTree
ORDER BY (domain_name, month);

將轉發Target表上資料的物化檢視如下:

CREATE MATERIALIZED VIEW analytics.monthly_aggregated_data_mv
TO analytics.monthly_aggregated_data
AS
SELECT
    toDate(toStartOfMonth(event_time)) AS month,
    domain_name,
    sumState(count_views) AS sumCountViews
FROM analytics.hourly_data
GROUP BY domain_name, month;

年度彙總表和物化檢視

現在,建立第二個物化檢視,該檢視將連結到之前的目標表monthly_aggregated_data
首先,建立一個新的目標表,該表將儲存每個域名每年彙總的檢視總和。

CREATE TABLE analytics.year_aggregated_data
(
    `domain_name` String,
    `year` UInt16,
    `sumCountViews` UInt64
)
ENGINE = SummingMergeTree()
ORDER BY (domain_name, year);

然後建立物化檢視,此步驟定義級聯。FROM 語句將使用monthly_aggregated_data表,這意味著資料流將是:
1.資料到達hourly_data表。
2.ClickHouse會將收到的資料轉發到第一個物化檢視monthly_aggregated_data
3.最後,步驟2中接收到的資料將被轉發到 year_aggregated_data

CREATE MATERIALIZED VIEW analytics.year_aggregated_data_mv
TO analytics.year_aggregated_data
AS
SELECT
    toYear(toStartOfYear(month)) AS year,
    domain_name,
    sumMerge(sumCountViews) as sumCountViews
FROM analytics.monthly_aggregated_data
GROUP BY domain_name, year;

注意:

在使用物化檢視時,一個常見的誤解是資料是從表中讀取的,這不是Materialized views的工作方式;轉發的資料是插入的資料塊,而不是表中的最終結果。

想象一下,在這個例子中,monthly_aggregated_data中使用的引擎是一個摺疊合併樹(CollapsingMergeTree),轉發到第二個物化檢視year_aggregated_data_mv 的資料將不是摺疊表的最終結果,它將轉發具有正如SELECT… GROUP BY中定義的欄位的資料塊。

如果末正在使用CollapsingMergeTreeReplacingMergeTree,甚至SummingMergeTree,並且計劃建立級聯物化檢視,則需要了解此處描述的限制。

採集資料

現在是時候透過插入一些資料來測試我們的級聯物化檢視了:

INSERT INTO analytics.hourly_data (domain_name, event_time, count_views)
VALUES ('clickhouse.com', '2019-01-01 10:00:00', 1),
       ('clickhouse.com', '2019-02-02 00:00:00', 2),
       ('clickhouse.com', '2019-02-01 00:00:00', 3),
       ('clickhouse.com', '2020-01-01 00:00:00', 6);

查詢analytics.hourly_data的內容,將查不到任何記錄,因為表引擎為Null,但資料已被處理

 SELECT * FROM analytics.hourly_data

輸出:

domain_name|event_time|count_views|
-----------+----------+-----------+

結果

如果嘗試查詢目標表的sumCountViews欄位值,將看到欄位值以二進位制表示(在某些終端中),因為該值不是以數字的形式儲存,而是以AggregateFunction型別儲存的。要獲得聚合的最終結果,應該使用-Merge字尾。

透過以下查詢,sumCountViews欄位值無法正常顯示:

SELECT sumCountViews FROM analytics.monthly_aggregated_data

輸出:

sumCountViews|
-------------+
             |
             |
             |

使用 Merge字尾獲取 sumCountViews 值:

SELECT sumMerge(sumCountViews) as sumCountViews
FROM analytics.monthly_aggregated_data;

輸出:

sumCountViews|
-------------+
           12|

AggregatingMergeTree 中將AggregateFunction 定義為sum,因此可以使用sumMerge。當在AggregateFunction上使用函式avg時,則將使用avgMerge,以此類推。

SELECT month, domain_name, sumMerge(sumCountViews) as sumCountViews
FROM analytics.monthly_aggregated_data
GROUP BY domain_name, month

輸出:

month     |domain_name   |sumCountViews|
----------+--------------+-------------+
2020-01-01|clickhouse.com|            6|
2019-01-01|clickhouse.com|            1|
2019-02-01|clickhouse.com|            5|

現在我們可以檢視物化檢視是否符合我們定義的目標。

現在已經將資料儲存在目標表monthly_aggregated_data中,可以按月聚合每個域名的資料:

SELECT month, domain_name, sumMerge(sumCountViews) as sumCountViews
FROM analytics.monthly_aggregated_data
GROUP BY domain_name, month;

輸出:

month     |domain_name   |sumCountViews|
----------+--------------+-------------+
2020-01-01|clickhouse.com|            6|
2019-01-01|clickhouse.com|            1|
2019-02-01|clickhouse.com|            5|

按年聚合每個域名的資料:

SELECT year, domain_name, sum(sumCountViews)
FROM analytics.year_aggregated_data
GROUP BY domain_name, year;

輸出:

year|domain_name   |sum(sumCountViews)|
----+--------------+------------------+
2019|clickhouse.com|                 6|
2020|clickhouse.com|                 6|

組合多個源表來建立單個目標表

物化檢視還可以用於將多個源表組合以到一個目標表中。這對於建立類似於 UNION ALL邏輯的物化檢視非常有用。

首先,建立兩個代表不同指標集的源表:

CREATE TABLE analytics.impressions
(
    `event_time` DateTime,
    `domain_name` String
) ENGINE = MergeTree ORDER BY (domain_name, event_time);

CREATE TABLE analytics.clicks
(
    `event_time` DateTime,
    `domain_name` String
) ENGINE = MergeTree ORDER BY (domain_name, event_time);

然後使用組合的指標集建立 Target表:

CREATE TABLE analytics.daily_overview
(
    `on_date` Date,
    `domain_name` String,
    `impressions` SimpleAggregateFunction(sum, UInt64),
    `clicks` SimpleAggregateFunction(sum, UInt64)
) ENGINE = AggregatingMergeTree ORDER BY (on_date, domain_name);

建立兩個指向同一Target表的物化檢視。不需要顯式地包含缺少的列:

CREATE MATERIALIZED VIEW analytics.daily_impressions_mv
TO analytics.daily_overview
AS                                                
SELECT
    toDate(event_time) AS on_date,
    domain_name,
    count() AS impressions,
    0 clicks   --<<<--- 如果去掉該列,則預設為 clicks為0
FROM                                              
    analytics.impressions
GROUP BY toDate(event_time) AS on_date, domain_name;

CREATE MATERIALIZED VIEW analytics.daily_clicks_mv
TO analytics.daily_overview
AS
SELECT
    toDate(event_time) AS on_date,
    domain_name,
    count() AS clicks,
    0 impressions    --<<<---如果去掉該列,則預設為 impressions 為0
FROM
    analytics.clicks
GROUP BY toDate(event_time) AS on_date, domain_name;

現在,當插入值時,這些值將被聚合到Target表中的相應列中:

INSERT INTO analytics.impressions (domain_name, event_time)
VALUES ('clickhouse.com', '2019-01-01 00:00:00'),
       ('clickhouse.com', '2019-01-01 12:00:00'),
       ('clickhouse.com', '2019-02-01 00:00:00'),
       ('clickhouse.com', '2019-03-01 00:00:00')
;

INSERT INTO analytics.clicks (domain_name, event_time)
VALUES ('clickhouse.com', '2019-01-01 00:00:00'),
       ('clickhouse.com', '2019-01-01 12:00:00'),
       ('clickhouse.com', '2019-03-01 00:00:00')
;

查詢目標表 the Target table:

SELECT
    on_date,
    domain_name,
    sum(impressions) AS impressions,
    sum(clicks) AS clicks
FROM
    analytics.daily_overview
GROUP BY
    on_date,
    domain_name
;

輸出:

on_date   |domain_name   |impressions|clicks|
----------+--------------+-----------+------+
2019-01-01|clickhouse.com|          2|     2|
2019-03-01|clickhouse.com|          1|     1|
2019-02-01|clickhouse.com|          1|     0|

參考連結

https://clickhouse.com/docs/en/guides/developer/cascading-materialized-views

AggregateFunction

聚合函式有一個實現定義的中間狀態,可以序列化為AggregateFunction(...)資料型別,並通常透過物化檢視儲存在表中。生成聚合函式狀態的常見方法是使用State字尾呼叫聚合函式。為了以後能獲得聚合的最終結果,必須使用帶有-Merge字尾的相同聚合函式。

AggregateFunction(name, types_of_arguments...) — 引數資料型別。

引數說明:

  • 聚合函式名稱。如果名稱對應的聚合函式鞋帶引數,則還需要為其它指定引數。
  • 聚合函式引數型別。

示例

CREATE TABLE testdb.aggregated_test_tb
(   
    `__name__` String, 
    `count` AggregateFunction(count),
    `avg_val` AggregateFunction(avg, Float64),
    `max_val` AggregateFunction(max, Float64),
    `time_max` AggregateFunction(argMax, DateTime, Float64),
    `mid_val` AggregateFunction(quantiles(0.5, 0.9), Float64) 
) ENGINE = AggregatingMergeTree() 
ORDER BY (__name__);

備註:如果上述SQL未新增ORDER BY (__name__, create_time),執行會報類似如下錯誤:

SQL 錯誤 [42]: ClickHouse exception, code: 42, host: 192.168.88.131, port: 8123; Code: 42, e.displayText() = DB::Exception: Storage AggregatingMergeTree requires 3 to 4 parameters: 
name of column with date,
[sampling element of primary key],
primary key expression,
index granularity

建立資料來源表並插入測試資料

CREATE TABLE testdb.test_tb 
(
    `__name__` String, 
    `create_time` DateTime, 
    `val` Float64
) ENGINE = MergeTree() 
PARTITION BY toStartOfWeek(create_time) 
ORDER BY (__name__, create_time);

INSERT INTO testdb.test_tb(`__name__`, `create_time`, `val`) VALUES
('xiaoxiao', now(), 80.5),
('xiaolin', addSeconds(now(), 10), 89.5),
('xiaohong', addSeconds(now(), 20), 90.5),
('lisi', addSeconds(now(), 30), 79.5),
('zhangshang', addSeconds(now(), 40), 60),
('wangwu', addSeconds(now(), 50), 65);

插入資料

使用以State字尾的聚合函式的INSERT SELECT 以插入資料--比如希望獲取目標列資料均值,即avg(target_column),那麼插入資料時使用的聚合函式為avgState*State聚合函式返回狀態(state),而不是最終值。換句話說,返回一個 AggregateFunction 型別的值。

INSERT INTO testdb.aggregated_test_tb (`__name__`, `count`, `avg_val`, `max_val`, `time_max`, `mid_val`)
SELECT `__name__`,
countState() AS count,
avgState(val) AS avg_val, 
maxState(val) AS max_val,
argMaxState(create_time, val) AS time_max,
quantilesState(0.5, 0.9)(val) AS `mid_val`
FROM testdb.test_tb
GROUP BY `__name__`, toStartOfMinute(create_time);

注意:SELECT語句中的欄位,要麼使用聚合函式呼叫(比如上述val欄位),要麼保持原欄位不變(比如上述__name__欄位),保持原欄位不變時,該欄位必須包含於GROUP BY子句中,否則會報類似如下錯誤:

SQL 錯誤 [215]: ClickHouse exception, code: 215, host: 192.168.88.131, port: 8123; Code: 215, e.displayText() = DB::Exception: Column `__name__` is not under aggregate function and not in GROUP BY (version 20.3.5.21 (official build))

查詢資料

AggregatingMergeTree表中查詢資料時,使用GROUP BY子句和與插入資料時相同的聚合函式,但使用Merge字尾,比如插入資料時使用的聚合函式為avgState,那麼查詢時使用的聚合函式為avgMerge

字尾為Merge的聚合函式接受一組狀態,將它們組合在一起,並返回完整資料聚合的結果。

例如,以下兩個查詢返回相同的結果

SELECT `__name__`, 
create_time,
avgMerge(avg_val) AS avg_val, 
maxMerge(max_val) AS max_val
FROM ( 
SELECT `__name__`, 
toStartOfMinute(create_time) AS create_time,
avgState(val) AS avg_val, 
maxState(val) AS max_val
FROM testdb.test_tb
GROUP BY `__name__`, create_time
)
GROUP BY `__name__`, create_time;

SELECT `__name__`, 
toStartOfMinute(create_time) AS create_time,
avg(val) AS avg_val, 
max(val) AS max_val
FROM testdb.test_tb
GROUP BY `__name__`, create_time;

例子:

SELECT `__name__`, 
countMerge(`count`), 
avgMerge(`avg_val`), 
maxMerge(`max_val`),
argMaxMerge(`time_max`),
quantilesMerge(0.5, 0.9)(`mid_val`)
FROM testdb.aggregated_test_tb
GROUP BY `__name__`;

參考連結

https://clickhouse.com/docs/en/sql-reference/data-types/aggregatefunction

AggregatingMergeTree

引擎繼承自MergeTree,更改了資料塊合併的邏輯。ClickHouse使用一條儲存了聚合函式狀態組合的單條記錄(在一個資料塊中)替換帶有相同主鍵(或更準確地說,用相同的排序鍵)的所有行

說明:資料塊是指ClickHouse儲存資料的基本單位

可以使用 AggregatingMergeTree 表進行增量資料聚合,包括聚合物化檢視。

引擎處理以下型別的所有列:

  • AggregateFunction

  • SimpleAggregateFunction

    如果能減少有序行數,則使用AggregatingMergeTree是合適的

建表

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
    ...
) ENGINE = AggregatingMergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[SAMPLE BY expr]
[TTL expr]
[SETTINGS name=value, ...]

有關請求引數的描述,參閱請求描述

查詢語句

建立AggregatingMergeTree表與建立MergeTree表的子句相同。

查詢和插入

要插入資料,使用INSERT SELECT使用aggregateState函式進行查詢。從AggregatingMergeTree表中查詢資料時,使用GROUP BY子句和與插入資料時相同的聚合函式,但使用Merge字尾。

SELECT查詢的結果中,AggregateFunction型別的值對所有ClickHouse輸出格式都有特定於實現的二進位制表示。例如,如果你可以使用SELECT查詢將資料轉儲為TabSeparated格式,則可以使用INSERT查詢將此轉儲重新載入。

一個物化檢視示例

CREATE DATABASE testdb;

建立存放原始資料的testdb.visits表:

CREATE TABLE testdb.visits
(
    StartDate DateTime64, 
    CounterID UInt64,
    Sign Nullable(Int32),
    UserID Nullable(Int32)
) ENGINE = MergeTree 
ORDER BY (StartDate, CounterID);

說明:上述StartDate DateTime64, 如果寫成StartDate DateTime64 NOT NULL, 執行會報錯,如下:

Expected one of: CODEC, ALIAS, TTL, ClosingRoundBracket, Comma, DEFAULT, MATERIALIZED, COMMENT, token (version 20.3.5.21 (official build))

接下來,建立一個AggregatingMergeTree表,該表將儲存AggregationFunction,用於跟蹤訪問總數和唯一使用者數。

建立一個AggregatingMergeTree 物化檢視,用於監視testdb.revisits表,並使用AggregateFunction 型別:

CREATE TABLE testdb.agg_visits (
    StartDate DateTime64,
    CounterID UInt64,
    Visits AggregateFunction(sum, Nullable(Int32)),
    Users AggregateFunction(uniq, Nullable(Int32))
)
ENGINE = AggregatingMergeTree() ORDER BY (StartDate, CounterID);
SQL 錯誤 [70]: ClickHouse exception, code: 70, host: 192.168.88.131, port: 8123; Code: 70, e.displayText() = DB::Exception: Conversion from AggregateFunction(sum, Int32) to AggregateFunction(sum, Nullable(Int32)) is not supported: while converting source column Visits to destination column Visits: while pushing to view testdb.visits_mv (version 20.3.5.21 (official build))

CREATE TABLE testdb.agg_visits (
    StartDate DateTime64,
    CounterID UInt64,
    Visits AggregateFunction(sum, Int32),
    Users AggregateFunction(uniq, Int32)
)
ENGINE = AggregatingMergeTree() ORDER BY (StartDate, CounterID);

建立一個物化檢視,從testdb.revisits填充testdb.agg_visits

CREATE MATERIALIZED VIEW testdb.visits_mv TO testdb.agg_visits
AS SELECT
    StartDate,
    CounterID,
    sumState(Sign) AS Visits,
    uniqState(UserID) AS Users
FROM testdb.visits
GROUP BY StartDate, CounterID;

插入資料到 testdb.visits 表:

INSERT INTO testdb.visits (StartDate, CounterID, Sign, UserID)
 VALUES (1667446031000, 1, 3, 4), (1667446031000, 1, 6, 3);

資料被同時插入到testdb.revisitstestdb.agg_visits中。

執行諸如 SELECT ... GROUP BY ...的語句查詢物化檢視test.mv_visits以獲取聚合資料

SELECT
    StartDate,
    sumMerge(Visits) AS Visits,
    uniqMerge(Users) AS Users
FROM testdb.agg_visits
GROUP BY StartDate
ORDER BY StartDate;

輸出:

StartDate          |Visits|Users|
-------------------+------+-----+
2022-11-03 11:27:11|     9|    2|

testdb.revisits中新增另外2條記錄,但這次嘗試對其中一條記錄使用不同的時間戳:

INSERT INTO testdb.visits (StartDate, CounterID, Sign, UserID)
 VALUES (1669446031000, 2, 5, 10), (1667446031000, 3, 7, 5);

再次查詢,輸出如下:

StartDate          |Visits|Users|
-------------------+------+-----+
2022-11-03 11:27:11|    16|    3|
2022-11-26 15:00:31|     5|    1|

參考連結

https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/aggregatingmergetree

相關文章