ClickHouse原始碼筆記2:聚合流程的實現

HappenLee發表於2020-07-17

上篇筆記講到了聚合函式的實現並且帶大家看了聚合函式是如何註冊到ClickHouse之中的並被呼叫使用的。這篇筆記,筆者會續上上篇的內容,將剖析一把ClickHouse聚合流程的整體實現。
第二篇文章,我們來一起看看聚合流程的實現~~ 上車!

1.基礎知識的梳理

ClickHouse的實現介面
  • Block類
    前文我們聊到ClickHouse是一個列式儲存資料庫,在記憶體之中用IColumn介面來作為資料結構表示資料。 而Block則是這些列的集合,也就是說Block包含了一組列,而無數個Block就構成了我們通常理解的表了。
    在ClickHouse進行查詢之中,資料的最小處理單位是 Block 。由下面程式碼可以看到,Block就是由一組列以及列名對應列的偏移map組成的。
class Block
{
private:
    using Container = ColumnsWithTypeAndName;
    using IndexByName = std::map<String, size_t>;

    Container data;
    IndexByName index_by_name;

這是一個很重要的類,實現的也並不複雜。Block類作為ClickHouse的核心,後續的工作都是基於Block類展開的。

  • 抽象類IBlockInputStream
    由名字可以看出,IBlockInputStream是一個實現介面。
    這也同樣是一個十分重要的介面,ClickHouse的呼叫模型就建立在IBlockInputStream介面之上。該介面最為核心的就是方法便是read函式,它返回一個被對應Stream處理過的Block。
    想必看到這裡應該明白了,ClickHouse就是通過IBlockInputStream實現的火山模型,每一個不同的Stream處理不同的查詢邏輯,最後層層迭代,完成最終輸出流就是使用者需要的結果了。
    IBlockInputStream類還有一個孿生兄弟IBlockoutputStream,顧名思義,需要進行寫操作的時候就要用到它了。
class IBlockInputStream : public TypePromotion<IBlockInputStream>
{
    friend struct BlockStreamProfileInfo;

public:
    IBlockInputStream() { info.parent = this; }
    virtual ~IBlockInputStream() {}

    IBlockInputStream(const IBlockInputStream &) = delete;
    IBlockInputStream & operator=(const IBlockInputStream &) = delete;

    /// To output the data stream transformation tree (query execution plan).
    virtual String getName() const = 0;

    /** Get data structure of the stream in a form of "header" block (it is also called "sample block").
      * Header block contains column names, data types, columns of size 0. Constant columns must have corresponding values.
      * It is guaranteed that method "read" returns blocks of exactly that structure.
      */
    virtual Block getHeader() const = 0;

    virtual const BlockMissingValues & getMissingValues() const
    {
        static const BlockMissingValues none;
        return none;
    }

    /// If this stream generates data in order by some keys, return true.
    virtual bool isSortedOutput() const { return false; }

    /// In case of isSortedOutput, return corresponding SortDescription
    virtual const SortDescription & getSortDescription() const;

    /** Read next block.
      * If there are no more blocks, return an empty block (for which operator `bool` returns false).
      * NOTE: Only one thread can read from one instance of IBlockInputStream simultaneously.
      * This also applies for readPrefix, readSuffix.
      */
    Block read();
  • AggregatingBlockInputStream類
    終於引出我們的主角了,AggregatingBlockInputStream類,作為上面IBlockInputStream的子類,也就是我們今天要重點分析的類。
class AggregatingBlockInputStream : public IBlockInputStream
{
public:
    /** keys are taken from the GROUP BY part of the query
      * Aggregate functions are searched everywhere in the expression.
      * Columns corresponding to keys and arguments of aggregate functions must already be computed.
      */
    AggregatingBlockInputStream(const BlockInputStreamPtr & input, const Aggregator::Params & params_, bool final_)
        : params(params_), aggregator(params), final(final_)
    {
        children.push_back(input);
    }

    String getName() const override { return "Aggregating"; }

    Block getHeader() const override;

protected:
    Block readImpl() override;

    Aggregator::Params params;
    Aggregator aggregator;
    bool final;

    bool executed = false;

    std::vector<std::unique_ptr<TemporaryFileStream>> temporary_inputs;

     /** From here we will get the completed blocks after the aggregation. */
    std::unique_ptr<IBlockInputStream> impl;
};

首先看它的構造方法,引數有:

  • BlockInputStreamPtr: 這個很好理解,就是它的子流,也就是實際產生資料的流,後續的聚合計算將會在子流返回的結果上展開。
  • params: 聚合引數,這個引數十分重要。它記錄了那些key屬於聚合,呼叫那些聚合引數等核心資訊。並且aggregator也就是執行聚合的類,也是通過該引數構造的,它是Aggregator的內部類。
  • final: 指明該Stream是否是最終結果,還是要繼續進行計算。

這裡最為核心的就是AggregatingBlockInputStream類通過繼承override對應的readImpl()的介面來實現對應的具體邏輯。AggregatingBlockInputStream類還有一個孿生兄弟:ParallelAggregatingBlockInputStream類,通過並行化來進一步加快聚合流程的執行效率。(通過筆者進行的測試,在簡單查詢聚合查詢下,並行化能夠提高近一倍的效率~~)

  • Aggregator::Params類
    Aggregator::Params類Aggregator的內部類。這個類是整個聚合過程之中最重要的類,查詢解析優化後生成聚合查詢的執行計劃。 而對應的執行計劃的引數都通過Aggregator::Params類來初始化,比如那些列要進行聚合,選取的聚合運算元等等,並傳遞給對應的Aggregator來實現對應的聚合邏輯。
 struct Params
    {
        /// Data structure of source blocks.
        Block src_header;
        /// Data structure of intermediate blocks before merge.
        Block intermediate_header;

        /// What to count.
        const ColumnNumbers keys;
        const AggregateDescriptions aggregates;
        const size_t keys_size;
        const size_t aggregates_size;

        /// The settings of approximate calculation of GROUP BY.
        const bool overflow_row;    /// Do we need to put into AggregatedDataVariants::without_key aggregates for keys that are not in max_rows_to_group_by.
        const size_t max_rows_to_group_by;
        const OverflowMode group_by_overflow_mode;



        /// Settings to flush temporary data to the filesystem (external aggregation).
        const size_t max_bytes_before_external_group_by;        /// 0 - do not use external aggregation.

        /// Return empty result when aggregating without keys on empty set.
        bool empty_result_for_aggregation_by_empty_set;

        VolumePtr tmp_volume;

        /// Settings is used to determine cache size. No threads are created.
        size_t max_threads;

        const size_t min_free_disk_space;
        Params(
            const Block & src_header_,
            const ColumnNumbers & keys_, const AggregateDescriptions & aggregates_,
            bool overflow_row_, size_t max_rows_to_group_by_, OverflowMode group_by_overflow_mode_,
            size_t group_by_two_level_threshold_, size_t group_by_two_level_threshold_bytes_,
            size_t max_bytes_before_external_group_by_,
            bool empty_result_for_aggregation_by_empty_set_,
            VolumePtr tmp_volume_, size_t max_threads_,
            size_t min_free_disk_space_)
            : src_header(src_header_),
            keys(keys_), aggregates(aggregates_), keys_size(keys.size()), aggregates_size(aggregates.size()),
            overflow_row(overflow_row_), max_rows_to_group_by(max_rows_to_group_by_), group_by_overflow_mode(group_by_overflow_mode_),
            group_by_two_level_threshold(group_by_two_level_threshold_), group_by_two_level_threshold_bytes(group_by_two_level_threshold_bytes_),
            max_bytes_before_external_group_by(max_bytes_before_external_group_by_),
            empty_result_for_aggregation_by_empty_set(empty_result_for_aggregation_by_empty_set_),
            tmp_volume(tmp_volume_), max_threads(max_threads_),
            min_free_disk_space(min_free_disk_space_)
        {
        }

        /// Only parameters that matter during merge.
        Params(const Block & intermediate_header_,
            const ColumnNumbers & keys_, const AggregateDescriptions & aggregates_, bool overflow_row_, size_t max_threads_)
            : Params(Block(), keys_, aggregates_, overflow_row_, 0, OverflowMode::THROW, 0, 0, 0, false, nullptr, max_threads_, 0)
        {
            intermediate_header = intermediate_header_;
        }
    };
  • Aggregator類
    顧名思義,這個是一個實際進行聚合工作展開的類。它最為核心的方法是下面兩個函式:
    • execute函式:將輸入流的stream依照次序進行blcok迭代處理,將聚合的結果寫入result之中。
    • mergeAndConvertToBlocks函式:將聚合的結果轉換為輸入流,並通過輸入流的read函式將結果繼續返回給上一層。
      通過上面兩個函式的呼叫,我們就可以完成被聚合的資料輸入-》 資料聚合 -》 資料輸出的流程。具體的細節筆者會在下一章詳細的進行剖析。
class Aggregator
{
public:
    Aggregator(const Params & params_);

    /// Aggregate the source. Get the result in the form of one of the data structures.
    void execute(const BlockInputStreamPtr & stream, AggregatedDataVariants & result);

    using AggregateColumns = std::vector<ColumnRawPtrs>;
    using AggregateColumnsData = std::vector<ColumnAggregateFunction::Container *>;
    using AggregateColumnsConstData = std::vector<const ColumnAggregateFunction::Container *>;
    using AggregateFunctionsPlainPtrs = std::vector<IAggregateFunction *>;

    /// Process one block. Return false if the processing should be aborted (with group_by_overflow_mode = 'break').
    bool executeOnBlock(const Block & block, AggregatedDataVariants & result,
        ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns,    /// Passed to not create them anew for each block
        bool & no_more_keys);

    bool executeOnBlock(Columns columns, UInt64 num_rows, AggregatedDataVariants & result,
        ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns,    /// Passed to not create them anew for each block
        bool & no_more_keys);

    /** Convert the aggregation data structure into a block.
      * If overflow_row = true, then aggregates for rows that are not included in max_rows_to_group_by are put in the first block.
      *
      * If final = false, then ColumnAggregateFunction is created as the aggregation columns with the state of the calculations,
      *  which can then be combined with other states (for distributed query processing).
      * If final = true, then columns with ready values are created as aggregate columns.
      */
    BlocksList convertToBlocks(AggregatedDataVariants & data_variants, bool final, size_t max_threads) const;

    /** Merge several aggregation data structures and output the result as a block stream.
      */
    std::unique_ptr<IBlockInputStream> mergeAndConvertToBlocks(ManyAggregatedDataVariants & data_variants, bool final, size_t max_threads) const;
    ManyAggregatedDataVariants prepareVariantsToMerge(ManyAggregatedDataVariants & data_variants) const;

    /** Merge the stream of partially aggregated blocks into one data structure.
      * (Pre-aggregate several blocks that represent the result of independent aggregations from remote servers.)
      */
    void mergeStream(const BlockInputStreamPtr & stream, AggregatedDataVariants & result, size_t max_threads);

    using BucketToBlocks = std::map<Int32, BlocksList>;
    /// Merge partially aggregated blocks separated to buckets into one data structure.
    void mergeBlocks(BucketToBlocks bucket_to_blocks, AggregatedDataVariants & result, size_t max_threads);

    /// Merge several partially aggregated blocks into one.
    /// Precondition: for all blocks block.info.is_overflows flag must be the same.
    /// (either all blocks are from overflow data or none blocks are).
    /// The resulting block has the same value of is_overflows flag.
    Block mergeBlocks(BlocksList & blocks, bool final);

     std::unique_ptr<IBlockInputStream> mergeAndConvertToBlocks(ManyAggregatedDataVariants & data_variants, bool final, size_t max_threads) const;

    using CancellationHook = std::function<bool()>;

    /** Set a function that checks whether the current task can be aborted.
      */
    void setCancellationHook(const CancellationHook cancellation_hook);

    /// Get data structure of the result.
    Block getHeader(bool final) const;

2.聚合流程的實現

這裡我們就從上文提到的Aggregator::execute(const BlockInputStreamPtr & stream, AggregatedDataVariants & result)函式作為起點來梳理一下ClickHouse的聚合實現:

void Aggregator::execute(const BlockInputStreamPtr & stream, AggregatedDataVariants & result)
{
    Stopwatch watch;

    size_t src_rows = 0;
    size_t src_bytes = 0;

    /// Read all the data
    while (Block block = stream->read())
    {
        if (isCancelled())
            return;

        src_rows += block.rows();
        src_bytes += block.bytes();

        if (!executeOnBlock(block, result, key_columns, aggregate_columns, no_more_keys))
            break;
    }

由上述程式碼可以看出,這裡就是依次讀取子節點流生成的Block,然後繼續呼叫executeOnBlock方法來執行聚合流程處理每一個Block的聚合。接著我們按圖索驥,繼續看下去,這個函式比較長,我們拆分成幾個部分,並且把無關緊要的程式碼先去掉:這部分主要完成的工作就是將param之中指定的key列與聚合列的指標作為引數提取出來,並且和聚合函式一起封裝到AggregateFunctionInstructions的結構之中。

bool Aggregator::executeOnBlock(Columns columns, UInt64 num_rows, AggregatedDataVariants & result,
    ColumnRawPtrs & key_columns, AggregateColumns & aggregate_columns, bool & no_more_keys)
{
    /// `result` will destroy the states of aggregate functions in the destructor
    result.aggregator = this;

    /// How to perform the aggregation?
    if (result.empty())
    {
        result.init(method_chosen);
        result.keys_size = params.keys_size;
        result.key_sizes = key_sizes;
        LOG_TRACE(log, "Aggregation method: " << result.getMethodName());
    }

    for (size_t i = 0; i < params.aggregates_size; ++i)
        aggregate_columns[i].resize(params.aggregates[i].arguments.size());

    /** Constant columns are not supported directly during aggregation.
      * To make them work anyway, we materialize them.
      */
    Columns materialized_columns;

    /// Remember the columns we will work with
    for (size_t i = 0; i < params.keys_size; ++i)
    {
        materialized_columns.push_back(columns.at(params.keys[i])->convertToFullColumnIfConst());
        key_columns[i] = materialized_columns.back().get();

        if (!result.isLowCardinality())
        {
            auto column_no_lc = recursiveRemoveLowCardinality(key_columns[i]->getPtr());
            if (column_no_lc.get() != key_columns[i])
            {
                materialized_columns.emplace_back(std::move(column_no_lc));
                key_columns[i] = materialized_columns.back().get();
            }
        }
    }

    AggregateFunctionInstructions aggregate_functions_instructions(params.aggregates_size + 1);
    aggregate_functions_instructions[params.aggregates_size].that = nullptr;

    std::vector<std::vector<const IColumn *>> nested_columns_holder;
    for (size_t i = 0; i < params.aggregates_size; ++i)
    {
        for (size_t j = 0; j < aggregate_columns[i].size(); ++j)
        {
            materialized_columns.push_back(columns.at(params.aggregates[i].arguments[j])->convertToFullColumnIfConst());
            aggregate_columns[i][j] = materialized_columns.back().get();

            auto column_no_lc = recursiveRemoveLowCardinality(aggregate_columns[i][j]->getPtr());
            if (column_no_lc.get() != aggregate_columns[i][j])
            {
                materialized_columns.emplace_back(std::move(column_no_lc));
                aggregate_columns[i][j] = materialized_columns.back().get();
            }
        }

        aggregate_functions_instructions[i].arguments = aggregate_columns[i].data();
        aggregate_functions_instructions[i].state_offset = offsets_of_aggregate_states[i];
        auto that = aggregate_functions[i];
        /// Unnest consecutive trailing -State combinators
        while (auto func = typeid_cast<const AggregateFunctionState *>(that))
            that = func->getNestedFunction().get();
        aggregate_functions_instructions[i].that = that;
        aggregate_functions_instructions[i].func = that->getAddressOfAddFunction();

        if (auto func = typeid_cast<const AggregateFunctionArray *>(that))
        {
            /// Unnest consecutive -State combinators before -Array
            that = func->getNestedFunction().get();
            while (auto nested_func = typeid_cast<const AggregateFunctionState *>(that))
                that = nested_func->getNestedFunction().get();
            auto [nested_columns, offsets] = checkAndGetNestedArrayOffset(aggregate_columns[i].data(), that->getArgumentTypes().size());
            nested_columns_holder.push_back(std::move(nested_columns));
            aggregate_functions_instructions[i].batch_arguments = nested_columns_holder.back().data();
            aggregate_functions_instructions[i].offsets = offsets;
        }
        else
            aggregate_functions_instructions[i].batch_arguments = aggregate_columns[i].data();

        aggregate_functions_instructions[i].batch_that = that;
    }

將需要準備的引數準備好了之後,後續就通過按部就班的呼叫executeImpl(*result.NAME, result.aggregates_pool, num_rows, key_columns, aggregate_functions_instructions.data(),
no_more_keys, overflow_row_ptr)
聚合運算了。我們來看看它的實現,它是一個模板函式,內部通過呼叫了 executeImplBatch(method, state, aggregates_pool, rows, aggregate_instructions)來實現的,資料庫都會通過Batch的形式,一次性提交一組需要操作的資料來減少虛擬函式呼叫的開銷。

template <typename Method>
void NO_INLINE Aggregator::executeImpl(
    Method & method,
    Arena * aggregates_pool,
    size_t rows,
    ColumnRawPtrs & key_columns,
    AggregateFunctionInstruction * aggregate_instructions,
    bool no_more_keys,
    AggregateDataPtr overflow_row) const
{
    typename Method::State state(key_columns, key_sizes, aggregation_state_cache);

    if (!no_more_keys)
        executeImplBatch(method, state, aggregates_pool, rows, aggregate_instructions);
    else
        executeImplCase<true>(method, state, aggregates_pool, rows, aggregate_instructions, overflow_row);
}

那我們就繼續看下去,executeImplBatch同樣也是一個模板函式。

  • 首先,它構造了一個AggregateDataPtr的陣列places,這裡是這就是後續我們實際聚合結果存放的地方。這個資料的長度也就是這個Batch的長度,也就是說,聚合結果的指標也作為一組列式的資料,參與到後續的聚合運算之中。
  • 接下來,通過一個for迴圈,依次呼叫state.emplaceKey,計算每列聚合key的hash值,進行分類,並且將對應結果依次和places對應。
  • 最後,通過一個for迴圈,呼叫聚合函式的addBatch方法,(這個函式我們在上一篇之中介紹過)。每個AggregateFunctionInstruction都有一個制定的places_offset和對應屬於進行聚合計算的value列,這裡通過一個for迴圈呼叫AddBatch,將places之中對應的資料指標和聚合value列進行聚合,最終形成所有的聚合計算的結果。

到這裡,整個聚合計算的核心流程算是完成了,後續就是將result的結果通過上面的convertToBlock的方式轉換為BlockStream流,繼續返回給上層的呼叫方。

template <typename Method>
void NO_INLINE Aggregator::executeImplBatch(
    Method & method,
    typename Method::State & state,
    Arena * aggregates_pool,
    size_t rows,
    AggregateFunctionInstruction * aggregate_instructions) const
{
    PODArray<AggregateDataPtr> places(rows);

    /// For all rows.
    for (size_t i = 0; i < rows; ++i)
    {
        AggregateDataPtr aggregate_data = nullptr;

        auto emplace_result = state.emplaceKey(method.data, i, *aggregates_pool);

        /// If a new key is inserted, initialize the states of the aggregate functions, and possibly something related to the key.
        if (emplace_result.isInserted())
        {
            /// exception-safety - if you can not allocate memory or create states, then destructors will not be called.
            emplace_result.setMapped(nullptr);

            aggregate_data = aggregates_pool->alignedAlloc(total_size_of_aggregate_states, align_aggregate_states);
            createAggregateStates(aggregate_data);

            emplace_result.setMapped(aggregate_data);
        }
        else
            aggregate_data = emplace_result.getMapped();

        places[i] = aggregate_data;
        assert(places[i] != nullptr);
    }

    /// Add values to the aggregate functions.
    for (AggregateFunctionInstruction * inst = aggregate_instructions; inst->that; ++inst)
    {
        if (inst->offsets)
            inst->batch_that->addBatchArray(rows, places.data(), inst->state_offset, inst->batch_arguments, inst->offsets, aggregates_pool);
        else
            inst->batch_that->addBatch(rows, places.data(), inst->state_offset, inst->batch_arguments, aggregates_pool);
    }

3. 小結

好了,到這裡也就把ClickHouse聚合流程的程式碼梳理完了。
除了聚合計算外,其他的物理執行操作符也是同樣通過流的方式依次對接處理的,原始碼閱讀的步驟也可以參照筆者的分析流程來參考。、
筆者是一個ClickHouse的初學者,對ClickHouse有興趣的同學,歡迎多多指教,交流。

4. 參考資料

官方文件
ClickHouse原始碼

相關文章