openGauss 列存表PSort索引

openGaussbaby發表於2024-04-10

原文網址 : https://www.cnblogs.com/helloopenGauss/p/18125721

openGauss 列存表 PSort 索引
概述
PSort(Partial sort) Index 是在列存表的列上建的聚簇索引。CUDesc 上有每個 CU 的 min 和 max 值，但如果業務的資料模型較為離散，查詢時透過 min 和 max 值去過濾 CU 會出現大量的 CU 誤讀取，例如每個 CU 的 min 和 max 跨度都比較大時，其查詢效率接近全表掃描。例如下圖中的場景，查詢 2 基本命中所有的 CU，此時查詢近似全表掃描。

PSort 索引可以對部分割槽間（一般會包含多個 CU 覆蓋的行）內的資料按照索引鍵進行排序，使得 CU 之間的交集儘量減少，提升查詢的效率。

PSort 索引使用
在批次插入列存表的過程中，如果發現有 PSort 索引，會先對這批資料進行排序。PSort 索引表的組織形式也是 cstore 表（CUDesc 是 astore 表），表的欄位包含了索引鍵的各個欄位，加上對應的行號(TID)欄位。插入資料的過程中如果發現有 PSort 索引，會將一定數量的資料按照 PSort 索引的索引鍵進行排序，與 TID 欄位共同拼裝成向量陣列，再插入到 PSort 索引的 cstore 表中。所以 PSort 索引資料中列數比實際的索引鍵要多一列，多出的這一列用於儲存這條記錄在資料 cstore 儲存中的位置。

// 構建 PSort 索引過程中構造索引資料
inline void ProjectToIndexVector(VectorBatch *scanBatch, VectorBatch *outBatch, IndexInfo *indexInfo)
{
Assert(scanBatch && outBatch && indexInfo);
int numAttrs = indexInfo->ii_NumIndexAttrs;
AttrNumber *attrNumbers = indexInfo->ii_KeyAttrNumbers;
Assert(outBatch->m_cols == (numAttrs + 1));

// index column
for (int i = 0; i < numAttrs; i++) {
    AttrNumber attno = attrNumbers[i];
    Assert(attno > 0 && attno <= scanBatch->m_cols);

    // shallow copy
    outBatch->m_arr[i].copy(&scanBatch->m_arr[attno - 1]);
}

// ctid column
// 最後一列是 tid
outBatch->m_arr[numAttrs].copy(scanBatch->GetSysVector(-1));

outBatch->m_rows = scanBatch->m_rows;

}
cstore 表執行插入流程，如果有 Psort 索引，會先將資料插入排序佇列

void CStoreInsert::BatchInsert(in VectorBatch* pBatch, in int options)
{
Assert(pBatch || IsEnd());

/* keep memory space from leaking during bulk-insert */
MemoryContext oldCnxt = MemoryContextSwitchTo(m_tmpMemCnxt);

// Step 1: relation has partial cluster key
// We need put data into sorter contatiner, and then do
// batchinsert data
if (NeedPartialSort()) {
    Assert(m_tmpBatchRows);

    if (pBatch) {
        Assert(pBatch->m_cols == m_relation->rd_att->natts);
        m_sorter->PutVecBatch(m_relation, pBatch); // 插入區域性排序佇列
    }

    if (m_sorter->IsFull() || IsEnd()) { // 排序佇列滿了或者插入資料輸入結束
        m_sorter->RunSort(); // 按照索引鍵排序

        /* reset and fetch next batch of values */
        DoBatchInsert(options);
        m_sorter->Reset(IsEnd());

        /* reset and free all memory blocks */
        m_tmpBatchRows->reset(false);
    }
}

// Step 2: relation doesn't have partial cluster key
// We need cache data until batchrows is full
else {
    Assert(m_bufferedBatchRows);

    // If batch row is full, we can do batchinsert now
    if (IsEnd()) {
        if (ENABLE_DELTA(m_bufferedBatchRows)) {
            InsertDeltaTable(m_bufferedBatchRows, options);
        } else {
            BatchInsertCommon(m_bufferedBatchRows, options);
        }
        m_bufferedBatchRows->reset(true);
    }

    // we need cache data until batchrows is full
    if (pBatch) {
        Assert(pBatch->m_rows <= BatchMaxSize);
        Assert(pBatch->m_cols && m_relation->rd_att->natts);
        Assert(m_bufferedBatchRows->m_rows_maxnum > 0);
        Assert(m_bufferedBatchRows->m_rows_maxnum % BatchMaxSize == 0);

        int startIdx = 0;
        while (m_bufferedBatchRows->append_one_vector(
                   RelationGetDescr(m_relation), pBatch, &startIdx, m_cstorInsertMem)) {
            BatchInsertCommon(m_bufferedBatchRows, options);
            m_bufferedBatchRows->reset(true);
        }
        Assert(startIdx == pBatch->m_rows);
    }
}

// Step 3: We must update index data for this batch data
// if end of batchInsert
FlushIndexDataIfNeed();

MemoryContextReset(m_tmpMemCnxt);
(void)MemoryContextSwitchTo(oldCnxt);

}

圖 cstore 表插入流程示意圖

插入流程中更新索引資料的程式碼

void CStoreInsert::InsertIdxTableIfNeed(bulkload_rows* batchRowPtr, uint32 cuId)
{
Assert(batchRowPtr);

if (relation_has_indexes(m_resultRelInfo)) {
    /* form all tids */
    bulkload_indexbatch_set_tids(m_idxBatchRow, cuId, batchRowPtr->m_rows_curnum);

    for (int indice = 0; indice < m_resultRelInfo->ri_NumIndices; ++indice) {
        /* form index-keys data for index relation */
        for (int key = 0; key < m_idxKeyNum[indice]; ++key) {
            bulkload_indexbatch_copy(m_idxBatchRow, key, batchRowPtr, m_idxKeyAttr[indice][key]);
        }

        /* form tid-keys data for index relation */
        bulkload_indexbatch_copy_tids(m_idxBatchRow, m_idxKeyNum[indice]);

        /* update the actual number of used attributes */
        m_idxBatchRow->m_attr_num = m_idxKeyNum[indice] + 1;

        if (m_idxInsert[indice] != NULL) {
            /* 插入PSort 索引 */
            m_idxInsert[indice]->BatchInsert(m_idxBatchRow, 0);
        } else {
            /* 插入 cbtree/cgin 索引 */
            CStoreInsert::InsertNotPsortIdx(indice);
        }
    }
}

}
索引插入流程和普通 cstore 資料插入相同。

使用 PSort 索引查詢時，由於 PSort 索引 CU 內部已經有序，因此可以使用二分查詢快速找到對應資料在 psort 索引中的行號，這一行資料的 tid 欄位就是這條資料在資料 cstore 中的行號。

圖-2 PSort 索引查詢示意圖

openGauss/MogDB列存表的delta表測試
2024-01-02
openGauss/MogDB列存表vacuum DELTAMERGE過程申請的鎖
2024-01-02
openGauss-索引推薦
2024-08-29
索引
openGauss儲存技術（二）——列儲存引擎和記憶體引擎
2022-11-09
儲存引擎記憶體
MOGDB/openGauss索引推薦及虛擬索引
2021-12-10
索引
什麼是行儲存和列儲存？正排索引和倒排索引？MySQL既不是倒排索引，也
2021-09-09
索引MySql
openGauss Index-advisor_索引推薦
2024-03-28
Index索引
關於openGauss中的虛擬索引
2024-04-01
索引
update表中index索引列對原索引條目做什麼操作？
2020-04-17
Index索引
SQL Server 列儲存索引第一篇：概述
2020-10-29
SQLServer索引
SQL Server 列儲存索引第三篇：維護
2020-10-31
SQLServer索引
SQL Server 列儲存索引第二篇：設計
2020-10-30
SQLServer索引
SQL Server 2012新功能巡禮：列儲存索引YX
2022-03-21
SQLServer索引
達夢列儲存表(HUGE Table)
2019-11-13
openGauss儲存技術（一）——行儲存引擎
2022-11-09
儲存引擎
openGauss In-place-Update儲存引擎
2024-03-19
儲存引擎
openGauss 高危操作一覽表
2024-05-13
openGauss 堆表支援預讀
2024-03-30
openGauss 更新表中資料
2024-06-25
openGauss 處理錯誤表
2024-07-08
openGauss 支援儲存過程除錯
2024-04-09
儲存過程除錯
索引與null（一）：單列索引
2019-03-04
索引Null
關於InnoDB表資料和索引資料的儲存
2022-07-18
索引
MySQL 生成列索引
2021-01-15
MySql索引
openGauss中如何管理表空間
2024-03-28
openGauss 函式及儲存過程支援
2024-03-30
函式儲存過程
MySQL通過通用列索引來提供一個JSON列索引
2021-12-17
MySql索引JSON
索引儲存小記
2019-05-18
索引
mysql優化 | 儲存引擎，建表，索引，sql的優化建議
2019-02-01
MySql優化儲存引擎索引
[20201007]exadata儲存索引.txt
2020-01-08
索引
使用Spark載入資料到SQL Server列儲存表
2021-03-03
SparkSQLServer
SQL Server 列儲存索引第四篇：實時運營資料分析
2020-11-01
SQLServer索引
【PG】ora2pg 分別匯出表，索引，儲存過程等
2024-03-14
索引儲存過程
DM7 陣列索引
2020-02-18
陣列索引
hadoop異構儲存+lucene索引
2019-08-27
Hadoop索引
2_mysql（索引、儲存引擎）
2020-11-15
MySql索引儲存引擎
記憶體資料庫的行存表索引是怎麼做到加速的
2022-07-15
記憶體資料庫索引
MySQL管理表和索引
2018-03-13
MySql索引

openGauss 列存表PSort索引

相關文章