讀TiDB原始碼聊設計：淺析HTAP的SQL最佳化器

版本	日期	備註
1.0	2024.2.18	文章首發

本文的的原始碼分析全部基於TiDB6.5來做分析。

1.引子

如果讓你做一個分散式資料庫的最佳化器，面對以下的SQL，你會想到什麼好的方法去執行他們呢？

SELECT id, name FROM person WHERE age >= 18 or height > 180 limit 100;：從條件上看，我們看到條件其實是二選一的： age >= 18 or height > 180。基於這種情況，我們肯定會去選擇有索引的資料，如果都有索引or都沒有，那麼肯定選擇掃描行數最少的資料。如果有一些運算元在裡面的話，則額外需要考慮資料的就近原則——有些運算元在部分架構下可以充分利用MPP的能力，而有些則不行。
SELECT orders.order_id, customers.customer_name, orders.order_date FROM orders LEFT JOIN customers ON orders.customer_id=customers.customer_id;分散式資料庫中的join，最優的方式就是小表廣播到大表資料所在的地方。那麼首先我們得知道誰是小表，誰是大表。

2.統計資訊收集

根據前面的兩個例子，我們可以發現——如果我們需要找到SQL對應的最佳計劃，我們會需要一些表的後設資料，或者說是統計資訊。從常規的角度來說，以下統計資訊是必須收集的：

表的總行數
每列資料的平均大小
每列資料的基數：即NDV（distinct value）
列的NULL值個數

如果是事務型的（行式儲存），那麼還要考慮索引平均長度、值的分佈等等。

如果是分析型的（列式儲存），那麼還需要考慮相關列的最大值、最小值等等。

而統計方式的收集，會有多種方式。主要是考慮資源和準確性之間的Trade off。常見的有：

TopN：相關資料出現次數前 n 的值。
直方圖：用於描述資料分佈圖的工具。按照資料的值大小進行分桶，並用一些簡單的資料來描述每個桶，比如落在桶裡的值的個數。
2D直方圖：由於直方圖無法反應列之間的關聯，可以用2D直方圖（聯合分佈）做到，但空間佔用也比較多。
Count-Min Sketch：類似雜湊表加上計算器的實現。可以用很少的資料來描述全體資料的特性。
HyperLogLog：一種估算海量資料基數的方法。很多情況下，統計並不需要那麼精確。工程方面要在使用資源和準確性裡找平衡。因此有人提出用HLL，這是一種佔用資源少，但能給出較為準確的近似結果的演算法。

TiDB收集的統計資訊見：https://docs.pingcap.com/zh/tidb/v6.5/statistics#%E7%9B%B4%E6...

3.代價的評估

一個SQL真正的物理執行計劃可能是有多個的。在以統計資訊為基礎之上，我們還需要加入相應的權重，舉個例子：

如果能夠在記憶體中計算完成，就不用去反覆發起本地IO
如果能在本地IO中完成，就不要去發起網路請求

等等...

這在TiDB的程式碼中也有對應的預設值。

DefOptCPUFactor                                = 3.0
DefOptCopCPUFactor                             = 3.0
DefOptTiFlashConcurrencyFactor                 = 24.0
DefOptNetworkFactor                            = 1.0
DefOptScanFactor                               = 1.5
DefOptDescScanFactor                           = 3.0
DefOptSeekFactor                               = 20.0
DefOptMemoryFactor                             = 0.001
DefOptDiskFactor                               = 1.5
DefOptConcurrencyFactor                        = 3.0

var defaultVer2Factors = costVer2Factors{
    TiDBTemp:      costVer2Factor{"tidb_temp_table_factor", 0.00},
    TiKVScan:      costVer2Factor{"tikv_scan_factor", 40.70},
    TiKVDescScan:  costVer2Factor{"tikv_desc_scan_factor", 61.05},
    TiFlashScan:   costVer2Factor{"tiflash_scan_factor", 11.60},
    TiDBCPU:       costVer2Factor{"tidb_cpu_factor", 49.90},
    TiKVCPU:       costVer2Factor{"tikv_cpu_factor", 49.90},
    TiFlashCPU:    costVer2Factor{"tiflash_cpu_factor", 2.40},
    TiDB2KVNet:    costVer2Factor{"tidb_kv_net_factor", 3.96},
    TiDB2FlashNet: costVer2Factor{"tidb_flash_net_factor", 2.20},
    TiFlashMPPNet: costVer2Factor{"tiflash_mpp_net_factor", 1.00},
    TiDBMem:       costVer2Factor{"tidb_mem_factor", 0.20},
    TiKVMem:       costVer2Factor{"tikv_mem_factor", 0.20},
    TiFlashMem:    costVer2Factor{"tiflash_mem_factor", 0.05},
    TiDBDisk:      costVer2Factor{"tidb_disk_factor", 200.00},
    TiDBRequest:   costVer2Factor{"tidb_request_factor", 6000000.00},
}

4.執行計劃列舉與擇優

當我們可以評估出物理執行計劃的代價時，將會列舉所有可以用上物理執行計劃，並在其中選擇代價最小的物理執行計劃。一般列舉分為兩個流派：

自底向上：代表有System R。
自頂向下：代表有Cascade。

自底向上沒法解決一個問題。當上層對下層的輸出結果順序感興趣時，那就不能只能從底層的視角來尋找區域性最優。

舉個例子，多表Join。一開始兩個表Join可能HashJoin代價很低，但是Join下一個表時，用MergeJoin從整體來看代價更低。從這個case來看，自底向上來做計劃取優並不合適。

所以有了Cascade：

搜尋方案是自頂向下的。這意味著它可以避免區域性最優而導致全域性不優。從Operator Tree 自頂向下遍歷時，可以做一系列變換：

- Implementation：邏輯運算元可以轉換成物理運算元；例如Join轉換成NestLoop或者HashJoin等
- Exploration：邏輯運算元可以做等價變換；例如交換Inner Join的兩個子節點，即可列舉Join順序

圖片來自於：Cascades Optimizer——https://zhuanlan.zhihu.com/p/73545345

它是基於Volcano模型演進而來的。
用物件導向的方式進行建模，編寫規則時不需要關心搜尋過程。相比傳統最佳化器中程式導向去一條條的編寫，的確是很大的改進。

5.TiDB的實現

大致的程式碼呼叫鏈為：

-- session/session.go

\-- ExecuteStmt //SQL執行的入口

|-- executor/compiler.go

\-- Compile //將SQL變成可執行的計劃

|--planner/planner/optmize.go

\-- Optimize //最佳化的入口

\-- optimize //這裡有兩個入口。一種是新的最佳化器入口，一種是老的最佳化器入口。根據配置來選擇。最終都會返回最優的物理執行計劃。

    |-- planner/cascades/optmize.go

        \--FindBestPlan 見5.1
    
        \-- onPhasePreprocessing //見5.3

        \-- implGroup

        |--planner/core/optmizer.go //見5.4

            \-- DoOptimize

            \-- physicalOptimize

            |--planner/core/find_best_task.go

            \-- findBestTask

            \-- enumeratePhysicalPlans4Task

            \-- compareTaskCost
    
            \-- getTaskPlanCost
    
            |-- planner/core/plan_cost_ver2.go

            \-- getPlanCost

5.1 邏輯最佳化

核心入口為：


// FindBestPlan is the optimization entrance of the cascades planner. The
// optimization is composed of 3 phases: preprocessing, exploration and implementation.
//
// ------------------------------------------------------------------------------
// Phase 1: Preprocessing
// ------------------------------------------------------------------------------
//
// The target of this phase is to preprocess the plan tree by some heuristic
// rules which should always be beneficial, for example Column Pruning.
//
// ------------------------------------------------------------------------------
// Phase 2: Exploration
// ------------------------------------------------------------------------------
//
// The target of this phase is to explore all the logically equivalent
// expressions by exploring all the equivalent group expressions of each group.
//
// At the very beginning, there is only one group expression in a Group. After
// applying some transformation rules on certain expressions of the Group, all
// the equivalent expressions are found and stored in the Group. This procedure
// can be regarded as searching for a weak connected component in a directed
// graph, where nodes are expressions and directed edges are the transformation
// rules.
//
// ------------------------------------------------------------------------------
// Phase 3: Implementation
// ------------------------------------------------------------------------------
//
// The target of this phase is to search the best physical plan for a Group
// which satisfies a certain required physical property.
//
// In this phase, we need to enumerate all the applicable implementation rules
// for each expression in each group under the required physical property. A
// memo structure is used for a group to reduce the repeated search on the same
// required physical property.
func (opt *Optimizer) FindBestPlan(sctx sessionctx.Context, logical plannercore.LogicalPlan) (p plannercore.PhysicalPlan, cost float64, err error) {
    logical, err = opt.onPhasePreprocessing(sctx, logical)
    if err != nil {
        return nil, 0, err
    }
    rootGroup := memo.Convert2Group(logical)
    err = opt.onPhaseExploration(sctx, rootGroup)
    if err != nil {
        return nil, 0, err
    }
    p, cost, err = opt.onPhaseImplementation(sctx, rootGroup)
    if err != nil {
        return nil, 0, err
    }
    err = p.ResolveIndices()
    return p, cost, err
}

註釋+程式碼很乾淨，這裡一共做三件事

onPhasePreprocessing：註釋很實在，說for example Column Pruning，結果裡面真的只做了列裁剪。
onPhaseExploration：探索所有邏輯等價存在的可能
onPhaseImplementation：根據代價尋找最優的物理執行計劃

這塊官網的部落格寫的非常好，我就不重複了：https://cn.pingcap.com/blog/tidb-cascades-planner/

5.2 統計資訊收集

這塊程式碼主要集中在stats.go裡，裡面的核心資料結構是stats_info.go裡的StatsInfo。呼叫鏈大致為：

|-- planner/cascades/optimizer.go

\--fillGroupStats

|-- planner/core/stats.go

\--DeriveStats

type LogicalPlan interface {
    Plan
    //......忽略一些程式碼
    // DeriveStats derives statistic info for current plan node given child stats.
    // We need selfSchema, childSchema here because it makes this method can be used in
    // cascades planner, where LogicalPlan might not record its children or schema.
    DeriveStats(childStats []*property.StatsInfo, selfSchema *expression.Schema, childSchema []*expression.Schema, colGroups [][]*expression.Column) (*property.StatsInfo, error)
    //......忽略一些程式碼
}

有很多結構體實現了這個方法：

收集統計資訊主要是analyze.go#Next方法中呼叫的#analyzeWorker。

5.3 新版本物理執行計劃的選擇

程式碼入口為：

// implGroup finds the best Implementation which satisfies the required
// physical property for a Group. The best Implementation should have the
// lowest cost among all the applicable Implementations.
//
// g:            the Group to be implemented.
// reqPhysProp: the required physical property.
// costLimit:   the maximum cost of all the Implementations.
func (opt *Optimizer) implGroup(g *memo.Group, reqPhysProp *property.PhysicalProperty, costLimit float64) (memo.Implementation, error) {
    groupImpl := g.GetImpl(reqPhysProp)
    if groupImpl != nil {
        if groupImpl.GetCost() <= costLimit {
            return groupImpl, nil
        }
        return nil, nil
    }
    // Handle implementation rules for each equivalent GroupExpr.
    var childImpls []memo.Implementation
    err := opt.fillGroupStats(g)
    if err != nil {
        return nil, err
    }
    outCount := math.Min(g.Prop.Stats.RowCount, reqPhysProp.ExpectedCnt)
    for elem := g.Equivalents.Front(); elem != nil; elem = elem.Next() {
        curExpr := elem.Value.(*memo.GroupExpr)
        impls, err := opt.implGroupExpr(curExpr, reqPhysProp)
        if err != nil {
            return nil, err
        }
        for _, impl := range impls {
            childImpls = childImpls[:0]
            for i, childGroup := range curExpr.Children {
                childImpl, err := opt.implGroup(childGroup, impl.GetPlan().GetChildReqProps(i), impl.GetCostLimit(costLimit, childImpls...))
                if err != nil {
                    return nil, err
                }
                if childImpl == nil {
                    impl.SetCost(math.MaxFloat64)
                    break
                }
                childImpls = append(childImpls, childImpl)
            }
            if impl.GetCost() == math.MaxFloat64 {
                continue
            }
            implCost := impl.CalcCost(outCount, childImpls...)
            if implCost > costLimit {
                continue
            }
            if groupImpl == nil || groupImpl.GetCost() > implCost {
                groupImpl = impl.AttachChildren(childImpls...)
                costLimit = implCost
            }
        }
    }
    // Handle enforcer rules for required physical property.
    for _, rule := range GetEnforcerRules(g, reqPhysProp) {
        newReqPhysProp := rule.NewProperty(reqPhysProp)
        enforceCost := rule.GetEnforceCost(g)
        childImpl, err := opt.implGroup(g, newReqPhysProp, costLimit-enforceCost)
        if err != nil {
            return nil, err
        }
        if childImpl == nil {
            continue
        }
        impl := rule.OnEnforce(reqPhysProp, childImpl)
        implCost := enforceCost + childImpl.GetCost()
        impl.SetCost(implCost)
        if groupImpl == nil || groupImpl.GetCost() > implCost {
            groupImpl = impl
            costLimit = implCost
        }
    }
    if groupImpl == nil || groupImpl.GetCost() == math.MaxFloat64 {
        return nil, nil
    }
    g.InsertImpl(reqPhysProp, groupImpl)
    return groupImpl, nil
}

這裡個函式會找出潛在符合條件的物理執行計劃，並不斷的搜尋。但它有一個剪枝邏輯——會記錄當前最小的cost，如果一個執行計劃自上向下搜尋時，如果超過了這個cost，則直接返回。這就是在第3節提到的自頂向下的最佳化。

接下來我們看一下memo.Implementation的定義：


package memo

import (
    plannercore "github.com/pingcap/tidb/planner/core"
)

// Implementation defines the interface for cost of physical plan.
type Implementation interface {
    CalcCost(outCount float64, children ...Implementation) float64
    SetCost(cost float64)
    GetCost() float64
    GetPlan() plannercore.PhysicalPlan

    // AttachChildren is used to attach children implementations and returns it self.
    AttachChildren(children ...Implementation) Implementation

    // GetCostLimit gets the costLimit for implementing the next childGroup.
    GetCostLimit(costLimit float64, children ...Implementation) float64
}

其中CalcCost方法就是用來計算物理執行計劃的代價。一共有這麼多結構體實現了它：

5.3.1 代價的評估

我們以開頭的例子，講解代價的評估。

select代價計算方式

// CalcCost calculates the cost of the table scan Implementation.
func (impl *TableScanImpl) CalcCost(outCount float64, _ ...memo.Implementation) float64 {
    ts := impl.plan.(*plannercore.PhysicalTableScan)
    width := impl.tblColHists.GetTableAvgRowSize(impl.plan.SCtx(), impl.tblCols, kv.TiKV, true)
    sessVars := ts.SCtx().GetSessionVars()
    impl.cost = outCount * sessVars.GetScanFactor(ts.Table) * width
    if ts.Desc {
        impl.cost = outCount * sessVars.GetDescScanFactor(ts.Table) * width
    }
    return impl.cost
}

// GetScanFactor returns the session variable scanFactor
// returns 0 when tbl is a temporary table.
func (s *SessionVars) GetScanFactor(tbl *model.TableInfo) float64 {
    if tbl != nil {
        if tbl.TempTableType != model.TempTableNone {
            return 0
        }
    }
    return s.scanFactor 
}


// CalcCost implements Implementation interface.
func (impl *IndexScanImpl) CalcCost(outCount float64, _ ...memo.Implementation) float64 {
    is := impl.plan.(*plannercore.PhysicalIndexScan)
    sessVars := is.SCtx().GetSessionVars()
    rowSize := impl.tblColHists.GetIndexAvgRowSize(is.SCtx(), is.Schema().Columns, is.Index.Unique)
    cost := outCount * rowSize * sessVars.GetScanFactor(is.Table)
    if is.Desc {
        cost = outCount * rowSize * sessVars.GetDescScanFactor(is.Table)
    }
    cost += float64(len(is.Ranges)) * sessVars.GetSeekFactor(is.Table)
    impl.cost = cost
    return impl.cost
}

這裡我們以全表掃描表和命中索引的程式碼為例子。其中預設的scanFactor是1.5。這裡可以看到indexScan和tableScan少了一個因數：width。這個變數代表了所需列的平均大小。這麼看基本上是indexScan最優了。

這裡的程式碼筆者認為有點不優雅——當Desc時，其實之前的Cost是沒必要算一遍的，浪費CPU資源。

join代價計算方式

// CalcCost implements Implementation CalcCost interface.
func (impl *HashJoinImpl) CalcCost(_ float64, children ...memo.Implementation) float64 {
    hashJoin := impl.plan.(*plannercore.PhysicalHashJoin)
    // The children here are only used to calculate the cost.
    hashJoin.SetChildren(children[0].GetPlan(), children[1].GetPlan())
    selfCost := hashJoin.GetCost(children[0].GetPlan().StatsCount(), children[1].GetPlan().StatsCount(), false, 0, nil)
    impl.cost = selfCost + children[0].GetCost() + children[1].GetCost()
    return impl.cost
}

// CalcCost implements Implementation CalcCost interface.
func (impl *MergeJoinImpl) CalcCost(_ float64, children ...memo.Implementation) float64 {
    mergeJoin := impl.plan.(*plannercore.PhysicalMergeJoin)
    // The children here are only used to calculate the cost.
    mergeJoin.SetChildren(children[0].GetPlan(), children[1].GetPlan())
    selfCost := mergeJoin.GetCost(children[0].GetPlan().StatsCount(), children[1].GetPlan().StatsCount(), 0)
    impl.cost = selfCost + children[0].GetCost() + children[1].GetCost()
    return impl.cost
}

具體的計算都在plan_cost_v1.go裡：


// GetCost computes cost of hash join operator itself.
func (p *PhysicalHashJoin) GetCost(lCnt, rCnt float64, isMPP bool, costFlag uint64, op *physicalOptimizeOp) float64 {
    buildCnt, probeCnt := lCnt, rCnt
    build := p.children[0]
    // Taking the right as the inner for right join or using the outer to build a hash table.
    if (p.InnerChildIdx == 1 && !p.UseOuterToBuild) || (p.InnerChildIdx == 0 && p.UseOuterToBuild) {
        buildCnt, probeCnt = rCnt, lCnt
        build = p.children[1]
    }
    sessVars := p.ctx.GetSessionVars()
    oomUseTmpStorage := variable.EnableTmpStorageOnOOM.Load()
    memQuota := sessVars.MemTracker.GetBytesLimit() // sessVars.MemQuotaQuery && hint
    rowSize := getAvgRowSize(build.statsInfo(), build.Schema().Columns)
    spill := oomUseTmpStorage && memQuota > 0 && rowSize*buildCnt > float64(memQuota) && p.storeTp != kv.TiFlash
    // Cost of building hash table.
    cpuFactor := sessVars.GetCPUFactor()
    diskFactor := sessVars.GetDiskFactor()
    memoryFactor := sessVars.GetMemoryFactor()
    concurrencyFactor := sessVars.GetConcurrencyFactor()

    cpuCost := buildCnt * cpuFactor
    memoryCost := buildCnt * memoryFactor
    diskCost := buildCnt * diskFactor * rowSize
    // Number of matched row pairs regarding the equal join conditions.
    helper := &fullJoinRowCountHelper{
        sctx:            p.SCtx(),
        cartesian:       false,
        leftProfile:     p.children[0].statsInfo(),
        rightProfile:    p.children[1].statsInfo(),
        leftJoinKeys:    p.LeftJoinKeys,
        rightJoinKeys:   p.RightJoinKeys,
        leftSchema:      p.children[0].Schema(),
        rightSchema:     p.children[1].Schema(),
        leftNAJoinKeys:  p.LeftNAJoinKeys,
        rightNAJoinKeys: p.RightNAJoinKeys,
    }
    numPairs := helper.estimate()
    // For semi-join class, if `OtherConditions` is empty, we already know
    // the join results after querying hash table, otherwise, we have to
    // evaluate those resulted row pairs after querying hash table; if we
    // find one pair satisfying the `OtherConditions`, we then know the
    // join result for this given outer row, otherwise we have to iterate
    // to the end of those pairs; since we have no idea about when we can
    // terminate the iteration, we assume that we need to iterate half of
    // those pairs in average.
    if p.JoinType == SemiJoin || p.JoinType == AntiSemiJoin ||
        p.JoinType == LeftOuterSemiJoin || p.JoinType == AntiLeftOuterSemiJoin {
        if len(p.OtherConditions) > 0 {
            numPairs *= 0.5
        } else {
            numPairs = 0
        }
    }
    if hasCostFlag(costFlag, CostFlagUseTrueCardinality) {
        numPairs = getOperatorActRows(p)
    }
    // Cost of querying hash table is cheap actually, so we just compute the cost of
    // evaluating `OtherConditions` and joining row pairs.
    probeCost := numPairs * cpuFactor
    probeDiskCost := numPairs * diskFactor * rowSize
    // Cost of evaluating outer filter.
    if len(p.LeftConditions)+len(p.RightConditions) > 0 {
        // Input outer count for the above compution should be adjusted by SelectionFactor.
        probeCost *= SelectionFactor
        probeDiskCost *= SelectionFactor
        probeCost += probeCnt * cpuFactor
    }
    diskCost += probeDiskCost
    probeCost /= float64(p.Concurrency)
    // Cost of additional concurrent goroutines.
    cpuCost += probeCost + float64(p.Concurrency+1)*concurrencyFactor
    // Cost of traveling the hash table to resolve missing matched cases when building the hash table from the outer table
    if p.UseOuterToBuild {
        if spill {
            // It runs in sequence when build data is on disk. See handleUnmatchedRowsFromHashTableInDisk
            cpuCost += buildCnt * cpuFactor
        } else {
            cpuCost += buildCnt * cpuFactor / float64(p.Concurrency)
        }
        diskCost += buildCnt * diskFactor * rowSize
    }

    if spill {
        memoryCost *= float64(memQuota) / (rowSize * buildCnt)
    } else {
        diskCost = 0
    }
    if op != nil {
        setPhysicalHashJoinCostDetail(p, op, spill, buildCnt, probeCnt, cpuFactor, rowSize, numPairs,
            cpuCost, probeCost, memoryCost, diskCost, probeDiskCost,
            diskFactor, memoryFactor, concurrencyFactor,
            memQuota)
    }
    return cpuCost + memoryCost + diskCost
}


// GetCost computes cost of merge join operator itself.
func (p *PhysicalMergeJoin) GetCost(lCnt, rCnt float64, costFlag uint64) float64 {
    outerCnt := lCnt
    innerCnt := rCnt
    innerKeys := p.RightJoinKeys
    innerSchema := p.children[1].Schema()
    innerStats := p.children[1].statsInfo()
    if p.JoinType == RightOuterJoin {
        outerCnt = rCnt
        innerCnt = lCnt
        innerKeys = p.LeftJoinKeys
        innerSchema = p.children[0].Schema()
        innerStats = p.children[0].statsInfo()
    }
    helper := &fullJoinRowCountHelper{
        sctx:          p.SCtx(),
        cartesian:     false,
        leftProfile:   p.children[0].statsInfo(),
        rightProfile:  p.children[1].statsInfo(),
        leftJoinKeys:  p.LeftJoinKeys,
        rightJoinKeys: p.RightJoinKeys,
        leftSchema:    p.children[0].Schema(),
        rightSchema:   p.children[1].Schema(),
    }
    numPairs := helper.estimate()
    if p.JoinType == SemiJoin || p.JoinType == AntiSemiJoin ||
        p.JoinType == LeftOuterSemiJoin || p.JoinType == AntiLeftOuterSemiJoin {
        if len(p.OtherConditions) > 0 {
            numPairs *= 0.5
        } else {
            numPairs = 0
        }
    }
    if hasCostFlag(costFlag, CostFlagUseTrueCardinality) {
        numPairs = getOperatorActRows(p)
    }
    sessVars := p.ctx.GetSessionVars()
    probeCost := numPairs * sessVars.GetCPUFactor()
    // Cost of evaluating outer filters.
    var cpuCost float64
    if len(p.LeftConditions)+len(p.RightConditions) > 0 {
        probeCost *= SelectionFactor
        cpuCost += outerCnt * sessVars.GetCPUFactor()
    }
    cpuCost += probeCost
    // For merge join, only one group of rows with same join key(not null) are cached,
    // we compute average memory cost using estimated group size.
    NDV, _ := getColsNDVWithMatchedLen(innerKeys, innerSchema, innerStats)
    memoryCost := (innerCnt / NDV) * sessVars.GetMemoryFactor()
    return cpuCost + memoryCost
}

HashJoin要考慮到記憶體不夠的情況，因此在計算到資料不夠時，會將對應的資料壓入硬碟。但它對資料的順序並無要求，因此可以併發的去做。見：

    // Cost of traveling the hash table to resolve missing matched cases when building the hash table from the outer table
    if p.UseOuterToBuild {
        if spill {
            // It runs in sequence when build data is on disk. See handleUnmatchedRowsFromHashTableInDisk
            cpuCost += buildCnt * cpuFactor
        } else {
            cpuCost += buildCnt * cpuFactor / float64(p.Concurrency)
        }
        diskCost += buildCnt * diskFactor * rowSize
    }

而MergeJoin的代價顯然會更小，但能夠選則到這個計劃也有較高的要求：當兩個關聯表要 Join 的欄位需要按排好的順序讀取時，適用 Merge Join 演算法。

5.4 老版本物理執行計劃的選擇

5.4.1 代價的評估

這塊程式碼主要是在plan_cost_ver1.go和plan_cost_ver2.go。v2對代價公式進行了更精確的迴歸校準，調整了部分代價公式，比此前版本的代價公式更加準確。程式碼上也更為簡潔：v2只暴露出了一個公共方法，內部透過不同的型別做轉發。

// GetPlanCost returns the cost of this plan.
func GetPlanCost(p PhysicalPlan, taskType property.TaskType, option *PlanCostOption) (float64, error) {
    return getPlanCost(p, taskType, option)
}

func getPlanCost(p PhysicalPlan, taskType property.TaskType, option *PlanCostOption) (float64, error) {
    if p.SCtx().GetSessionVars().CostModelVersion == modelVer2 {
        planCost, err := p.getPlanCostVer2(taskType, option)
        return planCost.cost, err
    }
    return p.getPlanCostVer1(taskType, option)
}

根據不同的PhysicalPlan型別，會找到不同繫結方法：

v1的部分方法展示：

select代價計算方式

// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = child-cost + filter-cost
func (p *PhysicalSelection) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
        return p.planCostVer2, nil
    }

    inputRows := getCardinality(p.children[0], option.CostFlag)
    cpuFactor := getTaskCPUFactorVer2(p, taskType)

    filterCost := filterCostVer2(option, inputRows, p.Conditions, cpuFactor)

    childCost, err := p.children[0].getPlanCostVer2(taskType, option)
    if err != nil {
        return zeroCostVer2, err
    }

    p.planCostVer2 = sumCostVer2(filterCost, childCost)
    p.planCostInit = true
    return p.planCostVer2, nil
}

這部分程式碼簡單易讀。代價就是子查詢的代價+篩選的代價。

那麼問題來了，中索引的和不中索引的代價應該是不一樣的。這裡沒有體現出來啊。仔細看childCost, err := p.children[0].getPlanCostVer2(taskType, option)，這裡是會去獲取子物理執行計劃的代價。

// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
func (p *PointGetPlan) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
        return p.planCostVer2, nil
    }

    if p.accessCols == nil { // from fast plan code path
        p.planCostVer2 = zeroCostVer2
        p.planCostInit = true
        return zeroCostVer2, nil
    }
    rowSize := getAvgRowSize(p.stats, p.schema.Columns)
    netFactor := getTaskNetFactorVer2(p, taskType)

    p.planCostVer2 = netCostVer2(option, 1, rowSize, netFactor)
    p.planCostInit = true
    return p.planCostVer2, nil
}

func netCostVer2(option *PlanCostOption, rows, rowSize float64, netFactor costVer2Factor) costVer2 {
    return newCostVer2(option, netFactor,
        rows*rowSize*netFactor.Value,
        func() string { return fmt.Sprintf("net(%v*rowsize(%v)*%v)", rows, rowSize, netFactor) })
}



// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = rows * log2(row-size) * scan-factor
// log2(row-size) is from experiments.
func (p *PhysicalTableScan) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
        return p.planCostVer2, nil
    }

    rows := getCardinality(p, option.CostFlag)
    var rowSize float64
    if p.StoreType == kv.TiKV {
        rowSize = getAvgRowSize(p.stats, p.tblCols) // consider all columns if TiKV
    } else { // TiFlash
        rowSize = getAvgRowSize(p.stats, p.schema.Columns)
    }
    rowSize = math.Max(rowSize, 2.0)
    scanFactor := getTaskScanFactorVer2(p, p.StoreType, taskType)

    p.planCostVer2 = scanCostVer2(option, rows, rowSize, scanFactor)

    // give TiFlash a start-up cost to let the optimizer prefers to use TiKV to process small table scans.
    if p.StoreType == kv.TiFlash {
        p.planCostVer2 = sumCostVer2(p.planCostVer2, scanCostVer2(option, 10000, rowSize, scanFactor))
    }

    p.planCostInit = true
    return p.planCostVer2, nil
}

func scanCostVer2(option *PlanCostOption, rows, rowSize float64, scanFactor costVer2Factor) costVer2 {
    if rowSize < 1 {
        rowSize = 1
    }
    return newCostVer2(option, scanFactor,
        // rows * log(row-size) * scanFactor, log2 from experiments
        rows*math.Log2(rowSize)*scanFactor.Value,
        func() string { return fmt.Sprintf("scan(%v*logrowsize(%v)*%v)", rows, rowSize, scanFactor) })
}

scanFactor的代價預設是40.7，netFactor的代價預設是3.96。結合程式碼來看，命中索引的代價更優。

join代價計算方式

// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = build-child-cost + build-filter-cost +
// (probe-cost + probe-filter-cost) / concurrency
// probe-cost = probe-child-cost * build-rows / batchRatio
func (p *PhysicalIndexJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    return p.getIndexJoinCostVer2(taskType, option, 0)
}

func (p *PhysicalIndexHashJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    return p.getIndexJoinCostVer2(taskType, option, 1)
}

func (p *PhysicalIndexMergeJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
    return p.getIndexJoinCostVer2(taskType, option, 2)
}

func (p *PhysicalIndexJoin) getIndexJoinCostVer2(taskType property.TaskType, option *PlanCostOption, indexJoinType int) (costVer2, error) {
    if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
        return p.planCostVer2, nil
    }

    build, probe := p.children[1-p.InnerChildIdx], p.children[p.InnerChildIdx]
    buildRows := getCardinality(build, option.CostFlag)
    buildRowSize := getAvgRowSize(build.Stats(), build.Schema().Columns)
    probeRowsOne := getCardinality(probe, option.CostFlag)
    probeRowsTot := probeRowsOne * buildRows
    probeRowSize := getAvgRowSize(probe.Stats(), probe.Schema().Columns)
    buildFilters, probeFilters := p.LeftConditions, p.RightConditions
    probeConcurrency := float64(p.ctx.GetSessionVars().IndexLookupJoinConcurrency())
    cpuFactor := getTaskCPUFactorVer2(p, taskType)
    memFactor := getTaskMemFactorVer2(p, taskType)
    requestFactor := getTaskRequestFactorVer2(p, taskType)

    buildFilterCost := filterCostVer2(option, buildRows, buildFilters, cpuFactor)
    buildChildCost, err := build.getPlanCostVer2(taskType, option)
    if err != nil {
        return zeroCostVer2, err
    }
    buildTaskCost := newCostVer2(option, cpuFactor,
        buildRows*10*cpuFactor.Value,
        func() string { return fmt.Sprintf("cpu(%v*10*%v)", buildRows, cpuFactor) })
    startCost := newCostVer2(option, cpuFactor,
        10*3*cpuFactor.Value,
        func() string { return fmt.Sprintf("cpu(10*3*%v)", cpuFactor) })

    probeFilterCost := filterCostVer2(option, probeRowsTot, probeFilters, cpuFactor)
    probeChildCost, err := probe.getPlanCostVer2(taskType, option)
    if err != nil {
        return zeroCostVer2, err
    }

    var hashTableCost costVer2
    switch indexJoinType {
    case 1: // IndexHashJoin
        hashTableCost = hashBuildCostVer2(option, buildRows, buildRowSize, float64(len(p.RightJoinKeys)), cpuFactor, memFactor)
    case 2: // IndexMergeJoin
        hashTableCost = newZeroCostVer2(traceCost(option))
    default: // IndexJoin
        hashTableCost = hashBuildCostVer2(option, probeRowsTot, probeRowSize, float64(len(p.LeftJoinKeys)), cpuFactor, memFactor)
    }

    // IndexJoin executes a batch of rows at a time, so the actual cost of this part should be
    //  `innerCostPerBatch * numberOfBatches` instead of `innerCostPerRow * numberOfOuterRow`.
    // Use an empirical value batchRatio to handle this now.
    // TODO: remove this empirical value.
    batchRatio := 6.0
    probeCost := divCostVer2(mulCostVer2(probeChildCost, buildRows), batchRatio)

    // Double Read Cost
    doubleReadCost := newZeroCostVer2(traceCost(option))
    if p.ctx.GetSessionVars().IndexJoinDoubleReadPenaltyCostRate > 0 {
        batchSize := float64(p.ctx.GetSessionVars().IndexJoinBatchSize)
        taskPerBatch := 1024.0 // TODO: remove this magic number
        doubleReadTasks := buildRows / batchSize * taskPerBatch
        doubleReadCost = doubleReadCostVer2(option, doubleReadTasks, requestFactor)
        doubleReadCost = mulCostVer2(doubleReadCost, p.ctx.GetSessionVars().IndexJoinDoubleReadPenaltyCostRate)
    }

    p.planCostVer2 = sumCostVer2(startCost, buildChildCost, buildFilterCost, buildTaskCost, divCostVer2(sumCostVer2(doubleReadCost, probeCost, probeFilterCost, hashTableCost), probeConcurrency))
    p.planCostInit = true
    return p.planCostVer2, nil
}

關鍵在於：

    switch indexJoinType {
    case 1: // IndexHashJoin
        hashTableCost = hashBuildCostVer2(option, buildRows, buildRowSize, float64(len(p.RightJoinKeys)), cpuFactor, memFactor)
    case 2: // IndexMergeJoin
        hashTableCost = newZeroCostVer2(traceCost(option))
    default: // IndexJoin
        hashTableCost = hashBuildCostVer2(option, probeRowsTot, probeRowSize, float64(len(p.LeftJoinKeys)), cpuFactor, memFactor)
    }

對應方法：

func hashBuildCostVer2(option *PlanCostOption, buildRows, buildRowSize, nKeys float64, cpuFactor, memFactor costVer2Factor) costVer2 {
    // TODO: 1) consider types of keys, 2) dedicated factor for build-probe hash table
    hashKeyCost := newCostVer2(option, cpuFactor,
        buildRows*nKeys*cpuFactor.Value,
        func() string { return fmt.Sprintf("hashkey(%v*%v*%v)", buildRows, nKeys, cpuFactor) })
    hashMemCost := newCostVer2(option, memFactor,
        buildRows*buildRowSize*memFactor.Value,
        func() string { return fmt.Sprintf("hashmem(%v*%v*%v)", buildRows, buildRowSize, memFactor) })
    hashBuildCost := newCostVer2(option, cpuFactor,
        buildRows*cpuFactor.Value,
        func() string { return fmt.Sprintf("hashbuild(%v*%v)", buildRows, cpuFactor) })
    return sumCostVer2(hashKeyCost, hashMemCost, hashBuildCost)
}

func newZeroCostVer2(trace bool) (ret costVer2) {
    if trace {
        ret.trace = &costTrace{make(map[string]float64), ""}
    }
    return
}

簡單的看一下程式碼，我們可以發現，從大多數的場景來看，按照代價從小到大來排，這幾種Join是MergeJoin<HashJoin<IndexJoin。

5.4.2執行計劃列舉與擇優

總得來說這塊程式碼較為簡單，本質就是列舉所有符合條件的物理執行計劃，並挑選出代價最小的執行計劃，故不再列舉程式碼。有興趣的同學可以根據以下大綱自行翻閱：

|--planner/core/find_best_task.go

\-- findBestTask

\-- enumeratePhysicalPlans4Task

\-- compareTaskCost

\-- getTaskPlanCost

|-- planner/core/plan_cost_ver2.go

\-- getPlanCost

6.其他

6.1 參考與引用的文章

Cascades Optimizer：https://zhuanlan.zhihu.com/p/73545345
揭秘 TiDB 新最佳化器：Cascades Planner 原理解析：https://cn.pingcap.com/blog/tidb-cascades-planner/
TiDB文件-統計資訊簡介：https://docs.pingcap.com/zh/tidb/v6.5/statistics#%E7%BB%9F%E8...
TiDB文件-錯誤索引的解決方案：https://docs.pingcap.com/zh/tidb/v6.5/wrong-index-solution#%E...
The Volcano Optimizer Generator: Extensibility and Efficient Search：https://15721.courses.cs.cmu.edu/spring2019/papers/22-optimiz...
The Cascades Framework for Query Optimization：https://15721.courses.cs.cmu.edu/spring2018/papers/15-optimiz...

6.2 知識補充：code generation && vectorized execution

資料庫引擎執行器中非常出名的兩種最佳化方式，code generation和 vectorized execution。

code generation主要是根據上下文來生成一整段最佳化過的程式碼，這與那種巢狀大量if...else、for迴圈、虛方法的程式碼完全相反，完全面向效能考慮。

vectorized execution基於拉模型。相比於一次拉一個tuple來說，它的批次拉取減少了多次拉取的開銷，同時還可以使用到SIMD。基於這種場景，vectorized execution的最佳化更加適用於列式資料庫。

讀TiDB原始碼聊設計：淺析HTAP的SQL最佳化器

1.引子

2.統計資訊收集

3.代價的評估

4.執行計劃列舉與擇優

5.TiDB的實現

5.1 邏輯最佳化

5.2 統計資訊收集

5.3 新版本 物理執行計劃的選擇

5.3.1 代價的評估

select代價計算方式

join代價計算方式

5.4 老版本 物理執行計劃的選擇

5.4.1 代價的評估

select代價計算方式

join代價計算方式

5.4.2執行計劃列舉與擇優

6.其他

6.1 參考與引用的文章

6.2 知識補充：code generation && vectorized execution

相關文章

5.3 新版本物理執行計劃的選擇

5.4 老版本物理執行計劃的選擇