版本 | 日期 | 備註 |
---|---|---|
1.0 | 2024.2.18 | 文章首發 |
本文的的原始碼分析全部基於TiDB6.5來做分析。
1.引子
如果讓你做一個分散式資料庫的最佳化器,面對以下的SQL,你會想到什麼好的方法去執行他們呢?
SELECT id, name FROM person WHERE age >= 18 or height > 180 limit 100;
:從條件上看,我們看到條件其實是二選一的:age >= 18 or height > 180
。基於這種情況,我們肯定會去選擇有索引的資料,如果都有索引or都沒有,那麼肯定選擇掃描行數最少的資料。如果有一些運算元在裡面的話,則額外需要考慮資料的就近原則——有些運算元在部分架構下可以充分利用MPP的能力,而有些則不行。SELECT orders.order_id, customers.customer_name, orders.order_date FROM orders LEFT JOIN customers ON orders.customer_id=customers.customer_id;
分散式資料庫中的join,最優的方式就是小表廣播到大表資料所在的地方。那麼首先我們得知道誰是小表,誰是大表。
2.統計資訊收集
根據前面的兩個例子,我們可以發現——如果我們需要找到SQL對應的最佳計劃,我們會需要一些表的後設資料,或者說是統計資訊。從常規的角度來說,以下統計資訊是必須收集的:
- 表的總行數
- 每列資料的平均大小
- 每列資料的基數:即NDV(distinct value)
- 列的NULL值個數
如果是事務型的(行式儲存),那麼還要考慮索引平均長度、值的分佈等等。
如果是分析型的(列式儲存),那麼還需要考慮相關列的最大值、最小值等等。
而統計方式的收集,會有多種方式。主要是考慮資源和準確性之間的Trade off。常見的有:
- TopN:相關資料出現次數前 n 的值。
- 直方圖:用於描述資料分佈圖的工具。按照資料的值大小進行分桶,並用一些簡單的資料來描述每個桶,比如落在桶裡的值的個數。
- 2D直方圖:由於直方圖無法反應列之間的關聯,可以用2D直方圖(聯合分佈)做到,但空間佔用也比較多。
- Count-Min Sketch:類似雜湊表加上計算器的實現。可以用很少的資料來描述全體資料的特性。
- HyperLogLog:一種估算海量資料基數的方法。很多情況下,統計並不需要那麼精確。工程方面要在使用資源和準確性裡找平衡。因此有人提出用HLL,這是一種佔用資源少,但能給出較為準確的近似結果的演算法。
TiDB收集的統計資訊見:https://docs.pingcap.com/zh/tidb/v6.5/statistics#%E7%9B%B4%E6...
3.代價的評估
一個SQL真正的物理執行計劃可能是有多個的。在以統計資訊為基礎之上,我們還需要加入相應的權重,舉個例子:
- 如果能夠在記憶體中計算完成,就不用去反覆發起本地IO
- 如果能在本地IO中完成,就不要去發起網路請求
等等...
這在TiDB的程式碼中也有對應的預設值。
DefOptCPUFactor = 3.0
DefOptCopCPUFactor = 3.0
DefOptTiFlashConcurrencyFactor = 24.0
DefOptNetworkFactor = 1.0
DefOptScanFactor = 1.5
DefOptDescScanFactor = 3.0
DefOptSeekFactor = 20.0
DefOptMemoryFactor = 0.001
DefOptDiskFactor = 1.5
DefOptConcurrencyFactor = 3.0
var defaultVer2Factors = costVer2Factors{
TiDBTemp: costVer2Factor{"tidb_temp_table_factor", 0.00},
TiKVScan: costVer2Factor{"tikv_scan_factor", 40.70},
TiKVDescScan: costVer2Factor{"tikv_desc_scan_factor", 61.05},
TiFlashScan: costVer2Factor{"tiflash_scan_factor", 11.60},
TiDBCPU: costVer2Factor{"tidb_cpu_factor", 49.90},
TiKVCPU: costVer2Factor{"tikv_cpu_factor", 49.90},
TiFlashCPU: costVer2Factor{"tiflash_cpu_factor", 2.40},
TiDB2KVNet: costVer2Factor{"tidb_kv_net_factor", 3.96},
TiDB2FlashNet: costVer2Factor{"tidb_flash_net_factor", 2.20},
TiFlashMPPNet: costVer2Factor{"tiflash_mpp_net_factor", 1.00},
TiDBMem: costVer2Factor{"tidb_mem_factor", 0.20},
TiKVMem: costVer2Factor{"tikv_mem_factor", 0.20},
TiFlashMem: costVer2Factor{"tiflash_mem_factor", 0.05},
TiDBDisk: costVer2Factor{"tidb_disk_factor", 200.00},
TiDBRequest: costVer2Factor{"tidb_request_factor", 6000000.00},
}
4.執行計劃列舉與擇優
當我們可以評估出物理執行計劃的代價時,將會列舉所有可以用上物理執行計劃,並在其中選擇代價最小的物理執行計劃。一般列舉分為兩個流派:
- 自底向上:代表有System R。
- 自頂向下:代表有Cascade。
自底向上沒法解決一個問題。當上層對下層的輸出結果順序感興趣時,那就不能只能從底層的視角來尋找區域性最優。
舉個例子,多表Join。一開始兩個表Join可能HashJoin代價很低,但是Join下一個表時,用MergeJoin從整體來看代價更低。從這個case來看,自底向上來做計劃取優並不合適。
所以有了Cascade:
- 搜尋方案是自頂向下的。這意味著它可以避免區域性最優而導致全域性不優。從Operator Tree 自頂向下遍歷時,可以做一系列變換:
- Implementation:邏輯運算元可以轉換成物理運算元;例如Join轉換成NestLoop或者HashJoin等
- Exploration:邏輯運算元可以做等價變換;例如交換Inner Join的兩個子節點,即可列舉Join順序
圖片來自於:Cascades Optimizer——https://zhuanlan.zhihu.com/p/73545345
- 它是基於Volcano模型演進而來的。
- 用物件導向的方式進行建模,編寫規則時不需要關心搜尋過程。相比傳統最佳化器中程式導向去一條條的編寫,的確是很大的改進。
5.TiDB的實現
大致的程式碼呼叫鏈為:
-- session/session.go
\-- ExecuteStmt //SQL執行的入口
|-- executor/compiler.go
\-- Compile //將SQL變成可執行的計劃
|--planner/planner/optmize.go
\-- Optimize //最佳化的入口
\-- optimize //這裡有兩個入口。一種是新的最佳化器入口,一種是老的最佳化器入口。根據配置來選擇。最終都會返回最優的物理執行計劃。
|-- planner/cascades/optmize.go
\--FindBestPlan 見5.1
\-- onPhasePreprocessing //見5.3
\-- implGroup
|--planner/core/optmizer.go //見5.4
\-- DoOptimize
\-- physicalOptimize
|--planner/core/find_best_task.go
\-- findBestTask
\-- enumeratePhysicalPlans4Task
\-- compareTaskCost
\-- getTaskPlanCost
|-- planner/core/plan_cost_ver2.go
\-- getPlanCost
5.1 邏輯最佳化
核心入口為:
// FindBestPlan is the optimization entrance of the cascades planner. The
// optimization is composed of 3 phases: preprocessing, exploration and implementation.
//
// ------------------------------------------------------------------------------
// Phase 1: Preprocessing
// ------------------------------------------------------------------------------
//
// The target of this phase is to preprocess the plan tree by some heuristic
// rules which should always be beneficial, for example Column Pruning.
//
// ------------------------------------------------------------------------------
// Phase 2: Exploration
// ------------------------------------------------------------------------------
//
// The target of this phase is to explore all the logically equivalent
// expressions by exploring all the equivalent group expressions of each group.
//
// At the very beginning, there is only one group expression in a Group. After
// applying some transformation rules on certain expressions of the Group, all
// the equivalent expressions are found and stored in the Group. This procedure
// can be regarded as searching for a weak connected component in a directed
// graph, where nodes are expressions and directed edges are the transformation
// rules.
//
// ------------------------------------------------------------------------------
// Phase 3: Implementation
// ------------------------------------------------------------------------------
//
// The target of this phase is to search the best physical plan for a Group
// which satisfies a certain required physical property.
//
// In this phase, we need to enumerate all the applicable implementation rules
// for each expression in each group under the required physical property. A
// memo structure is used for a group to reduce the repeated search on the same
// required physical property.
func (opt *Optimizer) FindBestPlan(sctx sessionctx.Context, logical plannercore.LogicalPlan) (p plannercore.PhysicalPlan, cost float64, err error) {
logical, err = opt.onPhasePreprocessing(sctx, logical)
if err != nil {
return nil, 0, err
}
rootGroup := memo.Convert2Group(logical)
err = opt.onPhaseExploration(sctx, rootGroup)
if err != nil {
return nil, 0, err
}
p, cost, err = opt.onPhaseImplementation(sctx, rootGroup)
if err != nil {
return nil, 0, err
}
err = p.ResolveIndices()
return p, cost, err
}
註釋+程式碼很乾淨,這裡一共做三件事
- onPhasePreprocessing:註釋很實在,說
for example Column Pruning
,結果裡面真的只做了列裁剪。 - onPhaseExploration:探索所有邏輯等價存在的可能
- onPhaseImplementation:根據代價尋找最優的物理執行計劃
這塊官網的部落格寫的非常好,我就不重複了:https://cn.pingcap.com/blog/tidb-cascades-planner/
5.2 統計資訊收集
這塊程式碼主要集中在stats.go裡,裡面的核心資料結構是stats_info.go裡的StatsInfo。呼叫鏈大致為:
|-- planner/cascades/optimizer.go
\--fillGroupStats
|-- planner/core/stats.go
\--DeriveStats
type LogicalPlan interface {
Plan
//......忽略一些程式碼
// DeriveStats derives statistic info for current plan node given child stats.
// We need selfSchema, childSchema here because it makes this method can be used in
// cascades planner, where LogicalPlan might not record its children or schema.
DeriveStats(childStats []*property.StatsInfo, selfSchema *expression.Schema, childSchema []*expression.Schema, colGroups [][]*expression.Column) (*property.StatsInfo, error)
//......忽略一些程式碼
}
有很多結構體實現了這個方法:
- 收集統計資訊主要是analyze.go#Next方法中呼叫的#analyzeWorker。
5.3 新版本 物理執行計劃的選擇
程式碼入口為:
// implGroup finds the best Implementation which satisfies the required
// physical property for a Group. The best Implementation should have the
// lowest cost among all the applicable Implementations.
//
// g: the Group to be implemented.
// reqPhysProp: the required physical property.
// costLimit: the maximum cost of all the Implementations.
func (opt *Optimizer) implGroup(g *memo.Group, reqPhysProp *property.PhysicalProperty, costLimit float64) (memo.Implementation, error) {
groupImpl := g.GetImpl(reqPhysProp)
if groupImpl != nil {
if groupImpl.GetCost() <= costLimit {
return groupImpl, nil
}
return nil, nil
}
// Handle implementation rules for each equivalent GroupExpr.
var childImpls []memo.Implementation
err := opt.fillGroupStats(g)
if err != nil {
return nil, err
}
outCount := math.Min(g.Prop.Stats.RowCount, reqPhysProp.ExpectedCnt)
for elem := g.Equivalents.Front(); elem != nil; elem = elem.Next() {
curExpr := elem.Value.(*memo.GroupExpr)
impls, err := opt.implGroupExpr(curExpr, reqPhysProp)
if err != nil {
return nil, err
}
for _, impl := range impls {
childImpls = childImpls[:0]
for i, childGroup := range curExpr.Children {
childImpl, err := opt.implGroup(childGroup, impl.GetPlan().GetChildReqProps(i), impl.GetCostLimit(costLimit, childImpls...))
if err != nil {
return nil, err
}
if childImpl == nil {
impl.SetCost(math.MaxFloat64)
break
}
childImpls = append(childImpls, childImpl)
}
if impl.GetCost() == math.MaxFloat64 {
continue
}
implCost := impl.CalcCost(outCount, childImpls...)
if implCost > costLimit {
continue
}
if groupImpl == nil || groupImpl.GetCost() > implCost {
groupImpl = impl.AttachChildren(childImpls...)
costLimit = implCost
}
}
}
// Handle enforcer rules for required physical property.
for _, rule := range GetEnforcerRules(g, reqPhysProp) {
newReqPhysProp := rule.NewProperty(reqPhysProp)
enforceCost := rule.GetEnforceCost(g)
childImpl, err := opt.implGroup(g, newReqPhysProp, costLimit-enforceCost)
if err != nil {
return nil, err
}
if childImpl == nil {
continue
}
impl := rule.OnEnforce(reqPhysProp, childImpl)
implCost := enforceCost + childImpl.GetCost()
impl.SetCost(implCost)
if groupImpl == nil || groupImpl.GetCost() > implCost {
groupImpl = impl
costLimit = implCost
}
}
if groupImpl == nil || groupImpl.GetCost() == math.MaxFloat64 {
return nil, nil
}
g.InsertImpl(reqPhysProp, groupImpl)
return groupImpl, nil
}
這裡個函式會找出潛在符合條件的物理執行計劃,並不斷的搜尋。但它有一個剪枝邏輯——會記錄當前最小的cost,如果一個執行計劃自上向下搜尋時,如果超過了這個cost,則直接返回。這就是在第3節提到的自頂向下的最佳化。
接下來我們看一下memo.Implementation
的定義:
package memo
import (
plannercore "github.com/pingcap/tidb/planner/core"
)
// Implementation defines the interface for cost of physical plan.
type Implementation interface {
CalcCost(outCount float64, children ...Implementation) float64
SetCost(cost float64)
GetCost() float64
GetPlan() plannercore.PhysicalPlan
// AttachChildren is used to attach children implementations and returns it self.
AttachChildren(children ...Implementation) Implementation
// GetCostLimit gets the costLimit for implementing the next childGroup.
GetCostLimit(costLimit float64, children ...Implementation) float64
}
其中CalcCost
方法就是用來計算物理執行計劃的代價。一共有這麼多結構體實現了它:
5.3.1 代價的評估
我們以開頭的例子,講解代價的評估。
select代價計算方式
// CalcCost calculates the cost of the table scan Implementation.
func (impl *TableScanImpl) CalcCost(outCount float64, _ ...memo.Implementation) float64 {
ts := impl.plan.(*plannercore.PhysicalTableScan)
width := impl.tblColHists.GetTableAvgRowSize(impl.plan.SCtx(), impl.tblCols, kv.TiKV, true)
sessVars := ts.SCtx().GetSessionVars()
impl.cost = outCount * sessVars.GetScanFactor(ts.Table) * width
if ts.Desc {
impl.cost = outCount * sessVars.GetDescScanFactor(ts.Table) * width
}
return impl.cost
}
// GetScanFactor returns the session variable scanFactor
// returns 0 when tbl is a temporary table.
func (s *SessionVars) GetScanFactor(tbl *model.TableInfo) float64 {
if tbl != nil {
if tbl.TempTableType != model.TempTableNone {
return 0
}
}
return s.scanFactor
}
// CalcCost implements Implementation interface.
func (impl *IndexScanImpl) CalcCost(outCount float64, _ ...memo.Implementation) float64 {
is := impl.plan.(*plannercore.PhysicalIndexScan)
sessVars := is.SCtx().GetSessionVars()
rowSize := impl.tblColHists.GetIndexAvgRowSize(is.SCtx(), is.Schema().Columns, is.Index.Unique)
cost := outCount * rowSize * sessVars.GetScanFactor(is.Table)
if is.Desc {
cost = outCount * rowSize * sessVars.GetDescScanFactor(is.Table)
}
cost += float64(len(is.Ranges)) * sessVars.GetSeekFactor(is.Table)
impl.cost = cost
return impl.cost
}
這裡我們以全表掃描表和命中索引的程式碼為例子。其中預設的scanFactor是1.5。這裡可以看到indexScan和tableScan少了一個因數:width。這個變數代表了所需列的平均大小。這麼看基本上是indexScan最優了。
這裡的程式碼筆者認為有點不優雅——當Desc時,其實之前的Cost是沒必要算一遍的,浪費CPU資源。
join代價計算方式
// CalcCost implements Implementation CalcCost interface.
func (impl *HashJoinImpl) CalcCost(_ float64, children ...memo.Implementation) float64 {
hashJoin := impl.plan.(*plannercore.PhysicalHashJoin)
// The children here are only used to calculate the cost.
hashJoin.SetChildren(children[0].GetPlan(), children[1].GetPlan())
selfCost := hashJoin.GetCost(children[0].GetPlan().StatsCount(), children[1].GetPlan().StatsCount(), false, 0, nil)
impl.cost = selfCost + children[0].GetCost() + children[1].GetCost()
return impl.cost
}
// CalcCost implements Implementation CalcCost interface.
func (impl *MergeJoinImpl) CalcCost(_ float64, children ...memo.Implementation) float64 {
mergeJoin := impl.plan.(*plannercore.PhysicalMergeJoin)
// The children here are only used to calculate the cost.
mergeJoin.SetChildren(children[0].GetPlan(), children[1].GetPlan())
selfCost := mergeJoin.GetCost(children[0].GetPlan().StatsCount(), children[1].GetPlan().StatsCount(), 0)
impl.cost = selfCost + children[0].GetCost() + children[1].GetCost()
return impl.cost
}
具體的計算都在plan_cost_v1.go裡:
// GetCost computes cost of hash join operator itself.
func (p *PhysicalHashJoin) GetCost(lCnt, rCnt float64, isMPP bool, costFlag uint64, op *physicalOptimizeOp) float64 {
buildCnt, probeCnt := lCnt, rCnt
build := p.children[0]
// Taking the right as the inner for right join or using the outer to build a hash table.
if (p.InnerChildIdx == 1 && !p.UseOuterToBuild) || (p.InnerChildIdx == 0 && p.UseOuterToBuild) {
buildCnt, probeCnt = rCnt, lCnt
build = p.children[1]
}
sessVars := p.ctx.GetSessionVars()
oomUseTmpStorage := variable.EnableTmpStorageOnOOM.Load()
memQuota := sessVars.MemTracker.GetBytesLimit() // sessVars.MemQuotaQuery && hint
rowSize := getAvgRowSize(build.statsInfo(), build.Schema().Columns)
spill := oomUseTmpStorage && memQuota > 0 && rowSize*buildCnt > float64(memQuota) && p.storeTp != kv.TiFlash
// Cost of building hash table.
cpuFactor := sessVars.GetCPUFactor()
diskFactor := sessVars.GetDiskFactor()
memoryFactor := sessVars.GetMemoryFactor()
concurrencyFactor := sessVars.GetConcurrencyFactor()
cpuCost := buildCnt * cpuFactor
memoryCost := buildCnt * memoryFactor
diskCost := buildCnt * diskFactor * rowSize
// Number of matched row pairs regarding the equal join conditions.
helper := &fullJoinRowCountHelper{
sctx: p.SCtx(),
cartesian: false,
leftProfile: p.children[0].statsInfo(),
rightProfile: p.children[1].statsInfo(),
leftJoinKeys: p.LeftJoinKeys,
rightJoinKeys: p.RightJoinKeys,
leftSchema: p.children[0].Schema(),
rightSchema: p.children[1].Schema(),
leftNAJoinKeys: p.LeftNAJoinKeys,
rightNAJoinKeys: p.RightNAJoinKeys,
}
numPairs := helper.estimate()
// For semi-join class, if `OtherConditions` is empty, we already know
// the join results after querying hash table, otherwise, we have to
// evaluate those resulted row pairs after querying hash table; if we
// find one pair satisfying the `OtherConditions`, we then know the
// join result for this given outer row, otherwise we have to iterate
// to the end of those pairs; since we have no idea about when we can
// terminate the iteration, we assume that we need to iterate half of
// those pairs in average.
if p.JoinType == SemiJoin || p.JoinType == AntiSemiJoin ||
p.JoinType == LeftOuterSemiJoin || p.JoinType == AntiLeftOuterSemiJoin {
if len(p.OtherConditions) > 0 {
numPairs *= 0.5
} else {
numPairs = 0
}
}
if hasCostFlag(costFlag, CostFlagUseTrueCardinality) {
numPairs = getOperatorActRows(p)
}
// Cost of querying hash table is cheap actually, so we just compute the cost of
// evaluating `OtherConditions` and joining row pairs.
probeCost := numPairs * cpuFactor
probeDiskCost := numPairs * diskFactor * rowSize
// Cost of evaluating outer filter.
if len(p.LeftConditions)+len(p.RightConditions) > 0 {
// Input outer count for the above compution should be adjusted by SelectionFactor.
probeCost *= SelectionFactor
probeDiskCost *= SelectionFactor
probeCost += probeCnt * cpuFactor
}
diskCost += probeDiskCost
probeCost /= float64(p.Concurrency)
// Cost of additional concurrent goroutines.
cpuCost += probeCost + float64(p.Concurrency+1)*concurrencyFactor
// Cost of traveling the hash table to resolve missing matched cases when building the hash table from the outer table
if p.UseOuterToBuild {
if spill {
// It runs in sequence when build data is on disk. See handleUnmatchedRowsFromHashTableInDisk
cpuCost += buildCnt * cpuFactor
} else {
cpuCost += buildCnt * cpuFactor / float64(p.Concurrency)
}
diskCost += buildCnt * diskFactor * rowSize
}
if spill {
memoryCost *= float64(memQuota) / (rowSize * buildCnt)
} else {
diskCost = 0
}
if op != nil {
setPhysicalHashJoinCostDetail(p, op, spill, buildCnt, probeCnt, cpuFactor, rowSize, numPairs,
cpuCost, probeCost, memoryCost, diskCost, probeDiskCost,
diskFactor, memoryFactor, concurrencyFactor,
memQuota)
}
return cpuCost + memoryCost + diskCost
}
// GetCost computes cost of merge join operator itself.
func (p *PhysicalMergeJoin) GetCost(lCnt, rCnt float64, costFlag uint64) float64 {
outerCnt := lCnt
innerCnt := rCnt
innerKeys := p.RightJoinKeys
innerSchema := p.children[1].Schema()
innerStats := p.children[1].statsInfo()
if p.JoinType == RightOuterJoin {
outerCnt = rCnt
innerCnt = lCnt
innerKeys = p.LeftJoinKeys
innerSchema = p.children[0].Schema()
innerStats = p.children[0].statsInfo()
}
helper := &fullJoinRowCountHelper{
sctx: p.SCtx(),
cartesian: false,
leftProfile: p.children[0].statsInfo(),
rightProfile: p.children[1].statsInfo(),
leftJoinKeys: p.LeftJoinKeys,
rightJoinKeys: p.RightJoinKeys,
leftSchema: p.children[0].Schema(),
rightSchema: p.children[1].Schema(),
}
numPairs := helper.estimate()
if p.JoinType == SemiJoin || p.JoinType == AntiSemiJoin ||
p.JoinType == LeftOuterSemiJoin || p.JoinType == AntiLeftOuterSemiJoin {
if len(p.OtherConditions) > 0 {
numPairs *= 0.5
} else {
numPairs = 0
}
}
if hasCostFlag(costFlag, CostFlagUseTrueCardinality) {
numPairs = getOperatorActRows(p)
}
sessVars := p.ctx.GetSessionVars()
probeCost := numPairs * sessVars.GetCPUFactor()
// Cost of evaluating outer filters.
var cpuCost float64
if len(p.LeftConditions)+len(p.RightConditions) > 0 {
probeCost *= SelectionFactor
cpuCost += outerCnt * sessVars.GetCPUFactor()
}
cpuCost += probeCost
// For merge join, only one group of rows with same join key(not null) are cached,
// we compute average memory cost using estimated group size.
NDV, _ := getColsNDVWithMatchedLen(innerKeys, innerSchema, innerStats)
memoryCost := (innerCnt / NDV) * sessVars.GetMemoryFactor()
return cpuCost + memoryCost
}
HashJoin要考慮到記憶體不夠的情況,因此在計算到資料不夠時,會將對應的資料壓入硬碟。但它對資料的順序並無要求,因此可以併發的去做。見:
// Cost of traveling the hash table to resolve missing matched cases when building the hash table from the outer table
if p.UseOuterToBuild {
if spill {
// It runs in sequence when build data is on disk. See handleUnmatchedRowsFromHashTableInDisk
cpuCost += buildCnt * cpuFactor
} else {
cpuCost += buildCnt * cpuFactor / float64(p.Concurrency)
}
diskCost += buildCnt * diskFactor * rowSize
}
而MergeJoin的代價顯然會更小,但能夠選則到這個計劃也有較高的要求:當兩個關聯表要 Join 的欄位需要按排好的順序讀取時,適用 Merge Join 演算法。
5.4 老版本 物理執行計劃的選擇
5.4.1 代價的評估
這塊程式碼主要是在plan_cost_ver1.go
和plan_cost_ver2.go
。v2對代價公式進行了更精確的迴歸校準,調整了部分代價公式,比此前版本的代價公式更加準確。程式碼上也更為簡潔:v2只暴露出了一個公共方法,內部透過不同的型別做轉發。
// GetPlanCost returns the cost of this plan.
func GetPlanCost(p PhysicalPlan, taskType property.TaskType, option *PlanCostOption) (float64, error) {
return getPlanCost(p, taskType, option)
}
func getPlanCost(p PhysicalPlan, taskType property.TaskType, option *PlanCostOption) (float64, error) {
if p.SCtx().GetSessionVars().CostModelVersion == modelVer2 {
planCost, err := p.getPlanCostVer2(taskType, option)
return planCost.cost, err
}
return p.getPlanCostVer1(taskType, option)
}
根據不同的PhysicalPlan
型別,會找到不同繫結方法:
v1的部分方法展示:
select代價計算方式
// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = child-cost + filter-cost
func (p *PhysicalSelection) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
return p.planCostVer2, nil
}
inputRows := getCardinality(p.children[0], option.CostFlag)
cpuFactor := getTaskCPUFactorVer2(p, taskType)
filterCost := filterCostVer2(option, inputRows, p.Conditions, cpuFactor)
childCost, err := p.children[0].getPlanCostVer2(taskType, option)
if err != nil {
return zeroCostVer2, err
}
p.planCostVer2 = sumCostVer2(filterCost, childCost)
p.planCostInit = true
return p.planCostVer2, nil
}
這部分程式碼簡單易讀。代價就是子查詢的代價+篩選的代價。
那麼問題來了,中索引的和不中索引的代價應該是不一樣的。這裡沒有體現出來啊。仔細看childCost, err := p.children[0].getPlanCostVer2(taskType, option)
,這裡是會去獲取子物理執行計劃的代價。
// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
func (p *PointGetPlan) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
return p.planCostVer2, nil
}
if p.accessCols == nil { // from fast plan code path
p.planCostVer2 = zeroCostVer2
p.planCostInit = true
return zeroCostVer2, nil
}
rowSize := getAvgRowSize(p.stats, p.schema.Columns)
netFactor := getTaskNetFactorVer2(p, taskType)
p.planCostVer2 = netCostVer2(option, 1, rowSize, netFactor)
p.planCostInit = true
return p.planCostVer2, nil
}
func netCostVer2(option *PlanCostOption, rows, rowSize float64, netFactor costVer2Factor) costVer2 {
return newCostVer2(option, netFactor,
rows*rowSize*netFactor.Value,
func() string { return fmt.Sprintf("net(%v*rowsize(%v)*%v)", rows, rowSize, netFactor) })
}
// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = rows * log2(row-size) * scan-factor
// log2(row-size) is from experiments.
func (p *PhysicalTableScan) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
return p.planCostVer2, nil
}
rows := getCardinality(p, option.CostFlag)
var rowSize float64
if p.StoreType == kv.TiKV {
rowSize = getAvgRowSize(p.stats, p.tblCols) // consider all columns if TiKV
} else { // TiFlash
rowSize = getAvgRowSize(p.stats, p.schema.Columns)
}
rowSize = math.Max(rowSize, 2.0)
scanFactor := getTaskScanFactorVer2(p, p.StoreType, taskType)
p.planCostVer2 = scanCostVer2(option, rows, rowSize, scanFactor)
// give TiFlash a start-up cost to let the optimizer prefers to use TiKV to process small table scans.
if p.StoreType == kv.TiFlash {
p.planCostVer2 = sumCostVer2(p.planCostVer2, scanCostVer2(option, 10000, rowSize, scanFactor))
}
p.planCostInit = true
return p.planCostVer2, nil
}
func scanCostVer2(option *PlanCostOption, rows, rowSize float64, scanFactor costVer2Factor) costVer2 {
if rowSize < 1 {
rowSize = 1
}
return newCostVer2(option, scanFactor,
// rows * log(row-size) * scanFactor, log2 from experiments
rows*math.Log2(rowSize)*scanFactor.Value,
func() string { return fmt.Sprintf("scan(%v*logrowsize(%v)*%v)", rows, rowSize, scanFactor) })
}
scanFactor的代價預設是40.7,netFactor的代價預設是3.96。結合程式碼來看,命中索引的代價更優。
join代價計算方式
// getPlanCostVer2 returns the plan-cost of this sub-plan, which is:
// plan-cost = build-child-cost + build-filter-cost +
// (probe-cost + probe-filter-cost) / concurrency
// probe-cost = probe-child-cost * build-rows / batchRatio
func (p *PhysicalIndexJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
return p.getIndexJoinCostVer2(taskType, option, 0)
}
func (p *PhysicalIndexHashJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
return p.getIndexJoinCostVer2(taskType, option, 1)
}
func (p *PhysicalIndexMergeJoin) getPlanCostVer2(taskType property.TaskType, option *PlanCostOption) (costVer2, error) {
return p.getIndexJoinCostVer2(taskType, option, 2)
}
func (p *PhysicalIndexJoin) getIndexJoinCostVer2(taskType property.TaskType, option *PlanCostOption, indexJoinType int) (costVer2, error) {
if p.planCostInit && !hasCostFlag(option.CostFlag, CostFlagRecalculate) {
return p.planCostVer2, nil
}
build, probe := p.children[1-p.InnerChildIdx], p.children[p.InnerChildIdx]
buildRows := getCardinality(build, option.CostFlag)
buildRowSize := getAvgRowSize(build.Stats(), build.Schema().Columns)
probeRowsOne := getCardinality(probe, option.CostFlag)
probeRowsTot := probeRowsOne * buildRows
probeRowSize := getAvgRowSize(probe.Stats(), probe.Schema().Columns)
buildFilters, probeFilters := p.LeftConditions, p.RightConditions
probeConcurrency := float64(p.ctx.GetSessionVars().IndexLookupJoinConcurrency())
cpuFactor := getTaskCPUFactorVer2(p, taskType)
memFactor := getTaskMemFactorVer2(p, taskType)
requestFactor := getTaskRequestFactorVer2(p, taskType)
buildFilterCost := filterCostVer2(option, buildRows, buildFilters, cpuFactor)
buildChildCost, err := build.getPlanCostVer2(taskType, option)
if err != nil {
return zeroCostVer2, err
}
buildTaskCost := newCostVer2(option, cpuFactor,
buildRows*10*cpuFactor.Value,
func() string { return fmt.Sprintf("cpu(%v*10*%v)", buildRows, cpuFactor) })
startCost := newCostVer2(option, cpuFactor,
10*3*cpuFactor.Value,
func() string { return fmt.Sprintf("cpu(10*3*%v)", cpuFactor) })
probeFilterCost := filterCostVer2(option, probeRowsTot, probeFilters, cpuFactor)
probeChildCost, err := probe.getPlanCostVer2(taskType, option)
if err != nil {
return zeroCostVer2, err
}
var hashTableCost costVer2
switch indexJoinType {
case 1: // IndexHashJoin
hashTableCost = hashBuildCostVer2(option, buildRows, buildRowSize, float64(len(p.RightJoinKeys)), cpuFactor, memFactor)
case 2: // IndexMergeJoin
hashTableCost = newZeroCostVer2(traceCost(option))
default: // IndexJoin
hashTableCost = hashBuildCostVer2(option, probeRowsTot, probeRowSize, float64(len(p.LeftJoinKeys)), cpuFactor, memFactor)
}
// IndexJoin executes a batch of rows at a time, so the actual cost of this part should be
// `innerCostPerBatch * numberOfBatches` instead of `innerCostPerRow * numberOfOuterRow`.
// Use an empirical value batchRatio to handle this now.
// TODO: remove this empirical value.
batchRatio := 6.0
probeCost := divCostVer2(mulCostVer2(probeChildCost, buildRows), batchRatio)
// Double Read Cost
doubleReadCost := newZeroCostVer2(traceCost(option))
if p.ctx.GetSessionVars().IndexJoinDoubleReadPenaltyCostRate > 0 {
batchSize := float64(p.ctx.GetSessionVars().IndexJoinBatchSize)
taskPerBatch := 1024.0 // TODO: remove this magic number
doubleReadTasks := buildRows / batchSize * taskPerBatch
doubleReadCost = doubleReadCostVer2(option, doubleReadTasks, requestFactor)
doubleReadCost = mulCostVer2(doubleReadCost, p.ctx.GetSessionVars().IndexJoinDoubleReadPenaltyCostRate)
}
p.planCostVer2 = sumCostVer2(startCost, buildChildCost, buildFilterCost, buildTaskCost, divCostVer2(sumCostVer2(doubleReadCost, probeCost, probeFilterCost, hashTableCost), probeConcurrency))
p.planCostInit = true
return p.planCostVer2, nil
}
關鍵在於:
switch indexJoinType {
case 1: // IndexHashJoin
hashTableCost = hashBuildCostVer2(option, buildRows, buildRowSize, float64(len(p.RightJoinKeys)), cpuFactor, memFactor)
case 2: // IndexMergeJoin
hashTableCost = newZeroCostVer2(traceCost(option))
default: // IndexJoin
hashTableCost = hashBuildCostVer2(option, probeRowsTot, probeRowSize, float64(len(p.LeftJoinKeys)), cpuFactor, memFactor)
}
對應方法:
func hashBuildCostVer2(option *PlanCostOption, buildRows, buildRowSize, nKeys float64, cpuFactor, memFactor costVer2Factor) costVer2 {
// TODO: 1) consider types of keys, 2) dedicated factor for build-probe hash table
hashKeyCost := newCostVer2(option, cpuFactor,
buildRows*nKeys*cpuFactor.Value,
func() string { return fmt.Sprintf("hashkey(%v*%v*%v)", buildRows, nKeys, cpuFactor) })
hashMemCost := newCostVer2(option, memFactor,
buildRows*buildRowSize*memFactor.Value,
func() string { return fmt.Sprintf("hashmem(%v*%v*%v)", buildRows, buildRowSize, memFactor) })
hashBuildCost := newCostVer2(option, cpuFactor,
buildRows*cpuFactor.Value,
func() string { return fmt.Sprintf("hashbuild(%v*%v)", buildRows, cpuFactor) })
return sumCostVer2(hashKeyCost, hashMemCost, hashBuildCost)
}
func newZeroCostVer2(trace bool) (ret costVer2) {
if trace {
ret.trace = &costTrace{make(map[string]float64), ""}
}
return
}
簡單的看一下程式碼,我們可以發現,從大多數的場景來看,按照代價從小到大來排,這幾種Join是MergeJoin<HashJoin<IndexJoin。
5.4.2執行計劃列舉與擇優
總得來說這塊程式碼較為簡單,本質就是列舉所有符合條件的物理執行計劃,並挑選出代價最小的執行計劃,故不再列舉程式碼。有興趣的同學可以根據以下大綱自行翻閱:
|--planner/core/find_best_task.go
\-- findBestTask
\-- enumeratePhysicalPlans4Task
\-- compareTaskCost
\-- getTaskPlanCost
|-- planner/core/plan_cost_ver2.go
\-- getPlanCost
6.其他
6.1 參考與引用的文章
- Cascades Optimizer:https://zhuanlan.zhihu.com/p/73545345
- 揭秘 TiDB 新最佳化器:Cascades Planner 原理解析:https://cn.pingcap.com/blog/tidb-cascades-planner/
- TiDB文件-統計資訊簡介:https://docs.pingcap.com/zh/tidb/v6.5/statistics#%E7%BB%9F%E8...
- TiDB文件-錯誤索引的解決方案:https://docs.pingcap.com/zh/tidb/v6.5/wrong-index-solution#%E...
- The Volcano Optimizer Generator: Extensibility and Efficient Search:https://15721.courses.cs.cmu.edu/spring2019/papers/22-optimiz...
- The Cascades Framework for Query Optimization:https://15721.courses.cs.cmu.edu/spring2018/papers/15-optimiz...
6.2 知識補充:code generation && vectorized execution
資料庫引擎執行器中非常出名的兩種最佳化方式,code generation和 vectorized execution。
code generation主要是根據上下文來生成一整段最佳化過的程式碼,這與那種巢狀大量if...else、for迴圈、虛方法的程式碼完全相反,完全面向效能考慮。
vectorized execution基於拉模型。相比於一次拉一個tuple來說,它的批次拉取減少了多次拉取的開銷,同時還可以使用到SIMD。基於這種場景,vectorized execution的最佳化更加適用於列式資料庫。