本節大體介紹了動態規劃演算法實現(standard_join_search)中的join_search_one_level->make_join_rel->populate_joinrel_with_paths->add_paths_to_joinrel->hash_inner_and_outer中的initial_cost_hashjoin和final_cost_nestloop函式，這些函式用於計算hash join的Cost。

一、資料結構

Cost相關
注意:實際使用的引數值透過系統配置檔案定義,而不是這裡的常量定義!

 typedef double Cost; /* execution cost (in page-access units) */

 /* defaults for costsize.c's Cost parameters */
 /* NB: cost-estimation code should use the variables, not these constants! */
 /* 注意:實際值透過系統配置檔案定義,而不是這裡的常量定義! */
 /* If you change these, update backend/utils/misc/postgresql.sample.conf */
 #define DEFAULT_SEQ_PAGE_COST  1.0       //順序掃描page的成本
 #define DEFAULT_RANDOM_PAGE_COST  4.0      //隨機掃描page的成本
 #define DEFAULT_CPU_TUPLE_COST  0.01     //處理一個元組的CPU成本
 #define DEFAULT_CPU_INDEX_TUPLE_COST 0.005   //處理一個索引元組的CPU成本
 #define DEFAULT_CPU_OPERATOR_COST  0.0025    //執行一次操作或函式的CPU成本
 #define DEFAULT_PARALLEL_TUPLE_COST 0.1    //並行執行,從一個worker傳輸一個元組到另一個worker的成本
 #define DEFAULT_PARALLEL_SETUP_COST  1000.0  //構建並行執行環境的成本
 
 #define DEFAULT_EFFECTIVE_CACHE_SIZE  524288    /*先前已有介紹, measured in pages */

 double      seq_page_cost = DEFAULT_SEQ_PAGE_COST;
 double      random_page_cost = DEFAULT_RANDOM_PAGE_COST;
 double      cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
 double      cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST;
 double      cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
 double      parallel_tuple_cost = DEFAULT_PARALLEL_TUPLE_COST;
 double      parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST;
 
 int         effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
 Cost        disable_cost = 1.0e10;//1後面10個0,透過設定一個巨大的成本,讓最佳化器自動放棄此路徑
 
 int         max_parallel_workers_per_gather = 2;//每次gather使用的worker數

二、原始碼解讀

hash join的演算法實現虛擬碼如下:
Step 1
FOR small_table_row IN (SELECT * FROM small_table)
LOOP
slot := HASH(small_table_row.join_key);
INSERT_HASH_TABLE(slot,small_table_row);
END LOOP;

Step 2
FOR large_table_row IN (SELECT * FROM large_table) LOOP
slot := HASH(large_table_row.join_key);
small_table_row = LOOKUP_HASH_TABLE(slot,large_table_row.join_key);
IF small_table_row FOUND THEN
output small_table_row + large_table_row;
END IF;
END LOOP;

initial_cost_hashjoin和final_cost_nestloop函式初步預估hash join訪問路徑的成本和計算最終hash join的Cost。

initial_cost_hashjoin

//----------------------------------------------------------------------------- initial_cost_hashjoin

/*
 * initial_cost_hashjoin
 *    Preliminary estimate of the cost of a hashjoin path.
 *    初步預估hash join訪問路徑的成本
 *    註釋請參照merge join和nestloop join
 * This must quickly produce lower-bound estimates of the path's startup and
 * total costs.  If we are unable to eliminate the proposed path from
 * consideration using the lower bounds, final_cost_hashjoin will be called
 * to obtain the final estimates.
 *
 * The exact division of labor between this function and final_cost_hashjoin
 * is private to them, and represents a tradeoff between speed of the initial
 * estimate and getting a tight lower bound.  We choose to not examine the
 * join quals here (other than by counting the number of hash clauses),
 * so we can't do much with CPU costs.  We do assume that
 * ExecChooseHashTableSize is cheap enough to use here.
 * 
 * 'workspace' is to be filled with startup_cost, total_cost, and perhaps
 *      other data to be used by final_cost_hashjoin
 * 'jointype' is the type of join to be performed
 * 'hashclauses' is the list of joinclauses to be used as hash clauses
 * 'outer_path' is the outer input to the join
 * 'inner_path' is the inner input to the join
 * 'extra' contains miscellaneous information about the join
 * 'parallel_hash' indicates that inner_path is partial and that a shared
 *      hash table will be built in parallel
 */
void
initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
                      JoinType jointype,
                      List *hashclauses,
                      Path *outer_path, Path *inner_path,
                      JoinPathExtraData *extra,
                      bool parallel_hash)
{
    Cost        startup_cost = 0;
    Cost        run_cost = 0;
    double      outer_path_rows = outer_path->rows;
    double      inner_path_rows = inner_path->rows;
    double      inner_path_rows_total = inner_path_rows;
    int         num_hashclauses = list_length(hashclauses);
    int         numbuckets;
    int         numbatches;
    int         num_skew_mcvs;
    size_t      space_allowed;  /* unused */

    /* cost of source data */
    startup_cost += outer_path->startup_cost;
    run_cost += outer_path->total_cost - outer_path->startup_cost;
    startup_cost += inner_path->total_cost;

    /*
     * Cost of computing hash function: must do it once per input tuple. We
     * charge one cpu_operator_cost for each column's hash function.  Also,
     * tack on one cpu_tuple_cost per inner row, to model the costs of
     * inserting the row into the hashtable.
     * 計算雜湊函式的成本:每個輸入元組必須執行一次。
     * 我們對每個列的雜湊函式的計算成本設定為cpu_operator_cost。
     * 另外，在每個內錶行的處理新增以cpu_tuple_cost為單位的成本，以模擬將行插入到雜湊表中的成本。
     *
     * XXX when a hashclause is more complex than a single operator, we really
     * should charge the extra eval costs of the left or right side, as
     * appropriate, here.  This seems more work than it's worth at the moment.
     * 當一個hash條件子句比單個運算子更復雜時，確實應該在這裡計算連線左邊或右邊的額外的表示式的解析(eval)成本。
     * 這看起來會比現在的工作要多。
     */
    startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost)
        * inner_path_rows;
    run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;

    /*
     * If this is a parallel hash build, then the value we have for
     * inner_rows_total currently refers only to the rows returned by each
     * participant.  For shared hash table size estimation, we need the total
     * number, so we need to undo the division.
     * 如果這是一個並行雜湊構建，那麼當前inner_rows_total的值僅引用每個參與者返回的行。
     * 對於共享雜湊表大小估計，需要總數，因此需要修正。
     */
    if (parallel_hash)
        inner_path_rows_total *= get_parallel_divisor(inner_path);

    /*
     * Get hash table size that executor would use for inner relation.
     * 執行器根據內表資訊,獲取hash表的大小.
     * 
     * XXX for the moment, always assume that skew optimization will be
     * performed.  As long as SKEW_WORK_MEM_PERCENT is small, it's not worth
     * trying to determine that for sure.
     * 目前，總是假設會執行傾斜最佳化(skew optimization)。
     * 但是隻要SKEW_WORK_MEM_PERCENT很小，就不需要考慮。
     *
     * XXX at some point it might be interesting to try to account for skew
     * optimization in the cost estimate, but for now, we don't.
     * 在某種程度上，嘗試在成本估算中考慮歪斜最佳化可能是很有趣的，但現在，不使用這樣的做法。
     */
    ExecChooseHashTableSize(inner_path_rows_total,
                            inner_path->pathtarget->width,
                            true,   /* useskew */
                            parallel_hash,  /* try_combined_work_mem */
                            outer_path->parallel_workers,
                            &space_allowed,
                            &numbuckets,
                            &numbatches,
                            &num_skew_mcvs);

    /*
     * If inner relation is too big then we will need to "batch" the join,
     * which implies writing and reading most of the tuples to disk an extra
     * time.  Charge seq_page_cost per page, since the I/O should be nice and
     * sequential.  Writing the inner rel counts as startup cost, all the rest
     * as run cost.
     * 如果內表太大，那麼將需要“批處理”連線，這意味著要花額外的時間將大多數元組寫入和讀取到磁碟。
     * 因為是連續順序I/O,所以每一頁page的成本為eq_page_cost。
     * 寫內表的成本統計作為啟動成本，其餘的都作為執行成本。
     */
    if (numbatches > 1)
    {
        double      outerpages = page_size(outer_path_rows,
                                           outer_path->pathtarget->width);
        double      innerpages = page_size(inner_path_rows,
                                           inner_path->pathtarget->width);

        startup_cost += seq_page_cost * innerpages;
        run_cost += seq_page_cost * (innerpages + 2 * outerpages);
    }

    /* 後續再計算CPU成本.CPU costs left for later */

    /* Public result fields */
    workspace->startup_cost = startup_cost;
    workspace->total_cost = startup_cost + run_cost;
    /* Save private data for final_cost_hashjoin */
    workspace->run_cost = run_cost;
    workspace->numbuckets = numbuckets;
    workspace->numbatches = numbatches;
    workspace->inner_rows_total = inner_path_rows_total;
}

//----------------------------------------------------------- ExecChooseHashTableSize
 /*
 * Compute appropriate size for hashtable given the estimated size of the
 * relation to be hashed (number of rows and average row width).
 * 給定需要雜湊的關係的估計大小(行數和平均行寬度),計算雜湊表的合適大小。
 * 
 * This is exported so that the planner's costsize.c can use it.
 * 該結果會匯出到其他地方以便最佳化器的costsize.c可以使用此值.
 */

/* Target bucket loading (tuples per bucket) */
#define NTUP_PER_BUCKET         1

void
ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
                        bool try_combined_work_mem,
                        int parallel_workers,
                        size_t *space_allowed,
                        int *numbuckets,
                        int *numbatches,
                        int *num_skew_mcvs)
{
    int         tupsize;
    double      inner_rel_bytes;
    long        bucket_bytes;
    long        hash_table_bytes;
    long        skew_table_bytes;
    long        max_pointers;
    long        mppow2;
    int         nbatch = 1;
    int         nbuckets;
    double      dbuckets;

    /* Force a plausible relation size if no info */
     //如元組數目沒有統計,則預設值為1000
    if (ntuples <= 0.0)
        ntuples = 1000.0;

    /*
     * Estimate tupsize based on footprint of tuple in hashtable... note this
     * does not allow for any palloc overhead.  The manipulations of spaceUsed
     * don't count palloc overhead either.
     * 基於雜湊表中元組的足跡(footprint)估計元組大小.
     * 注意，這不允許任何palloc開銷。空間使用的操作也不計入palloc的開銷。
     */
    tupsize = HJTUPLE_OVERHEAD +
        MAXALIGN(SizeofMinimalTupleHeader) +
        MAXALIGN(tupwidth);
    inner_rel_bytes = ntuples * tupsize;//元組數 * 元組大小

    /*
     * Target in-memory hashtable size is work_mem kilobytes.
     * 目標hash表大小為work_mem
     */
    hash_table_bytes = work_mem * 1024L;

    /*
     * Parallel Hash tries to use the combined work_mem of all workers to
     * avoid the need to batch.  If that won't work, it falls back to work_mem
     * per worker and tries to process batches in parallel.
     * 並行雜湊嘗試使用所有worker的work_mem總數來避免批處理。
     * 如果無法如此操作，將使用每個worker一個work_mem的方式，並嘗試並行處理批處理。
     */
    if (try_combined_work_mem)
        hash_table_bytes += hash_table_bytes * parallel_workers;

    *space_allowed = hash_table_bytes;

    /*
     * If skew optimization is possible, estimate the number of skew buckets
     * that will fit in the memory allowed, and decrement the assumed space
     * available for the main hash table accordingly.
     * 如果傾斜最佳化是可能的，那麼估計能夠在記憶體中容納的傾斜桶數，並相應地減小主雜湊表的假設可用空間。
     *
     * We make the optimistic assumption that each skew bucket will contain
     * one inner-relation tuple.  If that turns out to be low, we will recover
     * at runtime by reducing the number of skew buckets.
     * 我們樂觀地假設每一個斜桶包含一個內表元組。
     * 如果結果不容樂觀，那麼將透過減少傾斜桶的數量在執行時恢復。
     *
     * hashtable->skewBucket will have up to 8 times as many HashSkewBucket
     * pointers as the number of MCVs we allow, since ExecHashBuildSkewHash
     * will round up to the next power of 2 and then multiply by 4 to reduce
     * collisions.
     * hashtable->skewBucket的HashSkewBucket指標數量將是我們允許的MCVs數量的8倍，
     * 因為ExecHashBuildSkewHash將舍入到下一個2的乘方，然後乘以4以減少衝突。
     */
    if (useskew)//使用傾斜最佳化(資料列值非均勻分佈)
    {
        skew_table_bytes = hash_table_bytes * SKEW_WORK_MEM_PERCENT / 100;

        /*----------
         * Divisor is:
         * size of a hash tuple +
         * worst-case size of skewBucket[] per MCV +
         * size of skewBucketNums[] entry +
         * size of skew bucket struct itself
         *----------
         */
        *num_skew_mcvs = skew_table_bytes / (tupsize +
                                             (8 * sizeof(HashSkewBucket *)) +
                                             sizeof(int) +
                                             SKEW_BUCKET_OVERHEAD);
        if (*num_skew_mcvs > 0)
            hash_table_bytes -= skew_table_bytes;
    }
    else
        *num_skew_mcvs = 0;

    /*
     * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
     * memory is filled, assuming a single batch; but limit the value so that
     * the pointer arrays we'll try to allocate do not exceed work_mem nor
     * MaxAllocSize.
     * 設定nbucket，使其在記憶體填充時達到平均的桶負載NTUP_PER_BUCKET，
     * 假設是一個批處理;但是需要限制這個值，試圖分配的指標陣列不超過work_mem或MaxAllocSize。
     * 
     * Note that both nbuckets and nbatch must be powers of 2 to make
     * ExecHashGetBucketAndBatch fast.
     * 注意,不管是nbuckets還是nbatch必須是2的倍數,以使ExecHashGetBucketAndBatch函式更快
     */
    max_pointers = *space_allowed / sizeof(HashJoinTuple);
    max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
    /* If max_pointers isn't a power of 2, must round it down to one */
    //設定為2的倍數
    mppow2 = 1L << my_log2(max_pointers);
    if (max_pointers != mppow2)
        max_pointers = mppow2 / 2;

    /* Also ensure we avoid integer overflow in nbatch and nbuckets */
    //確保沒有整數溢位
    /* (this step is redundant given the current value of MaxAllocSize) */
    max_pointers = Min(max_pointers, INT_MAX / 2);

    dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
    dbuckets = Min(dbuckets, max_pointers);
    nbuckets = (int) dbuckets;
    /* don't let nbuckets be really small, though ... */
    //nbuckets最大為1024
    nbuckets = Max(nbuckets, 1024);
    /* ... and force it to be a power of 2. */
    //左移一位,即結果*2
    nbuckets = 1 << my_log2(nbuckets);

    /*
     * If there's not enough space to store the projected number of tuples and
     * the required bucket headers, we will need multiple batches.
     * 如果沒有足夠的空間來儲存元組的投影數量和所需的桶header，將需要多個批處理。
     */
    bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
    if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
    {
        //內表大小 + 桶大小 > hash表大小,需要拆分為多個批次
        /* We'll need multiple batches */
        long        lbuckets;
        double      dbatch;
        int         minbatch;
        long        bucket_size;

        /*
         * If Parallel Hash with combined work_mem would still need multiple
         * batches, we'll have to fall back to regular work_mem budget.
         */
        if (try_combined_work_mem)
        {
            ExecChooseHashTableSize(ntuples, tupwidth, useskew,
                                    false, parallel_workers,
                                    space_allowed,
                                    numbuckets,
                                    numbatches,
                                    num_skew_mcvs);
            return;
        }

        /*
         * Estimate the number of buckets we'll want to have when work_mem is
         * entirely full.  Each bucket will contain a bucket pointer plus
         * NTUP_PER_BUCKET tuples, whose projected size already includes
         * overhead for the hash code, pointer to the next tuple, etc.
         * 估計當work_mem完全滿時，可以得到多少個桶。
         * 每個桶bucket將包含一個bucket指標加上NTUP_PER_BUCKET元組，
         * 它的投影大小已經包括雜湊值的開銷、指向下一個元組的指標等等。
         */
        bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
        lbuckets = 1L << my_log2(hash_table_bytes / bucket_size);
        lbuckets = Min(lbuckets, max_pointers);
        nbuckets = (int) lbuckets;
        nbuckets = 1 << my_log2(nbuckets);
        bucket_bytes = nbuckets * sizeof(HashJoinTuple);

        /*
         * Buckets are simple pointers to hashjoin tuples, while tupsize
         * includes the pointer, hash code, and MinimalTupleData.  So buckets
         * should never really exceed 25% of work_mem (even for
         * NTUP_PER_BUCKET=1); except maybe for work_mem values that are not
         * 2^N bytes, where we might get more because of doubling. So let's
         * look for 50% here.
         * 桶bucket是指向hash join元組的簡單指標，而tupsize包括指標/hash值和MinimalTupleData。
         * 所以bucket不應該超過work_mem的25%(即使NTUP_PER_BUCKET=1);
         */
        Assert(bucket_bytes <= hash_table_bytes / 2);

        /* Calculate required number of batches. */
        //計算批次
        dbatch = ceil(inner_rel_bytes / (hash_table_bytes - bucket_bytes));
        dbatch = Min(dbatch, max_pointers);
        minbatch = (int) dbatch;
        nbatch = 2;
        while (nbatch < minbatch)
            nbatch <<= 1;
    }

    Assert(nbuckets > 0);
    Assert(nbatch > 0);

    *numbuckets = nbuckets;
    *numbatches = nbatch;
}

final_cost_hashjoin

//----------------------------------------------------------------------------- final_cost_hashjoin
/*
 * final_cost_hashjoin
 *    Final estimate of the cost and result size of a hashjoin path.
 *    最後的成本估算和hashjoin訪問路徑的結果大小
 *
 * Note: the numbatches estimate is also saved into 'path' for use later
 * 注意:numbatches估算也被儲存到path中以備以後使用
 * 
 * 'path' is already filled in except for the rows and cost fields and
 *      num_batches
 * 'workspace' is the result from initial_cost_hashjoin
 * 'extra' contains miscellaneous information about the join
 */
void

final_cost_hashjoin(PlannerInfo *root, HashPath *path,
                    JoinCostWorkspace *workspace,
                    JoinPathExtraData *extra)
{
    Path       *outer_path = path->jpath.outerjoinpath;
    Path       *inner_path = path->jpath.innerjoinpath;
    double      outer_path_rows = outer_path->rows;
    double      inner_path_rows = inner_path->rows;
    double      inner_path_rows_total = workspace->inner_rows_total;
    List       *hashclauses = path->path_hashclauses;
    Cost        startup_cost = workspace->startup_cost;
    Cost        run_cost = workspace->run_cost;
    int         numbuckets = workspace->numbuckets;
    int         numbatches = workspace->numbatches;
    Cost        cpu_per_tuple;
    QualCost    hash_qual_cost;
    QualCost    qp_qual_cost;
    double      hashjointuples;
    double      virtualbuckets;
    Selectivity innerbucketsize;
    Selectivity innermcvfreq;
    ListCell   *hcl;

    /* Mark the path with the correct row estimate */
    if (path->jpath.path.param_info)
        path->jpath.path.rows = path->jpath.path.param_info->ppi_rows;
    else
        path->jpath.path.rows = path->jpath.path.parent->rows;

    /* For partial paths, scale row estimate. */
    if (path->jpath.path.parallel_workers > 0)
    {
        double      parallel_divisor = get_parallel_divisor(&path->jpath.path);

        path->jpath.path.rows =
            clamp_row_est(path->jpath.path.rows / parallel_divisor);
    }

    /*
     * We could include disable_cost in the preliminary estimate, but that
     * would amount to optimizing for the case where the join method is
     * disabled, which doesn't seem like the way to bet.
     */
    if (!enable_hashjoin)
        startup_cost += disable_cost;//禁用hash join

    /* mark the path with estimated # of batches */
    path->num_batches = numbatches;//處理批數

    /* store the total number of tuples (sum of partial row estimates) */
    //儲存元組總數.
    path->inner_rows_total = inner_path_rows_total;

    /* and compute the number of "virtual" buckets in the whole join */
    //計算虛擬的桶數(buckets)
    virtualbuckets = (double) numbuckets * (double) numbatches;

    /*
     * Determine bucketsize fraction and MCV frequency for the inner relation.
     * We use the smallest bucketsize or MCV frequency estimated for any
     * individual hashclause; this is undoubtedly conservative.
     * 確定桶大小分數和MCV頻率之間的內在聯絡。
     * 使用每個雜湊子句估計的最小桶(bucket)大小或MCV頻率;但這無疑是保守的處理方法。
     * 
     * BUT: if inner relation has been unique-ified, we can assume it's good
     * for hashing.  This is important both because it's the right answer, and
     * because we avoid contaminating the cache with a value that's wrong for
     * non-unique-ified paths.
     * 但是:如果內表元組是唯一的，可以假設它對雜湊處理有好處。
     * 這很重要，因為它是正確的答案，而且避免了髒快取，而這個值對於非唯一化路徑是錯誤的。
     */
    if (IsA(inner_path, UniquePath))
    {
        //內表唯一
        innerbucketsize = 1.0 / virtualbuckets;
        innermcvfreq = 0.0;
    }
    else//內表不唯一
    {
        innerbucketsize = 1.0;
        innermcvfreq = 1.0;
        foreach(hcl, hashclauses)//迴圈遍歷hash條件
        {
            RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, hcl);
            Selectivity thisbucketsize;
            Selectivity thismcvfreq;

            /*
             * First we have to figure out which side of the hashjoin clause
             * is the inner side.
             * 首先，必須找出hashjoin子句的哪一邊是內表邊。
             *
             * Since we tend to visit the same clauses over and over when
             * planning a large query, we cache the bucket stats estimates in
             * the RestrictInfo node to avoid repeated lookups of statistics.
             * 因為在計劃大型查詢時，往往會反覆訪問相同的子句，
             * 所以將bucket stats估計快取在限制條件資訊節點中，以避免重複查詢統計資訊。
             */
            if (bms_is_subset(restrictinfo->right_relids,
                              inner_path->parent->relids))
            {
                /* righthand side is inner */
                thisbucketsize = restrictinfo->right_bucketsize;
                if (thisbucketsize < 0)
                {
                    /* not cached yet */
                    estimate_hash_bucket_stats(root,
                                               get_rightop(restrictinfo->clause),
                                               virtualbuckets,
                                               &restrictinfo->right_mcvfreq,
                                               &restrictinfo->right_bucketsize);
                    thisbucketsize = restrictinfo->right_bucketsize;
                }
                thismcvfreq = restrictinfo->right_mcvfreq;
            }
            else
            {
                Assert(bms_is_subset(restrictinfo->left_relids,
                                     inner_path->parent->relids));
                /* lefthand side is inner */
                thisbucketsize = restrictinfo->left_bucketsize;
                if (thisbucketsize < 0)
                {
                    /* not cached yet */
                    estimate_hash_bucket_stats(root,
                                               get_leftop(restrictinfo->clause),
                                               virtualbuckets,
                                               &restrictinfo->left_mcvfreq,
                                               &restrictinfo->left_bucketsize);
                    thisbucketsize = restrictinfo->left_bucketsize;
                }
                thismcvfreq = restrictinfo->left_mcvfreq;
            }

            if (innerbucketsize > thisbucketsize)
                innerbucketsize = thisbucketsize;
            if (innermcvfreq > thismcvfreq)
                innermcvfreq = thismcvfreq;
        }
    }

    /*
     * If the bucket holding the inner MCV would exceed work_mem, we don't
     * want to hash unless there is really no other alternative, so apply
     * disable_cost.  (The executor normally copes with excessive memory usage
     * by splitting batches, but obviously it cannot separate equal values
     * that way, so it will be unable to drive the batch size below work_mem
     * when this is true.)
     * 如果包含內部MCV的bucket會超過work_mem，那麼除非真的沒有其他選擇，否則我們不希望執行雜湊，因此禁用hash連線。
     * (執行程式通常透過分割批處理來處理過多的記憶體使用，但顯然它不能以這種方式分離相等的值，
     * 因此如果這是真的，它將無法驅動work_mem以下的批處理大小。)
     */
    if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
                           inner_path->pathtarget->width) >
        (work_mem * 1024L))
        startup_cost += disable_cost;//MCV的值如果大於work_mem,則禁用hash連線

    /*
     * Compute cost of the hashquals and qpquals (other restriction clauses)
     * separately.
     * 分別計算hash quals連線條件和qpquals(其他限制條件)的成本。
     */
    cost_qual_eval(&hash_qual_cost, hashclauses, root);
    cost_qual_eval(&qp_qual_cost, path->jpath.joinrestrictinfo, root);
    qp_qual_cost.startup -= hash_qual_cost.startup;
    qp_qual_cost.per_tuple -= hash_qual_cost.per_tuple;

    /* CPU costs */

    if (path->jpath.jointype == JOIN_SEMI ||
        path->jpath.jointype == JOIN_ANTI ||
        extra->inner_unique)
    {
        double      outer_matched_rows;
        Selectivity inner_scan_frac;

        /*
         * With a SEMI or ANTI join, or if the innerrel is known unique, the
         * executor will stop after the first match.
         * 對於半連線或反連線，或者如果內表是唯一的，執行器將在第一次匹配後停止。
         *
         * For an outer-rel row that has at least one match, we can expect the
         * bucket scan to stop after a fraction 1/(match_count+1) of the
         * bucket's rows, if the matches are evenly distributed.  Since they
         * probably aren't quite evenly distributed, we apply a fuzz factor of
         * 2.0 to that fraction.  (If we used a larger fuzz factor, we'd have
         * to clamp inner_scan_frac to at most 1.0; but since match_count is
         * at least 1, no such clamp is needed now.)
         * 對於至少有一個匹配的外表outer-rel行，如果匹配是均勻分佈的，
         * 那麼桶掃描在桶行數*1/(match_count+1)之後就會停止。
         * 因為它們可能不是均勻分佈的，所以對這個比例應用了2.0的模糊係數。
         * (如果我們使用更大的係數，將不得不將inner_scan_frac限制為1.0;
         * 但是因為match_count至少為1，所以現在不需要這樣處理了。)
         */
        outer_matched_rows = rint(outer_path_rows * extra->semifactors.outer_match_frac);
        inner_scan_frac = 2.0 / (extra->semifactors.match_count + 1.0);

        startup_cost += hash_qual_cost.startup;
        run_cost += hash_qual_cost.per_tuple * outer_matched_rows *
            clamp_row_est(inner_path_rows * innerbucketsize * inner_scan_frac) * 0.5;

        /*
         * For unmatched outer-rel rows, the picture is quite a lot different.
         * In the first place, there is no reason to assume that these rows
         * preferentially hit heavily-populated buckets; instead assume they
         * are uncorrelated with the inner distribution and so they see an
         * average bucket size of inner_path_rows / virtualbuckets.  In the
         * second place, it seems likely that they will have few if any exact
         * hash-code matches and so very few of the tuples in the bucket will
         * actually require eval of the hash quals.  We don't have any good
         * way to estimate how many will, but for the moment assume that the
         * effective cost per bucket entry is one-tenth what it is for
         * matchable tuples.
         * 對於沒有匹配的外表行來說，情況就大不相同了。
         * 首先，沒有理由假設這些行優先命中了被大量填充的桶(bucket);
         * 相反，假設它們與內表分佈不相關，因此它們看到的是inner_path_rows / virtualbuckets的平均桶大小。
         * 第二，如果有任何確切的雜湊值匹配，它們可能會很少，因此桶中的元組實際上很少需要對雜湊quals條件進行eval解析。
         * 目前沒有很好的方法來估計會有多少個，但是目前假設訪問每個桶的實際耗費的成本是匹配元組的十分之一。
         */
        run_cost += hash_qual_cost.per_tuple *
            (outer_path_rows - outer_matched_rows) *
            clamp_row_est(inner_path_rows / virtualbuckets) * 0.05;

        /* Get # of tuples that will pass the basic join */
        //參與基礎連線的元組數
        if (path->jpath.jointype == JOIN_ANTI)
            hashjointuples = outer_path_rows - outer_matched_rows;
        else
            hashjointuples = outer_matched_rows;
    }
    else
    {
        /*
         * The number of tuple comparisons needed is the number of outer
         * tuples times the typical number of tuples in a hash bucket, which
         * is the inner relation size times its bucketsize fraction.  At each
         * one, we need to evaluate the hashjoin quals.  But actually,
         * charging the full qual eval cost at each tuple is pessimistic,
         * since we don't evaluate the quals unless the hash values match
         * exactly.  For lack of a better idea, halve the cost estimate to
         * allow for that.
         * 需要的元組比較數量是外部元組的數量乘以雜湊桶中的典型元組數量，這是內部關係大小乘以它的桶大小比例。
         * 在每個函式中，都需要計算hash連線條件(hashjoin quals)。
         * 但實際上，認為在每個元組上耗費解析表示式(qual eval)的成本是較為悲觀的，
         * 因為我們不計算條件quals，除非雜湊值完全匹配。
         * 由於缺乏更好的想法，將成本估算減半。
         */
        startup_cost += hash_qual_cost.startup;
        run_cost += hash_qual_cost.per_tuple * outer_path_rows *
            clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;

        /*
         * Get approx # tuples passing the hashquals.  We use
         * approx_tuple_count here because we need an estimate done with
         * JOIN_INNER semantics.
         * 透過hash條件獲得大約的元組數。
         * 這裡使用approx_tuple_count，因為我們需要使用JOIN_INNER語義進行評估。
         */
        hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
    }

    /*
     * For each tuple that gets through the hashjoin proper, we charge
     * cpu_tuple_cost plus the cost of evaluating additional restriction
     * clauses that are to be applied at the join.  (This is pessimistic since
     * not all of the quals may get evaluated at each tuple.)
     * 對於透過雜湊連線的每個元組，每個元組的成本為cpu_tuple_cost，
     * 加上計算將在連線上應用的附加限制條款的成本。
     * (這是悲觀的，因為並不是所有的條件表示式quals都能在每個元組上得到計算。)
     */
    startup_cost += qp_qual_cost.startup;
    cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
    run_cost += cpu_per_tuple * hashjointuples;

    /* tlist eval costs are paid per output row, not per tuple scanned */
    startup_cost += path->jpath.path.pathtarget->cost.startup;
    run_cost += path->jpath.path.pathtarget->cost.per_tuple * path->jpath.path.rows;

    path->jpath.path.startup_cost = startup_cost;
    path->jpath.path.total_cost = startup_cost + run_cost;
}

三、跟蹤分析

測試指令碼如下

select a.*,b.grbh,b.je 
from t_dwxx a,
    lateral (select t1.dwbh,t1.grbh,t2.je 
     from t_grxx t1 
          inner join t_jfxx t2 on t1.dwbh = a.dwbh and t1.grbh = t2.grbh) b
order by b.dwbh;

啟動gdb,設定斷點

(gdb) b try_hashjoin_path
Breakpoint 1 at 0x7af2a4: file joinpath.c, line 737.

進入函式try_hashjoin_path,連線型別為JOIN_INNER

(gdb) c
Continuing.

Breakpoint 1, try_hashjoin_path (root=0x1b73980, joinrel=0x1b91d30, outer_path=0x1b866c0, inner_path=0x1b8e780, 
    hashclauses=0x1b93200, jointype=JOIN_INNER, extra=0x7ffc09955e80) at joinpath.c:737
737     required_outer = calc_non_nestloop_required_outer(outer_path,

進入initial_cost_hashjoin函式

(gdb) n
739     if (required_outer &&
(gdb) 
751     initial_cost_hashjoin(root, &workspace, jointype, hashclauses,
(gdb) step
initial_cost_hashjoin (root=0x1b73980, workspace=0x7ffc09955d00, jointype=JOIN_INNER, hashclauses=0x1b93200, 
    outer_path=0x1b866c0, inner_path=0x1b8e780, extra=0x7ffc09955e80, parallel_hash=false) at costsize.c:3160
3160        Cost        startup_cost = 0;

初始賦值,hash連線條件只有1個,即t_dwxx.dwbh = t_grxx.dwbh(連線資訊參照上一節)

3160        Cost        startup_cost = 0;
(gdb) n
3161        Cost        run_cost = 0;
(gdb) 
3162        double      outer_path_rows = outer_path->rows;
(gdb) 
3163        double      inner_path_rows = inner_path->rows;
(gdb) 
3164        double      inner_path_rows_total = inner_path_rows;
(gdb) 
3165        int         num_hashclauses = list_length(hashclauses);
(gdb) 
3172        startup_cost += outer_path->startup_cost;
(gdb) p num_hashclauses
$1 = 1

進入ExecChooseHashTableSize函式,該函式根據給定需要雜湊的關係的估計大小(行數和平均行寬度),計算雜湊表的合適大小。

...
(gdb) step
ExecChooseHashTableSize (ntuples=100000, tupwidth=9, useskew=true, try_combined_work_mem=false, parallel_workers=0, 
    space_allowed=0x7ffc09955c38, numbuckets=0x7ffc09955c4c, numbatches=0x7ffc09955c48, num_skew_mcvs=0x7ffc09955c44)
    at nodeHash.c:677

獲取元組大小->48Bytes,內部大小4800000Bytes(4.58M)和hash表大小->4194304(4M)

(gdb) n
690     tupsize = HJTUPLE_OVERHEAD +
(gdb) 
693     inner_rel_bytes = ntuples * tupsize;
(gdb) 
698     hash_table_bytes = work_mem * 1024L;
(gdb) 
705     if (try_combined_work_mem)
(gdb) p tupsize
$2 = 48
(gdb) p inner_rel_bytes
$3 = 4800000
(gdb) p hash_table_bytes
$4 = 4194304

使用行傾斜技術

(gdb) n
708     *space_allowed = hash_table_bytes;
(gdb) 
724     if (useskew)
(gdb) 
726         skew_table_bytes = hash_table_bytes * SKEW_WORK_MEM_PERCENT / 100;
(gdb) p useskew
$5 = true

行傾斜桶數為635,調整後的hash表大小4110418,最大的指標數524288

...
(gdb) p *num_skew_mcvs
$7 = 635
(gdb) p hash_table_bytes
$8 = 4110418
...
(gdb) n
756     max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
(gdb) 
758     mppow2 = 1L << my_log2(max_pointers);
(gdb) 
759     if (max_pointers != mppow2)
(gdb) 
764     max_pointers = Min(max_pointers, INT_MAX / 2);
(gdb) 
766     dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
(gdb) p max_pointers
$10 = 524288

計算桶數=131072

...
(gdb) n
767     dbuckets = Min(dbuckets, max_pointers);
(gdb) 
768     nbuckets = (int) dbuckets;
(gdb) 
770     nbuckets = Max(nbuckets, 1024);
(gdb) 
772     nbuckets = 1 << my_log2(nbuckets);
(gdb) 
778     bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
(gdb) 
779     if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
(gdb) p nbuckets
$11 = 131072

如果沒有足夠的空間，將需要多個批處理。

779     if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
(gdb) n
791         if (try_combined_work_mem)

計算桶數->131072和批次->2等資訊

...
(gdb) 
837     *numbuckets = nbuckets;
(gdb) 
838     *numbatches = nbatch;
(gdb) p nbuckets
$12 = 131072
(gdb) p nbatch
$13 = 2

回到initial_cost_hashjoin

(gdb) 
initial_cost_hashjoin (root=0x1b73980, workspace=0x7ffc09955d00, jointype=JOIN_INNER, hashclauses=0x1b93200, 
    outer_path=0x1b866c0, inner_path=0x1b8e780, extra=0x7ffc09955e80, parallel_hash=false) at costsize.c:3226
3226        if (numbatches > 1)

initial_cost_hashjoin函式執行完畢,檢視計算結果

(gdb) 
3246        workspace->inner_rows_total = inner_path_rows_total;
(gdb) 
3247    }
(gdb) p workspace
$14 = (JoinCostWorkspace *) 0x7ffc09955d00
(gdb) p *workspace
$15 = {startup_cost = 3465, total_cost = 4261, run_cost = 796, inner_run_cost = 0, 
  inner_rescan_run_cost = 6.9525149533036983e-310, outer_rows = 3.7882102964330281e-317, 
  inner_rows = 1.4285501039407471e-316, outer_skip_rows = 1.4285501039407471e-316, 
  inner_skip_rows = 6.9525149532894692e-310, numbuckets = 131072, numbatches = 2, inner_rows_total = 100000}
(gdb)

設定斷點,進入final_cost_hashjoin

(gdb) b final_cost_hashjoin
Breakpoint 2 at 0x7a0b49: file costsize.c, line 3265.
(gdb) c
Continuing.

Breakpoint 2, final_cost_hashjoin (root=0x1b73980, path=0x1b93238, workspace=0x7ffc09955d00, extra=0x7ffc09955e80)
    at costsize.c:3265
3265        Path       *outer_path = path->jpath.outerjoinpath;

賦值,並計算虛擬桶數

3265        Path       *outer_path = path->jpath.outerjoinpath;
(gdb) n
3266        Path       *inner_path = path->jpath.innerjoinpath;
...
(gdb) 
3314        virtualbuckets = (double) numbuckets * (double) numbatches;
(gdb) 
3326        if (IsA(inner_path, UniquePath))
(gdb) p virtualbuckets
$16 = 262144
(gdb)

開始遍歷hash連線條件,確定桶大小分數和MCV頻率之間的內在聯絡,更新連線條件中的相關資訊。

3335            foreach(hcl, hashclauses)
(gdb) 
3337                RestrictInfo *restrictinfo = lfirst_node(RestrictInfo, hcl);

內表在右端RTE=3,即t_grxx),更新相關資訊

3349                if (bms_is_subset(restrictinfo->right_relids,
(gdb) 
3353                    thisbucketsize = restrictinfo->right_bucketsize;
(gdb) 
3354                    if (thisbucketsize < 0)
(gdb) 
3357                        estimate_hash_bucket_stats(root,
(gdb) 
3358                                                   get_rightop(restrictinfo->clause),
(gdb) 
3357                        estimate_hash_bucket_stats(root,
(gdb) 
3362                        thisbucketsize = restrictinfo->right_bucketsize;
(gdb) p *restrictinfo
$17 = {type = T_RestrictInfo, clause = 0x1b87260, is_pushed_down = true, outerjoin_delayed = false, can_join = true, 
  pseudoconstant = false, leakproof = false, security_level = 0, clause_relids = 0x1b87380, required_relids = 0x1b86c68, 
  outer_relids = 0x0, nullable_relids = 0x0, left_relids = 0x1b87340, right_relids = 0x1b87360, orclause = 0x0, 
  parent_ec = 0x1b85b88, eval_cost = {startup = 0, per_tuple = 0.0025000000000000001}, norm_selec = 0.0001, 
  outer_selec = -1, mergeopfamilies = 0x1b873c8, left_ec = 0x1b85b88, right_ec = 0x1b85b88, left_em = 0x1b85d58, 
  right_em = 0x1b85c80, scansel_cache = 0x1b92e50, outer_is_left = true, hashjoinoperator = 98, left_bucketsize = -1, 
  right_bucketsize = 0.00010013016921998598, left_mcvfreq = -1, right_mcvfreq = 0}

計算過程與先前介紹的類似,計算表示式的成本等等,具體可自行debug.
最終的計算結果

3505    }
(gdb) p *path
$19 = {jpath = {path = {type = T_HashPath, pathtype = T_HashJoin, parent = 0x1b91d30, pathtarget = 0x1b91f68, 
      param_info = 0x0, parallel_aware = false, parallel_safe = true, parallel_workers = 0, rows = 100000, 
      startup_cost = 3465, total_cost = 5386, pathkeys = 0x0}, jointype = JOIN_INNER, inner_unique = false, 
    outerjoinpath = 0x1b866c0, innerjoinpath = 0x1b8e780, joinrestrictinfo = 0x1b92208}, path_hashclauses = 0x1b93200, 
  num_batches = 2, inner_rows_total = 100000}

DONE!

四、參考資料

allpaths.c
cost.h
costsize.c
PG Document:Query Planning

PostgreSQL 原始碼解讀（71）- 查詢語句#56（make_one_rel函式#21-...

一、資料結構

二、原始碼解讀

三、跟蹤分析

四、參考資料

相關文章