來源：PostgreSQL學徒

前言

在PG資料庫中shared_buffers會影響DROP TABLE的效能嗎，群裡也就這個問題討論了許久。簡而言之，就是 shared_buffers 引數的大小會直接影響到 DROP/TRUNCATE 的效能，這個以前還沒有注意過，自己下來也驗證了一下，發現的確如此，這立馬引起了我的興趣。

復現

具體復現步驟可以參照 DROP TABLE: KILLING SHARED_BUFFERS 中所述，此處稍作修改，驗證一下 TRUNCATE

[postgres@xiongcc ~]$ cat run.sql 
SET synchronous_commit TO off;
BEGIN;
CREATE TABLE if not exists x(id int);
INSERT INTO x VALUES (1);
truncate TABLE x;
COMMIT;
[postgres@xiongcc ~]$ cat test.sh 
#/bin/sh
 
DB=postgres
 
for x in '8 MB' '32 MB' '128 MB' '1 GB' '8 GB'
do
      pg_ctl -D /home/postgres/16data -l /dev/null -o "--shared_buffers='$x'" start
      sleep 1
      echo tps for $x
      psql -c "SHOW shared_buffers" $DB
      pgbench --file=run.sql -j 1 -c 1 -T 10 -P 2 $DB 2> /dev/null
      pg_ctl -D /home/postgres/16data stop
      sleep 1
done

以下是測試結果

[postgres@xiongcc ~]$ ./test.sh | grep tps
tps for 8 MB
tps = 950.839041 (without initial connection time)
tps for 32 MB
tps = 945.427848 (without initial connection time)
tps for 128 MB
tps = 928.624541 (without initial connection time)
tps for 1 GB
tps = 541.861555 (without initial connection time)

可以看到，當 shared_buffers 來到 1GB 的時候，TPS 直接衰減了一半，這個結果令我很詫異。為什麼 shared_buffers 越大，效能越慢？這個和我們常規認知完全背道而馳，shared_buffers 越大，能夠快取的物件就越多，避免額外 IO 進而提升效能。

那問題出在了哪裡？其實不難理解，當我們要刪除物件時，要確保 shared_buffers 裡面和這個物件相關的 buffer 都被清空，那麼慢的原因，基本上就是遍歷 shared_buffers 導致的了，越大，遍歷當然就越慢了。那讓我們深入程式碼細節，是否是這樣。

深入分析

以 16 的程式碼為例，TRUNCATE 的邏輯很好找，DropRelationsAllBuffers，整體的程式碼邏輯較長，讓我們一步一步分析。註釋很明瞭，從 Buffer pool 中移除指定表的所有分支檔案。

/* ---------------------------------------------------------------------
 *  DropRelationsAllBuffers
 *
 *  This function removes from the buffer pool all the pages of all
 *  forks of the specified relations.  It's equivalent to calling
 *  DropRelationBuffers once per fork per relation with firstDelBlock = 0.
 *  --------------------------------------------------------------------
 */

首先程式碼中會判斷是否是本地物件，即是否是臨時表

 /* If it's a local relation, it's localbuf.c's problem. */
 for (i = 0; i < nlocators; i++)
 {
  if (RelFileLocatorBackendIsTemp(smgr_reln[i]->smgr_rlocator))
  {
   if (smgr_reln[i]->smgr_rlocator.backend == MyBackendId)
    DropRelationAllLocalBuffers(smgr_reln[i]->smgr_rlocator.locator);
  }
  else
   rels[n++] = smgr_reln[i];
 }

 /*
  * If there are no non-local relations, then we're done. Release the
  * memory and return.
  */
 if (n == 0)
 {
  pfree(rels);
  return;
 }

然後分配一個二維陣列，記錄要 DROP/TRUNCATE 物件的所有資料塊

 /*
  * This is used to remember the number of blocks for all the relations
  * forks.
  */
 block = (BlockNumber (*)[MAX_FORKNUM + 1])
  palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1));

接下來是程式碼核心流程，先看註釋

We can avoid scanning the entire buffer pool if we know the exact size of each of the given relation forks. See DropRelationBuffers.
如果我們知道每個給定關係分支的確切大小，我們可以避免掃描整個緩衝池。請參閱 DropRelationBuffers。

 /*
  * We can avoid scanning the entire buffer pool if we know the exact size
  * of each of the given relation forks. See DropRelationBuffers.
  */
 for (i = 0; i < n && cached; i++)
 {
  for (int j = 0; j <= MAX_FORKNUM; j++)
  {
   /* Get the number of blocks for a relation's fork. */
   block[i][j] = smgrnblocks_cached(rels[i], j);    ---返回InvalidBlockNumber

   /* We need to only consider the relation forks that exists. */
   if (block[i][j] == InvalidBlockNumber)
   {
    if (!smgrexists(rels[i], j))  ---判斷檔案是否存在，進而返回cached = false
     continue;
    cached = false;
    break;
   }

   /* calculate the total number of blocks to be invalidated */
   nBlocksToInvalidate += block[i][j];
  }
 }

DropRelationBuffers 中有這麼一段註釋，寫明瞭原因，目前僅適用於恢復和備庫！

 /*
  * To remove all the pages of the specified relation forks from the buffer
  * pool, we need to scan the entire buffer pool but we can optimize it by
  * finding the buffers from BufMapping table provided we know the exact
  * size of each fork of the relation. The exact size is required to ensure
  * that we don't leave any buffer for the relation being dropped as
  * otherwise the background writer or checkpointer can lead to a PANIC
  * error while flushing buffers corresponding to files that don't exist.
  *
  為了從緩衝池中移除指定關係分支的所有頁面，我們需要掃描整個緩衝池，但如果我們知道關係的每個分
  支的確切大小，我們可以透過從BufMapping表中查詢緩衝區來最佳化它。需要確切大小是為了確保我們沒
  有留下任何要刪除的關係的緩衝區，否則後臺寫入器或檢查點器在重新整理不存在的檔案對應的緩衝區時可能
  會導致PANIC錯誤。


  * To know the exact size, we rely on the size cached for each fork by us
  * during recovery which limits the optimization to recovery and on
  * standbys but we can easily extend it once we have shared cache for
  * relation size.
  
  為了知道確切的大小，我們依賴於在恢復期間為每個分支快取的大小，這限制了最佳化適用於恢復和備用節
  點，但一旦我們有了關係大小的共享快取，我們可以輕鬆擴充套件它。
  
  *
  * In recovery, we cache the value returned by the first lseek(SEEK_END)
  * and the future writes keeps the cached value up-to-date. See
  * smgrextend. It is possible that the value of the first lseek is smaller
  * than the actual number of existing blocks in the file due to buggy
  * Linux kernels that might not have accounted for the recent write. But
  * that should be fine because there must not be any buffers after that
  * file size.
  */

於是，smgrnblocks_cached 返回的就是 InvalidBlockNumber，進而走到了將 cached 設定為了 false。

/*
 * smgrnblocks_cached() -- Get the cached number of blocks in the supplied
 *         relation.
 *
 * Returns an InvalidBlockNumber when not in recovery and when the relation
 * fork size is not cached.
 */
BlockNumber
smgrnblocks_cached(SMgrRelation reln, ForkNumber forknum)
{
 /*
  * For now, we only use cached values in recovery due to lack of a shared
  * invalidation mechanism for changes in file size.
  */
 if (InRecovery && reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
  return reln->smgr_cached_nblocks[forknum];

 return InvalidBlockNumber;
}

當 cached = false 時，後面的邏輯也就不會再繼續走下去了，所以此處的最佳化僅僅針對的是備庫在重放 WAL 的時候，才可以避免遍歷 shared_buffers

 /*
  * We apply the optimization iff the total number of blocks to invalidate
  * is below the BUF_DROP_FULL_SCAN_THRESHOLD.
  
  如果要無效的頁面數小於BUF_DROP_FULL_SCAN_THRESHOLD(NBuffers/32)的話，從hash中查詢，否則遍歷所有頁面
  */
 if (cached && nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
 {
  for (i = 0; i < n; i++)
  {
   for (int j = 0; j <= MAX_FORKNUM; j++)
   {
    /* ignore relation forks that doesn't exist */
    if (!BlockNumberIsValid(block[i][j]))
     continue;

    /* drop all the buffers for a particular relation fork */
    FindAndDropRelationBuffers(rels[i]->smgr_rlocator.locator,
             j, block[i][j], 0);
   }
  }

  pfree(block);
  pfree(rels);
  return;
 }

那麼後面的邏輯就十分清晰了，判斷要刪除的表是否大於 RELS_BSEARCH_THRESHOLD，這個值是 20，當然也是社群開發者拍腦袋想出來的

The threshold to use is rather a guess than an exactly determined value

比如 drop test1,test2,test... ，如果要刪除的表大於了 20，就採用二分查詢，否則就避免進行二分查詢，減少開銷

 /*
  * For low number of relations to drop just use a simple walk through, to
  * save the bsearch overhead. The threshold to use is rather a guess than
  * an exactly determined value, as it depends on many factors (CPU and RAM
  * speeds, amount of shared buffers etc.).
  */
 use_bsearch = n > RELS_BSEARCH_THRESHOLD;

 /* sort the list of rlocators if necessary */
 if (use_bsearch)
  pg_qsort(locators, n, sizeof(RelFileLocator), rlocator_comparator);

然後就是最為關鍵的地方了，NBuffers就是 shared buffers 的大小，可以看到，程式碼中採用了遍歷！複雜度是 O(N)，N = shared_buffers/8KB

 for (i = 0; i < NBuffers; i++)    ---最為關鍵的地方
 {
  RelFileLocator *rlocator = NULL;
  BufferDesc *bufHdr = GetBufferDescriptor(i);
  uint32  buf_state;

  /*
   * As in DropRelationBuffers, an unlocked precheck should be safe and
   * saves some cycles.
   */

  if (!use_bsearch)
  {
   int   j;

   for (j = 0; j < n; j++)
   {
    if (BufTagMatchesRelFileLocator(&bufHdr->tag, &locators[j]))
    {
     rlocator = &locators[j];
     break;
    }
   }
  }
  else
  {
   RelFileLocator locator;

   locator = BufTagGetRelFileLocator(&bufHdr->tag);
   rlocator = bsearch((const void *) &(locator),
          locators, n, sizeof(RelFileLocator),
          rlocator_comparator);
  }

  /* buffer doesn't belong to any of the given relfilelocators; skip it */
  if (rlocator == NULL)
   continue;

  buf_state = LockBufHdr(bufHdr);
  if (BufTagMatchesRelFileLocator(&bufHdr->tag, rlocator))
   InvalidateBuffer(bufHdr); /* releases spinlock */
  else
   UnlockBufHdr(bufHdr, buf_state);
 }

 pfree(locators);
 pfree(rels);

所以，為什麼較大的 shared_buffers 刪除表會導致效能問題的原因就十分清晰了：程式碼中會遍歷 shared_buffers，複雜度是 O(N)，N = shared_buffers/8KB，所以 shared_buffers 越大，效能就越慢！

那麼對於臨時表是如何？讓我們也簡單看一下

/*
 * DropRelationAllLocalBuffers
 *  This function removes from the buffer pool all pages of all forks
 *  of the specified relation.
 *
 *  See DropRelationsAllBuffers in bufmgr.c for more notes.
 */
void
DropRelationAllLocalBuffers(RelFileLocator rlocator)
{
 int   i;

 for (i = 0; i < NLocBuffer; i++)
 {
  BufferDesc *bufHdr = GetLocalBufferDescriptor(i);
  LocalBufferLookupEnt *hresult;
  uint32  buf_state;

  buf_state = pg_atomic_read_u32(&bufHdr->state);

  if ((buf_state & BM_TAG_VALID) &&
   BufTagMatchesRelFileLocator(&bufHdr->tag, &rlocator))
  {
   if (LocalRefCount[i] != 0)
    elog(ERROR, "block %u of %s is still referenced (local %u)",
      bufHdr->tag.blockNum,
      relpathbackend(BufTagGetRelFileLocator(&bufHdr->tag),
         MyBackendId,
         BufTagGetForkNum(&bufHdr->tag)),
      LocalRefCount[i]);
   /* Remove entry from hashtable */
   hresult = (LocalBufferLookupEnt *)
    hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
   if (!hresult)  /* shouldn't happen */
    elog(ERROR, "local buffer hash table corrupted");
   /* Mark buffer invalid */
   ClearBufferTag(&bufHdr->tag);
   buf_state &= ~BUF_FLAG_MASK;
   buf_state &= ~BUF_USAGECOUNT_MASK;
   pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
  }
 }
}

可以看到，它的邏輯也是類似，遍歷 NLocBuffer，但是 NLocBuffer 取決於 temp_buffers，temp_buffers 按照日常使用並不會特別大。

再次復現

那讓我們再次驗證一下我們的想法

(gdb) call smgrnblocks_cached(rels[i], j)
$1 = 4294967295
(gdb) p smgrnblocks_cached(rels[i], j)
$2 = 4294967295
(gdb) p/x smgrnblocks_cached(rels[i], j)
$3 = 0xffffffff
(gdb) p cached
$4 = false

這個 0xffffffff 便是 InvalidBlockNumber，最終返回 cached = false。

#define InvalidBlockNumber  ((BlockNumber) 0xFFFFFFFF)

另外，最開始判斷需要刪除的物件數量，包括表所有的物件，比如 TOAST，比如 TOAST 的索引，所以各位可以驗證一下，比如 create table test1(id int)，那麼這個 n 就是 1 ，如果 create table test2(info text)，這個 n 就是 3，然後所有的物件的都要經過遍歷，可想而知，表越複雜，索引越多，那麼刪除就越慢！

 /* If it's a local relation, it's localbuf.c's problem. */
 for (i = 0; i < nlocators; i++)
 {
  if (RelFileLocatorBackendIsTemp(smgr_reln[i]->smgr_rlocator))
  {
   if (smgr_reln[i]->smgr_rlocator.backend == MyBackendId)
    DropRelationAllLocalBuffers(smgr_reln[i]->smgr_rlocator.locator);
  }
  else
   rels[n++] = smgr_reln[i];
 }

此處我用 until 跳出迴圈，最終可以看到，遍歷了 16384 次！也就是整個 shared buffers 的大小。

(gdb) until
(gdb) p i
$16 = 16384

14中的最佳化

前文也提了，僅適用於備庫做回放的時候，14中引入了一項最佳化：， 如果要刪除的頁面數小於 NBuffers/32 的話，從BufMapping雜湊表中查詢，否則遍歷所有頁面

DropRelFileNodeBuffers()的恢復路徑經過了最佳化，當一個關係中要截斷的塊數低於某個閾值時，可以避免掃描整個緩衝池。對於這樣的情況，我們透過在BufMapping表中查詢來找到緩衝區。這在很多情況下（測試了1000個關係的多個小表）提高了效能，超過了100倍，特別是在伺服器配置了大量共享緩衝區的情況下（大於等於100GB）。這種最佳化有助於以下情況：1、當vacuum或autovacuum截斷了關係末尾的任何空頁，或者2、當關系在建立它的同一個事務中被截斷。此提交引入了一個新的API smgrnblocks_cached，它返回關係分支中塊數的快取值。這有助於我們確定應用此最佳化所需的關係的確切大小。需要確切的大小是為了確保我們沒有留下任何要刪除的關係的緩衝區，否則後臺寫入器或檢查點器在重新整理不存在的檔案對應的緩衝區時可能會導致PANIC錯誤。

不過前文也說了，僅僅是針對主從複製的場景。

小結

所以，為什麼普通表會出問題，因為 shared buffers 往往會很大，假如你將 temp_buffers 一樣設定很大，照樣會有效能衰減，讓我們稍作修改，針對臨時表測一下

#/bin/sh
 
DB=postgres
 
for x in '8 MB' '32 MB' '128 MB' '1 GB' '8 GB'
do
      pg_ctl -D /home/postgres/16data -l /dev/null -o "--temp_buffers='$x'" start
      sleep 1
      echo tps for $x
      # psql -c "SHOW shared_buffers" $DB
      psql -c "SHOW temp_buffers" $DB
      pgbench --file=run.sql -j 1 -c 1 -T 10 -P 2 $DB 2> /dev/null
      pg_ctl -D /home/postgres/16data stop
      sleep 1
done

測試結果如出一轍。

[postgres@xiongcc ~]$ ./test.sh | grep tps
tps for 8 MB
tps = 1643.377319 (without initial connection time)
tps for 32 MB
tps = 1641.946117 (without initial connection time)
tps for 128 MB
tps = 1375.333656 (without initial connection time)
tps for 1 GB
tps = 607.852753 (without initial connection time)
tps for 8 GB
tps = 96.615903 (without initial connection time)

看樣子 shared_buffers 的調整真的是一門很大的學問。

我見

藉著這個案例，也聊聊我對社群的淺薄看法吧，經常有人會納悶為什麼社群這不做，那不做，明明都是問題！就此例中的問題，其實早在 2015 年就有人提過這個問題了，%2BAw3atNfj9WkAGg%40mail.gmail.com，但是被社群大佬以這個場景比較罕見為由，並沒有做相關的最佳化。但是也確實可以看到，社群也在做不斷的最佳化，並非完全油鹽不進，比如 14 引入的 smgrnblocks_cached，PostgreSQL 始終是個開源資料庫，完全開源免費，不像商業資料庫，由客戶需求直接驅動。相反，我認為這些痛點和癢點，都是源自真實需求，這不正是基於 PostgreSQL 二次開發而來的國產資料庫需要做的事情嗎？覺得不好，不如去建設它。

源自社群，反哺社群，共創社群。

從一個罕見案例聊聊我對社群的看法

來源：PostgreSQL學徒

前言

復現

深入分析

再次復現

14中的最佳化

小結

我見

相關文章