深入解析 TiFlash丨多併發下執行緒建立、釋放的阻塞問題

TiFlash 初期存在一個棘手的問題：對於複雜的小查詢，無論增加多少併發，TiFlash 的整機 CPU 使用率都遠遠不能打滿。如下圖：

對 TiFlash 和問題本身經過一段時間的瞭解後，認為方向應該在“公共元件”（全域性鎖、底層儲存、上層服務等）上。在這個方向上做“地毯式”排查後，終於定位到問題的一個重要原因：高併發下頻繁的執行緒建立和釋放，這會引發執行緒在建立/釋放過程出現排隊和阻塞現象。

由於 TiFlash 的工作模式依賴於啟動大量臨時新執行緒去做一些區域性計算或者其他的事情，大量執行緒建立/釋放過程出現了排隊和阻塞現象，導致應用的計算工作也被阻塞了。而且併發越多，這個問題越嚴重，所以 CPU 使用率不再隨著併發增加而增加。
具體的排查過程，因為篇幅有限，本篇就不多贅述了。首先我們可以構造個簡單實驗來複現這個問題：

實驗復現、驗證

定義

首先定義三種工作模式: wait、 work 、 workOnNewThread

wait: while 迴圈，等待 condition_variable

work: while 迴圈，每次 memcpy 20次(每次 memcpy copy 1000000 bytes)。

workOnNewThread: while 迴圈，每次申請新的 thread，新 thread 內 memcpy 20次， join 等待執行緒結束，重複這個過程。

接下來按不同的工作模式組合去做實驗。

各實驗

實驗 1：40 個 work 執行緒

實驗 2：1000 個 wait 執行緒, 40 個 work 執行緒

實驗 3：40 個 workOnNewThread 執行緒

實驗 4：120 個 workOnNewThread 執行緒

實驗 5：500 個 workOnNewThread 執行緒

具體實驗結果

各實驗 CPU 使用率如下:

結果分析

實驗 1 和 2 表明，即使實驗 2 比實驗 1 多了 1000 個 wait 執行緒，並不會因為 wait 執行緒數非常多而導致 CPU 打不滿。過多的 wait 執行緒數並不會讓 CPU 打不滿。從原因上來講，wait 型別的執行緒不參與排程，後面會講到。另外，linux 採用的是 cfs 排程器，時間複雜度是 O(lgn)，所以理論上大規模可排程執行緒數目也並不會給排程增加明顯的壓力。

實驗 3、4、5 表明，如果大量工作執行緒的工作模式是頻繁申請和釋放執行緒，可以導致cpu打不滿的情況。

接下來帶大家一起分析下，為什麼執行緒的頻繁建立和釋放會帶來排隊和阻塞現象，代價如此之高？

多併發下，執行緒建立和釋放會發生什麼?

GDB上看到的阻塞現象

使用 GDB 檢視執行緒的頻繁建立和釋放場景下的程式，可以看到執行緒建立和釋放過程被 lll_lock_wait_private 的鎖阻塞掉。如圖:

#0 _lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007fbc55f60d80 in _L_lock_3443 () from /lib64/libpthread.so.0
#2 0x00007fbc55f60200 in get_cached_stack (memp=<synthetic pointer>, sizep=<synthetic pointer>)
   at allocatestack.c:175
#3 allocate_stack (stack=<synthetic pointer>, pdp=<synthetic pointer>,
   attr=0x7fbc56173400 <__default_pthread_attr>) at allocatestack.c:474
#4 __pthread_create_2_1 (newthread=0x7fb8f6c234a8, attr=0x0,
   start_routine=0x88835a0 <std::execute_native_thread_routine(void*)>, arg=0x7fbb8bd10cc0)
   at pthread_create.c:447
#5 0x0000000008883865 in __gthread_create (__args=<optimized out>
   __func=0x88835a0 <std::execute_native_thread_routine(void*)>,
   __threadid=_threadid@entry=0x7fb8f6c234a8)
   at /root/XXX/gcc-7.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/b...
#6 std::thread::_M_start_thread (this=this@entry=0x7fb8f6c234a8,state=...) 
   at ../../../../-/libstdc++-v3/src/c++11/thread.cc:163

<center>Figure 1：執行緒申請阻塞時堆疊</center>

#0 _lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007fbc55f60e59 in _L_lock_4600 () from /lib64/libpthread.so.0
#2 0x00007fbc55f6089f in allocate_stack (stack=<synthetic pointer>, pdp=<synthetic pointer>
   attr=0x7fbc56173400 <__default_pthread_attr>) at allocatestack.c:552
#3 __pthread_create_2_1 (newthread=0x7fb5f1a5e8b0, attr=0x0,
   start_routine=0x88835a0 <std::execute_native_thread_routine(void*)>, arg=0x7fbb8bcd6500)
   at pthread_create.c:447
#4 0x0000000008883865 in __gthread_create (__args=<optimized out>,
   __func=0x88835a0 <std::execute_native_thread_routine(void*)>,
   __threadid=__threadid@entry=0x7fb5f1a5e8b0)
   at /root/XXX/gcc-7.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/...
#5 std::thread::_M_start_thread (this=this@entry=0x7fb5f1a5e8b0, state=...) 
   at ../../../.././libstdc++-v3/src/c++11/thread.cc:163

<center>Figure 2：執行緒申請阻塞時堆疊</center>

#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007fbc55f60b71 in _L_lock_244 () from /lib64/libpthread.so.0
#2 0x00007fbc55f5ef3c in _deallocate_stack (pd=0x7fbc56173320 <stack_cache_lock>, pd@entry=0x7fb378912700) at allocatestack.c:704
#3 0x00007fbc55f60109 in __free_tcb (pd=pd@entry=0x7fb378912700) at pthread_create.c:223
#4 0x00007fbc55f61053 in pthread_join (threadid=140408798652160, thread_return=0x0) at pthread_join.c:111
#5 0x0000000008883803 in __gthread_join (__value_ptr=0x0, __threadid=<optimized out>)
        at /root/XXX/gcc-7.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:668
#6 std::thread::join (this=this@entry=0x7fbbc2005668) at ../../../.././libstdc++-v3/src/c++11/thread.cc:136

<center>Figure 3：執行緒釋放阻塞時堆疊</center>

從圖中堆疊可以看到，執行緒建立時會呼叫allocate_stack 和 __deallocate_stack，而執行緒釋放時會呼叫 __deallocate_stack，這幾個函式會因為觸發了名為 lll_lock_wait_private 的鎖爭搶而發生阻塞。

為了解釋這個情況，需要對 thread 的建立釋放過程進行了解。

thread 建立和釋放的工作過程

我們日常用到的執行緒，是通過 NPTL 實現的 pthread。NPTL(native posix thread library)，俗稱原生 pthread 庫，本身整合在 glibc 裡面。在分析了 glibc 的相關原始碼後，可以瞭解到 pthread 建立和釋放的工作過程。

執行緒建立工作會給執行緒分配 stack，析構工作會釋放 stack，這期間會用到stack_used 和stack_cache 兩個連結串列:stack_used 維護的是正在被執行緒使用 stack，而stack_cache維護的是的之前執行緒釋放後回收可利用的 stack。執行緒申請 stack 時，並不是直接去申請新的 stack，而是先嚐試從stack_cache 裡獲取。

__lll_lock_wait_private 是private形態的__lll_lock_wait，實際是一種基於 futex 實現的互斥鎖,後面會講到，private 是指在這個鎖只在程式內部使用，而不會跨程式。
這個鎖爭搶就是線上程呼叫allocate_stack (執行緒申請時)、deallocate_stack (執行緒釋放時)過程中對這兩個連結串列進行操作時發生的。

allocate_stack過程:

Returns a usable stack for a new thread either by allocating a
   new stack or reusing a cached stack of sufficient size.
   ATTR must be non-NULL and point to a valid pthread_attr.
   PDP must be non-NULL.  */
static int
allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
                ALLOCATE_STACK_PARMS)
{
  ... // do something

  /* Get memory for the stack.  */
  if (__glibc_unlikely (attr->flags & ATTR_FLAG_STACKADDR))
    { 
      ... // do something
    }
  else
    {
      // main branch
      /* Allocate some anonymous memory.  If possible use the cache.  */
      ... // do something

      /* Try to get a stack from the cache.  */
      reqsize = size;
      pd = get_cached_stack (&size, &mem);
      /* 
          If get_cached_stack() succeed, it will use cached_stack 
          to do rest work. Otherwise, it will call mmap() to allocate a stack.
      */
      if (pd == NULL) // if pd == NULL, get_cached_stack() failed
        {
          ... // do something
          mem = mmap (NULL, size, prot,
                      MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
          ... // do something
          /* Prepare to modify global data.  */
          lll_lock (stack_cache_lock, LLL_PRIVATE); // global lock

          /* And add to the list of stacks in use.  */
          stack_list_add (&pd->list, &stack_used);

          lll_unlock (stack_cache_lock, LLL_PRIVATE);
          ... // do something
        }
      ... //do something
    }
  ... //do something
  return 0;
}

/* Get a stack frame from the cache.  We have to match by size since
   some blocks might be too small or far too large.  */
static struct pthread *
get_cached_stack (size_t *sizep, void **memp)
{
  size_t size = *sizep;
  struct pthread *result = NULL;
  list_t *entry;

  lll_lock (stack_cache_lock, LLL_PRIVATE); // global lock

  /* Search the cache for a matching entry.  We search for the
     smallest stack which has at least the required size.  Note that
     in normal situations the size of all allocated stacks is the
     same.  As the very least there are only a few different sizes.
     Therefore this loop will exit early most of the time with an
     exact match.  */
  list_for_each (entry, &stack_cache)
    {
      ... // do something
    }

  ... // do something

  /* Dequeue the entry.  */
  stack_list_del (&result->list);

  /* And add to the list of stacks in use.  */
  stack_list_add (&result->list, &stack_used);

  /* And decrease the cache size.  */
  stack_cache_actsize -= result->stackblock_size;

  /* Release the lock early.  */
  lll_unlock (stack_cache_lock, LLL_PRIVATE);
  ... // do something
  return result;
}

<center>Figure 4: allocate_stack 程式碼分析</center>

結合堆疊和原始碼可知，pthread_create 最開始會呼叫allocate_stack 來進行執行緒堆疊的分配。具體過程如上圖: 首先檢查使用者是否自己提供了 stack 空間，如果是，那麼直接用使用者提供的空間進行分配。不過這種情況很少見。預設情況下，使用者是不提供的，而是系統自己去分配。這種情況下會先呼叫 get_cached_stack ，嘗試從已經分配過的 stack 列表中重新利用。如果獲取 stack 失敗，那麼會呼叫 syscall mmap 進行 stack 的分配，獲取 stack 後，會嘗試獲取全域性鎖lll_lock 將 stack 新增到stack_used 列表中。這個過程中， get_cached_stack 內部也會嘗試獲取相同的全域性鎖lll_lock ，首先掃描stack_cache 列表，將可用的 stack 找到，然後將該stack 從stack_cache 列表中刪除，再加入到stack_used 列表中。

deallocate_stack過程:

void
internal_function
__deallocate_stack (struct pthread *pd)
{
  lll_lock (stack_cache_lock, LLL_PRIVATE); //global lock

  /* Remove the thread from the list of threads with user defined
     stacks.  */
  stack_list_del (&pd->list); 

  /* Not much to do.  Just free the mmap()ed memory.  Note that we do
     not reset the 'used' flag in the 'tid' field.  This is done by
     the kernel.  If no thread has been created yet this field is
     still zero.  */
  if (__glibc_likely (! pd->user_stack))
    (void) queue_stack (pd); 
  else
    /* Free the memory associated with the ELF TLS.  */
    _dl_deallocate_tls (TLS_TPADJ (pd), false);

  lll_unlock (stack_cache_lock, LLL_PRIVATE);
}

/* Add a stack frame which is not used anymore to the stack.  Must be
   called with the cache lock held.  */
static inline void
__attribute ((always_inline))
queue_stack (struct pthread *stack)
{
  /* We unconditionally add the stack to the list.  The memory may
     still be in use but it will not be reused until the kernel marks
     the stack as not used anymore.  */
  stack_list_add (&stack->list, &stack_cache);

  stack_cache_actsize += stack->stackblock_size;
  if (__glibc_unlikely (stack_cache_actsize > stack_cache_maxsize))
    //if stack_cache is full, release some stacks
    __free_stacks (stack_cache_maxsize); 
}

/* Free stacks until cache size is lower than LIMIT.  */
void
__free_stacks (size_t limit)
{
  /* We reduce the size of the cache.  Remove the last entries until
     the size is below the limit.  */
  list_t *entry;
  list_t *prev;

  /* Search from the end of the list.  */
  list_for_each_prev_safe (entry, prev, &stack_cache)
    {
      struct pthread *curr;

      curr = list_entry (entry, struct pthread, list);
      if (FREE_P (curr))
        {
          ... // do something
          
          /* Remove this block.  This should never fail.  If it does
             something is really wrong.  */
          if (munmap (curr->stackblock, curr->stackblock_size) != 0)
            abort ();

          /* Maybe we have freed enough.  */
          if (stack_cache_actsize <= limit)
            break;
        }
    }
}

<center>Figure 5: deallocate_stack 程式碼分析</center>

//file path: nptl/allocatestack.c
/* Maximum size in kB of cache.  */
static size_t stack_cache_maxsize = 40 * 1024 * 1024; /* 40MiBi by default.  */
static size_t stack_cache_actsize;

<center>Figure 6: stack_cache 列表容量 stack_cache_maxsize 的預設值</center>

結合堆疊和原始碼可知，執行緒在結束時，會呼叫__free_tcb 來先將執行緒的 TCB(Thread Control Block，執行緒的後設資料)釋放，然後呼叫deallocate_stack 將 stack 回收。這個過程中，主要的瓶頸點在deallocate_stack 上。deallocate_stack 會嘗試持有跟allocate_stack裡面相同的lll_lock 全域性鎖，將stack從stack_used 列表中刪除。然後判斷 stack 是否是系統分配的，如果是，那麼將其加入到stack_cache 列表中。加入後，會檢查stack_cache 列表的大小是否超出閾值stack_cache_maxsize ，如果是，那麼會呼叫__free_stacks 函式釋放一些 stack 直到小於閾值stack_cache_maxsize。值得注意的是，__free_stacks 函式裡面會呼叫syscall munmap 來釋放記憶體。對於閾值stack_cache_maxsize ，如上圖，從原始碼上看，它的預設值是 4010241024，結合程式碼中的註釋，似乎單位是 kB。但是後來實測後發現，這個註釋是有問題，實際上stack_cache_maxsize 的單位是 Byte，也就是預設 40MB。而 thread 預設 stack 大小一般為 8~10 MB，也就是說 glibc 預設情況下大概可以幫使用者 cache 4~5 個執行緒 stack。

由此可見，執行緒在建立和釋放過程中，都會搶同一把全域性互斥鎖lll_lock，從而在高併發執行緒建立/釋放時，這些執行緒會發生排隊、阻塞的情況。由於這個過程中同一時間只能一個執行緒在工作，假設執行緒建立/釋放的代價是 c，那麼可以大致推算出 n 個執行緒建立/釋放的平均延遲 avg_rt = (1+2+…+n)c/n = n(n+1)/2c/n=(n+1)*c/2。也就是建立/釋放的平均延遲隨併發數線性增加。在 TiFlash 上對執行緒建立做打點監控後發現，40 個巢狀查詢（max_threads =4，注：此為TiFlash的併發度引數）下，執行緒建立/釋放的執行緒數規模達到了 3500 左右，執行緒建立平均延遲居然達到了 30ms! 這是延遲是非常恐怖的，執行緒建立/釋放已經不像想象中那麼“輕量”了。單次操作的延遲已經如此之高，對於像 TiFlash 這種巢狀型的執行緒建立場景，可想而知會更嚴重。

講到這裡，大家已經瞭解到執行緒建立和釋放過程會嘗試獲取全域性互斥鎖而發生排隊阻塞的行為，不過可能還對lll_lock 一頭霧水。什麼是lll_lock 呢?

lll_lock 和 Futex

<center>Figure 7: futex</center>

lll 是 low level lock的縮寫，俗稱底層鎖，實際是基於 Futex 實現的互斥鎖。Futex，全稱 fast userspace mutex，是一個非 vDSO 的system call。高版本 linux 的 mutex 也是基於 futex 實現的。futex 的設計思路認為大部分情況鎖爭搶是不會發生的，這時候可以直接在使用者態完成鎖操作。而當發生鎖爭搶時，lll_lock 通過非vDSO 的系統呼叫 sys_futex(FUTEX_WAIT) 陷入核心態等待被喚醒。成功搶到鎖的執行緒，幹完活後，通過lll_unlock來喚醒 val 個執行緒(val 一般設為1)，lll_unlock 實際通過非vDSO 的系統呼叫sys_futex(FUTEX_WAIT) 來完成喚醒操作。

從上面對 lll_lock、futex 的原理中可以瞭解到，如果是非爭搶情況下，這個操作是比較輕量的，也不會陷入核心態。但是在爭搶情況下，不但發生了排隊阻塞，還會觸發使用者態和核心態的切換，執行緒的建立/釋放效率雪上加霜。核心態和使用者態的切換之所以慢，主要因為非 vDSO 的系統呼叫。下面不妨講講系統呼叫的代價。

系統呼叫的代價

現代 linux 系統中，一般會將部分的 syscall 集合用 vDSO 的方式暴露給程式，程式以 vDSO 的方式對 syscall 進行呼叫其實是很高效的，因為不涉及到使用者態和核心態的切換。而非 vDSO 的 syscall 就不那麼幸運了，不幸的是 Futex 就屬於非 vDSO 類的。

<center>Figure 8: system call 工作方式</center>

傳統的 syscall 通過 int 0x80 中斷的方式進行，CPU 把控制權交給 OS，OS 會檢查傳入的引數，例如SYS_gettimeofday ，然後利根據暫存器中的系統呼叫號查詢系統呼叫表，獲得呼叫號對應的服務函式並執行比如: gettimeofday 。中斷會強制 CPU 儲存中斷前的執行狀態，為了在中斷結束後可以把狀態恢復。除了中斷本身, kernel 還會做更多的事情。Linux 被分為使用者空間和核心空間，核心空間的許可權等級最高，可以直接對硬體做操作。為了防止使用者程式的惡意行為，使用者應用無法直接訪問核心空間，要想做使用者態無法完成的工作，便需要 syscall 來間接完成，kernel 必須在使用者和核心這兩個記憶體段之間切換來完成這個操作。這種記憶體地址的切換，需要 CPU 暫存器內容的“大換血”，因為需要儲存和恢復之前暫存器的現場。此外還要對使用者傳入的內容做額外的許可權檢查，所以對效能的衝擊是比較大的。現代的 CPU 提供了 syscall 指令，因此高版本的 linux 實際通過 syscall 指令來代替原來的 int 0x80 中斷方式，但是代價依然很高。

mutex 實際也是基於 futex 實現的，為啥執行緒建立/析構就會變慢呢？

從之前 futex 介紹中講到， mutex 實際也是基於 futex 實現的。同樣都是基於 futex 實現，為啥執行緒建立/析構就會變慢呢？

通過修改 glibc 和 kernel 的原始碼，在裡面加入 trace 程式碼，定位到執行緒建立/析構在 futex 臨界區內耗時主要是munmap 這個 syscall 貢獻的。之前的原始碼分析中講到，當執行緒釋放時，如果 stack_cache 列表已經滿了，會呼叫munmap 來將 stack 釋放掉。

munmap 這個操作的耗時大概有幾 us 甚至幾十 us。幾乎貢獻了整個過程耗時的 90% 以上。又因為munmap 是通過 futex 全域性鎖在完成的，導致這期間其他的執行緒建立/析構工作都必須阻塞。引發嚴重的效能降級。

所以，執行緒建立/析構慢的更深層原因是：執行緒析構時如果stack_cache 滿了，需要呼叫munmap 來將 stack 釋放，這個過程的 futex 臨界區耗時過長！這樣建立和析構在搶同一把 futex 鎖的時候，都會發生阻塞現象。

可是同樣都是操作記憶體，用於申請記憶體的 mmap 卻非常快，基本不到 1us 就可以執行結束，為什麼兩者反差如此之大？接下來我們分析下 munmap 為什麼會這麼慢。

munmap、TLB shootdown 和核間中斷 IPI

首先簡要講下 munmap 的工作過程：munmap 會根據要釋放的記憶體範圍尋找對應的虛擬記憶體區VMA（virtual memory area），如果要釋放的記憶體範圍的首尾正好落在某個 VMA 的中間，那麼需要對對應的 VMA 進行分裂。然後解對映 unmap、釋放對應的 VMA、頁表。並作廢相關的 TLB。

通過在 kernel 的中加入的 trace 發現，耗時主要發生在tlb_flush_mmu中，這個是驅逐TLB的過程。因為 munmap 在釋放記憶體後，需要將過期失效的 TLB 作廢掉，所以會呼叫這個函式。

再深入下去，如果涉及的 TLB 在多個 CPU 核上都存在，tlb_flush_mmu 會呼叫smp_call_function_many 來在這些核上都做一遍 flush TLB，並且以同步的方式等待該過程執行完畢。單核 Flush TLB 的操作通過單核中斷完成，多核 Flush TLB 需要通過核間中斷 IPI 來完成。

通過 trace 定位，耗時主要是 IPI 貢獻的，光是 IPI 通訊的耗時就有幾 us 甚至幾十 us，而flush TLB 本身卻不到 1us。

<center>Figure 9: 核間中斷 IPI 工作方式</center>

IPI 的具體工作方式如上圖，多個 CPU 核心通過系統匯流排 System Bus 進行 IPI 訊息的通訊，當一個 CPU 核需要在多個 CPU 核心上做 IPI 工作時，該核心會傳送 IPI 請求到 System Bus 並等待其他核心全部完成 IPI 操作，相關的 CPU 核心上收到 IPI 請求後處理自己的 Interrupt 任務，完成後通過 System Bus 通知發起方。因為這個過程需要通過 CPU 外部的 System Bus 來完成，並且發起方在傳送 IPI 到等待其他核心完成中斷操作的過程中只能傻等著，所以 overhead 非常高（幾 us 甚至更高）。

翻看別人的研究成果, 更加驗證了 IPI 是很重的操作。根據 18 年發表的論文<<Latr: Lazy Translation Coherence>>的研究表明，

“an IPI takes up to 6.6 µs for 120 cores with 8 sockets and 2.7 µs for 16 cores with 2 sockets.”

也就是說一次 IPI 操作的 overhead 大概就是 us 級別的。

Context switch 和 CFS

除了執行緒建立和釋放的問題，執行緒數也是一個比較值得關注的問題。尤其是 running 執行緒數多了後，context switch 和排程的代價可能會對效能帶來衝擊。為什麼這裡刻意強調是 running 態執行緒呢? 因為處於阻塞態的執行緒（鎖等待、nanosleep 等），實際並不參與排程也不會發生上下文切換。可能很多人都有這樣的誤解就是:執行緒數(無論是否處於阻塞態)多了，上下文切換、排程代價就一定高，實際上並不完全正確的。因為對於處於阻塞態的執行緒，排程器不會分配給他任何 CPU 時間，直到被喚醒為止。Linux 的排程器實現是 CFS（Completely Fair Scheduler），它實際上在每個 CPU core 上維護了一個基於紅黑樹的處於 runnable 態執行緒的 queue，也叫 runqueue。這個 runqueue 的排程代價為 log(n)（n 為該佇列中 runnable 執行緒的數目）。由於 CFS 只對 running 態執行緒做排程，所以排程和 context switch 主要發生在 running 執行緒之間。剛才詳細分析了排程器 CFS 的代價，接下來講一下 context switch 的。

context switch 分為程式內和程式間，由於我們一般都是單程式下的多執行緒開發，所以這裡的上下文切換主要是指程式內執行緒的切換代價。程式內執行緒切換相對於跨程式切換效率相對較高，因為不發生 TLB(Translation lookaside buffer) flush。不過程式內執行緒切換的代價也不低，因為會發生暫存器現場、TCB(thread control block)的儲存和恢復，還有 CPU cache的部分失效。

之前版本 TiFlash 在高併發查詢下執行緒總數可以達到 5000 多，確實是一個比較恐怖的數目。但是 runnning 執行緒數一般不超過 100 個。假設在 40 個邏輯核的機器上執行, 這時候的排程代價最壞情況下不超過 lg(100) , 理想狀態應該是 lg(100/40) , 代價相對較小。而上下文切換代價大概相當於幾十個 running 執行緒的量級, 也屬於比較可控的狀態。

這裡, 我也做了個實驗來對比 5000 個 running 執行緒和 5000 個 blocked 執行緒的耗時對比。首先定義了 3 種執行緒狀態：work 是從 0 到 50000000 做計數；Yield 是迴圈做sched_yeild , 讓執行緒不做任何計算工作就讓出 CPU 執行權並維持 runnable 狀態，這樣的目的是在增加 running 態執行緒數目的同時，不引入額外計算工作量。Wait 是做condition_variable.wait(lock, false) 。耗時結果如下：

可以看到，因為鎖等待是非 running 的執行緒，實驗一和實驗二的耗時相差不大，說明 5000 個阻塞態執行緒並沒對效能造成明顯衝擊。而實驗三，500 個只做上下文切換的執行緒(相當於不做計算工作的 running 態執行緒)，數目上沒有實驗二的 wait 執行緒多，即使不做別的計算工作，也給效能造成巨大的衝擊。這帶來的排程和上下文切換代價就相當明顯了，耗時直接漲了近 10 倍多。這說明，排程和上下文切換代價主要跟非阻塞態的 running 執行緒數有關。這一點，有助於我們以後在分析效能問題時得到更準確的判斷。

警惕系統監控的誤導

我們在排查問題時，在監控上其實踩了不少坑，一個是系統監控工具 top 挖的。我們在 top 下看到 running threads 數目低於預期，經常在個位數徘徊。讓我們誤以為問題出在了系統的上下文有關。但是，主機的 CPU 使用率卻能達到 80%。可是細想又覺得不對勁：如果大部分時間都是幾個或者十幾個執行緒在工作，對於一臺 40 邏輯核的主機來說，是不可能達到這麼高的 CPU 使用率的，這是怎麼回事呢？

//Entry Point
static void procs_refresh (void) {
   ...
   read_something = Thread_mode ? readeither : readproc;

   for (;;) {
      ...
      // on the way to n_alloc, the library will allocate the underlying
      // proc_t storage whenever our private_ppt[] pointer is NULL...
      // read_something() is function readeither() in Thread_mode!
      if (!(ptask = read_something(PT, private_ppt[n_used]))) break;
      procs_hlp((private_ppt[n_used] = ptask));  // tally this proc_t
   }

   closeproc(PT);
   ...
} // end: procs_refresh

// readeither() is function pointer of read_something() in Thread_mode;
// readeither: return a pointer to a proc_t filled with requested info about
// the next unique process or task available.  If no more are available,
// return a null pointer (boolean false).  Use the passed buffer instead
// of allocating space if it is non-NULL.
proc_t* readeither (PROCTAB *restrict const PT, proc_t *restrict x) {
    ...

next_proc:
    ...

next_task:
    // fills in our path, plus x->tid and x->tgid
    // find next thread
    if ((!(PT->taskfinder(PT,&skel_p,x,path)))   // simple_nexttid()
    || (!(ret = PT->taskreader(PT,new_p,x,path)))) { // simple_readtask
        goto next_proc;
    }
    if (!new_p) {
        new_p = ret;
        canary = new_p->tid;
    }
    return ret;

end_procs:
    if (!saved_x) free(x);
    return NULL;
}

// simple_nexttid() is function simple_nexttid() actually
// This finds tasks in /proc/*/task/ in the traditional way.
// Return non-zero on success.
static int simple_nexttid(PROCTAB *restrict const PT, const proc_t *restrict const p, proc_t *restrict const t, char *restrict const path) {
  static struct dirent *ent;        /* dirent handle */
  if(PT->taskdir_user != p->tgid){ // init
    if(PT->taskdir){
      closedir(PT->taskdir); 
    }
    // use "path" as some tmp space
    // get iterator of directory  /proc/[PID]/task
    snprintf(path, PROCPATHLEN, "/proc/%d/task", p->tgid);
    PT->taskdir = opendir(path);
    if(!PT->taskdir) return 0;
    PT->taskdir_user = p->tgid;
  }
  for (;;) { // iterate files in current directory
    ent = readdir(PT->taskdir); // read state file of a thread
    if(unlikely(unlikely(!ent) || unlikely(!ent->d_name[0]))) return 0;
    if(likely(likely(*ent->d_name > '0') && likely(*ent->d_name <= '9'))) break;
  }
  ...
  return 1;
}

// how TOP statisticizes state of threads 
switch (this->state) {
      case 'R':
         Frame_running++;
         break;
      case 't':     // 't' (tracing stop)
      case 'T':
         Frame_stopped++;
         break;
      case 'Z':
         Frame_zombied++;
         break;
      default:
         /* the following states are counted as sleeping state
            currently: 'D' (disk sleep),
                       'I' (idle),
                       'P' (parked),
                       'S' (sleeping),
                       'X' (dead - actually 'dying' & probably never seen)
         */
         Frame_sleepin++;
         break;
   }

<center>Figure 10: top 原始碼分析</center>

分析了 top 的原始碼後，終於明白了原因。原來 top 顯示的不是當時的"瞬時情況"，因為 top 不會把程式停掉。具體的工作過程如上圖， top 會掃描一遍當時的執行緒列表，然後一個一個去取狀態，這個過程中程式是繼續執行的，所以 top 掃完列表後，之後新啟動執行緒是沒記錄進去的，而舊執行緒一部分已經結束了，結束狀態的執行緒會算到 sleeping 裡。所以對於高併發執行緒頻繁申請和釋放的場景下， top 上看到的 running 數就是會偏少的。

所以 top 中的 running 執行緒數，對於執行緒頻繁建立和釋放的程式來說，這個指標是不準確的。

此外，對於 pipeline 形式的 TiFlash，資料在 pipeline 流動的過程中，同一資料只會出現在 pipleline 的一個環節上，運算元有資料就處理，沒資料就等待（GDB 上看大部分執行緒都是這個狀態）。pipeline 中大部分的環節都處於沒資料等待，有資料又很快結束的狀況。監控工程中沒有停掉整個 TiFlash，所以對於每個執行緒了，大概率會取到這個執行緒的等待狀態。

經驗總結

在整個問題的排查過程中，有一些方法是可以沉澱下來，以後的開發、排查工作中，依然可以用到：

多執行緒開發中，應儘量採用執行緒池、協程等手段來避免頻繁的執行緒建立和釋放。
儘量在簡單環境下復現問題，以減少會對排查產生干擾的因素。
控制 running 態的執行緒數目，大於 CPU 核數後會產生多餘的上下文切換代價。
線上程等待資源的場景的開發中，儘量使用 lock， cv 等。如果用 sleep，睡眠間隔應儘量設得長一點，以減少不必要的執行緒喚醒。
辯證地看待監控工具，當分析結果和監控資料有矛盾時，不能排除對監控工具本身的質疑。此外，要仔細閱讀監控工具的文件和指標說明，避免對指標產生誤讀。
多執行緒 hang、slow、爭搶問題排查：pstack、GDB 看各個執行緒的狀態。
效能熱點工具：perf 、flamegraph。