ANALYZE導致的阻塞問題分析

攜程DBA發表於2020-08-17

背景

  • 問題描述

DBA同學收到qps大量下降的告警,qps從2w下降到1w,然後又自動恢復了。
基於Analysis Report資訊,發現有很多 STATE:Waiting for table flush 的狀態,因此我們分析了問題期間的snapshot,將有問題的狀態羅列整理如下:

0. Server version: 5.7.23-ctrip-log MySQL Community Server - (GPL)
1. qps 下降的比較多
2. thread_running 比較多
3. STATE:Waiting for table flush 比較多
4. Waiting for table flush 的表: 同一個表
5. 此時間段有定時任務: Analyze table 操作

基於以上,我們很快便懷疑和Analyze table 有關

  • 為什麼要Analyze table

當優化器決定選擇使用哪個索引來查詢時,它將使用該表儲存引擎提供的統計資訊。如果統計資訊不是最新的,那麼優化器在建立執行計劃時可能會選擇錯誤的索引,這會導致效能下降。

為了防止這種情況發生,儲存引擎支援自動和手動的統計資訊更新。雖然自動統計資訊更新通常工作得很好,但在某些情況下它們不能正確地完成工作。

MySQL對於索引資訊統計是通過取樣統計的方法,相比整表一條條記錄採集來說花費的代價小很多。取樣統計的時候,InnoDB預設使用N個page(預設16k大小),統計其中的不同值,然後取平均數,最後乘以這個索引對應的頁面數,最後得到索引的基數(cardinality)。

1) 自動更新:

innodb_stats_persistent=ON,表示統計資訊儲存到磁碟,演算法如下:取樣統計20個16k page(innodb_stats_persistent_sample_pages),當更新的記錄數超過1/10的時候,自動更新統計資訊;

innodb_stats_persistent=OFF,表示統計資訊儲存到記憶體,演算法如下:取樣統計8個16k page(innodb_stats_transient_sample_pages),當更新的記錄數超過1/16的時候,自動更新統計資訊;

2) 手動更新:

自己寫指令碼,定期呼叫Analyze table

3)綜合考慮:

我們選擇的方法,就是定期更新統計資訊。 原因如下:

a)可控制低峰期執行  
b)自動更新統計資訊,如果沒有達到觸發條件,則一直不會更新

問題重現與分析

問題重現

時間 conn1 conn2 conn3
1 show open tables like 'mychecksums' 【In_use=0】
2 ANALYZE NO_WRITE_TO_BINLOG TABLE mychecksums;【成功】
3 select * from mychecksums limit 1;【成功】
4 SELECT SLEEP(100) FROM mychecksums;【進入sleep】
5 show open tables like 'mychecksums' 【In_use=1】
6 ANALYZE NO_WRITE_TO_BINLOG TABLE mychecksums;【成功】
7 select * from mychecksums limit 1;【阻塞】
8 【KILL】 查詢結果【成功】
9 ANALYZE NO_WRITE_TO_BINLOG TABLE mychecksums;【成功】
10 SELECT SLEEP(100) FROM mychecksums;【進入sleep】
11 show open tables like 'mychecksums' 【In_use=1】
12 select * from mychecksums limit 1;【成功】

上述模擬表明:

  • Analyze table mychecksums之前,如果mychecksums table沒有被開啟使用(m_open_in_progress=0),那麼後續query執行成功。
  • Analyze table mychecksums之前,如果mychecksums table已經被開啟使用了(m_open_in_progress=1),那麼後續query會被阻塞。
    • 如果將之前引用該表的conn kill掉,那麼被阻塞的query立刻執行成功
  • Analyze table mychecksums之後,如果mychecksums table被開啟使用了(m_open_in_progress=1),那麼後續query執行成功。

阻塞堆疊

  • conn3 select * from mychecksums limit 1
Thread 22 (Thread 0x7f4cbc246700 (LWP 2981)):   conn3 select * from mychecksums limit 1
#0  0x00007f4d870c4cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000c1bf59 in native_cond_timedwait (abstime=0x7f4cbc242700, mutex=0x7f1c27005098, cond=0x7f1c270050c8) at xx/mysql-5.7.23/mysql-5.7.23/include/thr_cond.h:129
#2  my_cond_timedwait (abstime=0x7f4cbc242700, mp=0x7f1c27005098, cond=0x7f1c270050c8) at xx/mysql-5.7.23/mysql-5.7.23/include/thr_cond.h:182
#3  inline_mysql_cond_timedwait (src_file=0x142f2a8 "xx/mysql-5.7.23/mysql-5.7.23/sql/mdl.cc", src_line=1861, abstime=<optimized out>, mutex=0x7f1c27005098, that=<optimized out>) at xx/mysql-5.7.23/mysql-5.7.23/include/mysql/psi/mysql_thread.h:1229
#4  MDL_wait::timed_wait (this=this@entry=0x7f1c27005098, owner=owner@entry=0x7f1c27005000, abs_timeout=abs_timeout@entry=0x7f4cbc242700, set_status_on_timeout=set_status_on_timeout@entry=true, wait_state_name=<optimized out>) at xx/mysql-5.7.23/mysql-5.7.23/sql/mdl.cc:1861
#5  0x0000000000d6d5b4 in TABLE_SHARE::wait_for_old_version (this=this@entry=0x7f1c270b5130, thd=thd@entry=0x7f1c27005000, abstime=abstime@entry=0x7f4cbc242700, deadlock_weight=deadlock_weight@entry=0) at xx/mysql-5.7.23/mysql-5.7.23/sql/table.cc:4611
#6  0x0000000000c71e71 in tdc_wait_for_old_version (deadlock_weight=0, wait_timeout=31536000, table_name=<optimized out>, db=<optimized out>, thd=0x7f1c27005000) at xx/mysql-5.7.23/mysql-5.7.23/sql/sql_base.cc:2927
#7  open_table (thd=thd@entry=0x7f1c27005000, table_list=table_list@entry=0x7f1c2713b578, ot_ctx=ot_ctx@entry=0x7f4cbc242bf0) at xx/mysql-5.7.23/mysql-5.7.23/sql/sql_base.cc:3517
#8  0x0000000000c79402 in open_and_process_table (flags=0, ot_ctx=0x7f4cbc242bf0, has_prelocking_list=false, prelocking_strategy=0x7f4cbc242c80, counter=0x7f1c27007240, tables=0x7f1c2713b578, lex=<optimized out>, thd=0x7f1c27005000) at xx/mysql-5.7.23/mysql-5.7.23/sql/sql_base.cc:5144
#9  open_tables (thd=thd@entry=0x7f1c27005000, start=start@entry=0x7f4cbc242c78, counter=<optimized out>, flags=flags@entry=0, prelocking_strategy=prelocking_strategy@entry=0x7f4cbc242c80) at xx/mysql-5.7.23/mysql-5.7.23/sql/sql_base.cc:5755
#10 0x0000000000c79d5a in open_tables_for_query (thd=thd@entry=0x7f1c27005000, tables=tables@entry=0x7f1c2713b578, flags=flags@entry=0) at xx/mysql-5.7.23/mysql-5.7.23/sql/sql_base.cc:6530
#11 0x00000000007769fb in execute_sqlcom_select (thd=thd@entry=0x7f1c27005000, all_tables=0x7f1c2713b578) at xx/mysql-5.7.23/mysql-5.7.23/sql/sql_parse.cc:5099
#12 0x0000000000ccc3c4 in mysql_execute_command (thd=thd@entry=0x7f1c27005000, first_level=first_level@entry=true) at xx/mysql-5.7.23/mysql-5.7.23/sql/sql_parse.cc:2814
#13 0x0000000000cce9dd in mysql_parse (thd=thd@entry=0x7f1c27005000, parser_state=parser_state@entry=0x7f4cbc243d00) at xx/mysql-5.7.23/mysql-5.7.23/sql/sql_parse.cc:5554
#14 0x0000000000ccf55d in dispatch_command (thd=thd@entry=0x7f1c27005000, com_data=com_data@entry=0x7f4cbc2445a0, command=COM_QUERY) at xx/mysql-5.7.23/mysql-5.7.23/sql/sql_parse.cc:1484
#15 0x0000000000cd0f7f in do_command (thd=thd@entry=0x7f1c27005000) at xx/mysql-5.7.23/mysql-5.7.23/sql/sql_parse.cc:1025
#16 0x0000000000d93030 in handle_connection (arg=arg@entry=0x7f24d193d820) at xx/mysql-5.7.23/mysql-5.7.23/sql/conn_handler/connection_handler_per_thread.cc:308
#17 0x000000000127a504 in pfs_spawn_thread (arg=0x7f24d195dd20) at xx/mysql-5.7.23/mysql-5.7.23/storage/perfschema/pfs.cc:2444
#18 0x00007f4d870c0e25 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f4d85b7d34d in clone () from /lib64/libc.so.6

根因分析

  • 復現這個問題,最主要的幾點因素是:
1. 長事務
2. 併發事務
3. 在長事務結束之前,跑analyze table 

無論長事務還是併發事務,都是為了讓 m_open_in_progress=1 的概率提高,產生鎖的概率就提高了。 

我們這次遇到的問題,就是併發事務多造成的     
  • 過程
1. 開啟一張表,讓其m_open_in_progress=1
2. Analyze table mychecksums
    2.1 開啟並修改表的統計資訊
    2.2 tdc_remove_table: 讓該表從table definition cache中移除,讓其失效。移除之前,必須等待之前的引用標記結束。
        2.2.1 此時,如果後續有新的query來訪問該表,其記憶體物件版本不一致,導致等待,阻塞發生。(mysql_admin_table-->tdc_remove_table::execute
)(REPAIR TABLE,CHECK TABLE同理)
        2.2.2 移除成功
        2.2.3 之後訪問不受影響
    2.3 query_cache.invalidate(thd, table, FALSE)

相關程式碼

  • sql/sql_admin.cc
     if (table->table)
    {
      if (table->table->s->tmp_table)
      {
        /*
          If the table was not opened successfully, do not try to get
          status information. (Bug#47633)
        */
        if (open_for_modify && !open_error)
          table->table->file->info(HA_STATUS_CONST);
      }
      else if (open_for_modify || fatal_error)
      {
        tdc_remove_table(thd, TDC_RT_REMOVE_UNUSED,
                         table->db, table->table_name, FALSE);
        /*
          May be something modified. Consequently, we have to
          invalidate the query cache.
        */
        table->table= 0;                        // For query cache
        query_cache.invalidate(thd, table, FALSE);
      }


  • sql/sql_base.cc
/**
   Remove all or some (depending on parameter) instances of TABLE and
   TABLE_SHARE from the table definition cache.
   @param  thd          Thread context
   @param  remove_type  Type of removal:
                        TDC_RT_REMOVE_ALL     - remove all TABLE instances and
                                                TABLE_SHARE instance. There
                                                should be no used TABLE objects
                                                and caller should have exclusive
                                                metadata lock on the table.
                        TDC_RT_REMOVE_NOT_OWN - remove all TABLE instances
                                                except those that belong to
                                                this thread. There should be
                                                no TABLE objects used by other
                                                threads and caller should have
                                                exclusive metadata lock on the
                                                table.
                        TDC_RT_REMOVE_UNUSED  - remove all unused TABLE
                                                instances (if there are no
                                                used instances will also
                                                remove TABLE_SHARE).
                        TDC_RT_REMOVE_NOT_OWN_KEEP_SHARE -
                                                remove all TABLE instances
                                                except those that belong to
                                                this thread, but don't mark
                                                TABLE_SHARE as old. There
                                                should be no TABLE objects
                                                used by other threads and
                                                caller should have exclusive
                                                metadata lock on the table.
   @param  db           Name of database
   @param  table_name   Name of table
   @param  has_lock     If TRUE, LOCK_open is already acquired
   @note It assumes that table instances are already not used by any
   (other) thread (this should be achieved by using meta-data locks).
*/

void tdc_remove_table(THD *thd, enum_tdc_remove_table_type remove_type,
                      const char *db, const char *table_name,
                      bool has_lock)
{
  char key[MAX_DBKEY_LENGTH];
  size_t key_length;
  TABLE_SHARE *share;

  if (! has_lock)
    table_cache_manager.lock_all_and_tdc();
  else
    table_cache_manager.assert_owner_all_and_tdc();

  DBUG_ASSERT(remove_type == TDC_RT_REMOVE_UNUSED ||
              thd->mdl_context.owns_equal_or_stronger_lock(MDL_key::TABLE,
                                 db, table_name, MDL_EXCLUSIVE));

  key_length= create_table_def_key(thd, key, db, table_name, false);

  if ((share= (TABLE_SHARE*) my_hash_search(&table_def_cache,(uchar*) key,
                                            key_length)))
  {
    /*
      Since share->ref_count is incremented when a table share is opened
      in get_table_share(), before LOCK_open is temporarily released, it
      is sufficient to check this condition alone and ignore the
      share->m_open_in_progress flag.
      Note that it is safe to call table_cache_manager.free_table() for
      shares with m_open_in_progress == true, since such shares don't
      have any TABLE objects associated.
    */
    if (share->ref_count)
    {
      /*
        Set share's version to zero in order to ensure that it gets
        automatically deleted once it is no longer referenced.
        Note that code in TABLE_SHARE::wait_for_old_version() assumes
        that marking share as old and removal of its unused tables
        and of the share itself from TDC happens atomically under
        protection of LOCK_open, or, putting it another way, that
        TDC does not contain old shares which don't have any tables
        used.
      */
      if (remove_type != TDC_RT_REMOVE_NOT_OWN_KEEP_SHARE)
        share->version= 0;
      table_cache_manager.free_table(thd, remove_type, share);
    }
    else
    {
      DBUG_ASSERT(remove_type != TDC_RT_REMOVE_NOT_OWN_KEEP_SHARE);
      (void) my_hash_delete(&table_def_cache, (uchar*) share);
    }
  }

  if (! has_lock)
    table_cache_manager.unlock_all_and_tdc();
}


  • sql/sql_base.cc
/**
  Check if table's share is being removed from the table definition
  cache and, if yes, wait until the flush is complete.
  @param thd             Thread context.
  @param db              Database name.
  @param table_name      Table name.
  @param wait_timeout    Timeout for waiting.
  @param deadlock_weight Weight of this wait for deadlock detector.
  @retval false   Success. Share is up to date or has been flushed.
  @retval true    Error (OOM, our was killed, the wait resulted
                  in a deadlock or timeout). Reported.
*/

static bool tdc_wait_for_old_version(THD *thd, const char *db,
                                     const char *table_name, ulong wait_timeout,
                                     uint deadlock_weight) {
  TABLE_SHARE *share;
  bool res = false;

  mysql_mutex_lock(&LOCK_open);
  if ((share = get_cached_table_share(db, table_name)) &&
      share->has_old_version()) {
    struct timespec abstime;
    set_timespec(&abstime, wait_timeout);
    res = share->wait_for_old_version(thd, &abstime, deadlock_weight);
  }
  mysql_mutex_unlock(&LOCK_open);
  return res;
}

  • sql/sql_base.h
/**
  Get an existing table definition from the table definition cache.
  Search the table definition cache for a share with the given key.
  If the share exists, check the m_open_in_progress flag. If true,
  the share is in the process of being opened by another thread,
  so we must wait for the opening to finish. This may make the share
  be destroyed, if open_table_def() fails, so we must repeat the search
  in the hash table. Return the share.
  @note While waiting for the condition variable signaling that a
        table share is completely opened, the thread will temporarily
        release LOCK_open. Thus, the caller cannot rely on LOCK_open
        being held for the duration of the call.
  @param thd        thread descriptor
  @param db         database name
  @param table_name table name
  @retval NULL      a share for the table does not exist in the cache
  @retval != NULL   pointer to existing share in the cache
*/

TABLE_SHARE *get_cached_table_share(THD *thd, const char *db,
                                    const char *table_name)
{
  char key[MAX_DBKEY_LENGTH];
  size_t key_length;
  TABLE_SHARE *share= NULL;
  mysql_mutex_assert_owner(&LOCK_open);

  key_length= create_table_def_key((THD*) 0, key, db, table_name, 0);
  while ((share= reinterpret_cast<TABLE_SHARE*>(
                     my_hash_search(&table_def_cache,
                       reinterpret_cast<uchar*>(const_cast<char*>(key)),
                       key_length))))
  {
    if (!share->m_open_in_progress)
      break;

    DEBUG_SYNC(thd, "get_cached_share_cond_wait");
    mysql_cond_wait(&COND_open, &LOCK_open);
  }
  return share;
}



  • sql/table.cc
/**
  Wait until the subject share is removed from the table
  definition cache and make sure it's destroyed.
  @note This method may access the share concurrently with another
  thread if the share is in the process of being opened, i.e., that
  m_open_in_progress is true. In this case, close_cached_tables() may
  iterate over elements in the table definition cache, and call this
  method regardless of the share being opened or not. This works anyway
  since a new flush ticket is added below, and LOCK_open ensures
  that the share may not be destroyed by another thread in the time
  between finding this share (having an old version) and adding the flush
  ticket. Thus, after this thread has added the flush ticket, the thread
  opening the table will eventually call free_table_share (as a result of
  releasing the share after using it, or as a result of a failing
  open_table_def()), which will notify the owners of the flush tickets,
  and the last one being notified will actually destroy the share.
  @param mdl_context     MDL context for thread which is going to wait.
  @param abstime         Timeout for waiting as absolute time value.
  @param deadlock_weight Weight of this wait for deadlock detector.
  @pre LOCK_open is write locked, the share is used (has
       non-zero reference count), is marked for flush and
       this connection does not reference the share.
       LOCK_open will be unlocked temporarily during execution.
  @retval FALSE - Success.
  @retval TRUE  - Error (OOM, deadlock, timeout, etc...).
*/

bool TABLE_SHARE::wait_for_old_version(THD *thd, struct timespec *abstime,
                                       uint deadlock_weight)
{
  MDL_context *mdl_context= &thd->mdl_context;
  Wait_for_flush ticket(mdl_context, this, deadlock_weight);
  MDL_wait::enum_wait_status wait_status;

  mysql_mutex_assert_owner(&LOCK_open);
  /*
    We should enter this method only when share's version is not
    up to date and the share is referenced. Otherwise our
    thread will never be woken up from wait.
  */
  DBUG_ASSERT(version != refresh_version && ref_count != 0);

  m_flush_tickets.push_front(&ticket);

  mdl_context->m_wait.reset_status();

  mysql_mutex_unlock(&LOCK_open);

  mdl_context->will_wait_for(&ticket);

  mdl_context->find_deadlock();

  DEBUG_SYNC(thd, "flush_complete");

  wait_status= mdl_context->m_wait.timed_wait(thd, abstime, TRUE,
                                              &stage_waiting_for_table_flush);

  mdl_context->done_waiting_for();

  mysql_mutex_lock(&LOCK_open);

  m_flush_tickets.remove(&ticket);

  if (m_flush_tickets.is_empty() && ref_count == 0)
  {
    /*
      If our thread was the last one using the share,
      we must destroy it here.
    */
    destroy();
  }

  DEBUG_SYNC(thd, "share_destroyed");

  /*
    In cases when our wait was aborted by KILL statement,
    a deadlock or a timeout, the share might still be referenced,
    so we don't delete it. Note, that we can't determine this
    condition by checking wait_status alone, since, for example,
    a timeout can happen after all references to the table share
    were released, but before the share is removed from the
    cache and we receive the notification. This is why
    we first destroy the share, and then look at
    wait_status.
  */
  switch (wait_status)
  {
  case MDL_wait::GRANTED:
    return FALSE;
  case MDL_wait::VICTIM:
    my_error(ER_LOCK_DEADLOCK, MYF(0));
    return TRUE;
  case MDL_wait::TIMEOUT:
    my_error(ER_LOCK_WAIT_TIMEOUT, MYF(0));
    return TRUE;
  case MDL_wait::KILLED:
    return TRUE;
  default:
    DBUG_ASSERT(0);
    return TRUE;
  }
}


問題抽象

生產環境中,會遇到這種場景: 平常SQL語句跑的很快,怎麼突然就慢了呢?

資源不足

大家都知道,一臺機器的資源是有限的,這裡的資源包括但不限於 cpu資源、io資源、thread資源、cache資源 等。 當query突然猛增之後,這些資源就有可能出現不足,從而導致原來很快的SQL效能變差。

這類問題的通用解決方案一般是: 降級、限流

優化器bug

大家都知道優化器會基於最小代價選擇一個合適的索引去引擎層獲取資料。
這裡的代價包含但不限於:Rows-exam、排序、臨時表 等等

bug之所以為bug,就是沒道理可講,具體內容就不囉嗦了,簡單列幾個平常優化過程中的解決方案:

1)刪除重複索引
2)新增合適的索引
3)force index : 這種方法不通用,因為對業務侵入性比較大,且上線需要時間

這類問題的通用解決方案一般是: 定期check 重複索引 和 冗餘索引

統計資訊不準確

  • 為什麼統計資訊會不準確呢
    • 因為MySQL是通過取樣物理資料頁的方式來計算的,當資料分佈不均時,就會有統計資訊不準的情況
    • 另外,MySQL的delete操作,只是將物理記錄標記為刪除,真正的刪除是purge執行緒乾的活。當某些資料因為某種原因(比如:RR事務隔離級別,保證資料一致性),不能被purge的時候,新的insert不是Inplace,而是佔用新的空間,這樣也會導致統計資訊不準確

如果只是基於Rows-exam為成本代價來判斷,那麼是可以通過Analyze table 更新統計資訊來讓優化器做出正確的選擇。
但是真實場景往往很複雜,目前也沒有太好的辦法來判斷出哪些table的統計資訊需要重新更新,只能通過檢視該執行計劃的rows和真實查詢的記錄數的差別來簡單的判斷

解決方案

  • 運維層
* 主要原因
    - long query
        - 監控long query , 發現long query 記錄到日誌,並後期優化
    - high concurrent query
        - 白名單機制: 先加入白名單,後優化降低併發度  

* 按需check
    - 時間維度(全量)
        - 每天一次,改為每週一次
    - 數量維度(增量)
        - 由於沒有太好的辦法判斷哪些表需要Analyze,所以變通的方法是:從slow中找出慢的表,對它進行check
  • 原始碼層
    • 因為analyze table,既沒有修改table definition,也沒有修改table data,那麼我們在此操作中直接跳過tdc_remove_table即可
    • 已在我們5.7.29分支上合併了這部分程式碼,主要邏輯如下
* sql/sql_admin.cc

 if (table->table)
{
    const bool skip_flush=
    (operator_func == &handler::ha_analyze)
    && (table->table->file->ha_table_flags() & HA_ONLINE_ANALYZE);
  if (table->table->s->tmp_table)
  {
    /*
      If the table was not opened successfully, do not try to get
      status information. (Bug#47633)
    */
    if (open_for_modify && !open_error)
      table->table->file->info(HA_STATUS_CONST);
  }
  else if ((!skip_flush && open_for_modify) || fatal_error)
  {
    tdc_remove_table(thd, TDC_RT_REMOVE_UNUSED,
                     table->db, table->table_name, FALSE);
    /*
      May be something modified. Consequently, we have to
      invalidate the query cache.
    */
    table->table= 0;                        // For query cache
    query_cache.invalidate(thd, table, FALSE);
  }
 
  

感謝

本文主要參與者如下:餘聰、王棟、餘建鋒、聖勇進、徐雲磊、蘭春

相關文章