pt-online-schema-change 使用異常處理及注意事項

小亮520cl發表於2017-02-28

pt-online-schema-change最近使用pt-online-schema-change 做線上大表的線上DDL,發現幾個問題。
我使用的語句如下:
pt-online-schema-change --user=root --password="xxxxx" --host=192.168.xx.xx D=M_xx,t=T_xx  --alter "ADD Fxxxxx'" --charset=utf8 --no-check-replication-filters --alter-foreign-keys-method=auto --recursion-method=none --print --execute
在執行的過程中,有一個庫中斷了,錯誤資訊如下:
Copying `table_01005`.`T_xxx`:  19% 16:30 remain
Copying `table_01005`.`T_xxx`:  21% 16:21 remain
Copying `table_01005`.`T_xxx`:  22% 16:58 remain
2014-11-04T18:20:25 Dropping triggers...
DROP TRIGGER IF EXISTS `table_01005`.`pt_osc_table_01005_T_xxx_del`;
DROP TRIGGER IF EXISTS `table_01005`.`pt_osc_table_01005_T_xxx_upd`;
DROP TRIGGER IF EXISTS `table_01005`.`pt_osc_table_01005_T_xxx_ins`;
2014-11-04T18:20:28 Dropped triggers OK.
2014-11-04T18:20:28 Dropping new table...
DROP TABLE IF EXISTS `table_01005`.`_T_xxx_new`;
2014-11-04T18:20:30 Dropped new table OK.
`table_01005`.`T_xxx` was not altered.
2014-11-04T18:20:25 Error copying rows from `table_01005`.`T_xxx` to `table_01005`.`_T_xxx_new`: Threads_running=199 exceeds its critical threshold 50。

然後就中斷了,這個Threads_running 是活動的執行緒數。根據這個錯誤提示,查了下percona的文件,
簡單的根據錯誤提示檢視bug列表,找到一個bug,根據現象說是工具本身的一個bug。也正好和我們使用的版本一致,
那我就升級版本,完了再來,然後又重現了,顯示不是bug這麼簡單的事。還是再看錯誤資訊,提示為Threads_running 超過了警告的閥值。既然是閥值,
那是不是可以設定了,找來官網文件仔細瞅瞅,裡面有一個引數需要注意,--critical-load 。文件解釋如下:
--critical-load
type: Array; default: Threads_running=50
Examine SHOW GLOBAL STATUS after every chunk,
and abort if the load is too high. The option accepts a comma-separated list of MySQL status variables and thresholds.
An optional =MAX_VALUE (or :MAX_VALUE) can follow each variable. If not given,
the tool determines a threshold by examining the current value at startup and doubling it.
See --max-load for further details. These options work similarly,
except that this option will abort the tool’s operation instead of pausing it,
 and the default value is computed differently if you specify no threshold.
 The reason for this option is as a safety check in case the triggers on the
 original table add so much load to the server that it causes downtime.
 There is probably no single value of Threads_running that is wrong for
 every server, but a default of 50 seems likely to be unacceptably high
 for most servers, indicating that the operation should be canceled immediately.

大致的意思如下:
每次chunk操作前後,會根據show global status統計指定的狀態量的變化,預設是統計Thread_running。
目的是為了安全,防止原始表上的觸發器引起負載過高。這也是為了防止線上DDL對線上的影響。
超過設定的閥值,就會終止操作,線上DDL就會中斷。提示的異常如上報錯資訊。

和這個引數有的類似的還有一個--max-load :
--max-load
type: Array; default: Threads_running=25
Examine SHOW GLOBAL STATUS after every chunk, and pause if any status variables are higher than their thresholds.
The option accepts a comma-separated list of MySQL status variables. An optional =MAX_VALUE (or :MAX_VALUE) can
follow each variable. If not given, the tool determines a threshold by examining the current value and increasing it by 20%.

For example, if you want the tool to pause when Threads_connected gets too high, you can specify “Threads_connected”,
 and the tool will check the current value when it starts working and add 20% to that value. If the current value is 100,
 then the tool will pause when Threads_connected exceeds 120, and resume working when it is below 120 again. If you want to
 specify an explicit threshold, such as 110, you can use either “Threads_connected:110” or “Threads_connected=110”.

The purpose of this option is to prevent the tool from adding too much load to the server. If the data-copy queries are
intrusive, or if they cause lock waits, then other queries on the server will tend to block and queue. This will typically
cause Threads_running to increase, and the tool can detect that by running SHOW GLOBAL STATUS immediately after each query finishes.
If you specify a threshold for this variable, then you can instruct the tool to wait until queries are running normally again. This will
not prevent queueing, however; it will only give the server a chance to recover from the queueing. If you notice queueing, it is best to decrease the chunk time.

--max-load 選項定義一個閥值,在每次chunk操作後,檢視show global status狀態值是否高於指定的閥值。該引數接受一個mysql status狀態變數以及一個閥值,
如果沒有給定閥值,則定義一個閥值為為高於當前值的20%。
注意這個引數不會像--critical-load終止操作,而只是暫停操作。當status值低於閥值時,則繼續往下操作。
是暫停還是終止操作這是--max-load和--critical-load的差別。

引數值為列表形式,可以指定show global status出現的狀態值。比如,Thread_connect 等等。
格式如下:--critical-load="Threads_running=200"  或者--critical-load="Threads_running:200"。

根據引數要求,修改了Threads_runnings的閥值,成功執行了。所以文件還是很重要的,在使用一個工具之前,文件是必須要看的,
而不只是簡單的使用就完事,必須要對使用過程中可能出現的異常錯誤進行處理。

幾個注意事項:

The tool refuses to operate if it detects replication filters. See --[no]check-replication-filters for details.
The tool pauses the data copy operation if it observes any replicas that are delayed in replication. See --max-lag for details.
The tool pauses or aborts its operation if it detects too much load on the server. See --max-load and --critical-load for details.
The tool sets its lock wait timeout to 1 second so that it is more likely to be the victim of any lock contention, and less likely to disrupt other transactions. See --lock-wait-timeout for details.
The tool refuses to alter the table if foreign key constraints reference it, unless you specify --alter-foreign-keys-method.
The tool cannot alter MyISAM tables on “Percona XtraDB Cluster” nodes.


pt-online-schema-change 線上DDL工具,雖然說不會鎖表,但是對效能還是有一定的影響,執行過程中對全表做一次select。
這個過程會將buffer_cache中活躍資料全部交換一遍,這就導致活躍資料的請求都要從磁碟獲取,導致慢SQL增多,file_reads增大。


 

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29096438/viewspace-2134416/,如需轉載,請註明出處,否則將追究法律責任。

相關文章