目錄
mysql複製那點事 - Seconds_behind_Master引數調查筆記
0. 參考文獻
ps:如上的參考資料並非完全在本文中被引用,其他沒有被引用的資料在本文的形成的工作中,提供了思路上和其他方面的參考,因此一併列入參考文獻。感謝如上的文獻作者提供的參考。
1. 問題背景
部門當前的資料庫架構是雙主模式,既線上由2臺互為主從的資料庫搭建而成的叢集。高可用通過vip和headbeat來做保證。通常情況下,vip掛在主(本文稱之為da)上,當da發生了異常比如當機等問題的時候,vip自動漂移至從(本文稱之為dp)。架構如下圖所示:
可以看出,所有的寫入和讀取操作都在da上進行。da產生的binlog會被dp拉取重放。同時dp也會產生binlog並被da拉取重放(注:這裡不會產生迴圈複製(既da產生的binlog事件在dp執行過了之後,再被da讀取到再執行),因為binlog中每個event都有個serverid標記,標記是哪個server產生的事物。當da讀取到binlog的時候,發現serverid和本機的serverid一致,便會跳過當前的binlog event)。正常情況下,da的Seconds_behind_Master引數,應該保持在0或者很小的數值才對。但是監控系統發現,da的Seconds_behind_Master引數經常發生跳變,上一秒為0下一秒就可能為一個很大值(比如10000)。例如下面的資料,是通過指令碼抓取到的線上Seconds_behind_Master情況:
2019-06-27 13:26:00
Read_Master_Log_Pos: 472851480
Relay_Log_File: da01-relay-bin.002323
Relay_Log_Pos: 371
Relay_Master_Log_File: dp01-bin.004267
Exec_Master_Log_Pos: 472851480
Seconds_Behind_Master: 0
2019-06-27 13:26:01
Read_Master_Log_Pos: 473594815
Relay_Log_File: da01-relay-bin.002323
Relay_Log_Pos: 371
Relay_Master_Log_File: dp01-bin.004267
Exec_Master_Log_Pos: 473594649
Seconds_Behind_Master: 622
2019-06-27 13:26:02
Read_Master_Log_Pos: 474422859
Relay_Log_File: da01-relay-bin.002323
Relay_Log_Pos: 371
Relay_Master_Log_File: dp01-bin.004267
Exec_Master_Log_Pos: 474422859
Seconds_Behind_Master: 0
2. 調查結論
前文講到了本文調查的問題背景,在這裡先給出下整個問題的調查結論。在sql/rpl_slave.cc 中計算Seconds_Behind_Master邏輯如下 (文獻2):
if ((mi->get_master_log_pos() == mi->rli->get_group_master_log_pos()) && (!strcmp(mi->get_master_log_name(), mi->rli->get_group_master_log_name()))) {
if (mi->slave_running == MYSQL_SLAVE_RUN_CONNECT)
protocol->store(0LL);
else
protocol->store_null();
} else {
long time_diff= ((long)(time(0) - mi->rli->last_master_timestamp) - mi->clock_diff_with_master);
protocol->store((longlong)(mi->rli->last_master_timestamp ? max(0L, time_diff) : 0));
}
當Exec_Master_Log_Pos <Read_Master_Log_Pos 的時候進入else裡面。在da的io程式拉取dp的binlog的時候,會發現dp的binlog的serverid是自身產生的。 所以不會去更新last_master_timestamp。如果dp的binlog發生過輪轉,會產生一個ROTATE_EVENT事件更新這個last_master_timestamp。因此當sql程式比io程式慢的時候, 計算出來的time_diff 便是當前時間與上次ROTATE_EVENT事件的差值。復現方法:
- dp上flush logs
- da上stop slave sql_thread;
- da上DML資料。此時觀察show slave status 可以看到Read_Master_Log_Pos 很超前。
- da上start slave ;show slave status \G 可以看到Seconds_Behind_Master變大。
3. 調查與分析過程
從前文的結果中可以看出,跟Seconds_Behind_Master(下文簡稱SBM)值有關的mysql 變數就是last_master_timestamp。因此需要追蹤下在收到的binlog的serverid和mysql程式的serverid一致的情況下,last_master_timestamp是如何更新的。根據前文的復現方法,需要去確認如下的2個點:
- dp上flush logs的時候,da獲取到dp的binglog做的操作。
- 收到的serverid和本機一致的binlog的時候,da的mysql程式的操作。
為了便於觀察mysql的執行邏輯,本文參考文獻9的辦法,編譯了一個debug版本。在mysql執行的過程中,tail -f mysqld.trace檔案觀察mysql命令執行的過程。
3.1 輪轉binlog時的執行邏輯
3.1.1 寫入ROTATE_EVENT
在dp 執行flush logs 或者輪轉binlog的時候,會產生一個ROTATE_EVENT事件(具體的解析可以參考文獻11)。在mysqld.trace檔案中,可以觀察到如下的資訊,標誌著da此時接收到了dp的binlog輪轉事件的ROTATE_EVENT。
handle_slave_io: info: IO thread received event of type Rotate
相應的程式碼在sql/rpl_slave.cc檔案的5819行中:
5817 THD_STAGE_INFO(thd, stage_queueing_master_event_to_the_relay_log);
5818 event_buf= (const char*)mysql->net.read_pos + 1;
5819 DBUG_PRINT("info", ("IO thread received event of type %s",
5820 Log_event::get_type_str(
5821 (Log_event_type)event_buf[EVENT_TYPE_OFFSET])));
... ...
5831
5832 /* XXX: 'synced' should be updated by queue_event to indicate
5833 whether event has been synced to disk */
5834 bool synced= 0;
5835 if (queue_event(mi, event_buf, event_len))
5836 {
5837 mi->report(ERROR_LEVEL, ER_SLAVE_RELAY_LOG_WRITE_FAILURE,
5838 ER(ER_SLAVE_RELAY_LOG_WRITE_FAILURE),
5839 "could not queue event from master");
5840 goto err;
5841 }
可以看到在sql程式接收到一個binlog之後,會呼叫queue_event進行寫入relaylog。queue_event處理ROTATE_EVENT關鍵程式碼如下:
8236 switch (event_type) {
... ...
8251 case binary_log::ROTATE_EVENT:
8252 {
8253 Rotate_log_event rev(buf, checksum_alg != binary_log::BINLOG_CHECKSUM_ALG_OFF ?
8254 event_len - BINLOG_CHECKSUM_LEN : event_len,
8255 mi->get_mi_description_event());
8256
8257 if (unlikely(process_io_rotate(mi, &rev)))
8258 {
8259 mi->report(ERROR_LEVEL, ER_SLAVE_RELAY_LOG_WRITE_FAILURE,
8260 ER(ER_SLAVE_RELAY_LOG_WRITE_FAILURE),
8261 "could not queue event from master");
8262 goto err;
8263 }
在switch中判斷binlog如果是ROTATE_EVENT,則呼叫process_io_rotate進行處理。其中buf是sql進行拉取到的binlog的快取,mi是代表了masterinfo。process_io_rotate主要的工作是更新master_info 的相關資訊,例如下一個binglog的位置,master的binlog位置資訊等。可以在mysqld.trace 檔案中觀察到如下的資訊:
process_io_rotate: info: new (master_log_name, master_log_pos): ('dp-bin.000029', 4)
process_io_rotate函式註釋資訊如下:
7767 /**
7768 Used by the slave IO thread when it receives a rotate event from the
7769 master.
7770
7771 Updates the master info with the place in the next binary log where
7772 we should start reading. Rotate the relay log to avoid mixed-format
7773 relay logs.
7774
7775 @param mi master_info for the slave
7776 @param rev The rotate log event read from the master
7777
7778 @note The caller must hold mi->data_lock before invoking this function.
7779
7780 @retval 0 ok
7781 @retval 1 error
<!--7782 */-->
7783 static int process_io_rotate(Master_info *mi, Rotate_log_event *rev)
... ...
7830 }
在完成相關的工作之後,來到queue_event函式的最後,寫入ROTATE_EVENT,在mysqld.trace中可以看到如下的資訊:
queue_event: info: master_log_pos: 4
harvest_bytes_written: info: counter: 769 bytes_written: 44
queue_event: info: error: 0
相關程式碼如下:
8656 {
8657 bool is_error= false;
8658 /* write the event to the relay log */
8659 if (likely(rli->relay_log.append_buffer(buf, event_len, mi) == 0))
8660 {
8661 mi->set_master_log_pos(mi->get_master_log_pos() + inc_pos);
8662 DBUG_PRINT("info", ("master_log_pos: %lu", (ulong) mi->get_master_log_pos()));
8663 rli->relay_log.harvest_bytes_written(rli, true/*need_log_space_lock=true*/);
... ...
}
}
到此,ROTATE_EVENT事件被完全寫入到了relaylog中。借用一張經典mysql複製原理圖(文獻12)說明這個過程,即完成了圖中從BinaryLog 讀取到RelayLog的寫入過程。
3.1.2 sql 程式讀取和執行ROTATE_EVENT
在前文的描述中,可以觀察到在RelayLog的寫入過程中並沒有去更新last_master_timestamp相關資訊。因此更新last_master_timestamp的工作,只可能在sql程式中去完成。接下來本節將探索下ROTATE_EVENT在sql程式中的執行過程。
mysql複製程式碼中,sql程式的入口函式是handle_slave_sql:
7158 /**
7159 Slave SQL thread entry point.
7160
7161 @param arg Pointer to Relay_log_info object that holds information
7162 for the SQL thread.
7163
7164 @return Always 0.
7165 */
7166 extern "C" void *handle_slave_sql(void *arg)
在handle_slave_sql中,關鍵的程式碼是while (!sql_slave_killed(thd,rli))這個迴圈:
7438 while (!sql_slave_killed(thd,rli))
7439 {
7440 THD_STAGE_INFO(thd, stage_reading_event_from_the_relay_log);
7441 DBUG_ASSERT(rli->info_thd == thd);
7442 THD_CHECK_SENTRY(thd);
7443
7444 if (saved_skip && rli->slave_skip_counter == 0)
7445 {
... ...
7456 }
7457
7458 if (exec_relay_log_event(thd,rli))
7459 {
7460 ... ...
7525 }
7526 goto err;
7527 }
7528 }
可以發現,每次迴圈的時候,都會去執行exec_relay_log_event這個函式:
5098 Top-level function for executing the next event in the relay log.
5099 This is called from the SQL thread.
5100
5101 This function reads the event from the relay log, executes it, and
5102 advances the relay log position. It also handles errors, etc.
5103
5104 This function may fail to apply the event for the following reasons:
5105
5106 - The position specfied by the UNTIL condition of the START SLAVE
5107 command is reached.
5108
5109 - It was not possible to read the event from the log.
5110
5111 - The slave is killed.
5112
5113 - An error occurred when applying the event, and the event has been
5114 tried slave_trans_retries times. If the event has been retried
5115 fewer times, 0 is returned.
5116
5117 - init_info or init_relay_log_pos failed. (These are called
5118 if a failure occurs when applying the event.)
5119
5120 - An error occurred when updating the binlog position.
5121
5122 @retval 0 The event was applied.
5123
5124 @retval 1 The event was not applied.
5125 */
5126 static int exec_relay_log_event(THD* thd, Relay_log_info* rli)
在exec_relay_log_event中,執行對應的relaylog之後,會相應的更新last_master_timestamp。在這裡並行複製模式和非並行複製模式下,更新last_master_timestamp的方式是不同的。在這裡先介紹下非並行複製模式下更新last_master_timestamp的步驟。
3.1.2.1 非並行複製模式下更新last_master_timestamp
在exec_relay_log_event中判斷是否是並行複製是通過is_parallel_exec函式實現的。如果是並行複製模式則返回True,否則返回False。在非並行複製的模式下,會進入如下的程式碼執行:
5175 /*
5176 Even if we don't execute this event, we keep the master timestamp,
5177 so that seconds behind master shows correct delta (there are events
5178 that are not replayed, so we keep falling behind).
5179
5180 If it is an artificial event, or a relay log event (IO thread generated
5181 event) or ev->when is set to 0, or a FD from master, or a heartbeat
5182 event with server_id '0' then we don't update the last_master_timestamp.
5183
5184 In case of parallel execution last_master_timestamp is only updated when
5185 a job is taken out of GAQ. Thus when last_master_timestamp is 0 (which
5186 indicates that GAQ is empty, all slave workers are waiting for events from
5187 the Coordinator), we need to initialize it with a timestamp from the first
5188 event to be executed in parallel.
5189 */
5190 if ((!rli->is_parallel_exec() || rli->last_master_timestamp == 0) &&
5191 !(ev->is_artificial_event() || ev->is_relay_log_event() ||
5192 (ev->common_header->when.tv_sec == 0) ||
5193 ev->get_type_code() == binary_log::FORMAT_DESCRIPTION_EVENT ||
5194 ev->server_id == 0))
5195 {
5196 rli->last_master_timestamp= ev->common_header->when.tv_sec +
5197 (time_t) ev->exec_time;
5198 DBUG_ASSERT(rli->last_master_timestamp >= 0);
5199 }
可以看到,last_master_timestamp的值便是ev->common_header->when.tv_sec和(time_t) ev->exec_time的和。在程式碼中加上點除錯資訊,重新編譯:
DBUG_PRINT("info", ("before rli->last_master_timestamp = %lu", rli->last_master_timestamp));
DBUG_PRINT("info", ("before rli->is_parallel_exec() = %d", int(rli->is_parallel_exec())));
if ((!rli->is_parallel_exec() || rli->last_master_timestamp == 0) &&
!(ev->is_artificial_event() || ev->is_relay_log_event() ||
(ev->common_header->when.tv_sec == 0) ||
ev->get_type_code() == binary_log::FORMAT_DESCRIPTION_EVENT ||
ev->server_id == 0))
{
rli->last_master_timestamp= ev->common_header->when.tv_sec +
(time_t) ev->exec_time;
DBUG_PRINT("info", ("after rli->last_master_timestamp = %lu", rli->last_master_timestamp));
DBUG_PRINT("info", ("after rli->is_parallel_exec() = %d", int(rli->is_parallel_exec())));
DBUG_ASSERT(rli->last_master_timestamp >= 0);
}
可以在mysqld.trace中觀察到如下的除錯資訊,印證了前文關於非並行複製的結論。同時也可以看出,非並行複製下last_master_timestamp的更新是在binlogevent執行之前。
exec_relay_log_event: info: before rli->last_master_timestamp = 0
exec_relay_log_event: info: before rli->is_parallel_exec() = 0
exec_relay_log_event: info: after rli->last_master_timestamp = 1562201989
exec_relay_log_event: info: after rli->is_parallel_exec() = 0
3.1.2.2 並行複製模式下更新last_master_timestamp
前文介紹了非並行複製模式下的last_master_timestamp的更新邏輯。本節將介紹下在並行複製模式下,last_master_timestamp是如何更新的。
在並行複製模式下last_master_timestamp的更新邏輯更為複雜。同時對於ROTATE_EVENT事件和普通的binlog事件更新模式也有所不同。接下來首先介紹下普通的binlog事件下last_master_timestamp的更新邏輯:
3.1.2.2.1 普通的binlog事件下last_master_timestamp的更新邏輯
並行複製有一個分發佇列gaq,sql執行緒將binlog事務讀取到gaq,然後再分發給worker執行緒執行。並行複製時,binlog事件是併發穿插執行的,gaq中有一個checkpoint點稱為lwm, lwm之前的binlog都已經執行,而lwm之後的binlog有些執行有些沒有執行。 假設worker執行緒數為2,gap有1,2,3,4,5,6,7,8個事務。worker 1已執行的事務為1 4 6, woker 2執行的事務為2 3 ,那麼lwm為4。
並行複製更新gap checkpiont時,會推進lwm點,同時更新
last_master_timestamp
為lwm所在事務結束的event的時間。因此,並行複製是在事務執行完成後才更新last_master_timestamp
,更新是以事務為單位。同時更新gap checkpiont還受slave_checkpoint_period
引數的影響。
coordinator會呼叫函式mts_checkpoint_routune定期做“checkpoint”操作,將Relay_log_info.gaq中已經執行結束的事務移出佇列,移除方式為從佇列頭開始檢查,如果done為真則移出,否則停止掃描,並標記Low Water Mark為移出的事務中最大的sequence_number(lock-interval終止點)
可以看出,並行複製下last_master_timestamp的更新邏輯與lwm和mts_checkpoint_routune函式有關。mts_checkpoint_routune函式頭如下:
/**
Processing rli->gaq to find out the low-water-mark (lwm) coordinates
which is stored into the cental recovery table.
@param rli pointer to Relay-log-info of Coordinator
@param period period of processing GAQ, normally derived from
@c mts_checkpoint_period
@param force if TRUE then hang in a loop till some progress
@param need_data_lock False if rli->data_lock mutex is aquired by
the caller.
@return FALSE success, TRUE otherwise
*/
bool mts_checkpoint_routine(Relay_log_info *rli, ulonglong period,
bool force, bool need_data_lock)
在coordinator中會定期呼叫mts_checkpoint_routine將Relay_log_info.gaq中已經執行結束的事務移出佇列,並且更新Low Water Mark為移出的事務中最大的sequence_number(參見mts_checkpoint_routine中如下):
do
{
if (!is_mts_db_partitioned(rli))
mysql_mutex_lock(&rli->mts_gaq_LOCK);
cnt= rli->gaq->move_queue_head(&rli->workers);
if (!is_mts_db_partitioned(rli))
mysql_mutex_unlock(&rli->mts_gaq_LOCK);
#ifndef DBUG_OFF
if (DBUG_EVALUATE_IF("check_slave_debug_group", 1, 0) &&
cnt != opt_mts_checkpoint_period)
sql_print_error("This an error cnt != mts_checkpoint_period");
#endif
} while (!sql_slave_killed(rli->info_thd, rli) &&
cnt == 0 && force &&
!DBUG_EVALUATE_IF("check_slave_debug_group", 1, 0) &&
(my_sleep(rli->mts_coordinator_basic_nap), 1));
在函式的最後,會執行last_master_timestamp的更新邏輯:
/*
Update the rli->last_master_timestamp for reporting correct Seconds_behind_master.
If GAQ is empty, set it to zero.
Else, update it with the timestamp of the first job of the Slave_job_queue
which was assigned in the Log_event::get_slave_worker() function.
*/
ts= rli->gaq->empty()
? 0
: reinterpret_cast<Slave_job_group*>(rli->gaq->head_queue())->ts;
rli->reset_notified_checkpoint(cnt, ts, need_data_lock, true);
可以看出如果GAP為空(既work的佇列為空)則更新last_master_timestamp為0,否則更新為佇列中第一個是元素的timestamp。
3.1.2.2.2 ROTATE_EVENT事件下last_master_timestamp的更新模式
前文提到在並行複製模式下,普通的binlog事件會被coordinator程式分發到work中去執行。並且在mts_checkpoint_routine中去推進Low Water Mark和更新last_master_timestamp。但是如果執行的binlog是一個ROTATE_EVENT事件,則coordinator程式不會將事件分發到work中,而是在coordinator程式中自己執行並更新last_master_timestamp。
對於ROTATE_EVENT事件的執行邏輯,入口依然是exec_relay_log_event函式。在函式中呼叫了apply_event_and_update_pos函式執行binlog事件。
5247 /* ptr_ev can change to NULL indicating MTS coorinator passed to a Worker */
5248 exec_res= apply_event_and_update_pos(ptr_ev, thd, rli);
5249 /*
5250 Note: the above call to apply_event_and_update_pos executes
5251 mysql_mutex_unlock(&rli->data_lock);
5252 */
5253
5254 /* For deferred events, the ptr_ev is set to NULL
5255 in Deferred_log_events::add() function.
5256 Hence deferred events wont be deleted here.
5257 They will be deleted in Deferred_log_events::rewind() funciton.
5258 */
在apply_event_and_update_pos函式中,會首先呼叫apply_event去判斷當前的binlog事件是否可以被分發到work程式執行。在這裡收到的binlog事件是ROTATE_EVENT,因此apply_event返回的exec_res為0,ev->worker==rli(rli 代表了coordinator程式),因此不會進入下面的if邏輯中:
4709 if (reason == Log_event::EVENT_SKIP_NOT)
4710 {
4711 // Sleeps if needed, and unlocks rli->data_lock.
4712 if (sql_delay_event(ev, thd, rli))
4713 DBUG_RETURN(SLAVE_APPLY_EVENT_AND_UPDATE_POS_OK);
4714
4715 exec_res= ev->apply_event(rli);
4717 if (!exec_res && (ev->worker != rli))
4718 {
... ...
}
在mysqld.trace 中可以觀察到如下的輸出:
Log_event::shall_skip: info: skip reason=0=NOT
LOG_EVENT:apply_event: info: event_type=Rotate
apply_event_and_update_pos: info: apply_event error = 0
apply_event函式定義和註釋如下:
/**
Scheduling event to execute in parallel or execute it directly.
In MTS case the event gets associated with either Coordinator or a
Worker. A special case of the association is NULL when the Worker
can't be decided yet. In the single threaded sequential mode the
event maps to SQL thread rli.
@note in case of MTS failure Coordinator destroys all gathered
deferred events.
@return 0 as success, otherwise a failure.
*/
int Log_event::apply_event(Relay_log_info *rli)
因為無法被分發到work執行,因此會進入如下的執行邏輯中:
4832 DBUG_PRINT("info", ("apply_event error = %d", exec_res));
4833 if (exec_res == 0)
4834 {
4835 /*
4836 Positions are not updated here when an XID is processed. To make
4837 a slave crash-safe, positions must be updated while processing a
4838 XID event and as such do not need to be updated here again.
4839
4840 However, if the event needs to be skipped, this means that it
4841 will not be processed and then positions need to be updated here.
4842
4843 See sql/rpl_rli.h for further details.
4844 */
4845 int error= 0;
4846 if (*ptr_ev &&
4847 (ev->get_type_code() != binary_log::XID_EVENT ||
4848 skip_event || (rli->is_mts_recovery() && !is_gtid_event(ev) &&
4849 (ev->ends_group() || !rli->mts_recovery_group_seen_begin) &&
4850 bitmap_is_set(&rli->recovery_groups, rli->mts_recovery_index))))
4851 {
4852 #ifndef DBUG_OFF
4853 /*
4854 This only prints information to the debug trace.
4855
4856 TODO: Print an informational message to the error log?
4857 */
... ...
4873 error= ev->update_pos(rli);
... ...
}
}
可以發現,在此呼叫了update_pos進行更新。因為當前的binlog事件是ROTATE_EVENT,因此會呼叫如下的程式碼:
/*
Got a rotate log event from the master.
This is mainly used so that we can later figure out the logname and
position for the master.
We can't rotate the slave's BINlog as this will cause infinitive rotations
in a A -> B -> A setup.
The NOTES below is a wrong comment which will disappear when 4.1 is merged.
This must only be called from the Slave SQL thread, since it calls
flush_relay_log_info().
@retval
0 ok
*/
int Rotate_log_event::do_update_pos(Relay_log_info *rli)
在do_update_pos 更新last_master_timestamp 的關鍵邏輯如下:
if (rli->is_parallel_exec())
{
bool real_event= server_id && !is_artificial_event();
rli->reset_notified_checkpoint(0,
real_event ?
common_header->when.tv_sec +
(time_t) exec_time : 0,
true/*need_data_lock=true*/,
real_event? true : false);
}
可以在mysqld.trace 中觀察到如下的輸出:
Rotate_log_event::do_update_pos: info: server_id=248; ::server_id=236
Rotate_log_event::do_update_pos: info: new_log_ident: dp-bin.000063
Rotate_log_event::do_update_pos: info: pos: 4
Rotate_log_event::do_update_pos: info: old group_master_log_name: 'dp-bin.000062' old group_master_log_pos: 154
Relay_log_info::inc_group_relay_log_pos: info: log_pos: 4 group_master_log_pos: 154
Rotate_log_event::do_update_pos: info: new group_master_log_name: 'dp-bin.000063' new group_master_log_pos: 4
Rotate_log_event::do_update_pos: info: ------> 1562684581(此處為本文加入的除錯資訊,在reset_notified_checkpoint更新last_master_timestamp的時候列印出。)
至此,對於並行複製和非並行複製下當收到ROTATE_EVENT事件時,更新last_master_timestamp的邏輯分析全部完成。
3.2 da執行與本例項serverid一致的binlog時的執行邏輯
前文分析了當mysql收到一個ROTATE_EVENT事件的時候所執行的邏輯。本節將分析下在mysql收到與自己的serverid一致的binlog事件的時候的執行邏輯。
3.2.1 io thread 處理與本例項serverid一致的binlog
首先來看下io thread 接收到與本身serverid一致的binlog的時候所做的操作。在handle_slave_io函式中,會在while迴圈中不斷的呼叫queue_even函式。
5740 while (!io_slave_killed(thd,mi))
5741 {
5742 ulong event_len;
... ...
5750 event_len= read_event(mysql, mi, &suppress_warnings);
... ...
5813 /* XXX: 'synced' should be updated by queue_event to indicate
5814 whether event has been synced to disk */
5815 bool synced= 0;
5816 if (queue_event(mi, event_buf, event_len))
5817 {
5818 mi->report(ERROR_LEVEL, ER_SLAVE_RELAY_LOG_WRITE_FAILURE,
5819 ER(ER_SLAVE_RELAY_LOG_WRITE_FAILURE),
5820 "could not queue event from master");
5821 goto err;
5822 }
... ...
}
在queue_even函式中,對於收到與自己serverid一致的binlog的處理邏輯如下:
8534 /*
8535 If this event is originating from this server, don't queue it.
8536 We don't check this for 3.23 events because it's simpler like this; 3.23
8537 will be filtered anyway by the SQL slave thread which also tests the
8538 server id (we must also keep this test in the SQL thread, in case somebody
8539 upgrades a 4.0 slave which has a not-filtered relay log).
8540
8541 ANY event coming from ourselves can be ignored: it is obvious for queries;
8542 for STOP_EVENT/ROTATE_EVENT/START_EVENT: these cannot come from ourselves
8543 (--log-slave-updates would not log that) unless this slave is also its
8544 direct master (an unsupported, useless setup!).
8545 */
... ...
8560 if ((s_id == ::server_id && !mi->rli->replicate_same_server_id) ||
8561 /*
8562 the following conjunction deals with IGNORE_SERVER_IDS, if set
8563 If the master is on the ignore list, execution of
8564 format description log events and rotate events is necessary.
8565 */
8566 (mi->ignore_server_ids->dynamic_ids.size() > 0 &&
8567 mi->shall_ignore_server_id(s_id) &&
8568 /* everything is filtered out from non-master */
8569 (s_id != mi->master_id ||
8570 /* for the master meta information is necessary */
8571 (event_type != binary_log::FORMAT_DESCRIPTION_EVENT &&
8572 event_type != binary_log::ROTATE_EVENT))))
8573 {
8574 /*
8575 Do not write it to the relay log.
8576 a) We still want to increment mi->get_master_log_pos(), so that we won't
8577 re-read this event from the master if the slave IO thread is now
8578 stopped/restarted (more efficient if the events we are ignoring are big
8579 LOAD DATA INFILE).
8580 b) We want to record that we are skipping events, for the information of
8581 the slave SQL thread, otherwise that thread may let
8582 rli->group_relay_log_pos stay too small if the last binlog's event is
8583 ignored.
8584 But events which were generated by this slave and which do not exist in
8585 the master's binlog (i.e. Format_desc, Rotate & Stop) should not increment
8586 mi->get_master_log_pos().
8587 If the event is originated remotely and is being filtered out by
8588 IGNORE_SERVER_IDS it increments mi->get_master_log_pos()
8589 as well as rli->group_relay_log_pos.
8590 */
8591 if (!(s_id == ::server_id && !mi->rli->replicate_same_server_id) ||
8592 (event_type != binary_log::FORMAT_DESCRIPTION_EVENT &&
8593 event_type != binary_log::ROTATE_EVENT &&
8594 event_type != binary_log::STOP_EVENT))
8595 {
8596 mi->set_master_log_pos(mi->get_master_log_pos() + inc_pos);
8597 memcpy(rli->ign_master_log_name_end, mi->get_master_log_name(), FN_REFLEN);
8598 DBUG_ASSERT(rli->ign_master_log_name_end[0]);
8599 rli->ign_master_log_pos_end= mi->get_master_log_pos();
8600 }
8601 rli->relay_log.signal_update(); // the slave SQL thread needs to re-check
8602 DBUG_PRINT("info", ("master_log_pos: %lu, event originating from %u server, ignored",
8603 (ulong) mi->get_master_log_pos(), uint4korr(buf + SERVER_ID_OFFSET)));
8604 }
結合註釋,可以看出當mysql收到與本例項serverid一致的binlog的時候,不會將當前的binlog事件寫入relaylog中。同時會完成如下的2個事情:(ps :這是一個關鍵的地方,在下文中將提到它的作用)
- 將mi->get_master_log_name() 拷貝到rli->ign_master_log_name_end中。
- rli->relay_log.signal_update()。此處會新signal_cnt這個變數的值。
在mysqld.trace 中可以觀察到如下的日誌資訊:
queue_event: info: master_log_pos: 219, event originating from 236 server, ignored
queue_event: info: error: 0
3.2.2 sql thread 處理與本例項serverid一致的binlog
接下里看看sql程式對binlog的處理過程。入口的函式還是exec_relay_log_event,在exec_relay_log_event函式中會呼叫next_event函式獲取一個可執行的binlog事件。在這裡next_event對於非並行複製會有一個特殊的處理:
9175 if (!rli->is_parallel_exec())
9176 rli->last_master_timestamp= 0;
如果是非並行複製,則當讀取一個binlog的時候,都會把last_master_timestamp設定成0 。因此在非並行複製下,收到與本例項serverid一致的binlog的時候,mysqld.trace中可以觀察到rli->last_master_timestamp的值會一直為0:
next_event: info: seeing an ignored end segment
handle_slave_io: info: IO thread received event of type Query
exec_relay_log_event: info: ================================before rli->last_master_timestamp = 0
exec_relay_log_event: info: ================================before rli->is_parallel_exec() = 0
apply_event_and_update_pos: info: thd->options: ; rli->last_event_start_time: 0
Log_event::shall_skip: info: ev->server_id=0, ::server_id=236, rli->replicate_same_server_id=0, rli->slave_skip_counter=0
Log_event::shall_skip: info: skip reason=0=NOT
LOG_EVENT:apply_event: info: event_type=Rotate
... ...
... ...
exec_relay_log_event: info: ================================before rli->last_master_timestamp = 0
exec_relay_log_event: info: ================================before rli->is_parallel_exec() = 0
apply_event_and_update_pos: info: thd->options: ; rli->last_event_start_time: 0
Log_event::shall_skip: info: ev->server_id=0, ::server_id=236, rli->replicate_same_server_id=0, rli->slave_skip_counter=0
Log_event::shall_skip: info: skip reason=0=NOT
LOG_EVENT:apply_event: info: event_type=Rotate
apply_event_and_update_pos: info: apply_event error = 0
apply_event_and_update_pos: info: OPTION_BEGIN: 0; IN_STMT: 0
Rotate_log_event::do_update_pos: info: server_id=0; ::server_id=236
Rotate_log_event::do_update_pos: info: new_log_ident: dp-bin.000069
Rotate_log_event::do_update_pos: info: pos: 282
MYSQL_BIN_LOG::signal_update: info: signal_cnt : 13
queue_event: info: master_log_pos: 326, event originating from 236 server, ignored
queue_event: info: error: 0
之後的邏輯中,如果發現rli->ign_master_log_name_end[0]不為空(對應了上一個小節中的第一項:將mi->get_master_log_name() 拷貝到rli->ign_master_log_name_end中),則構造一個serverid為0的Rotate event並返回:
9181 if (rli->ign_master_log_name_end[0])
9182 {
9183 /* We generate and return a Rotate, to make our positions advance */
9184 DBUG_PRINT("info",("seeing an ignored end segment"));
9185 ev= new Rotate_log_event(rli->ign_master_log_name_end,
9186 0, rli->ign_master_log_pos_end,
9187 Rotate_log_event::DUP_NAME);
9188 rli->ign_master_log_name_end[0]= 0;
9189 mysql_mutex_unlock(log_lock);
9190 if (unlikely(!ev))
9191 {
9192 errmsg= "Slave SQL thread failed to create a Rotate event "
9193 "(out of memory?), SHOW SLAVE STATUS may be inaccurate";
9194 goto err;
9195 }
9196 ev->server_id= 0; // don't be ignored by slave SQL thread
9197 DBUG_RETURN(ev);
9198 }
緊接著呼叫 apply_event_and_update_pos函式。在apply_event_and_update_pos函式中,如前所述如果是ROTATE EVENT 則返回0,不會被work程式並行執行,並進入update_pos邏輯中。在update_pos邏輯中關鍵的程式碼如下(sql/log_event.cc 檔案):
6649 if (rli->is_parallel_exec())
6650 {
6651 bool real_event= server_id && !is_artificial_event();
6652 rli->reset_notified_checkpoint(0,
6653 real_event ?
6654 common_header->when.tv_sec +
6655 (time_t) exec_time : 0,
6656 true/*need_data_lock=true*/,
6657 real_event? true : false);
6658 }
可以見得real_event在server_id是0 的時候為false。因此當進入reset_notified_checkpoint函式後,因為update_timestamp條件(傳入的值便是real_event的值)為false便不會更新last_master_timestamp。
262 /**
263 This method is called in mts_checkpoint_routine() to mark that each
264 worker is required to adapt to a new checkpoint data whose coordinates
265 are passed to it through GAQ index.
266
267 Worker notices the new checkpoint value at the group commit to reset
268 the current bitmap and starts using the clean bitmap indexed from zero
269 of being reset checkpoint_seqno.
270
271 New seconds_behind_master timestamp is installed.
272
273 @param shift number of bits to shift by Worker due to the
274 current checkpoint change.
275 @param new_ts new seconds_behind_master timestamp value
276 unless zero. Zero could be due to FD event
277 or fake rotate event.
278 @param need_data_lock False if caller has locked @c data_lock
279 @param update_timestamp if true, this function will update the
280 rli->last_master_timestamp.
281 */
282 void Relay_log_info::reset_notified_checkpoint(ulong shift, time_t new_ts,
283 bool need_data_lock,
284 bool update_timestamp)
285 {
... ...
330 if (update_timestamp)
331 {
332 if (need_data_lock)
333 mysql_mutex_lock(&data_lock);
334 else
335 mysql_mutex_assert_owner(&data_lock);
336 last_master_timestamp= new_ts;
337 if (need_data_lock)
338 mysql_mutex_unlock(&data_lock);
339 }
340 }
3.3 小結
從上文的分析可以得出如下的2個結論 :
- 並行複製下會發現last_master_timestamp的值,一直是ROTATE EVENT發生的時間。當Exec_Master_Log_Pos < Read_Master_Log_Pos的時候,便會發現Seconds_Behind_Master的值突變(突變的值是當前時間和ROTATE EVENT發生的時間差的秒數)
- 在非並行複製下,因為 rli->last_master_timestamp會在next_event被置為0。同時之後的程式碼也不會去更新這個引數的值,因此rli->last_master_timestamp一直為0 。show slave status的時候,當Exec_Master_Log_Pos < Read_Master_Log_Pos的時候,Seconds_Behind_Master的值不會突變。
4. 總結
本文分析了並行複製和非並行複製下,Seconds_Behind_Master引數值更新的相關邏輯。限於本文的作者水平有限,文中的錯誤在所難免,懇請大家批評指正。