1.概述
在rpc請求裡,有了請求req就必然有回覆resp。本文就來解析傳送req的節點收到resp該怎麼處理。
2.handle_peer_resp原始碼解析
void raft_server::handle_peer_resp(ptr<resp_msg>& resp, const ptr<rpc_exception>& err)
{
if (err)
{
l_->info(sstrfmt("peer response error: %s").fmt(err->what()));
return;
}
// update peer last response time
{
read_lock(peers_lock_);
auto peer = peers_.find(resp->get_src());
if (peer != peers_.end())
{
peer->second->set_last_resp(system_clock::now());
}
else
{
l_->info(sstrfmt("Peer %d not found, ignore the message").fmt(resp->get_src()));
return;
}
}
l_->debug(lstrfmt("Receive a %s message from peer %d with Result=%d, Term=%llu, NextIndex=%llu")
.fmt(
__msg_type_str[resp->get_type()],
resp->get_src(),
resp->get_accepted() ? 1 : 0,
resp->get_term(),
resp->get_next_idx()));
{
recur_lock(lock_);
// if term is updated, no more action is required
if (update_term(resp->get_term()))
{
return;
}
// ignore the response that with lower term for safety
switch (resp->get_type())
{
case msg_type::vote_response:
handle_voting_resp(*resp);
break;
case msg_type::append_entries_response:
handle_append_entries_resp(*resp);
break;
case msg_type::install_snapshot_response:
handle_install_snapshot_resp(*resp);
break;
default:
l_->err(sstrfmt("Received an unexpected message %s for response, system exits.")
.fmt(__msg_type_str[resp->get_type()]));
ctx_->state_mgr_->system_exit(-1);
::exit(-1);
break;
}
}
}
- 1.與rep_handlers類似,resp_handlers同樣有一個總的處理resp的函式,透過switch-case來分流。
- 2.在交給具體的resp_handlers之前,handle_peer_resp還更新了peer的last_response_time。
- 3.如果這個resp可以更新節點的term,說明節點已經落後了,無需進行任何操作。
3.handle_voting_resp原始碼解析
void raft_server::handle_voting_resp(resp_msg& resp)
{
if (resp.get_term() != state_->get_term())
{
l_->info(sstrfmt("Received an outdated vote response at term %llu v.s. current term %llu")
.fmt(resp.get_term(), state_->get_term()));
return;
}
if (election_completed_)
{
l_->info("Election completed, will ignore the voting result from this server");
return;
}
if (voted_servers_.find(resp.get_src()) != voted_servers_.end())
{
l_->info(sstrfmt("Duplicate vote from %d for term %lld").fmt(resp.get_src(), state_->get_term()));
return;
}
{
read_lock(peers_lock_);
voted_servers_.insert(resp.get_src());
if (resp.get_accepted())
{
votes_granted_ += 1;
}
if (voted_servers_.size() >= (peers_.size() + 1))
{
election_completed_ = true;
}
if (votes_granted_ > (int32)((peers_.size() + 1) / 2))
{
l_->info(sstrfmt("Server is elected as leader for term %llu").fmt(state_->get_term()));
election_completed_ = true;
become_leader();
}
}
}
- 1.if (resp.get_term() != state_->get_term())判斷term是否相同,相同繼續。
- 2.判斷if (election_completed_)選舉是否完成,因為candidate只需要一半以上就會成功,所以可能出現election結束了但還收到了resp的情況。
- 3.判斷髮送resp的節點在不在candidate的voted_servers_裡面,在的話說明收到了同一個節點的兩票,出錯。
- 4.如果 (voted_servers_.size() >= (peers_.size() + 1))說明選舉已經結束。
- 5.透過resp.get_accepted()來統計自己的得票,如果超過了一半說明成功了,呼叫become_leader();
4.handle_append_entries_resp原始碼解析
void raft_server::handle_append_entries_resp(resp_msg& resp)
{
read_lock(peers_lock_);
peer_itor it = peers_.find(resp.get_src());
if (it == peers_.end())
{
l_->info(sstrfmt("the response is from an unkonw peer %d").fmt(resp.get_src()));
return;
}
// if there are pending logs to be synced or commit index need to be advanced, continue to send appendEntries to
// this peer
bool need_to_catchup = true;
ptr<peer> p = it->second;
if (resp.get_accepted())
{
{
auto_lock(p->get_lock());
p->set_next_log_idx(resp.get_next_idx());
p->set_matched_idx(resp.get_next_idx() - 1);
}
// try to commit with this response
std::vector<ulong> matched_indexes(peers_.size() + 1);
matched_indexes[0] = log_store_->next_slot() - 1;
int i = 1;
for (it = peers_.begin(); it != peers_.end(); ++it, i++)
{
matched_indexes[i] = it->second->get_matched_idx();
}
std::sort(matched_indexes.begin(), matched_indexes.end(), std::greater<ulong>());
commit(matched_indexes[(peers_.size() + 1) / 2]);
need_to_catchup = p->clear_pending_commit() || resp.get_next_idx() < log_store_->next_slot();
}
else
{
std::lock_guard<std::mutex> guard(p->get_lock());
if (resp.get_next_idx() > 0 && p->get_next_log_idx() > resp.get_next_idx())
{
// fast move for the peer to catch up
p->set_next_log_idx(resp.get_next_idx());
}
else if (p->get_next_log_idx() > 0)
{
p->set_next_log_idx(p->get_next_log_idx() - 1);
}
}
// This may not be a leader anymore, such as the response was sent out long time ago
// and the role was updated by UpdateTerm call
// Try to match up the logs for this peer
if (role_ == srv_role::leader && need_to_catchup)
{
request_append_entries(*p);
}
}
- 1.首先判斷髮送resp的節點是不是還在peers_列表裡面,不在報錯。
- 2.如果resp的accepted = 1,說明append-entry成功了,設定match_idx與next_idx。
- 3.提取peers_列表裡面所有的match_idx,然後sort排序。
- 4.排序後取中位數mid_idx,說明至少有一半以上的follower都應用到了mid_idx,將mid應用到自己(也就是leader)的狀態機。
- 5.如果沒接受(accepted = 0),那麼就調整next_idx繼續逼近。(具體可看cornerstone中msg型別解析)
- 6.如果該節點還需要catch_up,再傳送一遍。(可能是該leader在很久之前給他發的req,現在節點才回復,導致節點依然落後。)
知識點:
log_entry是先讓follower應用到狀態機,只有超過半數以上的都應用了,leader才會應用到自己的狀態機。具體到實現,可以將所有節點的match_idx排序然後取中位數。
5.handle_install_snapshot_resp原始碼解析
void raft_server::handle_install_snapshot_resp(resp_msg& resp)
{
read_lock(peers_lock_);
peer_itor it = peers_.find(resp.get_src());
if (it == peers_.end())
{
l_->info(sstrfmt("the response is from an unkonw peer %d").fmt(resp.get_src()));
return;
}
// if there are pending logs to be synced or commit index need to be advanced, continue to send appendEntries to
// this peer
bool need_to_catchup = true;
ptr<peer> p = it->second;
if (resp.get_accepted())
{
std::lock_guard<std::mutex> guard(p->get_lock());
ptr<snapshot_sync_ctx> sync_ctx = p->get_snapshot_sync_ctx();
if (sync_ctx == nilptr)
{
l_->info("no snapshot sync context for this peer, drop the response");
need_to_catchup = false;
}
else
{
if (resp.get_next_idx() >= sync_ctx->get_snapshot()->size())
{
l_->debug("snapshot sync is done");
ptr<snapshot> nil_snp;
p->set_next_log_idx(sync_ctx->get_snapshot()->get_last_log_idx() + 1);
p->set_matched_idx(sync_ctx->get_snapshot()->get_last_log_idx());
p->set_snapshot_in_sync(nil_snp);
need_to_catchup = p->clear_pending_commit() || resp.get_next_idx() < log_store_->next_slot();
}
else
{
l_->debug(sstrfmt("continue to sync snapshot at offset %llu").fmt(resp.get_next_idx()));
sync_ctx->set_offset(resp.get_next_idx());
}
}
}
else
{
l_->info("peer declines to install the snapshot, will retry");
}
// This may not be a leader anymore, such as the response was sent out long time ago
// and the role was updated by UpdateTerm call
// Try to match up the logs for this peer
if (role_ == srv_role::leader && need_to_catchup)
{
request_append_entries(*p);
}
}
- 核心程式碼是,其他與上面一致。
if (resp.get_next_idx() >= sync_ctx->get_snapshot()->size())
{
l_->debug("snapshot sync is done");
ptr<snapshot> nil_snp;
p->set_next_log_idx(sync_ctx->get_snapshot()->get_last_log_idx() + 1);
p->set_matched_idx(sync_ctx->get_snapshot()->get_last_log_idx());
p->set_snapshot_in_sync(nil_snp);
need_to_catchup = p->clear_pending_commit() || resp.get_next_idx() < log_store_->next_slot();
}
else
{
l_->debug(sstrfmt("continue to sync snapshot at offset %llu").fmt(resp.get_next_idx()));
sync_ctx->set_offset(resp.get_next_idx());
}
- 因為這是snapshot,不需要發idx,所以resp.get_next_idx()實際上是follower已經接受snapshot的offset。如果接受的offset >= sync_ctx->get_snapshot()->size(),說明已經完成了,設定next_idx與match_idx。否則繼續從已經接受的offset位置繼續傳送。
6.額外的ext_resp處理原始碼解析
void raft_server::handle_ext_resp(ptr<resp_msg>& resp, const ptr<rpc_exception>& err)
{
recur_lock(lock_);
if (err)
{
handle_ext_resp_err(*err);
return;
}
l_->debug(lstrfmt("Receive an extended %s message from peer %d with Result=%d, Term=%llu, NextIndex=%llu")
.fmt(
__msg_type_str[resp->get_type()],
resp->get_src(),
resp->get_accepted() ? 1 : 0,
resp->get_term(),
resp->get_next_idx()));
switch (resp->get_type())
{
case msg_type::sync_log_response:
if (srv_to_join_)
{
// we are reusing heartbeat interval value to indicate when to stop retry
srv_to_join_->resume_hb_speed();
srv_to_join_->set_next_log_idx(resp->get_next_idx());
srv_to_join_->set_matched_idx(resp->get_next_idx() - 1);
sync_log_to_new_srv(resp->get_next_idx());
}
break;
case msg_type::join_cluster_response:
if (srv_to_join_)
{
if (resp->get_accepted())
{
l_->debug("new server confirms it will join, start syncing logs to it");
sync_log_to_new_srv(resp->get_next_idx());
}
else
{
l_->debug("new server cannot accept the invitation, give up");
}
}
else
{
l_->debug("no server to join, drop the message");
}
break;
case msg_type::leave_cluster_response:
if (!resp->get_accepted())
{
l_->debug("peer doesn't accept to stepping down, stop proceeding");
return;
}
l_->debug("peer accepted to stepping down, removing this server from cluster");
rm_srv_from_cluster(resp->get_src());
break;
case msg_type::install_snapshot_response:
{
if (!srv_to_join_)
{
l_->info("no server to join, the response must be very old.");
return;
}
if (!resp->get_accepted())
{
l_->info("peer doesn't accept the snapshot installation request");
return;
}
ptr<snapshot_sync_ctx> sync_ctx = srv_to_join_->get_snapshot_sync_ctx();
if (sync_ctx == nilptr)
{
l_->err("Bug! SnapshotSyncContext must not be null");
ctx_->state_mgr_->system_exit(-1);
::exit(-1);
return;
}
if (resp->get_next_idx() >= sync_ctx->get_snapshot()->size())
{
// snapshot is done
ptr<snapshot> nil_snap;
l_->debug("snapshot has been copied and applied to new server, continue to sync logs after snapshot");
srv_to_join_->set_snapshot_in_sync(nil_snap);
srv_to_join_->set_next_log_idx(sync_ctx->get_snapshot()->get_last_log_idx() + 1);
srv_to_join_->set_matched_idx(sync_ctx->get_snapshot()->get_last_log_idx());
}
else
{
sync_ctx->set_offset(resp->get_next_idx());
l_->debug(sstrfmt("continue to send snapshot to new server at offset %llu").fmt(resp->get_next_idx()));
}
sync_log_to_new_srv(srv_to_join_->get_next_log_idx());
}
break;
case msg_type::prevote_response:
handle_prevote_resp(*resp);
break;
default:
l_->err(lstrfmt("received an unexpected response message type %s, for safety, stepping down")
.fmt(__msg_type_str[resp->get_type()]));
ctx_->state_mgr_->system_exit(-1);
::exit(-1);
break;
}
}
在解析前我們先梳理一下呼叫順序:
1.leader向新節點傳送invite_srv_to_join_cluster,新節點收到invite_srv_to_join_cluster請求後更新自己的role_,leader_等狀態,並呼叫reconfigure重置cluster的config。更新完後傳送join_cluster_response給leader。
2.leader收到該response後呼叫switch-case裡面的msg_type::join_cluster_response分支來處理。處理完join_cluster_response會呼叫ync_log_to_new_srv向新節點傳送sync_log_req。
3.新節點收到sync_log_req後傳送sync_log_response給leader,leader收到後呼叫switch-case裡面的msg_type::sync_log_response分支。
void raft_server::sync_log_to_new_srv(ulong start_idx)
{
// only sync committed logs
int32 gap = (int32)(quick_commit_idx_ - start_idx);
if (gap < ctx_->params_->log_sync_stop_gap_)
{
l_->info(lstrfmt("LogSync is done for server %d with log gap %d, now put the server into cluster")
.fmt(srv_to_join_->get_id(), gap));
ptr<cluster_config> new_conf = cs_new<cluster_config>(log_store_->next_slot(), config_->get_log_idx());
new_conf->get_servers().insert(
new_conf->get_servers().end(), config_->get_servers().begin(), config_->get_servers().end());
new_conf->get_servers().push_back(conf_to_add_);
bufptr new_conf_buf(new_conf->serialize());
ptr<log_entry> entry(cs_new<log_entry>(state_->get_term(), std::move(new_conf_buf), log_val_type::conf));
log_store_->append(entry);
config_changing_ = true;
request_append_entries();
return;
}
ptr<req_msg> req;
if (start_idx > 0 && start_idx < log_store_->start_index())
{
req = create_sync_snapshot_req(*srv_to_join_, start_idx, state_->get_term(), quick_commit_idx_);
}
else
{
int32 size_to_sync = std::min(gap, ctx_->params_->log_sync_batch_size_);
bufptr log_pack = log_store_->pack(start_idx, size_to_sync);
req = cs_new<req_msg>(
state_->get_term(),
msg_type::sync_log_request,
id_,
srv_to_join_->get_id(),
0L,
start_idx - 1,
quick_commit_idx_);
req->log_entries().push_back(
cs_new<log_entry>(state_->get_term(), std::move(log_pack), log_val_type::log_pack));
}
srv_to_join_->send_req(req, ex_resp_handler_);
}
- 1.msg_type::sync_log_response型別:
srv_to_join_->resume_hb_speed();
srv_to_join_->set_next_log_idx(resp->get_next_idx());
srv_to_join_->set_matched_idx(resp->get_next_idx() - 1);
sync_log_to_new_srv(resp->get_next_idx());
收到了節點的resp後,那麼給該節點新增hb任務,設定next_idx與match_idx。然後呼叫sync_log_to_new_srv再去同步一遍給該節點,直到兩者資料一致,否則重複傳送sync_log_request。(類似redis裡面主從同步的時候,把主節點在主從同步時候的寫入操作寫入一個buffer,然後在最後再發給從節點再同步一遍。)
根據上面sync_log_to_new_srv原始碼我們可以看到,sync_log不是單純採用request_append_entry去資料同步。因為新加入的節點落後很多,所以leader採用的機制是先傳送snapshot,直到新節點的last_log_idx大於leader的start_idx,接著分情況討論,如果兩者idx的差(gap) < ctx_->params_->log_sync_stop_gap_,說明gap不足以打包成log_pack,則呼叫request_append_entry,否則打包成log_pack傳送。
- 2.case msg_type::join_cluster_response型別:
case msg_type::join_cluster_response:
if (srv_to_join_)
{
if (resp->get_accepted())
{
l_->debug("new server confirms it will join, start syncing logs to it");
sync_log_to_new_srv(resp->get_next_idx());
}
else
{
l_->debug("new server cannot accept the invitation, give up");
}
}
else
{
l_->debug("no server to join, drop the message");
}
break;
解析完上一個resp,這裡就好理解了。如果srv_to_join存在則呼叫sync_log_to_new_srv來資料同步。
- 3.case msg_type::leave_cluster_response型別:
case msg_type::leave_cluster_response:
if (!resp->get_accepted())
{
l_->debug("peer doesn't accept to stepping down, stop proceeding");
return;
}
l_->debug("peer accepted to stepping down, removing this server from cluster");
rm_srv_from_cluster(resp->get_src());
break;
重點是rm_srv_from_cluster。
void raft_server::rm_srv_from_cluster(int32 srv_id)
{
ptr<cluster_config> new_conf = cs_new<cluster_config>(log_store_->next_slot(), config_->get_log_idx());
for (cluster_config::const_srv_itor it = config_->get_servers().begin(); it != config_->get_servers().end(); ++it)
{
if ((*it)->get_id() != srv_id)
{
new_conf->get_servers().push_back(*it);
}
}
l_->info(lstrfmt("removed a server from configuration and save the configuration to log store at %llu")
.fmt(new_conf->get_log_idx()));
config_changing_ = true;
bufptr new_conf_buf(new_conf->serialize());
ptr<log_entry> entry(cs_new<log_entry>(state_->get_term(), std::move(new_conf_buf), log_val_type::conf));
log_store_->append(entry);
request_append_entries();
}
先把要移除的srv從leader的config移除,然後把cluster的更改寫入leader的log_store,呼叫request_append_entries廣播給各個follower,達到所有節點更改的效果。
- 4.install_snapshot_response型別:
case msg_type::install_snapshot_response:
{
if (!srv_to_join_)
{
l_->info("no server to join, the response must be very old.");
return;
}
if (!resp->get_accepted())
{
l_->info("peer doesn't accept the snapshot installation request");
return;
}
ptr<snapshot_sync_ctx> sync_ctx = srv_to_join_->get_snapshot_sync_ctx();
if (sync_ctx == nilptr)
{
l_->err("Bug! SnapshotSyncContext must not be null");
ctx_->state_mgr_->system_exit(-1);
::exit(-1);
return;
}
if (resp->get_next_idx() >= sync_ctx->get_snapshot()->size())
{
// snapshot is done
ptr<snapshot> nil_snap;
l_->debug("snapshot has been copied and applied to new server, continue to sync logs after snapshot");
srv_to_join_->set_snapshot_in_sync(nil_snap);
srv_to_join_->set_next_log_idx(sync_ctx->get_snapshot()->get_last_log_idx() + 1);
srv_to_join_->set_matched_idx(sync_ctx->get_snapshot()->get_last_log_idx());
}
else
{
sync_ctx->set_offset(resp->get_next_idx());
l_->debug(sstrfmt("continue to send snapshot to new server at offset %llu").fmt(resp->get_next_idx()));
}
sync_log_to_new_srv(srv_to_join_->get_next_log_idx());
}
break;
(1)因為snapshot是分段傳送的,如果沒有srv_to_join_,則根本無法跟蹤offset,因此報錯。
(2)if (sync_ctx == nilptr)與(1)同理,必須要有sync_ctx,否則無法跟蹤offset。
(3)在前面handle_install_snapshot_resp裡面我們說過resp->get_next_idx()記錄的其實是snapshot的offset,根據offset我們分兩種情況,如果offset >= sync_ctx->get_snapshot()->size()說明snapshot已經完成了,更新next_idx與match_idx。否則從resp裡面的offset繼續同步。
(4)安裝完snapshot之後還要呼叫更小粒度的sync_log_to_new_srv(srv_to_join_->get_next_log_idx())來進一步同步資料。(類似redis裡面持久化先應用RDB快速同步再用AOF更細粒度同步)
- 5.case msg_type::prevote_response型別:
case msg_type::prevote_response:
handle_prevote_resp(*resp);
break;
重點是handle_prevote_resp:
void raft_server::handle_prevote_resp(resp_msg& resp)
{
if (!prevote_state_)
{
l_->info(sstrfmt("Prevote has completed, term received: %llu, current term %llu")
.fmt(resp.get_term(), state_->get_term()));
return;
}
{
read_lock(peers_lock_);
bool vote_added = prevote_state_->add_voted_server(resp.get_src());
if (!vote_added)
{
l_->info("Prevote has from %d has been processed.");
return;
}
if (resp.get_accepted())
{
prevote_state_->inc_accepted_votes();
}
if (prevote_state_->get_accepted_votes() > (int32)((peers_.size() + 1) / 2))
{
l_->info(sstrfmt("Prevote passed for term %llu").fmt(state_->get_term()));
become_candidate();
}
else if (prevote_state_->num_of_votes() >= (peers_.size() + 1))
{
l_->info(sstrfmt("Prevote failed for term %llu").fmt(state_->get_term()));
prevote_state_->reset(); // still in prevote state, just reset the prevote state
restart_election_timer(); // restart election timer for a new round of prevote
}
}
}
如果得到票數超過一半,成為candidate,否則再開始新一輪prevote。
7.總結
- 1.log_entry是先讓follower應用到狀態機,只有超過半數以上的都應用了,leader才會應用到自己的狀態機。具體到實現,可以將所有節點的match_idx排序然後取中位數。
- 2.對於snapshot的req與resp,可以利用idx這一項來記錄offset。
- 3.對於新節點資料同步,採用snapshot,log_pack等方式加快資料同步。
- 4.資料同步需要多次同步,直到粒度滿足要求。
- 5.因為snapshot是分段傳送的,如果無法跟蹤offset,說明resp錯誤。