cornerstone中raft_server_resp_handlers原始碼解析

TomGeller發表於2024-11-26

1.概述

在rpc請求裡,有了請求req就必然有回覆resp。本文就來解析傳送req的節點收到resp該怎麼處理。

2.handle_peer_resp原始碼解析

void raft_server::handle_peer_resp(ptr<resp_msg>& resp, const ptr<rpc_exception>& err)
{
    if (err)
    {
        l_->info(sstrfmt("peer response error: %s").fmt(err->what()));
        return;
    }

    // update peer last response time
    {
        read_lock(peers_lock_);
        auto peer = peers_.find(resp->get_src());
        if (peer != peers_.end())
        {
            peer->second->set_last_resp(system_clock::now());
        }
        else
        {
            l_->info(sstrfmt("Peer %d not found, ignore the message").fmt(resp->get_src()));
            return;
        }
    }

    l_->debug(lstrfmt("Receive a %s message from peer %d with Result=%d, Term=%llu, NextIndex=%llu")
                  .fmt(
                      __msg_type_str[resp->get_type()],
                      resp->get_src(),
                      resp->get_accepted() ? 1 : 0,
                      resp->get_term(),
                      resp->get_next_idx()));

    {
        recur_lock(lock_);
        // if term is updated, no more action is required
        if (update_term(resp->get_term()))
        {
            return;
        }

        // ignore the response that with lower term for safety
        switch (resp->get_type())
        {
            case msg_type::vote_response:
                handle_voting_resp(*resp);
                break;
            case msg_type::append_entries_response:
                handle_append_entries_resp(*resp);
                break;
            case msg_type::install_snapshot_response:
                handle_install_snapshot_resp(*resp);
                break;
            default:
                l_->err(sstrfmt("Received an unexpected message %s for response, system exits.")
                            .fmt(__msg_type_str[resp->get_type()]));
                ctx_->state_mgr_->system_exit(-1);
                ::exit(-1);
                break;
        }
    }
}
  • 1.與rep_handlers類似,resp_handlers同樣有一個總的處理resp的函式,透過switch-case來分流。
  • 2.在交給具體的resp_handlers之前,handle_peer_resp還更新了peer的last_response_time。
  • 3.如果這個resp可以更新節點的term,說明節點已經落後了,無需進行任何操作。

3.handle_voting_resp原始碼解析

void raft_server::handle_voting_resp(resp_msg& resp)
{
    if (resp.get_term() != state_->get_term())
    {
        l_->info(sstrfmt("Received an outdated vote response at term %llu v.s. current term %llu")
                     .fmt(resp.get_term(), state_->get_term()));
        return;
    }

    if (election_completed_)
    {
        l_->info("Election completed, will ignore the voting result from this server");
        return;
    }

    if (voted_servers_.find(resp.get_src()) != voted_servers_.end())
    {
        l_->info(sstrfmt("Duplicate vote from %d for term %lld").fmt(resp.get_src(), state_->get_term()));
        return;
    }

    {
        read_lock(peers_lock_);
        voted_servers_.insert(resp.get_src());
        if (resp.get_accepted())
        {
            votes_granted_ += 1;
        }

        if (voted_servers_.size() >= (peers_.size() + 1))
        {
            election_completed_ = true;
        }

        if (votes_granted_ > (int32)((peers_.size() + 1) / 2))
        {
            l_->info(sstrfmt("Server is elected as leader for term %llu").fmt(state_->get_term()));
            election_completed_ = true;
            become_leader();
        }
    }
}
  • 1.if (resp.get_term() != state_->get_term())判斷term是否相同,相同繼續。
  • 2.判斷if (election_completed_)選舉是否完成,因為candidate只需要一半以上就會成功,所以可能出現election結束了但還收到了resp的情況。
  • 3.判斷髮送resp的節點在不在candidate的voted_servers_裡面,在的話說明收到了同一個節點的兩票,出錯。
  • 4.如果 (voted_servers_.size() >= (peers_.size() + 1))說明選舉已經結束。
  • 5.透過resp.get_accepted()來統計自己的得票,如果超過了一半說明成功了,呼叫become_leader();

4.handle_append_entries_resp原始碼解析

void raft_server::handle_append_entries_resp(resp_msg& resp)
{
    read_lock(peers_lock_);
    peer_itor it = peers_.find(resp.get_src());
    if (it == peers_.end())
    {
        l_->info(sstrfmt("the response is from an unkonw peer %d").fmt(resp.get_src()));
        return;
    }

    // if there are pending logs to be synced or commit index need to be advanced, continue to send appendEntries to
    // this peer
    bool need_to_catchup = true;
    ptr<peer> p = it->second;
    if (resp.get_accepted())
    {
        {
            auto_lock(p->get_lock());
            p->set_next_log_idx(resp.get_next_idx());
            p->set_matched_idx(resp.get_next_idx() - 1);
        }

        // try to commit with this response
        std::vector<ulong> matched_indexes(peers_.size() + 1);
        matched_indexes[0] = log_store_->next_slot() - 1;
        int i = 1;
        for (it = peers_.begin(); it != peers_.end(); ++it, i++)
        {
            matched_indexes[i] = it->second->get_matched_idx();
        }

        std::sort(matched_indexes.begin(), matched_indexes.end(), std::greater<ulong>());
        commit(matched_indexes[(peers_.size() + 1) / 2]);
        need_to_catchup = p->clear_pending_commit() || resp.get_next_idx() < log_store_->next_slot();
    }
    else
    {
        std::lock_guard<std::mutex> guard(p->get_lock());
        if (resp.get_next_idx() > 0 && p->get_next_log_idx() > resp.get_next_idx())
        {
            // fast move for the peer to catch up
            p->set_next_log_idx(resp.get_next_idx());
        }
        else if (p->get_next_log_idx() > 0)
        {
            p->set_next_log_idx(p->get_next_log_idx() - 1);
        }
    }

    // This may not be a leader anymore, such as the response was sent out long time ago
    // and the role was updated by UpdateTerm call
    // Try to match up the logs for this peer
    if (role_ == srv_role::leader && need_to_catchup)
    {
        request_append_entries(*p);
    }
}
  • 1.首先判斷髮送resp的節點是不是還在peers_列表裡面,不在報錯。
  • 2.如果resp的accepted = 1,說明append-entry成功了,設定match_idx與next_idx。
  • 3.提取peers_列表裡面所有的match_idx,然後sort排序。
  • 4.排序後取中位數mid_idx,說明至少有一半以上的follower都應用到了mid_idx,將mid應用到自己(也就是leader)的狀態機。
  • 5.如果沒接受(accepted = 0),那麼就調整next_idx繼續逼近。(具體可看cornerstone中msg型別解析)
  • 6.如果該節點還需要catch_up,再傳送一遍。(可能是該leader在很久之前給他發的req,現在節點才回復,導致節點依然落後。)

知識點:
log_entry是先讓follower應用到狀態機,只有超過半數以上的都應用了,leader才會應用到自己的狀態機。具體到實現,可以將所有節點的match_idx排序然後取中位數。

5.handle_install_snapshot_resp原始碼解析

void raft_server::handle_install_snapshot_resp(resp_msg& resp)
{
    read_lock(peers_lock_);
    peer_itor it = peers_.find(resp.get_src());
    if (it == peers_.end())
    {
        l_->info(sstrfmt("the response is from an unkonw peer %d").fmt(resp.get_src()));
        return;
    }

    // if there are pending logs to be synced or commit index need to be advanced, continue to send appendEntries to
    // this peer
    bool need_to_catchup = true;
    ptr<peer> p = it->second;
    if (resp.get_accepted())
    {
        std::lock_guard<std::mutex> guard(p->get_lock());
        ptr<snapshot_sync_ctx> sync_ctx = p->get_snapshot_sync_ctx();
        if (sync_ctx == nilptr)
        {
            l_->info("no snapshot sync context for this peer, drop the response");
            need_to_catchup = false;
        }
        else
        {
            if (resp.get_next_idx() >= sync_ctx->get_snapshot()->size())
            {
                l_->debug("snapshot sync is done");
                ptr<snapshot> nil_snp;
                p->set_next_log_idx(sync_ctx->get_snapshot()->get_last_log_idx() + 1);
                p->set_matched_idx(sync_ctx->get_snapshot()->get_last_log_idx());
                p->set_snapshot_in_sync(nil_snp);
                need_to_catchup = p->clear_pending_commit() || resp.get_next_idx() < log_store_->next_slot();
            }
            else
            {
                l_->debug(sstrfmt("continue to sync snapshot at offset %llu").fmt(resp.get_next_idx()));
                sync_ctx->set_offset(resp.get_next_idx());
            }
        }
    }
    else
    {
        l_->info("peer declines to install the snapshot, will retry");
    }

    // This may not be a leader anymore, such as the response was sent out long time ago
    // and the role was updated by UpdateTerm call
    // Try to match up the logs for this peer
    if (role_ == srv_role::leader && need_to_catchup)
    {
        request_append_entries(*p);
    }
}
  • 核心程式碼是,其他與上面一致。
if (resp.get_next_idx() >= sync_ctx->get_snapshot()->size())
            {
                l_->debug("snapshot sync is done");
                ptr<snapshot> nil_snp;
                p->set_next_log_idx(sync_ctx->get_snapshot()->get_last_log_idx() + 1);
                p->set_matched_idx(sync_ctx->get_snapshot()->get_last_log_idx());
                p->set_snapshot_in_sync(nil_snp);
                need_to_catchup = p->clear_pending_commit() || resp.get_next_idx() < log_store_->next_slot();
            }
            else
            {
                l_->debug(sstrfmt("continue to sync snapshot at offset %llu").fmt(resp.get_next_idx()));
                sync_ctx->set_offset(resp.get_next_idx());
            }
  • 因為這是snapshot,不需要發idx,所以resp.get_next_idx()實際上是follower已經接受snapshot的offset。如果接受的offset >= sync_ctx->get_snapshot()->size(),說明已經完成了,設定next_idx與match_idx。否則繼續從已經接受的offset位置繼續傳送。

6.額外的ext_resp處理原始碼解析

void raft_server::handle_ext_resp(ptr<resp_msg>& resp, const ptr<rpc_exception>& err)
{
    recur_lock(lock_);
    if (err)
    {
        handle_ext_resp_err(*err);
        return;
    }

    l_->debug(lstrfmt("Receive an extended %s message from peer %d with Result=%d, Term=%llu, NextIndex=%llu")
                  .fmt(
                      __msg_type_str[resp->get_type()],
                      resp->get_src(),
                      resp->get_accepted() ? 1 : 0,
                      resp->get_term(),
                      resp->get_next_idx()));

    switch (resp->get_type())
    {
        case msg_type::sync_log_response:
            if (srv_to_join_)
            {
                // we are reusing heartbeat interval value to indicate when to stop retry
                srv_to_join_->resume_hb_speed();
                srv_to_join_->set_next_log_idx(resp->get_next_idx());
                srv_to_join_->set_matched_idx(resp->get_next_idx() - 1);
                sync_log_to_new_srv(resp->get_next_idx());
            }
            break;
        case msg_type::join_cluster_response:
            if (srv_to_join_)
            {
                if (resp->get_accepted())
                {
                    l_->debug("new server confirms it will join, start syncing logs to it");
                    sync_log_to_new_srv(resp->get_next_idx());
                }
                else
                {
                    l_->debug("new server cannot accept the invitation, give up");
                }
            }
            else
            {
                l_->debug("no server to join, drop the message");
            }
            break;
        case msg_type::leave_cluster_response:
            if (!resp->get_accepted())
            {
                l_->debug("peer doesn't accept to stepping down, stop proceeding");
                return;
            }

            l_->debug("peer accepted to stepping down, removing this server from cluster");
            rm_srv_from_cluster(resp->get_src());
            break;
        case msg_type::install_snapshot_response:
        {
            if (!srv_to_join_)
            {
                l_->info("no server to join, the response must be very old.");
                return;
            }

            if (!resp->get_accepted())
            {
                l_->info("peer doesn't accept the snapshot installation request");
                return;
            }

            ptr<snapshot_sync_ctx> sync_ctx = srv_to_join_->get_snapshot_sync_ctx();
            if (sync_ctx == nilptr)
            {
                l_->err("Bug! SnapshotSyncContext must not be null");
                ctx_->state_mgr_->system_exit(-1);
                ::exit(-1);
                return;
            }

            if (resp->get_next_idx() >= sync_ctx->get_snapshot()->size())
            {
                // snapshot is done
                ptr<snapshot> nil_snap;
                l_->debug("snapshot has been copied and applied to new server, continue to sync logs after snapshot");
                srv_to_join_->set_snapshot_in_sync(nil_snap);
                srv_to_join_->set_next_log_idx(sync_ctx->get_snapshot()->get_last_log_idx() + 1);
                srv_to_join_->set_matched_idx(sync_ctx->get_snapshot()->get_last_log_idx());
            }
            else
            {
                sync_ctx->set_offset(resp->get_next_idx());
                l_->debug(sstrfmt("continue to send snapshot to new server at offset %llu").fmt(resp->get_next_idx()));
            }

            sync_log_to_new_srv(srv_to_join_->get_next_log_idx());
        }
        break;
        case msg_type::prevote_response:
            handle_prevote_resp(*resp);
            break;
        default:
            l_->err(lstrfmt("received an unexpected response message type %s, for safety, stepping down")
                        .fmt(__msg_type_str[resp->get_type()]));
            ctx_->state_mgr_->system_exit(-1);
            ::exit(-1);
            break;
    }
}

在解析前我們先梳理一下呼叫順序:
1.leader向新節點傳送invite_srv_to_join_cluster,新節點收到invite_srv_to_join_cluster請求後更新自己的role_,leader_等狀態,並呼叫reconfigure重置cluster的config。更新完後傳送join_cluster_response給leader。
2.leader收到該response後呼叫switch-case裡面的msg_type::join_cluster_response分支來處理。處理完join_cluster_response會呼叫ync_log_to_new_srv向新節點傳送sync_log_req。
3.新節點收到sync_log_req後傳送sync_log_response給leader,leader收到後呼叫switch-case裡面的msg_type::sync_log_response分支。

void raft_server::sync_log_to_new_srv(ulong start_idx)
{
    // only sync committed logs
    int32 gap = (int32)(quick_commit_idx_ - start_idx);
    if (gap < ctx_->params_->log_sync_stop_gap_)
    {
        l_->info(lstrfmt("LogSync is done for server %d with log gap %d, now put the server into cluster")
                     .fmt(srv_to_join_->get_id(), gap));
        ptr<cluster_config> new_conf = cs_new<cluster_config>(log_store_->next_slot(), config_->get_log_idx());
        new_conf->get_servers().insert(
            new_conf->get_servers().end(), config_->get_servers().begin(), config_->get_servers().end());
        new_conf->get_servers().push_back(conf_to_add_);
        bufptr new_conf_buf(new_conf->serialize());
        ptr<log_entry> entry(cs_new<log_entry>(state_->get_term(), std::move(new_conf_buf), log_val_type::conf));
        log_store_->append(entry);
        config_changing_ = true;
        request_append_entries();
        return;
    }

    ptr<req_msg> req;
    if (start_idx > 0 && start_idx < log_store_->start_index())
    {
        req = create_sync_snapshot_req(*srv_to_join_, start_idx, state_->get_term(), quick_commit_idx_);
    }
    else
    {
        int32 size_to_sync = std::min(gap, ctx_->params_->log_sync_batch_size_);
        bufptr log_pack = log_store_->pack(start_idx, size_to_sync);
        req = cs_new<req_msg>(
            state_->get_term(),
            msg_type::sync_log_request,
            id_,
            srv_to_join_->get_id(),
            0L,
            start_idx - 1,
            quick_commit_idx_);
        req->log_entries().push_back(
            cs_new<log_entry>(state_->get_term(), std::move(log_pack), log_val_type::log_pack));
    }

    srv_to_join_->send_req(req, ex_resp_handler_);
}
  • 1.msg_type::sync_log_response型別:
                srv_to_join_->resume_hb_speed();
                srv_to_join_->set_next_log_idx(resp->get_next_idx());
                srv_to_join_->set_matched_idx(resp->get_next_idx() - 1);
                sync_log_to_new_srv(resp->get_next_idx());

收到了節點的resp後,那麼給該節點新增hb任務,設定next_idx與match_idx。然後呼叫sync_log_to_new_srv再去同步一遍給該節點,直到兩者資料一致,否則重複傳送sync_log_request。(類似redis裡面主從同步的時候,把主節點在主從同步時候的寫入操作寫入一個buffer,然後在最後再發給從節點再同步一遍。)
根據上面sync_log_to_new_srv原始碼我們可以看到,sync_log不是單純採用request_append_entry去資料同步。因為新加入的節點落後很多,所以leader採用的機制是先傳送snapshot,直到新節點的last_log_idx大於leader的start_idx,接著分情況討論,如果兩者idx的差(gap) < ctx_->params_->log_sync_stop_gap_,說明gap不足以打包成log_pack,則呼叫request_append_entry,否則打包成log_pack傳送。

  • 2.case msg_type::join_cluster_response型別:
case msg_type::join_cluster_response:
            if (srv_to_join_)
            {
                if (resp->get_accepted())
                {
                    l_->debug("new server confirms it will join, start syncing logs to it");
                    sync_log_to_new_srv(resp->get_next_idx());
                }
                else
                {
                    l_->debug("new server cannot accept the invitation, give up");
                }
            }
            else
            {
                l_->debug("no server to join, drop the message");
            }
            break;

解析完上一個resp,這裡就好理解了。如果srv_to_join存在則呼叫sync_log_to_new_srv來資料同步。

  • 3.case msg_type::leave_cluster_response型別:
case msg_type::leave_cluster_response:
            if (!resp->get_accepted())
            {
                l_->debug("peer doesn't accept to stepping down, stop proceeding");
                return;
            }

            l_->debug("peer accepted to stepping down, removing this server from cluster");
            rm_srv_from_cluster(resp->get_src());
            break;

重點是rm_srv_from_cluster。

void raft_server::rm_srv_from_cluster(int32 srv_id)
{
    ptr<cluster_config> new_conf = cs_new<cluster_config>(log_store_->next_slot(), config_->get_log_idx());
    for (cluster_config::const_srv_itor it = config_->get_servers().begin(); it != config_->get_servers().end(); ++it)
    {
        if ((*it)->get_id() != srv_id)
        {
            new_conf->get_servers().push_back(*it);
        }
    }

    l_->info(lstrfmt("removed a server from configuration and save the configuration to log store at %llu")
                 .fmt(new_conf->get_log_idx()));
    config_changing_ = true;
    bufptr new_conf_buf(new_conf->serialize());
    ptr<log_entry> entry(cs_new<log_entry>(state_->get_term(), std::move(new_conf_buf), log_val_type::conf));
    log_store_->append(entry);
    request_append_entries();
}

先把要移除的srv從leader的config移除,然後把cluster的更改寫入leader的log_store,呼叫request_append_entries廣播給各個follower,達到所有節點更改的效果。

  • 4.install_snapshot_response型別:
 case msg_type::install_snapshot_response:
        {
            if (!srv_to_join_)
            {
                l_->info("no server to join, the response must be very old.");
                return;
            }

            if (!resp->get_accepted())
            {
                l_->info("peer doesn't accept the snapshot installation request");
                return;
            }

            ptr<snapshot_sync_ctx> sync_ctx = srv_to_join_->get_snapshot_sync_ctx();
            if (sync_ctx == nilptr)
            {
                l_->err("Bug! SnapshotSyncContext must not be null");
                ctx_->state_mgr_->system_exit(-1);
                ::exit(-1);
                return;
            }

            if (resp->get_next_idx() >= sync_ctx->get_snapshot()->size())
            {
                // snapshot is done
                ptr<snapshot> nil_snap;
                l_->debug("snapshot has been copied and applied to new server, continue to sync logs after snapshot");
                srv_to_join_->set_snapshot_in_sync(nil_snap);
                srv_to_join_->set_next_log_idx(sync_ctx->get_snapshot()->get_last_log_idx() + 1);
                srv_to_join_->set_matched_idx(sync_ctx->get_snapshot()->get_last_log_idx());
            }
            else
            {
                sync_ctx->set_offset(resp->get_next_idx());
                l_->debug(sstrfmt("continue to send snapshot to new server at offset %llu").fmt(resp->get_next_idx()));
            }

            sync_log_to_new_srv(srv_to_join_->get_next_log_idx());
        }
        break;

(1)因為snapshot是分段傳送的,如果沒有srv_to_join_,則根本無法跟蹤offset,因此報錯。
(2)if (sync_ctx == nilptr)與(1)同理,必須要有sync_ctx,否則無法跟蹤offset。
(3)在前面handle_install_snapshot_resp裡面我們說過resp->get_next_idx()記錄的其實是snapshot的offset,根據offset我們分兩種情況,如果offset >= sync_ctx->get_snapshot()->size()說明snapshot已經完成了,更新next_idx與match_idx。否則從resp裡面的offset繼續同步。
(4)安裝完snapshot之後還要呼叫更小粒度的sync_log_to_new_srv(srv_to_join_->get_next_log_idx())來進一步同步資料。(類似redis裡面持久化先應用RDB快速同步再用AOF更細粒度同步)

  • 5.case msg_type::prevote_response型別:
case msg_type::prevote_response:
            handle_prevote_resp(*resp);
            break;

重點是handle_prevote_resp:

void raft_server::handle_prevote_resp(resp_msg& resp)
{
    if (!prevote_state_)
    {
        l_->info(sstrfmt("Prevote has completed, term received: %llu, current term %llu")
                     .fmt(resp.get_term(), state_->get_term()));
        return;
    }

    {
        read_lock(peers_lock_);
        bool vote_added = prevote_state_->add_voted_server(resp.get_src());
        if (!vote_added)
        {
            l_->info("Prevote has from %d has been processed.");
            return;
        }

        if (resp.get_accepted())
        {
            prevote_state_->inc_accepted_votes();
        }

        if (prevote_state_->get_accepted_votes() > (int32)((peers_.size() + 1) / 2))
        {
            l_->info(sstrfmt("Prevote passed for term %llu").fmt(state_->get_term()));
            become_candidate();
        }
        else if (prevote_state_->num_of_votes() >= (peers_.size() + 1))
        {
            l_->info(sstrfmt("Prevote failed for term %llu").fmt(state_->get_term()));
            prevote_state_->reset();  // still in prevote state, just reset the prevote state
            restart_election_timer(); // restart election timer for a new round of prevote
        }
    }
}

如果得到票數超過一半,成為candidate,否則再開始新一輪prevote。

7.總結

  • 1.log_entry是先讓follower應用到狀態機,只有超過半數以上的都應用了,leader才會應用到自己的狀態機。具體到實現,可以將所有節點的match_idx排序然後取中位數。
  • 2.對於snapshot的req與resp,可以利用idx這一項來記錄offset。
  • 3.對於新節點資料同步,採用snapshot,log_pack等方式加快資料同步。
  • 4.資料同步需要多次同步,直到粒度滿足要求。
  • 5.因為snapshot是分段傳送的,如果無法跟蹤offset,說明resp錯誤。

相關文章