cornerstone中raft_server_req_handlers原始碼解析

TomGeller發表於2024-11-25

1.概述

之前說過raft_server是cornerstone的核心,其中充滿了很多req的傳送,那麼follower收到leader的req會怎麼處理呢?
本文就是來解析cornerstone中處理req的原始碼。

2.process_req原始碼解析

ptr<resp_msg> raft_server::process_req(req_msg& req)
{
    ptr<resp_msg> resp;
    l_->debug(lstrfmt("Receive a %s message from %d with LastLogIndex=%llu, LastLogTerm=%llu, EntriesLength=%d, "
                      "CommitIndex=%llu and Term=%llu")
                  .fmt(
                      __msg_type_str[req.get_type()],
                      req.get_src(),
                      req.get_last_log_idx(),
                      req.get_last_log_term(),
                      req.log_entries().size(),
                      req.get_commit_idx(),
                      req.get_term()));
    {
        recur_lock(lock_);
        if (req.get_type() == msg_type::append_entries_request || req.get_type() == msg_type::vote_request ||
            req.get_type() == msg_type::install_snapshot_request)
        {
            // we allow the server to be continue after term updated to save a round message
            update_term(req.get_term());

            // Reset stepping down value to prevent this server goes down when leader crashes after sending a
            // LeaveClusterRequest
            if (steps_to_down_ > 0)
            {
                steps_to_down_ = 2;
            }
        }

        if (req.get_type() == msg_type::append_entries_request)
        {
            resp = handle_append_entries(req);
        }
        else if (req.get_type() == msg_type::vote_request)
        {
            resp = handle_vote_req(req);
        }
        else if (req.get_type() == msg_type::client_request)
        {
            resp = handle_cli_req(req);
        }
        else
        {
            // extended requests
            resp = handle_extended_msg(req);
        }
    }

    if (resp)
    {
        l_->debug(lstrfmt("Response back a %s message to %d with Accepted=%d, Term=%llu, NextIndex=%llu")
                      .fmt(
                          __msg_type_str[resp->get_type()],
                          resp->get_dst(),
                          resp->get_accepted() ? 1 : 0,
                          resp->get_term(),
                          resp->get_next_idx()));
    }

    return resp;
}

ptr<resp_msg> raft_server::handle_extended_msg(req_msg& req)
{
    switch (req.get_type())
    {
        case msg_type::add_server_request:
            return handle_add_srv_req(req);
        case msg_type::remove_server_request:
            return handle_rm_srv_req(req);
        case msg_type::sync_log_request:
            return handle_log_sync_req(req);
        case msg_type::join_cluster_request:
            return handle_join_cluster_req(req);
        case msg_type::leave_cluster_request:
            return handle_leave_cluster_req(req);
        case msg_type::install_snapshot_request:
            return handle_install_snapshot_req(req);
        case msg_type::prevote_request:
            return handle_prevote_req(req);
        default:
            l_->err(
                sstrfmt("receive an unknown request %s, for safety, step down.").fmt(__msg_type_str[req.get_type()]));
            ctx_->state_mgr_->system_exit(-1);
            ::exit(-1);
            break;
    }

    return ptr<resp_msg>();
}
  • 1.判斷是不是append-entry,vote或者install-snapshot型別請求,如果是可以用req的term來更新自己,同時更新step_down = 2。
  • 2.接著便是用switch來分流處理req。

知識點:
這裡更新term的方法很巧妙,append-entry或者install-snapshot都是用於leader與follower資料間的同步。在election結束後,follower其實是不知道election是否已經結束。為了更新自己的term,follower不採用輪詢election是否結束這樣佔用時間且低效的方式更新term,而是採用事件驅動模型,收到append-entry或install-snapshot的rpc請求後順帶更新term,高效快捷。而對於vote_request也更新自己的term則是考慮到了在網路分割槽的情況下某些節點可能一直處於candidate狀態,在某個時間點網路又正常了,這時候透過vote_request便可以更新term到最新狀態。(對於append-entry或者install-snapshot也可以幫助網路分割槽的節點在網路恢復後更新自己的term。)

3.handle_append_entries原始碼解析

ptr<resp_msg> raft_server::handle_append_entries(req_msg& req)
{
    if (req.get_term() == state_->get_term())
    {
        if (role_ == srv_role::candidate)
        {
            become_follower();
        }
        else if (role_ == srv_role::leader)
        {
            l_->debug(lstrfmt("Receive AppendEntriesRequest from another leader(%d) with same term, there must be a "
                              "bug, server exits")
                          .fmt(req.get_src()));
            ctx_->state_mgr_->system_exit(-1);
            ::exit(-1);
        }
        else
        {
            restart_election_timer();
        }
    }

    // After a snapshot the req.get_last_log_idx() may less than log_store_->next_slot() but equals to
    // log_store_->next_slot() -1 In this case, log is Okay if req.get_last_log_idx() == lastSnapshot.get_last_log_idx()
    // && req.get_last_log_term() == lastSnapshot.get_last_log_term() In not accepted case, we will return
    // log_store_->next_slot() for the leader to quick jump to the index that might aligned
    ptr<resp_msg> resp(cs_new<resp_msg>(
        state_->get_term(), msg_type::append_entries_response, id_, req.get_src(), log_store_->next_slot()));
    bool log_okay = req.get_last_log_idx() == 0 || (req.get_last_log_idx() < log_store_->next_slot() &&
                                                    req.get_last_log_term() == term_for_log(req.get_last_log_idx()));
    if (req.get_term() < state_->get_term() || !log_okay)
    {
        return resp;
    }

    // follower & log is okay
    if (req.log_entries().size() > 0)
    {
        // write logs to store, start from overlapped logs
        ulong idx = req.get_last_log_idx() + 1;
        size_t log_idx = 0;
        while (idx < log_store_->next_slot() && log_idx < req.log_entries().size())
        {
            if (log_store_->term_at(idx) == req.log_entries().at(log_idx)->get_term())
            {
                idx++;
                log_idx++;
            }
            else
            {
                break;
            }
        }

        // dealing with overwrites
        while (idx < log_store_->next_slot() && log_idx < req.log_entries().size())
        {
            ptr<log_entry> old_entry(log_store_->entry_at(idx));
            if (old_entry->get_val_type() == log_val_type::app_log)
            {
                state_machine_->rollback(idx, old_entry->get_buf(), old_entry->get_cookie());
            }
            else if (old_entry->get_val_type() == log_val_type::conf)
            {
                l_->info(sstrfmt("revert from a prev config change to config at %llu").fmt(config_->get_log_idx()));
                config_changing_ = false;
            }

            ptr<log_entry> entry = req.log_entries().at(log_idx);
            log_store_->write_at(idx, entry);
            if (entry->get_val_type() == log_val_type::app_log)
            {
                state_machine_->pre_commit(idx, entry->get_buf(), entry->get_cookie());
            }
            else if (entry->get_val_type() == log_val_type::conf)
            {
                l_->info(sstrfmt("receive a config change from leader at %llu").fmt(idx));
                config_changing_ = true;
            }

            idx += 1;
            log_idx += 1;
        }

        // append new log entries
        while (log_idx < req.log_entries().size())
        {
            ptr<log_entry> entry = req.log_entries().at(log_idx++);
            ulong idx_for_entry = log_store_->append(entry);
            if (entry->get_val_type() == log_val_type::conf)
            {
                l_->info(sstrfmt("receive a config change from leader at %llu").fmt(idx_for_entry));
                config_changing_ = true;
            }
            else if (entry->get_val_type() == log_val_type::app_log)
            {
                state_machine_->pre_commit(idx_for_entry, entry->get_buf(), entry->get_cookie());
            }
        }
    }

    leader_ = req.get_src();
    commit(req.get_commit_idx());
    resp->accept(req.get_last_log_idx() + req.log_entries().size() + 1);
    return resp;
}
  • 1.根據不同角色判斷(正常來說結束選舉後的follower,在前面process_req就已經更新自己的term與leader一致了)
    (1)是candidate,說明election已經結束,已經出來了leader,退回到follower。
    (2)是leader,說明出bug了,終止程式。
    (3)是follower,說明已經收到了來自leader的訊息,透過restart_election_timer來重新定時,避免election_timeout後觸發election事件。
  • 2.最難理解的log_okay的判斷,
bool log_okay = req.get_last_log_idx() == 0 || (req.get_last_log_idx() < log_store_->next_slot() &&
                                                    req.get_last_log_term() == term_for_log(req.get_last_log_idx()));

(1)這裡的req的last_log_idx = leader對於follower猜測的next_idx - 1,如果last_log_idx = 0,說明需要從follower的log_store的頭(idx = 0)開始更新log_store。
(2)req.get_last_log_idx() < log_store_->next_slot(),說明last_log_idx在follower的log_store範圍內。如果last_log_idx超過follower的log_store範圍,說明猜測不準,需要更正。這時候透過resp傳送follower的log_store_->next_slot()可以快速幫助leader更正next_idx。(具體可見cornerstone中msg型別解析

  • 3.接著便是更新follower的log_store,其中相同的就跳過加快速度,需要overwrite就覆蓋,需要在末尾append的就先分配空間再追加。

知識點:
對於follower與leader資料的同步,如果相同可以不用更新,直接跳過加快速度。

4.handle_vote_req原始碼解析

ptr<resp_msg> raft_server::handle_vote_req(req_msg& req)
{
    ptr<resp_msg> resp(cs_new<resp_msg>(state_->get_term(), msg_type::vote_response, id_, req.get_src()));
    bool log_okay = req.get_last_log_term() > log_store_->last_entry()->get_term() ||
                    (req.get_last_log_term() == log_store_->last_entry()->get_term() &&
                     log_store_->next_slot() - 1 <= req.get_last_log_idx());
    bool grant = req.get_term() == state_->get_term() && log_okay &&
                 (state_->get_voted_for() == req.get_src() || state_->get_voted_for() == -1);
    if (grant)
    {
        resp->accept(log_store_->next_slot());
        state_->set_voted_for(req.get_src());
        ctx_->state_mgr_->save_state(*state_);
    }

    return resp;
}
  • 1.重點是log_okay的判斷
bool log_okay = req.get_last_log_term() > log_store_->last_entry()->get_term() ||
                    (req.get_last_log_term() == log_store_->last_entry()->get_term() &&
                     log_store_->next_slot() - 1 <= req.get_last_log_idx());

(1)req.get_last_log_term() > log_store_->last_entry()->get_term(),根據req最新的term與follower最新的term來比較,如果req的term更高,直接log_okay = true。
(2)如果term相同,判斷req的idx是不是要比follower的更高,是的話log_okay = true。

  • 2.log_okay透過後,還要進一步判斷,
  bool grant = req.get_term() == state_->get_term() && log_okay &&
                 (state_->get_voted_for() == req.get_src() || state_->get_voted_for() == -1);

(1)判斷任期是否相同。
(2)判斷是否未投票,因為raft規定一個follower只能投給一個candidate,否則會導致split-vote從而可能有多個leader。

知識點:
1.根據candidate的最後一個log_entry來給follower判斷是否要投票。
2.一個follower只能投給一個candidate,否則造成split-vote導致多個leader。

5.handle_cli_req原始碼解析

ptr<resp_msg> raft_server::handle_cli_req(req_msg& req)
{
    bool leader = is_leader();

    // check if leader has expired.
    // there could be a case that the leader just elected, in that case, client can
    // just simply retry, no safety issue here.
    if (role_ == srv_role::leader && !leader)
    {
        return cs_new<resp_msg>(state_->get_term(), msg_type::append_entries_response, id_, -1);
    }

    ptr<resp_msg> resp(cs_new<resp_msg>(state_->get_term(), msg_type::append_entries_response, id_, leader_));
    if (!leader)
    {
        return resp;
    }

    std::vector<ptr<log_entry>>& entries = req.log_entries();
    for (size_t i = 0; i < entries.size(); ++i)
    {
        // force the log's term to current term
        entries.at(i)->set_term(state_->get_term());

        log_store_->append(entries.at(i));
        state_machine_->pre_commit(log_store_->next_slot() - 1, entries.at(i)->get_buf(), entries.at(i)->get_cookie());
    }

    // urgent commit, so that the commit will not depend on hb
    request_append_entries();
    resp->accept(log_store_->next_slot());
    return resp;
}
  • 1.首先判斷是不是leader,不是則不處理,否則繼續。
  • 2.是leader的話就把client要上傳的entry先下載到自己的log_store裡面,然後再透過request_append_entries廣播給自己的follower。

知識點:
raft採用強leader機制,對於client的請求只能由leader處理,leader將req應用到自己狀態機再廣播給follower。

6.handle_add_srv_req原始碼解析

ptr<resp_msg> raft_server::handle_add_srv_req(req_msg& req)
{
    std::vector<ptr<log_entry>>& entries(req.log_entries());
    ptr<resp_msg> resp(cs_new<resp_msg>(state_->get_term(), msg_type::add_server_response, id_, leader_));
    if (entries.size() != 1 || entries[0]->get_val_type() != log_val_type::cluster_server)
    {
        l_->debug("bad add server request as we are expecting one log entry with value type of ClusterServer");
        return resp;
    }

    if (role_ != srv_role::leader)
    {
        l_->info("this is not a leader, cannot handle AddServerRequest");
        return resp;
    }

    ptr<srv_config> srv_conf(srv_config::deserialize(entries[0]->get_buf()));
    {
        read_lock(peers_lock_);
        if (peers_.find(srv_conf->get_id()) != peers_.end() || id_ == srv_conf->get_id())
        {
            l_->warn(
                lstrfmt("the server to be added has a duplicated id with existing server %d").fmt(srv_conf->get_id()));
            return resp;
        }
    }

    if (config_changing_)
    {
        // the previous config has not committed yet
        l_->info("previous config has not committed yet");
        return resp;
    }

    conf_to_add_ = std::move(srv_conf);
    timer_task<peer&>::executor exec = [this](peer& p) { this->handle_hb_timeout(p); };
    srv_to_join_ = cs_new<peer>(conf_to_add_, *ctx_, exec);
    invite_srv_to_join_cluster();
    resp->accept(log_store_->next_slot());
    return resp;
}

void raft_server::invite_srv_to_join_cluster()
{
    ptr<req_msg> req(cs_new<req_msg>(
        state_->get_term(),
        msg_type::join_cluster_request,
        id_,
        srv_to_join_->get_id(),
        0L,
        log_store_->next_slot() - 1,
        quick_commit_idx_));
    req->log_entries().push_back(cs_new<log_entry>(state_->get_term(), config_->serialize(), log_val_type::conf));
    srv_to_join_->send_req(req, ex_resp_handler_);
}

ptr<resp_msg> raft_server::handle_join_cluster_req(req_msg& req)
{
    std::vector<ptr<log_entry>>& entries = req.log_entries();
    ptr<resp_msg> resp(cs_new<resp_msg>(state_->get_term(), msg_type::join_cluster_response, id_, req.get_src()));
    if (entries.size() != 1 || entries[0]->get_val_type() != log_val_type::conf)
    {
        l_->info("receive an invalid JoinClusterRequest as the log entry value doesn't meet the requirements");
        return resp;
    }

    if (catching_up_)
    {
        l_->info("this server is already in log syncing mode");
        return resp;
    }

    catching_up_ = true;
    role_ = srv_role::follower;
    leader_ = req.get_src();
    sm_commit_index_ = 0;
    quick_commit_idx_ = 0;
    state_->set_voted_for(-1);
    state_->set_term(req.get_term());
    ctx_->state_mgr_->save_state(*state_);
    reconfigure(cluster_config::deserialize(entries[0]->get_buf()));
    resp->accept(log_store_->next_slot());
    return resp;
}
  • 1.首先判斷是不是隻新增一個srv,if (entries.size() != 1 || entries[0]->get_val_type() != log_val_type::cluster_server)。raft為了保證election的時候只選出一個leader,強制要求每次cluster變更只能變更一個srv。具體可見Raft演算法(三):如何解決成員變更的問題?
  • 2.接著判斷role是不是leader,因為叢集成員的變更是需要廣播給各個節點的,只有leader能透過把叢集的config寫入自己的log_store然後呼叫append-entry廣播給follower做到這點。
  • 3.if (config_changing_)判斷是不是正在修改叢集成員,保證一次只更改一個srv。
  • 4.呼叫invite_srv_to_join_cluster更新新節點的狀態。
  • 5.在leader收到的resp裡面,leader將新的cluster_config寫入自己的log_store,呼叫append-entry將cluster叢集成員變更同步給follower。(具體可見raft_server_resp_handlers原始碼解析)
    知識點:
    為了確保只有一個leader,叢集成員變更一次只能變更一個節點,同時如果前一個叢集變更還沒完成不能開始下一個。

7.handle_rm_srv_req原始碼解析

ptr<resp_msg> raft_server::handle_rm_srv_req(req_msg& req)
{
    std::vector<ptr<log_entry>>& entries(req.log_entries());
    ptr<resp_msg> resp(cs_new<resp_msg>(state_->get_term(), msg_type::remove_server_response, id_, leader_));
    if (entries.size() != 1 || entries[0]->get_buf().size() != sz_int)
    {
        l_->info("bad remove server request as we are expecting one log entry with value type of int");
        return resp;
    }

    if (role_ != srv_role::leader)
    {
        l_->info("this is not a leader, cannot handle RemoveServerRequest");
        return resp;
    }

    if (config_changing_)
    {
        // the previous config has not committed yet
        l_->info("previous config has not committed yet");
        return resp;
    }

    int32 srv_id = entries[0]->get_buf().get_int();
    if (srv_id == id_)
    {
        l_->info("cannot request to remove leader");
        return resp;
    }

    ptr<peer> p;
    {
        read_lock(peers_lock_);
        peer_itor pit = peers_.find(srv_id);
        if (pit == peers_.end())
        {
            l_->info(sstrfmt("server %d does not exist").fmt(srv_id));
            return resp;
        }

        p = pit->second;
    }

    ptr<req_msg> leave_req(cs_new<req_msg>(
        state_->get_term(),
        msg_type::leave_cluster_request,
        id_,
        srv_id,
        0,
        log_store_->next_slot() - 1,
        quick_commit_idx_));
    p->send_req(leave_req, ex_resp_handler_);
    resp->accept(log_store_->next_slot());
    return resp;
}

與handle_add_srv_req同理,不再贅述。

8.handle_log_sync_req原始碼解析

ptr<resp_msg> raft_server::handle_log_sync_req(req_msg& req)
{
    std::vector<ptr<log_entry>>& entries = req.log_entries();
    ptr<resp_msg> resp(cs_new<resp_msg>(state_->get_term(), msg_type::sync_log_response, id_, req.get_src()));
    if (entries.size() != 1 || entries[0]->get_val_type() != log_val_type::log_pack)
    {
        l_->info("receive an invalid LogSyncRequest as the log entry value doesn't meet the requirements");
        return resp;
    }

    if (!catching_up_)
    {
        l_->info("This server is ready for cluster, ignore the request");
        return resp;
    }

    log_store_->apply_pack(req.get_last_log_idx() + 1, entries[0]->get_buf());
    commit(log_store_->next_slot() - 1);
    resp->accept(log_store_->next_slot());
    return resp;
}
  • 1.這裡log_sync是特指新加入的srv的log_store同步,普通的follower與leader的同步是透過append-entry進行的。
  • 2.特判是不是隻有一個srv的變更,然後再判斷這個新加的srv是不是在catch-up,如果catch-up = 0,說明已經更新好了,無需更新。否則更新新加的srv。

9.handle_install_snapshot_req原始碼解析

ptr<resp_msg> raft_server::handle_install_snapshot_req(req_msg& req)
{
    if (req.get_term() == state_->get_term() && !catching_up_)
    {
        if (role_ == srv_role::candidate)
        {
            become_follower();
        }
        else if (role_ == srv_role::leader)
        {
            l_->err(lstrfmt("Receive InstallSnapshotRequest from another leader(%d) with same term, there must be a "
                            "bug, server exits")
                        .fmt(req.get_src()));
            ctx_->state_mgr_->system_exit(-1);
            ::exit(-1);
            return ptr<resp_msg>();
        }
        else
        {
            restart_election_timer();
        }
    }

    ptr<resp_msg> resp(cs_new<resp_msg>(state_->get_term(), msg_type::install_snapshot_response, id_, req.get_src()));
    if (!catching_up_ && req.get_term() < state_->get_term())
    {
        l_->info("received an install snapshot request which has lower term than this server, decline the request");
        return resp;
    }

    std::vector<ptr<log_entry>>& entries(req.log_entries());
    if (entries.size() != 1 || entries[0]->get_val_type() != log_val_type::snp_sync_req)
    {
        l_->warn("Receive an invalid InstallSnapshotRequest due to bad log entries or bad log entry value");
        return resp;
    }

    ptr<snapshot_sync_req> sync_req(snapshot_sync_req::deserialize(entries[0]->get_buf()));
    if (sync_req->get_snapshot().get_last_log_idx() <= sm_commit_index_)
    {
        l_->warn(sstrfmt("received a snapshot (%llu) that is older than current log store")
                     .fmt(sync_req->get_snapshot().get_last_log_idx()));
        return resp;
    }

    if (handle_snapshot_sync_req(*sync_req))
    {
        resp->accept(sync_req->get_offset() + sync_req->get_data().size());
    }

    return resp;
}
  • 1.判斷req的term與節點的term是否相同,相同則分情況(前面說過install_snapshot的rpc請求是會呼叫update_term,所以正常情況term是相同的)
    (1)節點是candidate,說明已經出現了leader,退回到follower。
    (2)節點是leader,出現了兩個leader,出bug了。
    (3)節點是follower,說明收到了訊息,重新定時。
  • 2.if (!catching_up_ && req.get_term() < state_->get_term())判斷req是否合法,這裡不單純只判斷req.get_term() < state_->get_term()是因為可能出現這個節點是新加的srv,term還沒更新完,所以還要加一個!catching_up_的判斷。
  • 3.snapshot應該只有一項,所以判斷 if (entries.size() != 1 || entries[0]->get_val_type() != log_val_type::snp_sync_req)。
  • 4.判斷snapshot的idx與節點的commit_idx大小關係,如果snapshot的小說明snapshot是舊的,不要。否則應用到自己狀態機。

10.handle_prevote_req原始碼解析

ptr<resp_msg> raft_server::handle_prevote_req(req_msg& req)
{
    ptr<resp_msg> resp(cs_new<resp_msg>(state_->get_term(), msg_type::prevote_response, id_, req.get_src()));
    bool log_okay = req.get_last_log_term() > log_store_->last_entry()->get_term() ||
                    (req.get_last_log_term() == log_store_->last_entry()->get_term() &&
                     log_store_->next_slot() - 1 <= req.get_last_log_idx());
    bool grant = req.get_term() >= state_->get_term() && log_okay;
    if (ctx_->params_->defensive_prevote_)
    {
        // In defensive mode, server will deny the prevote when it's operating well.
        grant = grant && prevote_state_;
    }

    if (grant)
    {
        resp->accept(log_store_->next_slot());
    }

    return resp;
}

這裡程式碼與handle_vote_req很相似,只是沒有判斷vote_for,意味著一個節點可能投多個票。

11. 總結

  • 1.採用事件驅動模型更新節點的term,高效快捷。
  • 2.對於follower與leader資料的同步,如果相同可以不用更新,直接跳過加快速度。
  • 3.一個follower只能投給一個candidate,否則造成split-vote導致多個leader。
  • 4.raft採用強leader機制,對於client的請求只能由leader處理,leader將req應用到自己狀態機再廣播給follower。
  • 5.為了確保只有一個leader,叢集成員變更一次只能變更一個節點,同時如果前一個叢集變更還沒完成不能開始下一個。

相關文章