本文首發於泊浮目的簡書:https://www.jianshu.com/u/204...

版本	日期	備註
1.0	2020.6.14	文章首發

1. 前言

對於一個分散式叢集來說，保證資料寫入一致性最簡單的方式就是依靠一個節點來排程和管理其他節點。在分散式中我們一般稱其為Leader。

為什麼是最簡單的方式呢？我們想象一下，當我們寫資料到Leader時，Leader寫入自己的一份資料後，可能會做副本到Follower，那麼拷貝的數量、及所在的位置都由該Leader來控制。但如果是多Leader排程，就要涉及到資料分割槽，請求負載均衡等問題了。

今天，筆者就和大家一起來看看ZK的選舉流程。

2. 選舉演算法剖析

2.1 ZAB概述

這是一種典型的多數派演算法，聽名字就知道是為ZK而生了（Zookeeper Atomic Broadcast）。其Leader的選舉主要關心節點的ID和資料ID，這兩個屬性越大，則表示資料越新，優先成為主。

2.2 選舉時機

常見由兩種場景觸發選舉，無論如何，至少得有兩臺ZK機器。

2.2.1 Startup觸發

我們知道，每臺zk都需要配置不同的myid，而當剛開始時，zxid必定都為0。這便意味著會挑選myid最大的zk節點作為leader。

2.2.2 Leader失聯觸發

zk節點每經過一次事務處理，都會更新zxid。那便意味著資料越新，zxid會越大。在這個選舉過程中，會挑選出zxid的節點作為leader。

2.3 Zk選舉過程剖析（帶原始碼分析）

核心方法為org.apache.zookeeper.server.quorum.QuorumPeer.startLeaderElection和org.apache.zookeeper.server.quorum.QuorumPeer.run，我們的原始碼分析也基於此展開。

2.3.1 Startup

我們得從QuorumPeerMain來看，因為這是啟動的入口：

/**
 *
 * <h2>Configuration file</h2>
 *
 * When the main() method of this class is used to start the program, the first
 * argument is used as a path to the config file, which will be used to obtain
 * configuration information. This file is a Properties file, so keys and
 * values are separated by equals (=) and the key/value pairs are separated
 * by new lines. The following is a general summary of keys used in the
 * configuration file. For full details on this see the documentation in
 * docs/index.html
 * <ol>
 * <li>dataDir - The directory where the ZooKeeper data is stored.</li>
 * <li>dataLogDir - The directory where the ZooKeeper transaction log is stored.</li>
 * <li>clientPort - The port used to communicate with clients.</li>
 * <li>tickTime - The duration of a tick in milliseconds. This is the basic
 * unit of time in ZooKeeper.</li>
 * <li>initLimit - The maximum number of ticks that a follower will wait to
 * initially synchronize with a leader.</li>
 * <li>syncLimit - The maximum number of ticks that a follower will wait for a
 * message (including heartbeats) from the leader.</li>
 * <li>server.<i>id</i> - This is the host:port[:port] that the server with the
 * given id will use for the quorum protocol.</li>
 * </ol>
 * In addition to the config file. There is a file in the data directory called
 * "myid" that contains the server id as an ASCII decimal value.
 *
 */
@InterfaceAudience.Public
public class QuorumPeerMain {
    private static final Logger LOG = LoggerFactory.getLogger(QuorumPeerMain.class);

    private static final String USAGE = "Usage: QuorumPeerMain configfile";

    protected QuorumPeer quorumPeer;

    /**
     * To start the replicated server specify the configuration file name on
     * the command line.
     * @param args path to the configfile
     */
    public static void main(String[] args) {
        QuorumPeerMain main = new QuorumPeerMain();
        try {
            main.initializeAndRun(args);
        } catch (IllegalArgumentException e) {
            LOG.error("Invalid arguments, exiting abnormally", e);
            LOG.info(USAGE);
            System.err.println(USAGE);
            System.exit(2);
        } catch (ConfigException e) {
            LOG.error("Invalid config, exiting abnormally", e);
            System.err.println("Invalid config, exiting abnormally");
            System.exit(2);
        } catch (DatadirException e) {
            LOG.error("Unable to access datadir, exiting abnormally", e);
            System.err.println("Unable to access datadir, exiting abnormally");
            System.exit(3);
        } catch (AdminServerException e) {
            LOG.error("Unable to start AdminServer, exiting abnormally", e);
            System.err.println("Unable to start AdminServer, exiting abnormally");
            System.exit(4);
        } catch (Exception e) {
            LOG.error("Unexpected exception, exiting abnormally", e);
            System.exit(1);
        }
        LOG.info("Exiting normally");
        System.exit(0);
    }

    protected void initializeAndRun(String[] args)
        throws ConfigException, IOException, AdminServerException
    {
        QuorumPeerConfig config = new QuorumPeerConfig();
        if (args.length == 1) {
            config.parse(args[0]);
        }

        // Start and schedule the the purge task
        DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config
                .getDataDir(), config.getDataLogDir(), config
                .getSnapRetainCount(), config.getPurgeInterval());
        purgeMgr.start();

        if (args.length == 1 && config.isDistributed()) {
            runFromConfig(config);
        } else {
            LOG.warn("Either no config or no quorum defined in config, running "
                    + " in standalone mode");
            // there is only server in the quorum -- run as standalone
            ZooKeeperServerMain.main(args);
        }
    }

    public void runFromConfig(QuorumPeerConfig config)
            throws IOException, AdminServerException
    {
      try {
          ManagedUtil.registerLog4jMBeans();
      } catch (JMException e) {
          LOG.warn("Unable to register log4j JMX control", e);
      }

      LOG.info("Starting quorum peer");
      try {
          ServerCnxnFactory cnxnFactory = null;
          ServerCnxnFactory secureCnxnFactory = null;

          if (config.getClientPortAddress() != null) {
              cnxnFactory = ServerCnxnFactory.createFactory();
              cnxnFactory.configure(config.getClientPortAddress(),
                      config.getMaxClientCnxns(),
                      false);
          }

          if (config.getSecureClientPortAddress() != null) {
              secureCnxnFactory = ServerCnxnFactory.createFactory();
              secureCnxnFactory.configure(config.getSecureClientPortAddress(),
                      config.getMaxClientCnxns(),
                      true);
          }

          quorumPeer = getQuorumPeer();
          quorumPeer.setTxnFactory(new FileTxnSnapLog(
                      config.getDataLogDir(),
                      config.getDataDir()));
          quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
          quorumPeer.enableLocalSessionsUpgrading(
              config.isLocalSessionsUpgradingEnabled());
          //quorumPeer.setQuorumPeers(config.getAllMembers());
          quorumPeer.setElectionType(config.getElectionAlg());
          quorumPeer.setMyid(config.getServerId());
          quorumPeer.setTickTime(config.getTickTime());
          quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
          quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
          quorumPeer.setInitLimit(config.getInitLimit());
          quorumPeer.setSyncLimit(config.getSyncLimit());
          quorumPeer.setConfigFileName(config.getConfigFilename());
          quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
          quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);
          if (config.getLastSeenQuorumVerifier()!=null) {
              quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);
          }
          quorumPeer.initConfigInZKDatabase();
          quorumPeer.setCnxnFactory(cnxnFactory);
          quorumPeer.setSecureCnxnFactory(secureCnxnFactory);
          quorumPeer.setSslQuorum(config.isSslQuorum());
          quorumPeer.setUsePortUnification(config.shouldUsePortUnification());
          quorumPeer.setLearnerType(config.getPeerType());
          quorumPeer.setSyncEnabled(config.getSyncEnabled());
          quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
          if (config.sslQuorumReloadCertFiles) {
              quorumPeer.getX509Util().enableCertFileReloading();
          }

          // sets quorum sasl authentication configurations
          quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
          if(quorumPeer.isQuorumSaslAuthEnabled()){
              quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
              quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
              quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
              quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
              quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
          }
          quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
          quorumPeer.initialize();
          
          quorumPeer.start();
          quorumPeer.join();
      } catch (InterruptedException e) {
          // warn, but generally this is ok
          LOG.warn("Quorum Peer interrupted", e);
      }
    }

    // @VisibleForTesting
    protected QuorumPeer getQuorumPeer() throws SaslException {
        return new QuorumPeer();
    }
}

我們從QuorumPeerMain.main() -> main.initializeAndRun(args) -> runFromConfig -> quorumPeer.start()，繼續往下看QuorumPeer.java（這個類用於管理選舉相關的邏輯）：

    @Override
    public synchronized void start() {
        if (!getView().containsKey(myid)) {
            throw new RuntimeException("My id " + myid + " not in the peer list");
         }
        loadDataBase();
        startServerCnxnFactory();
        try {
            adminServer.start();
        } catch (AdminServerException e) {
            LOG.warn("Problem starting AdminServer", e);
            System.out.println(e);
        }
        startLeaderElection();
        super.start();
    }

現在，我們來到核心程式碼startLeaderElection：

    synchronized public void startLeaderElection() {
       try {
           if (getPeerState() == ServerState.LOOKING) {
               currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
           }
       } catch(IOException e) {
           RuntimeException re = new RuntimeException(e.getMessage());
           re.setStackTrace(e.getStackTrace());
           throw re;
       }

       // if (!getView().containsKey(myid)) {
      //      throw new RuntimeException("My id " + myid + " not in the peer list");
        //}
        if (electionType == 0) {
            try {
                udpSocket = new DatagramSocket(getQuorumAddress().getPort());
                responder = new ResponderThread();
                responder.start();
            } catch (SocketException e) {
                throw new RuntimeException(e);
            }
        }
        this.electionAlg = createElectionAlgorithm(electionType);
    }

邏輯非常的簡單，如果處於Looking狀態（伺服器剛啟動時預設為Looking），那麼就發起選舉的投票，並確認選舉演算法（從3.4.0開始，只有FastLeaderElection選舉演算法了），並將其傳送出去。由於程式碼篇幅較大，這裡不再粘出，感興趣的讀者可以自行閱讀FastLeaderElection.Messenger.WorkerReceiver.run。其本質上就是一個執行緒，從儲存vote的佇列中取出vote，併傳送。

在這裡普及一下伺服器狀態：

LOOKING：尋找Leader狀態。當伺服器處於該狀態時，它認為當前叢集中沒有Leader。
FOLLOWING：跟隨者狀態，表明當前伺服器角色Follower。
LEADING：領導者狀態，表明當前伺服器角色是Leader。
OBSERVING：觀察者狀態，表明當前伺服器是Observer。

接下來看QuorumPeer的相關核心程式碼：

    @Override
    public void run() {
        updateThreadName();

        LOG.debug("Starting quorum peer");
        try {
            jmxQuorumBean = new QuorumBean(this);
            MBeanRegistry.getInstance().register(jmxQuorumBean, null);
            for(QuorumServer s: getView().values()){
                ZKMBeanInfo p;
                if (getId() == s.id) {
                    p = jmxLocalPeerBean = new LocalPeerBean(this);
                    try {
                        MBeanRegistry.getInstance().register(p, jmxQuorumBean);
                    } catch (Exception e) {
                        LOG.warn("Failed to register with JMX", e);
                        jmxLocalPeerBean = null;
                    }
                } else {
                    RemotePeerBean rBean = new RemotePeerBean(this, s);
                    try {
                        MBeanRegistry.getInstance().register(rBean, jmxQuorumBean);
                        jmxRemotePeerBean.put(s.id, rBean);
                    } catch (Exception e) {
                        LOG.warn("Failed to register with JMX", e);
                    }
                }
            }
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            jmxQuorumBean = null;
        }

        try {
            /*
             * Main loop
             */
            while (running) {
                switch (getPeerState()) {
                case LOOKING:
                    LOG.info("LOOKING");

                    if (Boolean.getBoolean("readonlymode.enabled")) {
                        LOG.info("Attempting to start ReadOnlyZooKeeperServer");

                        // Create read-only server but don't start it immediately
                        final ReadOnlyZooKeeperServer roZk =
                            new ReadOnlyZooKeeperServer(logFactory, this, this.zkDb);
    
                        // Instead of starting roZk immediately, wait some grace
                        // period before we decide we're partitioned.
                        //
                        // Thread is used here because otherwise it would require
                        // changes in each of election strategy classes which is
                        // unnecessary code coupling.
                        Thread roZkMgr = new Thread() {
                            public void run() {
                                try {
                                    // lower-bound grace period to 2 secs
                                    sleep(Math.max(2000, tickTime));
                                    if (ServerState.LOOKING.equals(getPeerState())) {
                                        roZk.startup();
                                    }
                                } catch (InterruptedException e) {
                                    LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
                                } catch (Exception e) {
                                    LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
                                }
                            }
                        };
                        try {
                            roZkMgr.start();
                            reconfigFlagClear();
                            if (shuttingDownLE) {
                                shuttingDownLE = false;
                                startLeaderElection();
                            }
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        } finally {
                            // If the thread is in the the grace period, interrupt
                            // to come out of waiting.
                            roZkMgr.interrupt();
                            roZk.shutdown();
                        }
                    } else {
                        try {
                           reconfigFlagClear();
                            if (shuttingDownLE) {
                               shuttingDownLE = false;
                               startLeaderElection();
                               }
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        }                        
                    }
                    break;

在這裡僅僅擷取了Looking的相關邏輯，上半段的if主要處理只讀服務——其用於handle只讀client。else邏輯則是常見的情況，但是從程式碼塊：

             reconfigFlagClear();
                            if (shuttingDownLE) {
                               shuttingDownLE = false;
                               startLeaderElection();
                               }
                            setCurrentVote(makeLEStrategy().lookForLeader());

其實區別不大。接著來看lookForLeader,為了篇幅，我們只擷取Looking相關的程式碼：

    /**
     * Starts a new round of leader election. Whenever our QuorumPeer
     * changes its state to LOOKING, this method is invoked, and it
     * sends notifications to all other peers.
     */
    public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(
                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }
        if (self.start_fle == 0) {
           self.start_fle = Time.currentElapsedTime();
        }
        try {
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            synchronized(this){
                logicalclock.incrementAndGet();
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            sendNotifications();

            /*
             * Loop in which we exchange notifications until we find a leader
             */

            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

註釋說的很清楚，這個方法會開啟新的一輪選舉：當我們的伺服器狀態變為Looking，這個方法會被呼叫，被通知叢集其他需要參與選舉的伺服器。那麼在這段邏輯中，recvqueue會存放著相關的選舉通知資訊，取出一個。接下來有兩個邏輯分支：

為空。想辦法通知其他伺服器。
有效的投票（即大家的選舉輪次都是統一論次），那麼便進行選票PK。

我們來看totalOrderPredicate這個方法：

    /**
     * Check if a pair (server id, zxid) succeeds our
     * current vote.
     *
     * @param id    Server identifier
     * @param zxid  Last zxid observed by the issuer of this vote
     */
    protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
        if(self.getQuorumVerifier().getWeight(newId) == 0){
            return false;
        }

        /*
         * We return true if one of the following three cases hold:
         * 1- New epoch is higher
         * 2- New epoch is the same as current epoch, but new zxid is higher
         * 3- New epoch is the same as current epoch, new zxid is the same
         *  as current zxid, but server id is higher.
         */

        return ((newEpoch > curEpoch) ||
                ((newEpoch == curEpoch) &&
                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
    }

理一下邏輯：

如果新的輪次大於內部投票輪次，則需要進行投票變更
如果選舉輪次一致，並外部投票的ZXID大於內部投票的，則需要變更
如果選舉輪次一致，並外部投票的SID大於內部投票的，則需要變更

經過這個邏輯，便可以確定外部投票優於內部投票——即更適合成為Leader。這時便會把外部選票資訊來覆蓋內部投票，併傳送出去：

                    case LOOKING:
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();

接下來就會判斷叢集中是否有過半的伺服器認可該投票。

    /**
     * Termination predicate. Given a set of votes, determines if have
     * sufficient to declare the end of the election round.
     * 
     * @param votes
     *            Set of votes
     * @param vote
     *            Identifier of the vote received last
     */
    protected boolean termPredicate(Map<Long, Vote> votes, Vote vote) {
        SyncedLearnerTracker voteSet = new SyncedLearnerTracker();
        voteSet.addQuorumVerifier(self.getQuorumVerifier());
        if (self.getLastSeenQuorumVerifier() != null
                && self.getLastSeenQuorumVerifier().getVersion() > self
                        .getQuorumVerifier().getVersion()) {
            voteSet.addQuorumVerifier(self.getLastSeenQuorumVerifier());
        }

        /*
         * First make the views consistent. Sometimes peers will have different
         * zxids for a server depending on timing.
         */
        for (Map.Entry<Long, Vote> entry : votes.entrySet()) {
            if (vote.equals(entry.getValue())) {
                voteSet.addAck(entry.getKey());
            }
        }

        return voteSet.hasAllQuorums(); //是否超過一半
    }

否則的話會繼續收集選票。

接下來便是更新伺服器狀態。

                         /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                                Vote endVote = new Vote(proposedLeader,
                                        proposedZxid, logicalclock.get(), 
                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }

2.3.2 Leader失聯

上文我們提到了QuorumPeer.java，裡面有個main loop，不同的角色會在這個loop下做自己的事。直到退出。在這裡，我們以Follower為例，進行分析：

                case FOLLOWING:
                    try {
                       LOG.info("FOLLOWING");
                        setFollower(makeFollower(logFactory));
                        follower.followLeader();
                    } catch (Exception e) {
                       LOG.warn("Unexpected exception",e);
                    } finally {
                       follower.shutdown();
                       setFollower(null);
                       updateServerState();
                    }
                    break;

跳follower.followLeader()：

    /**
     * the main method called by the follower to follow the leader
     *
     * @throws InterruptedException
     */
    void followLeader() throws InterruptedException {
        self.end_fle = Time.currentElapsedTime();
        long electionTimeTaken = self.end_fle - self.start_fle;
        self.setElectionTimeTaken(electionTimeTaken);
        LOG.info("FOLLOWING - LEADER ELECTION TOOK - {} {}", electionTimeTaken,
                QuorumPeer.FLE_TIME_UNIT);
        self.start_fle = 0;
        self.end_fle = 0;
        fzk.registerJMX(new FollowerBean(this, zk), self.jmxLocalPeerBean);
        try {
            QuorumServer leaderServer = findLeader();            
            try {
                connectToLeader(leaderServer.addr, leaderServer.hostname);
                long newEpochZxid = registerWithLeader(Leader.FOLLOWERINFO);
                if (self.isReconfigStateChange())
                   throw new Exception("learned about role change");
                //check to see if the leader zxid is lower than ours
                //this should never happen but is just a safety check
                long newEpoch = ZxidUtils.getEpochFromZxid(newEpochZxid);
                if (newEpoch < self.getAcceptedEpoch()) {
                    LOG.error("Proposed leader epoch " + ZxidUtils.zxidToString(newEpochZxid)
                            + " is less than our accepted epoch " + ZxidUtils.zxidToString(self.getAcceptedEpoch()));
                    throw new IOException("Error: Epoch of leader is lower");
                }
                syncWithLeader(newEpochZxid);                
                QuorumPacket qp = new QuorumPacket();
                while (this.isRunning()) {
                    readPacket(qp);
                    processPacket(qp);
                }
            } catch (Exception e) {
                LOG.warn("Exception when following the leader", e);
                try {
                    sock.close();
                } catch (IOException e1) {
                    e1.printStackTrace();
                }
    
                // clear pending revalidations
                pendingRevalidations.clear();
            }
        } finally {
            zk.unregisterJMX((Learner)this);
        }
    }

跳往核心方法processPacket：

   /**
     * Examine the packet received in qp and dispatch based on its contents.
     * @param qp
     * @throws IOException
     */
    protected void processPacket(QuorumPacket qp) throws Exception{
        switch (qp.getType()) {
        case Leader.PING:            
            ping(qp);            
            break;
        case Leader.PROPOSAL:           
            TxnHeader hdr = new TxnHeader();
            Record txn = SerializeUtils.deserializeTxn(qp.getData(), hdr);
            if (hdr.getZxid() != lastQueued + 1) {
                LOG.warn("Got zxid 0x"
                        + Long.toHexString(hdr.getZxid())
                        + " expected 0x"
                        + Long.toHexString(lastQueued + 1));
            }
            lastQueued = hdr.getZxid();
            
            if (hdr.getType() == OpCode.reconfig){
               SetDataTxn setDataTxn = (SetDataTxn) txn;       
               QuorumVerifier qv = self.configFromString(new String(setDataTxn.getData()));
               self.setLastSeenQuorumVerifier(qv, true);                               
            }
            
            fzk.logRequest(hdr, txn);
            break;
        case Leader.COMMIT:
            fzk.commit(qp.getZxid());
            break;
            
        case Leader.COMMITANDACTIVATE:
           // get the new configuration from the request
           Request request = fzk.pendingTxns.element();
           SetDataTxn setDataTxn = (SetDataTxn) request.getTxn();                                                                                                      
           QuorumVerifier qv = self.configFromString(new String(setDataTxn.getData()));                                
 
           // get new designated leader from (current) leader's message
           ByteBuffer buffer = ByteBuffer.wrap(qp.getData());    
           long suggestedLeaderId = buffer.getLong();
            boolean majorChange = 
                   self.processReconfig(qv, suggestedLeaderId, qp.getZxid(), true);
           // commit (writes the new config to ZK tree (/zookeeper/config)                     
           fzk.commit(qp.getZxid());
            if (majorChange) {
               throw new Exception("changes proposed in reconfig");
           }
           break;
        case Leader.UPTODATE:
            LOG.error("Received an UPTODATE message after Follower started");
            break;
        case Leader.REVALIDATE:
            revalidate(qp);
            break;
        case Leader.SYNC:
            fzk.sync();
            break;
        default:
            LOG.warn("Unknown packet type: {}", LearnerHandler.packetToString(qp));
            break;
        }
    }

在case COMMITANDACTIVATE中，我們可以看到當其收到leader改變相關的訊息時，就會丟擲異常。接下來它自己就會變成LOOKING狀態，開始選舉。

那麼如何確定leader不可用呢？答案是通過心跳指令。在一定時間內如果leader的心跳沒有過來，那麼則認為其已經不可用。

見LeanerHandler.run裡的case Leader.PING：

                case Leader.PING:
                    // Process the touches
                    ByteArrayInputStream bis = new ByteArrayInputStream(qp
                            .getData());
                    DataInputStream dis = new DataInputStream(bis);
                    while (dis.available() > 0) {
                        long sess = dis.readLong();
                        int to = dis.readInt();
                        leader.zk.touch(sess, to);
                    }
                    break;

3. 其他常見選舉演算法

首先，我們要知道。選舉演算法的本質是共識演算法，而絕大多數共識演算法就是為了解決分散式環境下資料一致性而誕生的。而zk裡所謂leader、follower之類的，無非也是個狀態，基於zk這個語義下（上下文裡）大家都認為一個leader是leader，才是有效的共識。

常見的共識演算法都有哪些呢？現階段的共識演算法主要可以分成三大類：公鏈，聯盟鏈和私鏈。下面描述這三種類別的特徵：

私鏈：私鏈的共識演算法即區塊鏈這個概念還沒普及時的傳統分散式系統裡的共識演算法，比如 zookeeper 的 zab 協議，就是類 paxos 演算法的一種。私鏈的適用環境一般是不考慮叢集中存在作惡節點，只考慮因為系統或者網路原因導致的故障節點。
聯盟鏈：聯盟鏈中，經典的代表專案是 Hyperledger 組織下的 Fabric 專案， Fabric0.6 版本使用的就是 pbft 演算法。聯盟鏈的適用環境除了需要考慮叢集中存在故障節點，還需要考慮叢集中存在作惡節點。對於聯盟鏈，每個新加入的節點都是需要驗證和稽核的。
公鏈：公鏈不斷需要考慮網路中存在故障節點，還需要考慮作惡節點，這一點和聯盟鏈是類似的。和聯盟鏈最大的區別就是，公鏈中的節點可以很自由的加入或者退出，不需要嚴格的驗證和稽核。

copy from https://zhuanlan.zhihu.com/p/...;作者：美圖技術團隊

基於篇幅，接下來簡單介紹下兩個較為典型的共識演算法。

3.1 Raft

Raft 演算法是典型的多數派投票選舉演算法，其選舉機制與我們日常生活中的民主投票機制類似，核心思想是“少數服從多數”。也就是說，Raft 演算法中，獲得投票最多的節點成為主。

採用 Raft 演算法選舉，叢集節點的角色有 3 種：

Leader，即主節點，同一時刻只有一個 Leader，負責協調和管理其他節點；- Candidate，即候選者，每一個節點都可以成為 Candidate，節點在該角色下才可以被選為新的 Leader；
Follower，Leader 的跟隨者，不可以發起選舉。

Raft 選舉的流程，可以分為以下幾步：

初始化時，所有節點均為 Follower 狀態。
開始選主時，所有節點的狀態由 Follower 轉化為 Candidate，並向其他節點傳送選舉請求。
其他節點根據接收到的選舉請求的先後順序，回覆是否同意成為主。這裡需要注意的是，在每一輪選舉中，一個節點只能投出一張票。
若發起選舉請求的節點獲得超過一半的投票，則成為主節點，其狀態轉化為 Leader，其他節點的狀態則由 Candidate 降為 Follower。Leader 節點與 Follower 節點之間會定期傳送心跳包，以檢測主節點是否活著。
當 Leader 節點的任期到了，即發現其他伺服器開始下一輪選主週期時（或主節點掛了），Leader 節點的狀態由 Leader 降級為 Follower，進入新一輪選主。

這個演算法比起ZAB，較易實現，但由於訊息通訊量大，相比於ZAB，更適用於中小的場景。

3.2 Pow

PoW 演算法，是以每個節點或伺服器的計算能力（即“算力”）來競爭記賬權的機制，因此是一種使用工作量證明機制的共識演算法。也就是說，誰的算力強（解題快），誰獲得記賬權的可能性就越大。

比如發生一次交易，同時有三個節點（A、B、C）都收到了這個記賬請求。A節點已經算出來了，那麼就會通知BC節點進行驗證——這是一種橢圓曲線加密演算法，解題的速度會比驗證的速度慢很多。當所有節點驗證後，這個記賬就記下來了。

聽起來很公平。但PoW 機制每次達成共識需要全網共同參與運算，增加了每個節點的計算量，並且如果題目過難，會導致計算時間長、資源消耗多 ；而如果題目過於簡單，會導致大量節點同時獲得記賬權，衝突多。這些問題，都會增加達成共識的時間。

4. 小結

在本文，我們先提到了zookeeper的leader選舉，大致流程如下：

4.1 伺服器啟動時選舉

每個Server會發出一個投票
接受來每個Server的投票
處理投票（對比ZXID和myid）
統計投票，直到超過半數的機器收到相同的投票資訊
更改伺服器角色

4.2 伺服器執行期間選舉

和伺服器啟動時選舉非常的像，無非就是多了一個狀態變更——當Leader掛了，餘下的Follower都會將自己的伺服器狀態變更為LOOKING，然後進入選舉流程。

4.3 一致性演算法和共識演算法

我們還提到了一致性演算法和共識演算法的概念，那麼一致性與共識的區別是什麼呢？在平常使用中，我們通常會混淆一致性和共識這兩個概念，不妨在這兒說清：

一致性：分散式系統中的多個節點之間，給定一系列的操作，在約定協議的保障下，對外界呈現的資料或狀態是一致的。
共識：分散式系統中多個節點之間，彼此對某個狀態達成一致結果的過程。

即：一致性強調的是結果，共識強調的是達成一致的過程，共識演算法是保障系統滿足不同程度一致性的核心技術。

深入淺出Zookeeper（七）：Leader選舉