Zookeeper原始碼（啟動+選舉）

Justhing發表於2020-11-07

簡介

　　關於Zookeeper，目前普遍的應用場景基本作為服務註冊中心，用於服務發現。但這只是Zookeeper的一個的功能，根據Apache的官方概述：“The Apache ZooKeeper system for distributed coordination is a high-performance service for building distributed applications.” Zookeeper是一個用於構建分散式應用的coordination, 並且為高效能的。Zookeeper藉助於它內部的節點結構和監聽機制，能用於很大部分的分散式協調場景。配置管理、命名服務、分散式鎖、服務發現和釋出訂閱等等，這些場景在Zookeeper中基本使用其節點的“變更+通知”來實現。因為分散式的重點在於通訊，通訊的作用也就是協調。

　　Zookeeper由Java語言編寫（也有C語言的Api實現）,對於其原理，算是Paxos演算法的實現，包含了Leader、Follower、Proposal等角色和選舉之類的一些概念，但於Paxos還有一些不同（ZAB協議）。對於Paxos演算法，個人認為，它是一套解決方案的理論，想要理解也有點的複雜。這裡對於演算法不太深入概述，僅對於Zookeeper服務端進行部分原始碼解析，包含應用的啟動和選舉方面，不包含Client。

原始碼獲取

　　Zookeeper原始碼可以從Github（https://github.com/apache/zookeeper）上clone下來；

　　也可從Zookeeper官網（Apache）https://zookeeper.apache.org/releases.html上獲取。

　　Zookeeper在3.5.5之前使用的是Ant構建，在3.5.5開始使用的是Maven構建。

Zookeeper原始碼（啟動+選舉）

本次採用的3.5.4版本進行解析　

工程結構

　　目錄結構：

Zookeeper原始碼（啟動+選舉）

　其中src中包含了C和Java原始碼，本次忽略C的Api。conf下為配置檔案，也就是Zookeeper啟動的配置檔案。bin為Zookeeper啟動指令碼（server/client）。

　　org.apache.jute為Zookeeper的通訊協議和序列化相關的元件，其通訊協議基於TCP協議，它提供了Record介面用於序列化和反序列化，OutputArchive/InputArchive介面.

　　org.apache.zookeeper下為Zookeeper核心程式碼。包含了核心的業務實現。

啟動流程

　　在我們使用Zookeeper的應用時，通過“./zkServer.sh start”命令來啟動程式。通過檢視zkServer.sh指令碼，可以追蹤到Zookeeper程式啟動入口為“org.apache.zookeeper.server.quorum.QuorumPeerMain”,同時為程式指定了日誌相關的配置。

1 ZOOMAIN="org.apache.zookeeper.server.quorum.QuorumPeerMain"　
2 #.......
3 nohup "$JAVA" $ZOO_DATADIR_AUTOCREATE "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" \
4     "-Dzookeeper.log.file=${ZOO_LOG_FILE}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \
5     -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -9 %p' \
6     -cp "$CLASSPATH" $JVMFLAGS $ZOOMAIN "$ZOOCFG" > "$_ZOO_DAEMON_OUT" 2>&1 < /dev/null &
7     if [ $? -eq 0 ]
  #.......

　　Zookeeper啟動流程：

　　 Zookeeper原始碼（啟動+選舉）

　　QuorumPeerMain.main()接受至少一個引數，一般就一個引數，引數為zoo.cfg檔案路徑。main方法中沒有很多的業務程式碼，例項化了一個QuorumPeerMain 物件，然後main.initializeAndRun(args);進行了例項化

 1     public static void main(String[] args) {
 2         QuorumPeerMain main = new QuorumPeerMain();
 3         try {
 4             main.initializeAndRun(args);
 5         } catch (IllegalArgumentException e) {
 6             LOG.error("Invalid arguments, exiting abnormally", e);
 7             LOG.info(USAGE);
 8             System.err.println(USAGE);
 9             System.exit(2);
10         } catch (ConfigException e) {
11             LOG.error("Invalid config, exiting abnormally", e);
12             System.err.println("Invalid config, exiting abnormally");
13             System.exit(2);
14         } catch (DatadirException e) {
15             LOG.error("Unable to access datadir, exiting abnormally", e);
16             System.err.println("Unable to access datadir, exiting abnormally");
17             System.exit(3);
18         } catch (AdminServerException e) {
19             LOG.error("Unable to start AdminServer, exiting abnormally", e);
20             System.err.println("Unable to start AdminServer, exiting abnormally");
21             System.exit(4);
22         } catch (Exception e) {
23             LOG.error("Unexpected exception, exiting abnormally", e);
24             System.exit(1);
25         }
26         LOG.info("Exiting normally");
27         System.exit(0);
28     }

　　initializeAndRun方法則通過例項化QuorumPeerConfig物件，通過parseProperties()來解析zoo.cfg檔案中的配置，QuorumPeerConfig包含了Zookeeper整個應用的配置屬性。接著開啟一個DatadirCleanupManager物件來開啟一個Timer用於清除並建立管理新的DataDir相關的資料。

　　最後進行程式的啟動，因為Zookeeper分為單機和叢集模式，所以分為兩種不同的啟動方式，當zoo.cfg檔案中配置了standaloneEnabled=true為單機模式，如果配置server.0,server.1......叢集節點，則為叢集模式.

 1    protected void initializeAndRun(String[] args)
 2         throws ConfigException, IOException, AdminServerException
 3     {
 4         QuorumPeerConfig config = new QuorumPeerConfig();
 5         if (args.length == 1) {
 6             config.parse(args[0]);
 7         }
 8 
 9         // Start and schedule the the purge task
10         DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config
11                 .getDataDir(), config.getDataLogDir(), config
12                 .getSnapRetainCount(), config.getPurgeInterval());
13         purgeMgr.start();
14 
15         // 當配置了多節點資訊，return quorumVerifier!=null && (!standaloneEnabled || quorumVerifier.getVotingMembers().size() > 1);
16         if (args.length == 1 && config.isDistributed()) {
17             // 叢集模式
18             runFromConfig(config);
19         } else {
20             LOG.warn("Either no config or no quorum defined in config, running "
21                     + " in standalone mode");
22             // there is only server in the quorum -- run as standalone
23             // 單機模式
24             ZooKeeperServerMain.main(args);
25         }
26     }

　單機模式啟動

　　當配置了standaloneEnabled=true 或者沒有配置叢集節點（sever.*）時，Zookeeper使用單機環境啟動。單機環境啟動入口為ZooKeeperServerMain類，ZooKeeperServerMain類中持有ServerCnxnFactory、ContainerManager和AdminServer物件;

 1 public class ZooKeeperServerMain {
 2     /*.............*/
 3     // ZooKeeper server supports two kinds of connection: unencrypted and encrypted.
 4     private ServerCnxnFactory cnxnFactory;
 5     private ServerCnxnFactory secureCnxnFactory;
 6     private ContainerManager containerManager;
 7 
 8     private AdminServer adminServer;
 9     /*.............*/
10 }

　　ServerCnxnFactory為Zookeeper中的核心元件，用於網路通訊IO的實現和管理客戶端連線，Zookeeper內部提供了兩種實現，一種是基於JDK的NIO實現，一種是基於netty的實現。

　　 Zookeeper原始碼（啟動+選舉）

　　ContainerManager類，用於管理維護Zookeeper中節點Znode的資訊，管理zkDatabase；

　　AdminServer是一個Jetty服務，預設開啟8080埠，用於提供Zookeeper的資訊的查詢介面。該功能從3.5的版本開始。

　　ZooKeeperServerMain的main方法中同QuorumPeerMain中一致，先例項化本身的物件，再進行init，載入配置檔案，然後啟動。

　　載入配置資訊：

 1     // 解析單機模式的配置物件，並啟動單機模式
 2     protected void initializeAndRun(String[] args)
 3         throws ConfigException, IOException, AdminServerException
 4     {
 5         try {
 6 
 7             //註冊jmx
 8            // JMX的全稱為Java Management Extensions.是管理Java的一種擴充套件。
 9            // 這種機制可以方便的管理、監控正在執行中的Java程式。常用於管理執行緒，記憶體，日誌Level，服務重啟，系統環境等
10             ManagedUtil.registerLog4jMBeans();
11         } catch (JMException e) {
12             LOG.warn("Unable to register log4j JMX control", e);
13         }
14 
15         // 建立服務配置物件
16         ServerConfig config = new ServerConfig();
17 
18         //如果入參只有一個，則認為是配置檔案的路徑
19         if (args.length == 1) {
20             // 解析配置檔案
21             config.parse(args[0]);
22         } else {
23             // 引數有多個，解析引數
24             config.parse(args);
25         }
26 
27         // 根據配置執行服務
28         runFromConfig(config);
29     }

　　服務啟動：　runFromConfig()為應用啟動之前初始化一些物件，

　　1.　　初始化FileTxnSnapLog物件，用於管理dataDir和datalogDir資料。

　　2.　　初始化ZooKeeperServer 物件；

　　3.　　例項化CountDownLatch執行緒計數器物件，在程式啟動後，執行shutdownLatch.await();用於掛起主程式，並監聽Zookeeper執行狀態。

　　4.　　建立adminServer（Jetty）服務並開啟。

　　5.　　建立ServerCnxnFactory物件，cnxnFactory = ServerCnxnFactory.createFactory(); Zookeeper預設使用NIOServerCnxnFactory來實現網路通訊IO。

　　6.　　啟動ServerCnxnFactory服務

　　7.　　建立ContainerManager物件，並啟動；

　　8.　　Zookeeper應用啟動。

 1     public void runFromConfig(ServerConfig config)
 2             throws IOException, AdminServerException {
 3         LOG.info("Starting server");
 4         FileTxnSnapLog txnLog = null;
 5         try {
 6             // Note that this thread isn't going to be doing anything else,
 7             // so rather than spawning another thread, we will just call
 8             // run() in this thread.
 9             // create a file logger url from the command line args
10             //初始化日誌檔案
11             txnLog = new FileTxnSnapLog(config.dataLogDir, config.dataDir);
12 
13            // 初始化zkServer物件
14             final ZooKeeperServer zkServer = new ZooKeeperServer(txnLog,
15                     config.tickTime, config.minSessionTimeout, config.maxSessionTimeout, null);
16 
17             // 服務結束鉤子，用於知道伺服器錯誤或關閉狀態更改。
18             final CountDownLatch shutdownLatch = new CountDownLatch(1);
19             zkServer.registerServerShutdownHandler(
20                     new ZooKeeperServerShutdownHandler(shutdownLatch));
21 
22 
23             // Start Admin server
24             // 建立admin服務，用於接收請求(建立jetty服務)
25             adminServer = AdminServerFactory.createAdminServer();
26             // 設定zookeeper服務
27             adminServer.setZooKeeperServer(zkServer);
28             // AdminServer是3.5.0之後支援的特性,啟動了一個jettyserver,預設埠是8080,訪問此埠可以獲取Zookeeper執行時的相關資訊
29             adminServer.start();
30 
31             boolean needStartZKServer = true;
32 
33 
34             //---啟動ZooKeeperServer
35             //判斷配置檔案中 clientportAddress是否為null
36             if (config.getClientPortAddress() != null) {
37                 //ServerCnxnFactory是Zookeeper中的重要元件,負責處理客戶端與伺服器的連線
38                 //初始化server端IO物件，預設是NIOServerCnxnFactory:Java原生NIO處理網路IO事件
39                 cnxnFactory = ServerCnxnFactory.createFactory();
40 
41                 //初始化配置資訊
42                 cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns(), false);
43 
44                 //啟動服務:此方法除了啟動ServerCnxnFactory,還會啟動ZooKeeper
45                 cnxnFactory.startup(zkServer);
46                 // zkServer has been started. So we don't need to start it again in secureCnxnFactory.
47                 needStartZKServer = false;
48             }
49             if (config.getSecureClientPortAddress() != null) {
50                 secureCnxnFactory = ServerCnxnFactory.createFactory();
51                 secureCnxnFactory.configure(config.getSecureClientPortAddress(), config.getMaxClientCnxns(), true);
52                 secureCnxnFactory.startup(zkServer, needStartZKServer);
53             }
54 
55             // 定時清除容器節點
56             //container ZNodes是3.6版本之後新增的節點型別，Container型別的節點會在它沒有子節點時
57             // 被刪除（新建立的Container節點除外），該類就是用來週期性的進行檢查清理工作
58             containerManager = new ContainerManager(zkServer.getZKDatabase(), zkServer.firstProcessor,
59                     Integer.getInteger("znode.container.checkIntervalMs", (int) TimeUnit.MINUTES.toMillis(1)),
60                     Integer.getInteger("znode.container.maxPerMinute", 10000)
61             );
62             containerManager.start();
63 
64             // Watch status of ZooKeeper server. It will do a graceful shutdown
65             // if the server is not running or hits an internal error.
66 
67             // ZooKeeperServerShutdownHandler處理邏輯，只有在服務執行不正常的情況下，才會往下執行
68             shutdownLatch.await();
69 
70             // 關閉服務
71             shutdown();
72 
73             if (cnxnFactory != null) {
74                 cnxnFactory.join();
75             }
76             if (secureCnxnFactory != null) {
77                 secureCnxnFactory.join();
78             }
79             if (zkServer.canShutdown()) {
80                 zkServer.shutdown(true);
81             }
82         } catch (InterruptedException e) {
83             // warn, but generally this is ok
84             LOG.warn("Server interrupted", e);
85         } finally {
86             if (txnLog != null) {
87                 txnLog.close();
88             }
89         }
90     }

　　Zookeeper中 ServerCnxnFactory預設採用了NIOServerCnxnFactory來實現，也可以通過配置系統屬性zookeeper.serverCnxnFactory 來設定使用Netty實現；

 1     static public ServerCnxnFactory createFactory() throws IOException {
 2         String serverCnxnFactoryName =
 3             System.getProperty(ZOOKEEPER_SERVER_CNXN_FACTORY);
 4         if (serverCnxnFactoryName == null) {
 5             //如果未指定實現類，預設使用NIOServerCnxnFactory
 6             serverCnxnFactoryName = NIOServerCnxnFactory.class.getName();
 7         }
 8         try {
 9             ServerCnxnFactory serverCnxnFactory = (ServerCnxnFactory) Class.forName(serverCnxnFactoryName)
10                     .getDeclaredConstructor().newInstance();
11             LOG.info("Using {} as server connection factory", serverCnxnFactoryName);
12             return serverCnxnFactory;
13         } catch (Exception e) {
14             IOException ioe = new IOException("Couldn't instantiate "
15                     + serverCnxnFactoryName);
16             ioe.initCause(e);
17             throw ioe;
18         }
19     }

　　cnxnFactory.startup(zkServer);方法啟動了ServerCnxnFactory ，同時啟動ZooKeeper服務

 1     public void startup(ZooKeeperServer zks, boolean startServer)
 2             throws IOException, InterruptedException {
 3         // 啟動相關執行緒
 4         //開啟NIOWorker執行緒池，
 5         //啟動NIO Selector執行緒
 6         //啟動客戶端連線處理acceptThread執行緒
 7         start();
 8         setZooKeeperServer(zks);
 9 
10         //啟動服務
11         if (startServer) {
12             // 載入資料到zkDataBase
13             zks.startdata();
14             // 啟動定時清除session的管理器,註冊jmx,新增請求處理器
15             zks.startup();
16         }
17     }

　　zks.startdata();

 1     public void startdata() throws IOException, InterruptedException {
 2         //初始化ZKDatabase，該資料結構用來儲存ZK上面儲存的所有資料
 3         //check to see if zkDb is not null
 4         if (zkDb == null) {
 5             //初始化資料資料，這裡會加入一些原始節點，例如/zookeeper
 6             zkDb = new ZKDatabase(this.txnLogFactory);
 7         }
 8         //載入磁碟上已經儲存的資料，如果有的話
 9         if (!zkDb.isInitialized()) {
10             loadData();
11         }
12     }

　　zks.startup();

 1     public synchronized void startup() {
 2         //初始化session追蹤器
 3         if (sessionTracker == null) {
 4             createSessionTracker();
 5         }
 6         //啟動session追蹤器
 7         startSessionTracker();
 8 
 9         //建立請求處理鏈路
10         setupRequestProcessors();
11 
12         //註冊jmx
13         registerJMX();
14 
15         setState(State.RUNNING);
16         notifyAll();
17     }

　　最終Zookeeper應用服務啟動，並處於監聽狀態。

　叢集模式啟動

　　Zookeeper主程式QuorumPeerMain載入配置檔案後，配置容器物件QuorumPeerConfig中持有一個QuorumVerifier物件，該物件會儲存其他Zookeeper server節點資訊，如果zoo.cfg中配置了server.*節點資訊，會例項化一個QuorumVeriferi物件。其中AllMembers = VotingMembers + ObservingMembers

 1 public interface QuorumVerifier {
 2     long getWeight(long id);
 3     boolean containsQuorum(Set<Long> set);
 4     long getVersion();
 5     void setVersion(long ver);
 6     Map<Long, QuorumServer> getAllMembers();
 7     Map<Long, QuorumServer> getVotingMembers();
 8     Map<Long, QuorumServer> getObservingMembers();
 9     boolean equals(Object o);
10     String toString();
11 }

　　如果quorumVerifier.getVotingMembers().size() > 1 則使用叢集模式啟動。呼叫runFromConfig(QuorumPeerConfig config)，同時會例項化ServerCnxnFactory 物件，初始化一個QuorumPeer物件。

　　QuorumPeer為一個Zookeeper節點， QuorumPeer 為一個執行緒類，代表一個Zookeeper服務執行緒，最終會啟動該執行緒。

　　runFromConfig方法中設定了一些列屬性。包括選舉型別、server Id、節點資料庫等資訊。最後通過quorumPeer.start();啟動Zookeeper節點。

 1  public void runFromConfig(QuorumPeerConfig config)
 2             throws IOException, AdminServerException
 3     {
 4       try {
 5           // 註冊jmx
 6           ManagedUtil.registerLog4jMBeans();
 7       } catch (JMException e) {
 8           LOG.warn("Unable to register log4j JMX control", e);
 9       }
10 
11       LOG.info("Starting quorum peer");
12       try {
13           ServerCnxnFactory cnxnFactory = null;
14           ServerCnxnFactory secureCnxnFactory = null;
15 
16           if (config.getClientPortAddress() != null) {
17               cnxnFactory = ServerCnxnFactory.createFactory();
18               // 配置客戶端連線埠
19               cnxnFactory.configure(config.getClientPortAddress(),
20                       config.getMaxClientCnxns(),
21                       false);
22           }
23 
24           if (config.getSecureClientPortAddress() != null) {
25               secureCnxnFactory = ServerCnxnFactory.createFactory();
26               // 配置安全連線埠
27               secureCnxnFactory.configure(config.getSecureClientPortAddress(),
28                       config.getMaxClientCnxns(),
29                       true);
30           }
31 
32           // ------------初始化當前zk服務節點的配置----------------
33           // 設定資料和快照操作
34           quorumPeer = getQuorumPeer();
35           quorumPeer.setTxnFactory(new FileTxnSnapLog(
36                       config.getDataLogDir(),
37                       config.getDataDir()));
38           quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled());
39           quorumPeer.enableLocalSessionsUpgrading(
40               config.isLocalSessionsUpgradingEnabled());
41           //quorumPeer.setQuorumPeers(config.getAllMembers());
42           // 選舉型別
43           quorumPeer.setElectionType(config.getElectionAlg());
44           // server Id
45           quorumPeer.setMyid(config.getServerId());
46           quorumPeer.setTickTime(config.getTickTime());
47           quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
48           quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
49           quorumPeer.setInitLimit(config.getInitLimit());
50           quorumPeer.setSyncLimit(config.getSyncLimit());
51           quorumPeer.setConfigFileName(config.getConfigFilename());
52 
53           // 設定zk的節點資料庫
54           quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
55           quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false);
56           if (config.getLastSeenQuorumVerifier()!=null) {
57               quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false);
58           }
59 
60           // 初始化zk資料庫
61           quorumPeer.initConfigInZKDatabase();
62           quorumPeer.setCnxnFactory(cnxnFactory);
63           quorumPeer.setSecureCnxnFactory(secureCnxnFactory);
64           quorumPeer.setLearnerType(config.getPeerType());
65           quorumPeer.setSyncEnabled(config.getSyncEnabled());
66           quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
67 
68           // sets quorum sasl authentication configurations
69           quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
70           if(quorumPeer.isQuorumSaslAuthEnabled()){
71               quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
72               quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
73               quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
74               quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
75               quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
76           }
77           quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
78 
79           // -------------初始化當前zk服務節點的配置---------------
80           quorumPeer.initialize();
81 
82           //啟動
83           quorumPeer.start();
84           quorumPeer.join();
85       } catch (InterruptedException e) {
86           // warn, but generally this is ok
87           LOG.warn("Quorum Peer interrupted", e);
88       }
89     }

　　quorumPeer.start(); Zookeeper會首先載入本地磁碟資料，如果之前存在一些Zookeeper資訊，則會載入到Zookeeper記憶體資料庫中。通過FileTxnSnapLog中的loadDatabse();

 1     public synchronized void start() {
 2 
 3         // 校驗serverid如果不在peer列表中，拋異常
 4         if (!getView().containsKey(myid)) {
 5             throw new RuntimeException("My id " + myid + " not in the peer list");
 6          }
 7 
 8         // 載入zk資料庫:載入之前持久化的一些資訊
 9         loadDataBase();
10 
11         // 啟動連線服務端
12         startServerCnxnFactory();
13         try {
14             adminServer.start();
15         } catch (AdminServerException e) {
16             LOG.warn("Problem starting AdminServer", e);
17             System.out.println(e);
18         }
19         // 啟動之後馬上進行選舉，主要是建立選舉必須的環境，比如：啟動相關執行緒
20         startLeaderElection();
21 
22         // 執行選舉邏輯
23         super.start();
24     }

　　載入資料完之後同單機模式啟動一樣，會呼叫ServerCnxnFactory.start(),啟動NIOServerCnxnFactory服務和Zookeeper服務，最後啟動AdminServer服務。

　　與單機模式啟動不同的是，叢集會在啟動之後馬上進行選舉操作，會在配置的所有Zookeeper server節點中選舉出一個leader角色。startLeaderElection();

選舉

　　Zookeeper中分為Leader、Follower和Observer三個角色，各個角色扮演不同的業務功能。在Leader故障之後，Follower也會選舉一個新的Leader。

　　Leader為叢集中的主節點，一個叢集只有一個Leader，Leader負責處理Zookeeper的事物操作，也就是更改Zookeeper資料和狀態的操作。

　　Follower負責處理客戶端的讀請求和參與選舉。同時負責處理Leader發出的事物提交請求，也就是提議（proposal）。

　　Observer用於提高Zookeeper叢集的讀取的吞吐量，響應讀請求，和Follower不同的是，Observser不參與Leader的選舉，也不響應Leader發出的proposal。

　　有角色就有選舉。有選舉就有策略，Zookeeper中的選舉策略有三種實現：包括了LeaderElection、AuthFastLeaderElection和FastLeaderElection，目前Zookeeper預設採用FastLeaderElection，前兩個選舉演算法已經設定為@Deprecated；

　　Zookeeper節點資訊

　　serverId：服務節點Id，也就是Zookeeper dataDir中配置的myid ，server.*上指定的id。0,1,2,3,4..... ，該Id啟動後不變

　　zxid：資料狀態Id，zookeeper每次更新狀態之後增加，可理解為全域性有序id ，zxid越大，表示資料越新。Zxid是一個64位的數字，高32位為epoch，低32位為遞增計數。

　　epoch：選舉時鐘,也可以理解為選舉輪次，沒進行一次選舉，該值會+1；

　　ServerState：服務狀態，Zookeeper節點角色狀態，分為LOOKING、FOLLOWING、LEADING和OBSERVING，分別對應於不同的角色，當處於選舉時，節點處於Looking狀態。

　　每次投票，一個Vote會包含Zookeeper節點資訊。

　　Zookeeper在啟動之後會馬上進行選舉操作，不斷的向其他Follower節點傳送選票資訊，同時也接收別的Follower傳送過來的選票資訊。最終每個Follower都持有共同的一個選票池，通過同樣的演算法選出Leader，如果當前節點選為Leader，則向其他每個Follower傳送資訊，如果沒有則向Leader傳送資訊。

　　Zookeeper定義了Election介面；其中lookForLeader()就是選舉操作。

1 public interface Election {
2     public Vote lookForLeader() throws InterruptedException;
3     public void shutdown();
4 }

　　在上面的叢集模式啟動流程中，最後會呼叫startLeaderElection()來下進行選舉操作。startLeaderElection()中指定了選舉演算法。同時定義了為自己投一票（堅持你自己，年輕人！），一個Vote包含了投票節點、當前節點的zxid和當前的epoch。Zookeeper預設採取了FastLeaderElection選舉演算法。最後啟動QuorumPeer執行緒，開始投票。

 1     synchronized public void startLeaderElection() {
 2        try {
 3 
 4            // 所有節點啟動的初始狀態都是LOOKING，因此這裡都會是建立一張投自己為Leader的票
 5            if (getPeerState() == ServerState.LOOKING) {
 6                currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
 7            }
 8        } catch(IOException e) {
 9            RuntimeException re = new RuntimeException(e.getMessage());
10            re.setStackTrace(e.getStackTrace());
11            throw re;
12        }
13 
14        // if (!getView().containsKey(myid)) {
15       //      throw new RuntimeException("My id " + myid + " not in the peer list");
16         //}
17         if (electionType == 0) {
18             try {
19                 udpSocket = new DatagramSocket(myQuorumAddr.getPort());
20                 responder = new ResponderThread();
21                 responder.start();
22             } catch (SocketException e) {
23                 throw new RuntimeException(e);
24             }
25         }
26         //初始化選舉演算法，electionType預設為3
27         this.electionAlg = createElectionAlgorithm(electionType);
28     }

　　FastLeaderElection類中定義三個內部類Notification、 ToSend 和 Messenger ，Messenger 中又定義了WorkerReceiver 和 WorkerSender

　　 Zookeeper原始碼（啟動+選舉）

　　Notification類表示收到的選舉投票資訊（其他伺服器發來的選舉投票資訊），其包含了被選舉者的id、zxid、選舉週期等資訊。

　　ToSend類表示傳送給其他伺服器的選舉投票資訊，也包含了被選舉者的id、zxid、選舉週期等資訊。

　　Message類為訊息處理的類，用於傳送和接收投票資訊，包含了WorkerReceiver和WorkerSender兩個執行緒類。

　　FastLeaderElection類：

 1 public class FastLeaderElection implements Election {
 2     //..........
 3     /**
 4      * Connection manager. Fast leader election uses TCP for
 5      * communication between peers, and QuorumCnxManager manages
 6      * such connections.
 7      */
 8 
 9     QuorumCnxManager manager;
10     /*
11         Notification表示收到的選舉投票資訊（其他伺服器發來的選舉投票資訊），
12         其包含了被選舉者的id、zxid、選舉週期等資訊，
13         其buildMsg方法將選舉資訊封裝至ByteBuffer中再進行傳送
14      */
15     static public class Notification {
16        //..........
17     }
18     /**
19      * Messages that a peer wants to send to other peers.
20      * These messages can be both Notifications and Acks
21      * of reception of notification.
22      */
23     /*
24      ToSend表示傳送給其他伺服器的選舉投票資訊，也包含了被選舉者的id、zxid、選舉週期等資訊
25      */
26     static public class ToSend {
27         //..........
28     }
29     LinkedBlockingQueue<ToSend> sendqueue;
30     LinkedBlockingQueue<Notification> recvqueue;
31 
32     /**
33      * Multi-threaded implementation of message handler. Messenger
34      * implements two sub-classes: WorkReceiver and  WorkSender. The
35      * functionality of each is obvious from the name. Each of these
36      * spawns a new thread.
37      */
38     protected class Messenger {
39         /**
40          * Receives messages from instance of QuorumCnxManager on
41          * method run(), and processes such messages.
42          */
43 
44         class WorkerReceiver extends ZooKeeperThread  {
45              //..........
46         }
47         /**
48          * This worker simply dequeues a message to send and
49          * and queues it on the manager's queue.
50          */
51 
52         class WorkerSender extends ZooKeeperThread {
53             //..........
54         }
55 
56         WorkerSender ws;
57         WorkerReceiver wr;
58         Thread wsThread = null;
59         Thread wrThread = null;
60 
61 
62     }
63     //..........
64     QuorumPeer self;
65     Messenger messenger;
66     AtomicLong logicalclock = new AtomicLong(); /* Election instance */
67     long proposedLeader;
68     long proposedZxid;
69     long proposedEpoch;
70     //..........
71 }

　　QuorumPeer執行緒啟動後會開啟對ServerState的監聽，如果當前服務節點屬於Looking狀態，則會執行選舉操作。Zookeeper伺服器啟動後是Looking狀態，所以服務啟動後會馬上進行選舉操作。通過呼叫makeLEStrategy().lookForLeader()進行投票操作，也就是FastLeaderElection.lookForLeader();

　　QuorumPeer.run()：

  1    public void run() {
  2         updateThreadName();
  3         
  4         //..........
  5 
  6         try {
  7             /*
  8              * Main loop
  9              */
 10             while (running) {
 11                 switch (getPeerState()) {
 12                 case LOOKING:
 13                     LOG.info("LOOKING");
 14 
 15                     if (Boolean.getBoolean("readonlymode.enabled")) {
 16                         final ReadOnlyZooKeeperServer roZk =
 17                             new ReadOnlyZooKeeperServer(logFactory, this, this.zkDb);
 18                         Thread roZkMgr = new Thread() {
 19                             public void run() {
 20                                 try {
 21                                     // lower-bound grace period to 2 secs
 22                                     sleep(Math.max(2000, tickTime));
 23                                     if (ServerState.LOOKING.equals(getPeerState())) {
 24                                         roZk.startup();
 25                                     }
 26                                 } catch (InterruptedException e) {
 27                                     LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
 28                                 } catch (Exception e) {
 29                                     LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
 30                                 }
 31                             }
 32                         };
 33                         try {
 34                             roZkMgr.start();
 35                             reconfigFlagClear();
 36                             if (shuttingDownLE) {
 37                                 shuttingDownLE = false;
 38                                 startLeaderElection();
 39                             }
 40                             setCurrentVote(makeLEStrategy().lookForLeader());
 41                         } catch (Exception e) {
 42                             LOG.warn("Unexpected exception", e);
 43                             setPeerState(ServerState.LOOKING);
 44                         } finally {
 45                             roZkMgr.interrupt();
 46                             roZk.shutdown();
 47                         }
 48                     } else {
 49                         try {
 50                            reconfigFlagClear();
 51                             if (shuttingDownLE) {
 52                                shuttingDownLE = false;
 53                                startLeaderElection();
 54                                }
 55                             setCurrentVote(makeLEStrategy().lookForLeader());
 56                         } catch (Exception e) {
 57                             LOG.warn("Unexpected exception", e);
 58                             setPeerState(ServerState.LOOKING);
 59                         }                        
 60                     }
 61                     break;
 62                 case OBSERVING:
 63                     try {
 64                         LOG.info("OBSERVING");
 65                         setObserver(makeObserver(logFactory));
 66                         observer.observeLeader();
 67                     } catch (Exception e) {
 68                         LOG.warn("Unexpected exception",e );
 69                     } finally {
 70                         observer.shutdown();
 71                         setObserver(null);  
 72                        updateServerState();
 73                     }
 74                     break;
 75                 case FOLLOWING:
 76                     try {
 77                        LOG.info("FOLLOWING");
 78                         setFollower(makeFollower(logFactory));
 79                         follower.followLeader();
 80                     } catch (Exception e) {
 81                        LOG.warn("Unexpected exception",e);
 82                     } finally {
 83                        follower.shutdown();
 84                        setFollower(null);
 85                        updateServerState();
 86                     }
 87                     break;
 88                 case LEADING:
 89                     LOG.info("LEADING");
 90                     try {
 91                         setLeader(makeLeader(logFactory));
 92                         leader.lead();
 93                         setLeader(null);
 94                     } catch (Exception e) {
 95                         LOG.warn("Unexpected exception",e);
 96                     } finally {
 97                         if (leader != null) {
 98                             leader.shutdown("Forcing shutdown");
 99                             setLeader(null);
100                         }
101                         updateServerState();
102                     }
103                     break;
104                 }
105                 start_fle = Time.currentElapsedTime();
106             }
107         } finally {
108             LOG.warn("QuorumPeer main thread exited");
109             MBeanRegistry instance = MBeanRegistry.getInstance();
110             instance.unregister(jmxQuorumBean);
111             instance.unregister(jmxLocalPeerBean);
112 
113             for (RemotePeerBean remotePeerBean : jmxRemotePeerBean.values()) {
114                 instance.unregister(remotePeerBean);
115             }
116 
117             jmxQuorumBean = null;
118             jmxLocalPeerBean = null;
119             jmxRemotePeerBean = null;
120         }
121     }

　　FastLeaderElection.lookForLeader()：

  1     public Vote lookForLeader() throws InterruptedException {
  2         try {
  3             self.jmxLeaderElectionBean = new LeaderElectionBean();
  4             MBeanRegistry.getInstance().register(
  5                     self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
  6         } catch (Exception e) {
  7             LOG.warn("Failed to register with JMX", e);
  8             self.jmxLeaderElectionBean = null;
  9         }
 10         if (self.start_fle == 0) {
 11            self.start_fle = Time.currentElapsedTime();
 12         }
 13         try {
 14             HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
 15 
 16             HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
 17             //等待200毫秒
 18             int notTimeout = finalizeWait;
 19 
 20             synchronized(this){
 21                 //邏輯時鐘自增+1 
 22                 logicalclock.incrementAndGet();
 23                 updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
 24             }
 25 
 26             LOG.info("New election. My id =  " + self.getId() +
 27                     ", proposed zxid=0x" + Long.toHexString(proposedZxid));
 28             //傳送投票資訊
 29             sendNotifications();
 30 
 31             /*
 32              * Loop in which we exchange notifications until we find a leader
 33              */
 34             //判斷是否為Looking狀態
 35             while ((self.getPeerState() == ServerState.LOOKING) &&
 36                     (!stop)){
 37                 /*
 38                  * Remove next notification from queue, times out after 2 times
 39                  * the termination time
 40                  */
 41                 //獲取接收其他Follow傳送的投票資訊
 42                 Notification n = recvqueue.poll(notTimeout,
 43                         TimeUnit.MILLISECONDS);
 44 
 45                 /*
 46                  * Sends more notifications if haven't received enough.
 47                  * Otherwise processes new notification.
 48                  */
 49                 //未收到投票資訊
 50                 if(n == null){
 51                     //判斷是否和叢集離線了
 52                     if(manager.haveDelivered()){
 53                         //未斷開，傳送投票
 54                         sendNotifications();
 55                     } else {
 56                         //斷開，重連
 57                         manager.connectAll();
 58                     }
 59                     /*
 60                      * Exponential backoff
 61                      */
 62                     int tmpTimeOut = notTimeout*2;
 63                     notTimeout = (tmpTimeOut < maxNotificationInterval?
 64                             tmpTimeOut : maxNotificationInterval);
 65                     LOG.info("Notification time out: " + notTimeout);
 66                 } //接收到了投票，則處理收到的投票資訊
 67                 else if (validVoter(n.sid) && validVoter(n.leader)) {
 68                     /*
 69                      * Only proceed if the vote comes from a replica in the current or next
 70                      * voting view for a replica in the current or next voting view.
 71                      */
 72                     //其他節點的Server.state
 73                     switch (n.state) {
 74                     case LOOKING:
 75                         //如果其他節點也為Looking狀態，說明當前正處於選舉階段，則處理投票資訊。
 76                         
 77                         // If notification > current, replace and send messages out
 78                         //如果當前的epoch（投票輪次）小於其他的投票資訊，則說明自己的投票輪次已經過時，則更新自己的投票輪次
 79                         if (n.electionEpoch > logicalclock.get()) {
 80                             //更新投票輪次
 81                             logicalclock.set(n.electionEpoch);
 82                             //清除收到的投票
 83                             recvset.clear();
 84                             //比對投票資訊
 85                             //如果本身的投票資訊 低於 收到的的投票資訊，則使用收到的投票資訊，否則再次使用自身的投票資訊進行傳送投票。
 86                             if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
 87                                     getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
 88                                 //使用收到的投票資訊
 89                                 updateProposal(n.leader, n.zxid, n.peerEpoch);
 90                             } else {
 91                                 //使用自己的投票資訊
 92                                 updateProposal(getInitId(),
 93                                         getInitLastLoggedZxid(),
 94                                         getPeerEpoch());
 95                             }
 96                             //傳送投票資訊
 97                             sendNotifications();
 98                         } else if (n.electionEpoch < logicalclock.get()) {
 99                             //如果其他節點的epoch小於當前的epoch則丟棄
100                             if(LOG.isDebugEnabled()){
101                                 LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
102                                         + Long.toHexString(n.electionEpoch)
103                                         + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
104                             }
105                             break;
106                         } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
107                                 proposedLeader, proposedZxid, proposedEpoch)) {
108                             //同樣的epoch,正常情況，所有節點基本處於同一輪次
109                             //如果自身投票資訊 低於 收到的投票資訊，則更新投票資訊。併傳送
110                             updateProposal(n.leader, n.zxid, n.peerEpoch);
111                             sendNotifications();
112                         }
113 
114                         if(LOG.isDebugEnabled()){
115                             LOG.debug("Adding vote: from=" + n.sid +
116                                     ", proposed leader=" + n.leader +
117                                     ", proposed zxid=0x" + Long.toHexString(n.zxid) +
118                                     ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
119                         }
120                         //投票資訊Vote歸檔，收到的有效選票 票池
121                         recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
122                         
123                         //統計投票結果 ，判斷是否能結束選舉
124                         if (termPredicate(recvset,
125                                 new Vote(proposedLeader, proposedZxid,
126                                         logicalclock.get(), proposedEpoch))) {
127                             //如果已經選出leader
128                             
129                             // Verify if there is any change in the proposed leader
130                             while((n = recvqueue.poll(finalizeWait,
131                                     TimeUnit.MILLISECONDS)) != null){
132                                 if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
133                                         proposedLeader, proposedZxid, proposedEpoch)){
134                                     recvqueue.put(n);
135                                     break;
136                                 }
137                             }
138 
139                             /*
140                              * This predicate is true once we don't read any new
141                              * relevant message from the reception queue
142                              */
143                             //如果選票結果為當前節點，則更新ServerState，否則設定為Follwer
144                             if (n == null) {
145                                 self.setPeerState((proposedLeader == self.getId()) ?
146                                         ServerState.LEADING: learningState());
147 
148                                 Vote endVote = new Vote(proposedLeader,
149                                         proposedZxid, proposedEpoch);
150                                 leaveInstance(endVote);
151                                 return endVote;
152                             }
153                         }
154                         break;
155                     case OBSERVING:
156                         LOG.debug("Notification from observer: " + n.sid);
157                         break;
158                     case FOLLOWING:
159                     case LEADING:
160                         /*
161                          * Consider all notifications from the same epoch
162                          * together.
163                          */
164                         //如果其他節點已經確定為Leader
165                         //如果同一個的投票輪次，則加入選票池
166                         //判斷是否能過半選舉出leader ，如果是，則checkLeader
167                         /*checkLeader：
168                          * 【是否能選舉出leader】and
169                          * 【(如果投票leader為自身，且輪次一致） or
170                          * （如果所選leader不是自身資訊在outofelection不為空，且leader的ServerState狀態已經為leader）】
171                          *   
172                          */
173                         if(n.electionEpoch == logicalclock.get()){
174                             recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
175                             if(termPredicate(recvset, new Vote(n.leader,
176                                             n.zxid, n.electionEpoch, n.peerEpoch, n.state))
177                                             && checkLeader(outofelection, n.leader, n.electionEpoch)) {
178                                 self.setPeerState((n.leader == self.getId()) ?
179                                         ServerState.LEADING: learningState());
180 
181                                 Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
182                                 leaveInstance(endVote);
183                                 return endVote;
184                             }
185                         }
186 
187                         /*
188                          * Before joining an established ensemble, verify that
189                          * a majority are following the same leader.
190                          * Only peer epoch is used to check that the votes come
191                          * from the same ensemble. This is because there is at
192                          * least one corner case in which the ensemble can be
193                          * created with inconsistent zxid and election epoch
194                          * info. However, given that only one ensemble can be
195                          * running at a single point in time and that each 
196                          * epoch is used only once, using only the epoch to 
197                          * compare the votes is sufficient.
198                          * 
199                          * @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732
200                          */
201                         outofelection.put(n.sid, new Vote(n.leader, 
202                                 IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state));
203                         //說明此時 叢集中存在別的輪次選舉已經有了選舉結果
204                         //比對outofelection選票池，是否能結束選舉，同時檢查leader資訊
205                         //如果能結束選舉 接收到的選票產生的leader通過checkLeader為true，則更新當前節點資訊
206                         if (termPredicate(outofelection, new Vote(n.leader,
207                                 IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state))
208                                 && checkLeader(outofelection, n.leader, IGNOREVALUE)) {
209                             synchronized(this){
210                                 logicalclock.set(n.electionEpoch);
211                                 self.setPeerState((n.leader == self.getId()) ?
212                                         ServerState.LEADING: learningState());
213                             }
214                             Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch);
215                             leaveInstance(endVote);
216                             return endVote;
217                         }
218                         break;
219                     default:
220                         LOG.warn("Notification state unrecoginized: " + n.state
221                               + " (n.state), " + n.sid + " (n.sid)");
222                         break;
223                     }
224                 } else {
225                     if (!validVoter(n.leader)) {
226                         LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
227                     }
228                     if (!validVoter(n.sid)) {
229                         LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
230                     }
231                 }
232             }
233             return null;
234         } finally {
235             try {
236                 if(self.jmxLeaderElectionBean != null){
237                     MBeanRegistry.getInstance().unregister(
238                             self.jmxLeaderElectionBean);
239                 }
240             } catch (Exception e) {
241                 LOG.warn("Failed to unregister with JMX", e);
242             }
243             self.jmxLeaderElectionBean = null;
244             LOG.debug("Number of connection processing threads: {}",
245                     manager.getConnectionThreadCount());
246         }
247     }

　　lookForLeader方法中把當前選票和收到的選舉進行不斷的比對和更新，最終選出leader，其中比對選票的方法為totalOrderPredicate(): 其中的比對投票資訊方式為：

　　1.　　首先判斷epoch(選舉輪次)，也就是選擇epoch值更大的節點；如果收到的epoch更大，則當前階段落後，更新自己的epoch，否則丟棄。

　　2.　　如果同於輪次中，則選擇zxid更大的節點，因為zxid越大說明資料越新。

　　3.　　如果同一輪次，且zxid一樣，則選擇serverId最大的節點。

　　綜上3點可理解為越大越棒！

　　totalOrderPredicate():

 1     protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
 2         LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
 3                 Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
 4         if(self.getQuorumVerifier().getWeight(newId) == 0){
 5             return false;
 6         }
 7 
 8         /*
 9          * We return true if one of the following three cases hold:
10          * 1- New epoch is higher
11          * 2- New epoch is the same as current epoch, but new zxid is higher
12          * 3- New epoch is the same as current epoch, new zxid is the same
13          *  as current zxid, but server id is higher.
14          */
15 
16         return ((newEpoch > curEpoch) ||
17                 ((newEpoch == curEpoch) &&
18                 ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
19     }