簡介
關於Zookeeper,目前普遍的應用場景基本作為服務註冊中心,用於服務發現。但這只是Zookeeper的一個的功能,根據Apache的官方概述:“The Apache ZooKeeper system for distributed coordination is a high-performance service for building distributed applications.” Zookeeper是一個用於構建分散式應用的coordination, 並且為高效能的。Zookeeper藉助於它內部的節點結構和監聽機制,能用於很大部分的分散式協調場景。配置管理、命名服務、分散式鎖、服務發現和釋出訂閱等等,這些場景在Zookeeper中基本使用其節點的“變更+通知”來實現。因為分散式的重點在於通訊,通訊的作用也就是協調。
Zookeeper由Java語言編寫(也有C語言的Api實現),對於其原理,算是Paxos演算法的實現,包含了Leader、Follower、Proposal等角色和選舉之類的一些概念,但於Paxos還有一些不同(ZAB協議)。對於Paxos演算法,個人認為,它是一套解決方案的理論,想要理解也有點的複雜。這裡對於演算法不太深入概述,僅對於Zookeeper服務端進行部分原始碼解析,包含應用的啟動和選舉方面,不包含Client。
原始碼獲取
Zookeeper原始碼可以從Github(https://github.com/apache/zookeeper)上clone下來;
也可從Zookeeper官網(Apache)https://zookeeper.apache.org/releases.html上獲取。
Zookeeper在3.5.5之前使用的是Ant構建,在3.5.5開始使用的是Maven構建。
本次採用的3.5.4版本進行解析
工程結構
目錄結構:
其中src中包含了C和Java原始碼,本次忽略C的Api。conf下為配置檔案,也就是Zookeeper啟動的配置檔案。bin為Zookeeper啟動指令碼(server/client)。
org.apache.jute為Zookeeper的通訊協議和序列化相關的元件,其通訊協議基於TCP協議,它提供了Record介面用於序列化和反序列化,OutputArchive/InputArchive介面.
org.apache.zookeeper下為Zookeeper核心程式碼。包含了核心的業務實現。
啟動流程
在我們使用Zookeeper的應用時,通過“./zkServer.sh start”命令來啟動程式。通過檢視zkServer.sh指令碼,可以追蹤到Zookeeper程式啟動入口為“org.apache.zookeeper.server.quorum.QuorumPeerMain”,同時為程式指定了日誌相關的配置。
1 ZOOMAIN="org.apache.zookeeper.server.quorum.QuorumPeerMain" 2 #....... 3 nohup "$JAVA" $ZOO_DATADIR_AUTOCREATE "-Dzookeeper.log.dir=${ZOO_LOG_DIR}" \ 4 "-Dzookeeper.log.file=${ZOO_LOG_FILE}" "-Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \ 5 -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -9 %p' \ 6 -cp "$CLASSPATH" $JVMFLAGS $ZOOMAIN "$ZOOCFG" > "$_ZOO_DAEMON_OUT" 2>&1 < /dev/null & 7 if [ $? -eq 0 ]
#.......
Zookeeper啟動流程:
QuorumPeerMain.main()接受至少一個引數,一般就一個引數,引數為zoo.cfg檔案路徑。main方法中沒有很多的業務程式碼,例項化了一個QuorumPeerMain 物件,然後main.initializeAndRun(args);進行了例項化
1 public static void main(String[] args) { 2 QuorumPeerMain main = new QuorumPeerMain(); 3 try { 4 main.initializeAndRun(args); 5 } catch (IllegalArgumentException e) { 6 LOG.error("Invalid arguments, exiting abnormally", e); 7 LOG.info(USAGE); 8 System.err.println(USAGE); 9 System.exit(2); 10 } catch (ConfigException e) { 11 LOG.error("Invalid config, exiting abnormally", e); 12 System.err.println("Invalid config, exiting abnormally"); 13 System.exit(2); 14 } catch (DatadirException e) { 15 LOG.error("Unable to access datadir, exiting abnormally", e); 16 System.err.println("Unable to access datadir, exiting abnormally"); 17 System.exit(3); 18 } catch (AdminServerException e) { 19 LOG.error("Unable to start AdminServer, exiting abnormally", e); 20 System.err.println("Unable to start AdminServer, exiting abnormally"); 21 System.exit(4); 22 } catch (Exception e) { 23 LOG.error("Unexpected exception, exiting abnormally", e); 24 System.exit(1); 25 } 26 LOG.info("Exiting normally"); 27 System.exit(0); 28 }
initializeAndRun方法則通過例項化QuorumPeerConfig物件,通過parseProperties()來解析zoo.cfg檔案中的配置,QuorumPeerConfig包含了Zookeeper整個應用的配置屬性。接著開啟一個DatadirCleanupManager物件來開啟一個Timer用於清除並建立管理新的DataDir相關的資料。
最後進行程式的啟動,因為Zookeeper分為單機和叢集模式,所以分為兩種不同的啟動方式,當zoo.cfg檔案中配置了standaloneEnabled=true為單機模式,如果配置server.0,server.1......叢集節點,則為叢集模式.
1 protected void initializeAndRun(String[] args) 2 throws ConfigException, IOException, AdminServerException 3 { 4 QuorumPeerConfig config = new QuorumPeerConfig(); 5 if (args.length == 1) { 6 config.parse(args[0]); 7 } 8 9 // Start and schedule the the purge task 10 DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config 11 .getDataDir(), config.getDataLogDir(), config 12 .getSnapRetainCount(), config.getPurgeInterval()); 13 purgeMgr.start(); 14 15 // 當配置了多節點資訊,return quorumVerifier!=null && (!standaloneEnabled || quorumVerifier.getVotingMembers().size() > 1); 16 if (args.length == 1 && config.isDistributed()) { 17 // 叢集模式 18 runFromConfig(config); 19 } else { 20 LOG.warn("Either no config or no quorum defined in config, running " 21 + " in standalone mode"); 22 // there is only server in the quorum -- run as standalone 23 // 單機模式 24 ZooKeeperServerMain.main(args); 25 } 26 }
單機模式啟動
當配置了standaloneEnabled=true 或者沒有配置叢集節點(sever.*)時,Zookeeper使用單機環境啟動。單機環境啟動入口為ZooKeeperServerMain類,ZooKeeperServerMain類中持有ServerCnxnFactory、ContainerManager和AdminServer物件;
1 public class ZooKeeperServerMain { 2 /*.............*/ 3 // ZooKeeper server supports two kinds of connection: unencrypted and encrypted. 4 private ServerCnxnFactory cnxnFactory; 5 private ServerCnxnFactory secureCnxnFactory; 6 private ContainerManager containerManager; 7 8 private AdminServer adminServer; 9 /*.............*/ 10 }
ServerCnxnFactory為Zookeeper中的核心元件,用於網路通訊IO的實現和管理客戶端連線,Zookeeper內部提供了兩種實現,一種是基於JDK的NIO實現,一種是基於netty的實現。
ContainerManager類,用於管理維護Zookeeper中節點Znode的資訊,管理zkDatabase;
AdminServer是一個Jetty服務,預設開啟8080埠,用於提供Zookeeper的資訊的查詢介面。該功能從3.5的版本開始。
ZooKeeperServerMain的main方法中同QuorumPeerMain中一致,先例項化本身的物件,再進行init,載入配置檔案,然後啟動。
載入配置資訊:
1 // 解析單機模式的配置物件,並啟動單機模式 2 protected void initializeAndRun(String[] args) 3 throws ConfigException, IOException, AdminServerException 4 { 5 try { 6 7 //註冊jmx 8 // JMX的全稱為Java Management Extensions.是管理Java的一種擴充套件。 9 // 這種機制可以方便的管理、監控正在執行中的Java程式。常用於管理執行緒,記憶體,日誌Level,服務重啟,系統環境等 10 ManagedUtil.registerLog4jMBeans(); 11 } catch (JMException e) { 12 LOG.warn("Unable to register log4j JMX control", e); 13 } 14 15 // 建立服務配置物件 16 ServerConfig config = new ServerConfig(); 17 18 //如果入參只有一個,則認為是配置檔案的路徑 19 if (args.length == 1) { 20 // 解析配置檔案 21 config.parse(args[0]); 22 } else { 23 // 引數有多個,解析引數 24 config.parse(args); 25 } 26 27 // 根據配置執行服務 28 runFromConfig(config); 29 }
服務啟動: runFromConfig()為應用啟動之前初始化一些物件,
1. 初始化FileTxnSnapLog物件,用於管理dataDir和datalogDir資料。
2. 初始化ZooKeeperServer 物件;
3. 例項化CountDownLatch執行緒計數器物件,在程式啟動後,執行shutdownLatch.await();用於掛起主程式,並監聽Zookeeper執行狀態。
4. 建立adminServer(Jetty)服務並開啟。
5. 建立ServerCnxnFactory物件,cnxnFactory = ServerCnxnFactory.createFactory(); Zookeeper預設使用NIOServerCnxnFactory來實現網路通訊IO。
6. 啟動ServerCnxnFactory服務
7. 建立ContainerManager物件,並啟動;
8. Zookeeper應用啟動。
1 public void runFromConfig(ServerConfig config) 2 throws IOException, AdminServerException { 3 LOG.info("Starting server"); 4 FileTxnSnapLog txnLog = null; 5 try { 6 // Note that this thread isn't going to be doing anything else, 7 // so rather than spawning another thread, we will just call 8 // run() in this thread. 9 // create a file logger url from the command line args 10 //初始化日誌檔案 11 txnLog = new FileTxnSnapLog(config.dataLogDir, config.dataDir); 12 13 // 初始化zkServer物件 14 final ZooKeeperServer zkServer = new ZooKeeperServer(txnLog, 15 config.tickTime, config.minSessionTimeout, config.maxSessionTimeout, null); 16 17 // 服務結束鉤子,用於知道伺服器錯誤或關閉狀態更改。 18 final CountDownLatch shutdownLatch = new CountDownLatch(1); 19 zkServer.registerServerShutdownHandler( 20 new ZooKeeperServerShutdownHandler(shutdownLatch)); 21 22 23 // Start Admin server 24 // 建立admin服務,用於接收請求(建立jetty服務) 25 adminServer = AdminServerFactory.createAdminServer(); 26 // 設定zookeeper服務 27 adminServer.setZooKeeperServer(zkServer); 28 // AdminServer是3.5.0之後支援的特性,啟動了一個jettyserver,預設埠是8080,訪問此埠可以獲取Zookeeper執行時的相關資訊 29 adminServer.start(); 30 31 boolean needStartZKServer = true; 32 33 34 //---啟動ZooKeeperServer 35 //判斷配置檔案中 clientportAddress是否為null 36 if (config.getClientPortAddress() != null) { 37 //ServerCnxnFactory是Zookeeper中的重要元件,負責處理客戶端與伺服器的連線 38 //初始化server端IO物件,預設是NIOServerCnxnFactory:Java原生NIO處理網路IO事件 39 cnxnFactory = ServerCnxnFactory.createFactory(); 40 41 //初始化配置資訊 42 cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns(), false); 43 44 //啟動服務:此方法除了啟動ServerCnxnFactory,還會啟動ZooKeeper 45 cnxnFactory.startup(zkServer); 46 // zkServer has been started. So we don't need to start it again in secureCnxnFactory. 47 needStartZKServer = false; 48 } 49 if (config.getSecureClientPortAddress() != null) { 50 secureCnxnFactory = ServerCnxnFactory.createFactory(); 51 secureCnxnFactory.configure(config.getSecureClientPortAddress(), config.getMaxClientCnxns(), true); 52 secureCnxnFactory.startup(zkServer, needStartZKServer); 53 } 54 55 // 定時清除容器節點 56 //container ZNodes是3.6版本之後新增的節點型別,Container型別的節點會在它沒有子節點時 57 // 被刪除(新建立的Container節點除外),該類就是用來週期性的進行檢查清理工作 58 containerManager = new ContainerManager(zkServer.getZKDatabase(), zkServer.firstProcessor, 59 Integer.getInteger("znode.container.checkIntervalMs", (int) TimeUnit.MINUTES.toMillis(1)), 60 Integer.getInteger("znode.container.maxPerMinute", 10000) 61 ); 62 containerManager.start(); 63 64 // Watch status of ZooKeeper server. It will do a graceful shutdown 65 // if the server is not running or hits an internal error. 66 67 // ZooKeeperServerShutdownHandler處理邏輯,只有在服務執行不正常的情況下,才會往下執行 68 shutdownLatch.await(); 69 70 // 關閉服務 71 shutdown(); 72 73 if (cnxnFactory != null) { 74 cnxnFactory.join(); 75 } 76 if (secureCnxnFactory != null) { 77 secureCnxnFactory.join(); 78 } 79 if (zkServer.canShutdown()) { 80 zkServer.shutdown(true); 81 } 82 } catch (InterruptedException e) { 83 // warn, but generally this is ok 84 LOG.warn("Server interrupted", e); 85 } finally { 86 if (txnLog != null) { 87 txnLog.close(); 88 } 89 } 90 }
Zookeeper中 ServerCnxnFactory預設採用了NIOServerCnxnFactory來實現,也可以通過配置系統屬性zookeeper.serverCnxnFactory 來設定使用Netty實現;
1 static public ServerCnxnFactory createFactory() throws IOException { 2 String serverCnxnFactoryName = 3 System.getProperty(ZOOKEEPER_SERVER_CNXN_FACTORY); 4 if (serverCnxnFactoryName == null) { 5 //如果未指定實現類,預設使用NIOServerCnxnFactory 6 serverCnxnFactoryName = NIOServerCnxnFactory.class.getName(); 7 } 8 try { 9 ServerCnxnFactory serverCnxnFactory = (ServerCnxnFactory) Class.forName(serverCnxnFactoryName) 10 .getDeclaredConstructor().newInstance(); 11 LOG.info("Using {} as server connection factory", serverCnxnFactoryName); 12 return serverCnxnFactory; 13 } catch (Exception e) { 14 IOException ioe = new IOException("Couldn't instantiate " 15 + serverCnxnFactoryName); 16 ioe.initCause(e); 17 throw ioe; 18 } 19 }
cnxnFactory.startup(zkServer);方法啟動了ServerCnxnFactory ,同時啟動ZooKeeper服務
1 public void startup(ZooKeeperServer zks, boolean startServer) 2 throws IOException, InterruptedException { 3 // 啟動相關執行緒 4 //開啟NIOWorker執行緒池, 5 //啟動NIO Selector執行緒 6 //啟動客戶端連線處理acceptThread執行緒 7 start(); 8 setZooKeeperServer(zks); 9 10 //啟動服務 11 if (startServer) { 12 // 載入資料到zkDataBase 13 zks.startdata(); 14 // 啟動定時清除session的管理器,註冊jmx,新增請求處理器 15 zks.startup(); 16 } 17 }
zks.startdata();
1 public void startdata() throws IOException, InterruptedException { 2 //初始化ZKDatabase,該資料結構用來儲存ZK上面儲存的所有資料 3 //check to see if zkDb is not null 4 if (zkDb == null) { 5 //初始化資料資料,這裡會加入一些原始節點,例如/zookeeper 6 zkDb = new ZKDatabase(this.txnLogFactory); 7 } 8 //載入磁碟上已經儲存的資料,如果有的話 9 if (!zkDb.isInitialized()) { 10 loadData(); 11 } 12 }
zks.startup();
1 public synchronized void startup() { 2 //初始化session追蹤器 3 if (sessionTracker == null) { 4 createSessionTracker(); 5 } 6 //啟動session追蹤器 7 startSessionTracker(); 8 9 //建立請求處理鏈路 10 setupRequestProcessors(); 11 12 //註冊jmx 13 registerJMX(); 14 15 setState(State.RUNNING); 16 notifyAll(); 17 }
最終Zookeeper應用服務啟動,並處於監聽狀態。
叢集模式啟動
Zookeeper主程式QuorumPeerMain載入配置檔案後,配置容器物件QuorumPeerConfig中持有一個QuorumVerifier物件,該物件會儲存其他Zookeeper server節點資訊,如果zoo.cfg中配置了server.*節點資訊,會例項化一個QuorumVeriferi物件。其中AllMembers = VotingMembers + ObservingMembers
1 public interface QuorumVerifier { 2 long getWeight(long id); 3 boolean containsQuorum(Set<Long> set); 4 long getVersion(); 5 void setVersion(long ver); 6 Map<Long, QuorumServer> getAllMembers(); 7 Map<Long, QuorumServer> getVotingMembers(); 8 Map<Long, QuorumServer> getObservingMembers(); 9 boolean equals(Object o); 10 String toString(); 11 }
如果quorumVerifier.getVotingMembers().size() > 1 則使用叢集模式啟動。呼叫runFromConfig(QuorumPeerConfig config),同時會例項化ServerCnxnFactory 物件,初始化一個QuorumPeer物件。
QuorumPeer為一個Zookeeper節點, QuorumPeer 為一個執行緒類,代表一個Zookeeper服務執行緒,最終會啟動該執行緒。
runFromConfig方法中設定了一些列屬性。包括選舉型別、server Id、節點資料庫等資訊。最後通過quorumPeer.start();啟動Zookeeper節點。
1 public void runFromConfig(QuorumPeerConfig config) 2 throws IOException, AdminServerException 3 { 4 try { 5 // 註冊jmx 6 ManagedUtil.registerLog4jMBeans(); 7 } catch (JMException e) { 8 LOG.warn("Unable to register log4j JMX control", e); 9 } 10 11 LOG.info("Starting quorum peer"); 12 try { 13 ServerCnxnFactory cnxnFactory = null; 14 ServerCnxnFactory secureCnxnFactory = null; 15 16 if (config.getClientPortAddress() != null) { 17 cnxnFactory = ServerCnxnFactory.createFactory(); 18 // 配置客戶端連線埠 19 cnxnFactory.configure(config.getClientPortAddress(), 20 config.getMaxClientCnxns(), 21 false); 22 } 23 24 if (config.getSecureClientPortAddress() != null) { 25 secureCnxnFactory = ServerCnxnFactory.createFactory(); 26 // 配置安全連線埠 27 secureCnxnFactory.configure(config.getSecureClientPortAddress(), 28 config.getMaxClientCnxns(), 29 true); 30 } 31 32 // ------------初始化當前zk服務節點的配置---------------- 33 // 設定資料和快照操作 34 quorumPeer = getQuorumPeer(); 35 quorumPeer.setTxnFactory(new FileTxnSnapLog( 36 config.getDataLogDir(), 37 config.getDataDir())); 38 quorumPeer.enableLocalSessions(config.areLocalSessionsEnabled()); 39 quorumPeer.enableLocalSessionsUpgrading( 40 config.isLocalSessionsUpgradingEnabled()); 41 //quorumPeer.setQuorumPeers(config.getAllMembers()); 42 // 選舉型別 43 quorumPeer.setElectionType(config.getElectionAlg()); 44 // server Id 45 quorumPeer.setMyid(config.getServerId()); 46 quorumPeer.setTickTime(config.getTickTime()); 47 quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout()); 48 quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout()); 49 quorumPeer.setInitLimit(config.getInitLimit()); 50 quorumPeer.setSyncLimit(config.getSyncLimit()); 51 quorumPeer.setConfigFileName(config.getConfigFilename()); 52 53 // 設定zk的節點資料庫 54 quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory())); 55 quorumPeer.setQuorumVerifier(config.getQuorumVerifier(), false); 56 if (config.getLastSeenQuorumVerifier()!=null) { 57 quorumPeer.setLastSeenQuorumVerifier(config.getLastSeenQuorumVerifier(), false); 58 } 59 60 // 初始化zk資料庫 61 quorumPeer.initConfigInZKDatabase(); 62 quorumPeer.setCnxnFactory(cnxnFactory); 63 quorumPeer.setSecureCnxnFactory(secureCnxnFactory); 64 quorumPeer.setLearnerType(config.getPeerType()); 65 quorumPeer.setSyncEnabled(config.getSyncEnabled()); 66 quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs()); 67 68 // sets quorum sasl authentication configurations 69 quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl); 70 if(quorumPeer.isQuorumSaslAuthEnabled()){ 71 quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl); 72 quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl); 73 quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal); 74 quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext); 75 quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext); 76 } 77 quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize); 78 79 // -------------初始化當前zk服務節點的配置--------------- 80 quorumPeer.initialize(); 81 82 //啟動 83 quorumPeer.start(); 84 quorumPeer.join(); 85 } catch (InterruptedException e) { 86 // warn, but generally this is ok 87 LOG.warn("Quorum Peer interrupted", e); 88 } 89 }
quorumPeer.start(); Zookeeper會首先載入本地磁碟資料,如果之前存在一些Zookeeper資訊,則會載入到Zookeeper記憶體資料庫中。通過FileTxnSnapLog中的loadDatabse();
1 public synchronized void start() { 2 3 // 校驗serverid如果不在peer列表中,拋異常 4 if (!getView().containsKey(myid)) { 5 throw new RuntimeException("My id " + myid + " not in the peer list"); 6 } 7 8 // 載入zk資料庫:載入之前持久化的一些資訊 9 loadDataBase(); 10 11 // 啟動連線服務端 12 startServerCnxnFactory(); 13 try { 14 adminServer.start(); 15 } catch (AdminServerException e) { 16 LOG.warn("Problem starting AdminServer", e); 17 System.out.println(e); 18 } 19 // 啟動之後馬上進行選舉,主要是建立選舉必須的環境,比如:啟動相關執行緒 20 startLeaderElection(); 21 22 // 執行選舉邏輯 23 super.start(); 24 }
載入資料完之後同單機模式啟動一樣,會呼叫ServerCnxnFactory.start(),啟動NIOServerCnxnFactory服務和Zookeeper服務,最後啟動AdminServer服務。
與單機模式啟動不同的是,叢集會在啟動之後馬上進行選舉操作,會在配置的所有Zookeeper server節點中選舉出一個leader角色。startLeaderElection();
選舉
Zookeeper中分為Leader、Follower和Observer三個角色,各個角色扮演不同的業務功能。在Leader故障之後,Follower也會選舉一個新的Leader。
Leader為叢集中的主節點,一個叢集只有一個Leader,Leader負責處理Zookeeper的事物操作,也就是更改Zookeeper資料和狀態的操作。
Follower負責處理客戶端的讀請求和參與選舉。同時負責處理Leader發出的事物提交請求,也就是提議(proposal)。
Observer用於提高Zookeeper叢集的讀取的吞吐量,響應讀請求,和Follower不同的是,Observser不參與Leader的選舉,也不響應Leader發出的proposal。
有角色就有選舉。有選舉就有策略,Zookeeper中的選舉策略有三種實現:包括了LeaderElection、AuthFastLeaderElection和FastLeaderElection,目前Zookeeper預設採用FastLeaderElection,前兩個選舉演算法已經設定為@Deprecated;
Zookeeper節點資訊
serverId:服務節點Id,也就是Zookeeper dataDir中配置的myid ,server.*上指定的id。0,1,2,3,4..... ,該Id啟動後不變
zxid:資料狀態Id,zookeeper每次更新狀態之後增加,可理解為全域性有序id ,zxid越大,表示資料越新。Zxid是一個64位的數字,高32位為epoch,低32位為遞增計數。
epoch:選舉時鐘,也可以理解為選舉輪次,沒進行一次選舉,該值會+1;
ServerState:服務狀態,Zookeeper節點角色狀態,分為LOOKING、FOLLOWING、LEADING和OBSERVING,分別對應於不同的角色,當處於選舉時,節點處於Looking狀態。
每次投票,一個Vote會包含Zookeeper節點資訊。
Zookeeper在啟動之後會馬上進行選舉操作,不斷的向其他Follower節點傳送選票資訊,同時也接收別的Follower傳送過來的選票資訊。最終每個Follower都持有共同的一個選票池,通過同樣的演算法選出Leader,如果當前節點選為Leader,則向其他每個Follower傳送資訊,如果沒有則向Leader傳送資訊。
Zookeeper定義了Election介面;其中lookForLeader()就是選舉操作。
1 public interface Election { 2 public Vote lookForLeader() throws InterruptedException; 3 public void shutdown(); 4 }
在上面的叢集模式啟動流程中,最後會呼叫startLeaderElection()來下進行選舉操作。startLeaderElection()中指定了選舉演算法。同時定義了為自己投一票(堅持你自己,年輕人!),一個Vote包含了投票節點、當前節點的zxid和當前的epoch。Zookeeper預設採取了FastLeaderElection選舉演算法。最後啟動QuorumPeer執行緒,開始投票。
1 synchronized public void startLeaderElection() { 2 try { 3 4 // 所有節點啟動的初始狀態都是LOOKING,因此這裡都會是建立一張投自己為Leader的票 5 if (getPeerState() == ServerState.LOOKING) { 6 currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch()); 7 } 8 } catch(IOException e) { 9 RuntimeException re = new RuntimeException(e.getMessage()); 10 re.setStackTrace(e.getStackTrace()); 11 throw re; 12 } 13 14 // if (!getView().containsKey(myid)) { 15 // throw new RuntimeException("My id " + myid + " not in the peer list"); 16 //} 17 if (electionType == 0) { 18 try { 19 udpSocket = new DatagramSocket(myQuorumAddr.getPort()); 20 responder = new ResponderThread(); 21 responder.start(); 22 } catch (SocketException e) { 23 throw new RuntimeException(e); 24 } 25 } 26 //初始化選舉演算法,electionType預設為3 27 this.electionAlg = createElectionAlgorithm(electionType); 28 }
FastLeaderElection類中定義三個內部類Notification、 ToSend 和 Messenger ,Messenger 中又定義了WorkerReceiver 和 WorkerSender
Notification類表示收到的選舉投票資訊(其他伺服器發來的選舉投票資訊),其包含了被選舉者的id、zxid、選舉週期等資訊。
ToSend類表示傳送給其他伺服器的選舉投票資訊,也包含了被選舉者的id、zxid、選舉週期等資訊。
Message類為訊息處理的類,用於傳送和接收投票資訊,包含了WorkerReceiver和WorkerSender兩個執行緒類。
FastLeaderElection類:
1 public class FastLeaderElection implements Election { 2 //.......... 3 /** 4 * Connection manager. Fast leader election uses TCP for 5 * communication between peers, and QuorumCnxManager manages 6 * such connections. 7 */ 8 9 QuorumCnxManager manager; 10 /* 11 Notification表示收到的選舉投票資訊(其他伺服器發來的選舉投票資訊), 12 其包含了被選舉者的id、zxid、選舉週期等資訊, 13 其buildMsg方法將選舉資訊封裝至ByteBuffer中再進行傳送 14 */ 15 static public class Notification { 16 //.......... 17 } 18 /** 19 * Messages that a peer wants to send to other peers. 20 * These messages can be both Notifications and Acks 21 * of reception of notification. 22 */ 23 /* 24 ToSend表示傳送給其他伺服器的選舉投票資訊,也包含了被選舉者的id、zxid、選舉週期等資訊 25 */ 26 static public class ToSend { 27 //.......... 28 } 29 LinkedBlockingQueue<ToSend> sendqueue; 30 LinkedBlockingQueue<Notification> recvqueue; 31 32 /** 33 * Multi-threaded implementation of message handler. Messenger 34 * implements two sub-classes: WorkReceiver and WorkSender. The 35 * functionality of each is obvious from the name. Each of these 36 * spawns a new thread. 37 */ 38 protected class Messenger { 39 /** 40 * Receives messages from instance of QuorumCnxManager on 41 * method run(), and processes such messages. 42 */ 43 44 class WorkerReceiver extends ZooKeeperThread { 45 //.......... 46 } 47 /** 48 * This worker simply dequeues a message to send and 49 * and queues it on the manager's queue. 50 */ 51 52 class WorkerSender extends ZooKeeperThread { 53 //.......... 54 } 55 56 WorkerSender ws; 57 WorkerReceiver wr; 58 Thread wsThread = null; 59 Thread wrThread = null; 60 61 62 } 63 //.......... 64 QuorumPeer self; 65 Messenger messenger; 66 AtomicLong logicalclock = new AtomicLong(); /* Election instance */ 67 long proposedLeader; 68 long proposedZxid; 69 long proposedEpoch; 70 //.......... 71 }
QuorumPeer執行緒啟動後會開啟對ServerState的監聽,如果當前服務節點屬於Looking狀態,則會執行選舉操作。Zookeeper伺服器啟動後是Looking狀態,所以服務啟動後會馬上進行選舉操作。通過呼叫makeLEStrategy().lookForLeader()進行投票操作,也就是FastLeaderElection.lookForLeader();
QuorumPeer.run():
1 public void run() { 2 updateThreadName(); 3 4 //.......... 5 6 try { 7 /* 8 * Main loop 9 */ 10 while (running) { 11 switch (getPeerState()) { 12 case LOOKING: 13 LOG.info("LOOKING"); 14 15 if (Boolean.getBoolean("readonlymode.enabled")) { 16 final ReadOnlyZooKeeperServer roZk = 17 new ReadOnlyZooKeeperServer(logFactory, this, this.zkDb); 18 Thread roZkMgr = new Thread() { 19 public void run() { 20 try { 21 // lower-bound grace period to 2 secs 22 sleep(Math.max(2000, tickTime)); 23 if (ServerState.LOOKING.equals(getPeerState())) { 24 roZk.startup(); 25 } 26 } catch (InterruptedException e) { 27 LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started"); 28 } catch (Exception e) { 29 LOG.error("FAILED to start ReadOnlyZooKeeperServer", e); 30 } 31 } 32 }; 33 try { 34 roZkMgr.start(); 35 reconfigFlagClear(); 36 if (shuttingDownLE) { 37 shuttingDownLE = false; 38 startLeaderElection(); 39 } 40 setCurrentVote(makeLEStrategy().lookForLeader()); 41 } catch (Exception e) { 42 LOG.warn("Unexpected exception", e); 43 setPeerState(ServerState.LOOKING); 44 } finally { 45 roZkMgr.interrupt(); 46 roZk.shutdown(); 47 } 48 } else { 49 try { 50 reconfigFlagClear(); 51 if (shuttingDownLE) { 52 shuttingDownLE = false; 53 startLeaderElection(); 54 } 55 setCurrentVote(makeLEStrategy().lookForLeader()); 56 } catch (Exception e) { 57 LOG.warn("Unexpected exception", e); 58 setPeerState(ServerState.LOOKING); 59 } 60 } 61 break; 62 case OBSERVING: 63 try { 64 LOG.info("OBSERVING"); 65 setObserver(makeObserver(logFactory)); 66 observer.observeLeader(); 67 } catch (Exception e) { 68 LOG.warn("Unexpected exception",e ); 69 } finally { 70 observer.shutdown(); 71 setObserver(null); 72 updateServerState(); 73 } 74 break; 75 case FOLLOWING: 76 try { 77 LOG.info("FOLLOWING"); 78 setFollower(makeFollower(logFactory)); 79 follower.followLeader(); 80 } catch (Exception e) { 81 LOG.warn("Unexpected exception",e); 82 } finally { 83 follower.shutdown(); 84 setFollower(null); 85 updateServerState(); 86 } 87 break; 88 case LEADING: 89 LOG.info("LEADING"); 90 try { 91 setLeader(makeLeader(logFactory)); 92 leader.lead(); 93 setLeader(null); 94 } catch (Exception e) { 95 LOG.warn("Unexpected exception",e); 96 } finally { 97 if (leader != null) { 98 leader.shutdown("Forcing shutdown"); 99 setLeader(null); 100 } 101 updateServerState(); 102 } 103 break; 104 } 105 start_fle = Time.currentElapsedTime(); 106 } 107 } finally { 108 LOG.warn("QuorumPeer main thread exited"); 109 MBeanRegistry instance = MBeanRegistry.getInstance(); 110 instance.unregister(jmxQuorumBean); 111 instance.unregister(jmxLocalPeerBean); 112 113 for (RemotePeerBean remotePeerBean : jmxRemotePeerBean.values()) { 114 instance.unregister(remotePeerBean); 115 } 116 117 jmxQuorumBean = null; 118 jmxLocalPeerBean = null; 119 jmxRemotePeerBean = null; 120 } 121 }
FastLeaderElection.lookForLeader():
1 public Vote lookForLeader() throws InterruptedException { 2 try { 3 self.jmxLeaderElectionBean = new LeaderElectionBean(); 4 MBeanRegistry.getInstance().register( 5 self.jmxLeaderElectionBean, self.jmxLocalPeerBean); 6 } catch (Exception e) { 7 LOG.warn("Failed to register with JMX", e); 8 self.jmxLeaderElectionBean = null; 9 } 10 if (self.start_fle == 0) { 11 self.start_fle = Time.currentElapsedTime(); 12 } 13 try { 14 HashMap<Long, Vote> recvset = new HashMap<Long, Vote>(); 15 16 HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>(); 17 //等待200毫秒 18 int notTimeout = finalizeWait; 19 20 synchronized(this){ 21 //邏輯時鐘自增+1 22 logicalclock.incrementAndGet(); 23 updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch()); 24 } 25 26 LOG.info("New election. My id = " + self.getId() + 27 ", proposed zxid=0x" + Long.toHexString(proposedZxid)); 28 //傳送投票資訊 29 sendNotifications(); 30 31 /* 32 * Loop in which we exchange notifications until we find a leader 33 */ 34 //判斷是否為Looking狀態 35 while ((self.getPeerState() == ServerState.LOOKING) && 36 (!stop)){ 37 /* 38 * Remove next notification from queue, times out after 2 times 39 * the termination time 40 */ 41 //獲取接收其他Follow傳送的投票資訊 42 Notification n = recvqueue.poll(notTimeout, 43 TimeUnit.MILLISECONDS); 44 45 /* 46 * Sends more notifications if haven't received enough. 47 * Otherwise processes new notification. 48 */ 49 //未收到投票資訊 50 if(n == null){ 51 //判斷是否和叢集離線了 52 if(manager.haveDelivered()){ 53 //未斷開,傳送投票 54 sendNotifications(); 55 } else { 56 //斷開,重連 57 manager.connectAll(); 58 } 59 /* 60 * Exponential backoff 61 */ 62 int tmpTimeOut = notTimeout*2; 63 notTimeout = (tmpTimeOut < maxNotificationInterval? 64 tmpTimeOut : maxNotificationInterval); 65 LOG.info("Notification time out: " + notTimeout); 66 } //接收到了投票,則處理收到的投票資訊 67 else if (validVoter(n.sid) && validVoter(n.leader)) { 68 /* 69 * Only proceed if the vote comes from a replica in the current or next 70 * voting view for a replica in the current or next voting view. 71 */ 72 //其他節點的Server.state 73 switch (n.state) { 74 case LOOKING: 75 //如果其他節點也為Looking狀態,說明當前正處於選舉階段,則處理投票資訊。 76 77 // If notification > current, replace and send messages out 78 //如果當前的epoch(投票輪次)小於其他的投票資訊,則說明自己的投票輪次已經過時,則更新自己的投票輪次 79 if (n.electionEpoch > logicalclock.get()) { 80 //更新投票輪次 81 logicalclock.set(n.electionEpoch); 82 //清除收到的投票 83 recvset.clear(); 84 //比對投票資訊 85 //如果本身的投票資訊 低於 收到的的投票資訊,則使用收到的投票資訊,否則再次使用自身的投票資訊進行傳送投票。 86 if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, 87 getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) { 88 //使用收到的投票資訊 89 updateProposal(n.leader, n.zxid, n.peerEpoch); 90 } else { 91 //使用自己的投票資訊 92 updateProposal(getInitId(), 93 getInitLastLoggedZxid(), 94 getPeerEpoch()); 95 } 96 //傳送投票資訊 97 sendNotifications(); 98 } else if (n.electionEpoch < logicalclock.get()) { 99 //如果其他節點的epoch小於當前的epoch則丟棄 100 if(LOG.isDebugEnabled()){ 101 LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x" 102 + Long.toHexString(n.electionEpoch) 103 + ", logicalclock=0x" + Long.toHexString(logicalclock.get())); 104 } 105 break; 106 } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, 107 proposedLeader, proposedZxid, proposedEpoch)) { 108 //同樣的epoch,正常情況,所有節點基本處於同一輪次 109 //如果自身投票資訊 低於 收到的投票資訊,則更新投票資訊。併傳送 110 updateProposal(n.leader, n.zxid, n.peerEpoch); 111 sendNotifications(); 112 } 113 114 if(LOG.isDebugEnabled()){ 115 LOG.debug("Adding vote: from=" + n.sid + 116 ", proposed leader=" + n.leader + 117 ", proposed zxid=0x" + Long.toHexString(n.zxid) + 118 ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch)); 119 } 120 //投票資訊Vote歸檔,收到的有效選票 票池 121 recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch)); 122 123 //統計投票結果 ,判斷是否能結束選舉 124 if (termPredicate(recvset, 125 new Vote(proposedLeader, proposedZxid, 126 logicalclock.get(), proposedEpoch))) { 127 //如果已經選出leader 128 129 // Verify if there is any change in the proposed leader 130 while((n = recvqueue.poll(finalizeWait, 131 TimeUnit.MILLISECONDS)) != null){ 132 if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, 133 proposedLeader, proposedZxid, proposedEpoch)){ 134 recvqueue.put(n); 135 break; 136 } 137 } 138 139 /* 140 * This predicate is true once we don't read any new 141 * relevant message from the reception queue 142 */ 143 //如果選票結果為當前節點,則更新ServerState,否則設定為Follwer 144 if (n == null) { 145 self.setPeerState((proposedLeader == self.getId()) ? 146 ServerState.LEADING: learningState()); 147 148 Vote endVote = new Vote(proposedLeader, 149 proposedZxid, proposedEpoch); 150 leaveInstance(endVote); 151 return endVote; 152 } 153 } 154 break; 155 case OBSERVING: 156 LOG.debug("Notification from observer: " + n.sid); 157 break; 158 case FOLLOWING: 159 case LEADING: 160 /* 161 * Consider all notifications from the same epoch 162 * together. 163 */ 164 //如果其他節點已經確定為Leader 165 //如果同一個的投票輪次,則加入選票池 166 //判斷是否能過半選舉出leader ,如果是,則checkLeader 167 /*checkLeader: 168 * 【是否能選舉出leader】and 169 * 【(如果投票leader為自身,且輪次一致) or 170 * (如果所選leader不是自身資訊在outofelection不為空,且leader的ServerState狀態已經為leader)】 171 * 172 */ 173 if(n.electionEpoch == logicalclock.get()){ 174 recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch)); 175 if(termPredicate(recvset, new Vote(n.leader, 176 n.zxid, n.electionEpoch, n.peerEpoch, n.state)) 177 && checkLeader(outofelection, n.leader, n.electionEpoch)) { 178 self.setPeerState((n.leader == self.getId()) ? 179 ServerState.LEADING: learningState()); 180 181 Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch); 182 leaveInstance(endVote); 183 return endVote; 184 } 185 } 186 187 /* 188 * Before joining an established ensemble, verify that 189 * a majority are following the same leader. 190 * Only peer epoch is used to check that the votes come 191 * from the same ensemble. This is because there is at 192 * least one corner case in which the ensemble can be 193 * created with inconsistent zxid and election epoch 194 * info. However, given that only one ensemble can be 195 * running at a single point in time and that each 196 * epoch is used only once, using only the epoch to 197 * compare the votes is sufficient. 198 * 199 * @see https://issues.apache.org/jira/browse/ZOOKEEPER-1732 200 */ 201 outofelection.put(n.sid, new Vote(n.leader, 202 IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state)); 203 //說明此時 叢集中存在別的輪次選舉已經有了選舉結果 204 //比對outofelection選票池,是否能結束選舉,同時檢查leader資訊 205 //如果能結束選舉 接收到的選票產生的leader通過checkLeader為true,則更新當前節點資訊 206 if (termPredicate(outofelection, new Vote(n.leader, 207 IGNOREVALUE, IGNOREVALUE, n.peerEpoch, n.state)) 208 && checkLeader(outofelection, n.leader, IGNOREVALUE)) { 209 synchronized(this){ 210 logicalclock.set(n.electionEpoch); 211 self.setPeerState((n.leader == self.getId()) ? 212 ServerState.LEADING: learningState()); 213 } 214 Vote endVote = new Vote(n.leader, n.zxid, n.peerEpoch); 215 leaveInstance(endVote); 216 return endVote; 217 } 218 break; 219 default: 220 LOG.warn("Notification state unrecoginized: " + n.state 221 + " (n.state), " + n.sid + " (n.sid)"); 222 break; 223 } 224 } else { 225 if (!validVoter(n.leader)) { 226 LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid); 227 } 228 if (!validVoter(n.sid)) { 229 LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid); 230 } 231 } 232 } 233 return null; 234 } finally { 235 try { 236 if(self.jmxLeaderElectionBean != null){ 237 MBeanRegistry.getInstance().unregister( 238 self.jmxLeaderElectionBean); 239 } 240 } catch (Exception e) { 241 LOG.warn("Failed to unregister with JMX", e); 242 } 243 self.jmxLeaderElectionBean = null; 244 LOG.debug("Number of connection processing threads: {}", 245 manager.getConnectionThreadCount()); 246 } 247 }
lookForLeader方法中把當前選票和收到的選舉進行不斷的比對和更新,最終選出leader,其中比對選票的方法為totalOrderPredicate(): 其中的比對投票資訊方式為:
1. 首先判斷epoch(選舉輪次),也就是選擇epoch值更大的節點;如果收到的epoch更大,則當前階段落後,更新自己的epoch,否則丟棄。
2. 如果同於輪次中,則選擇zxid更大的節點,因為zxid越大說明資料越新。
3. 如果同一輪次,且zxid一樣,則選擇serverId最大的節點。
綜上3點可理解為越大越棒!
totalOrderPredicate():
1 protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) { 2 LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" + 3 Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid)); 4 if(self.getQuorumVerifier().getWeight(newId) == 0){ 5 return false; 6 } 7 8 /* 9 * We return true if one of the following three cases hold: 10 * 1- New epoch is higher 11 * 2- New epoch is the same as current epoch, but new zxid is higher 12 * 3- New epoch is the same as current epoch, new zxid is the same 13 * as current zxid, but server id is higher. 14 */ 15 16 return ((newEpoch > curEpoch) || 17 ((newEpoch == curEpoch) && 18 ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId))))); 19 }
選舉流程
整個選舉過程可大致理解不斷的接收選票,比對選票,直到選出leader,每個zookeeper節點都持有自己的選票池,按照統一的比對演算法,正常情況下最終選出來的leader是一致的。
end;
本內容僅是zookeeper一部分原始碼解析,包括啟動和選舉;其中核心的zookeeper事物處理和一致性協議ZAB,後續再跟進。如果不對或不妥的地方,歡迎留言指出。
Zookeeper github:https://github.com/apache/zookeeper/
Apache zk:https://zookeeper.apache.org/releases.html
原始碼部分註釋來源:拉鉤-子幕