Hadoop3.2.1 【 HDFS 】原始碼分析 : Standby Namenode解析


一. 前言

在同一個HA HDFS叢集中, 將會同時執行兩個Namenode例項, 其中一個為Active Namenode,用於實時處理所有客戶端請求; 另一個為Standby Namenode, StandbyNamenode的名稱空間與ActiveNamenode是完全保持一致的。 所以當ActiveNamenode出現故障時, Standby Namenode可以立即切換成Active狀態。


為了讓Standby Namenode的名稱空間與Active Namenode保持同步, 它們都將和JournalNodes守護程式通訊。

當Active Namenode執行任何修改名稱空間的操作時, 它至少需要將產生的editlog檔案持久化到N-(N-1)/2個JournalNode節點上才能保證名稱空間修改的安全性。 換句話說, 如果在HA策略下啟動了N個JournalNode程式, 那麼整個JournalNode叢集最多允許(N-1)/2個程式死掉, 這樣才能保證editlog成功完整地被寫入。 比如叢集中有3個JournalNode時, 最多允許1個JournalNode掛掉; 叢集中有5個JournalNode時, 最多允許2個JournalNode掛掉。 Standby Namenode則負責觀察editlog檔案的變化, 它能夠從JournalNodes中讀取editlog檔案, 然後合併更新到它自己的名稱空間中。 一旦ActiveNamenode出現故障, Standby Namenode就會保證從JournalNodes中讀出全部的editlog檔案, 然後切換成Active狀態。

Standby Namenode讀取全部的editlog檔案可確保在發生故障轉移之前, 和Active Namenode擁有完全同步的名稱空間狀態。

Standby Namenode始終保持著一個最新版本的名稱空間, 它會不斷地將讀入的editlog檔案與當前的名稱空間並 .

StandbyNamenode只需判斷當前是否滿足觸發檢查點操作的兩個條件, 如果滿足觸發條件, 則將Standby Namenode的名稱空間寫入一個新的fsimage檔案中, 並通過HTTP將這個fsimage檔案傳回Active Namenode。


■ Standby Namenode檢查是否滿足觸發檢查點操作的兩個條件。
■ Standby Namenode將當前的名稱空間儲存到fsimage.ckpt_txid檔案中, 這裡的txid是當前最新的editlog檔案中記錄的事務id。 之後Standby Namenode寫入fsimage檔案的MD5校驗和, 最後重新命名這個fsimage.ckpt_txid檔案為fsimage_txid。 當執行這個操作時, 其他的Standby Namenode操作將會被阻塞, 例如Active Namenode發生錯誤時, 需要進行主備切換或者訪問Standby Namenode的Web介面等操作。注意, Active Namenode的操作並不會被影響, 例如listing、 reading、 writing檔案等。
■ Standby Namenode向Active Namenode的ImageServlet傳送HTTP GET請求/getimage?putimage=1。 這個請求的URL中包含了新的fsimage檔案的事務id,以及Standby Namenode用於下載的埠和IP地址。
■ Active Namenode會根據Standby Namenode提供的資訊向Standby Namenode的ImageServlet發起HTTP GET請求下載fsimage檔案。 Namenode首先將下載檔案命名為fsimage.ckpt_*, 然後建立MD5校驗和, 最後將fsimage.ckpt_*重新命名為fsimage_*。


三. FSNamesystem#startStandbyServices

Namenode在啟動的時候會載入FSNamesystem, 在FSNamesystem中會通過startStandbyServices啟動一個StandbyCheckpointer類.用於處理checkpoint操作. 

   * Start services required in standby or observer state
   * @throws IOException
  void startStandbyServices(final Configuration conf, boolean isObserver)
      throws IOException {
    LOG.info("Starting services required for " +
        (isObserver ? "observer" : "standby") + " state");
    if (!getFSImage().editLog.isOpenForRead()) {
      // During startup, we're already open for read.

    // Disable quota checks while in standby.
    editLogTailer = new EditLogTailer(this, conf);
    if (!isObserver && standbyShouldCheckpoint) {
      standbyCheckpointer = new StandbyCheckpointer(conf, this);


四. CheckpointerThread

StandbyCheckpointer在呼叫start方法的時候,會啟動CheckpointerThread執行緒,而執行run方法的時候, 會呼叫doWork方法.

private void doWork() {

      //間隔時長 1小時
      final long checkPeriod = 1000 * checkpointConf.getCheckPeriod();
      System.out.println("StandbyCheckpointer#doWork=>checkPeriod : "+ checkPeriod);
      // Reset checkpoint time so that we don't always checkpoint
      // on startup.
      lastCheckpointTime = monotonicNow();
      lastUploadTime = monotonicNow();
      while (shouldRun) {

        boolean needRollbackCheckpoint = namesystem.isNeedRollbackFsImage();

        if (!needRollbackCheckpoint) {
          try {
          } catch (InterruptedException ie) {
          if (!shouldRun) {
        try {
          // We may have lost our ticket since last checkpoint, log in again, just in case
          if (UserGroupInformation.isSecurityEnabled()) {

          final long now = monotonicNow();
          final long uncheckpointed = countUncheckpointedTxns();

          final long secsSinceLast = (now - lastCheckpointTime) / 1000;

          // if we need a rollback checkpoint, always attempt to checkpoint
          boolean needCheckpoint = needRollbackCheckpoint;

          if (needCheckpoint) {
            LOG.info("Triggering a rollback fsimage for rolling upgrade.");
          } else if (uncheckpointed >= checkpointConf.getTxnCount()) {

            // 當最後一次往JournalNode寫入的txid和最近一次做檢查點的txid的差值
            // 大於或者等於dfs.namenode.checkpoint.txns配置的數量(預設為100萬)時做一次合併

            LOG.info("Triggering checkpoint because there have been {} txns " +
                "since the last checkpoint, " +
                "which exceeds the configured threshold {}",
                uncheckpointed, checkpointConf.getTxnCount());
            needCheckpoint = true;
          } else if (secsSinceLast >= checkpointConf.getPeriod()) {
            LOG.info("Triggering checkpoint because it has been {} seconds " +
                "since the last checkpoint, which exceeds the configured " +
                "interval {}", secsSinceLast, checkpointConf.getPeriod());

            // 當時間間隔大於或者等於dfs.namenode.checkpoint.period [一小時]
            // 配置的時間時做合併
            needCheckpoint = true;

          //滿足檢查點執行條件, 則呼叫doCheckpoint()方法執行檢查點操作
          if (needCheckpoint) {
            synchronized (cancelLock) {
              if (now < preventCheckpointsUntil) {
                LOG.info("But skipping this checkpoint since we are about to failover!");
              assert canceler == null;
              canceler = new Canceler();

            // on all nodes, we build the checkpoint. However, we only ship the checkpoint if have a
            // rollback request, are the checkpointer, are outside the quiet period.
            final long secsSinceLastUpload = (now - lastUploadTime) / 1000;
            boolean sendRequest = isPrimaryCheckPointer || secsSinceLastUpload >= checkpointConf.getQuietPeriod();


            // reset needRollbackCheckpoint to false only when we finish a ckpt
            // for rollback image
            if (needRollbackCheckpoint
                && namesystem.getFSImage().hasRollbackFSImage()) {
            lastCheckpointTime = now;
            LOG.info("Checkpoint finished successfully.");
        } catch (SaveNamespaceCancelledException ce) {
          LOG.info("Checkpoint was cancelled: {}", ce.getMessage());
        } catch (InterruptedException ie) {
          LOG.info("Interrupted during checkpointing", ie);
          // Probably requested shutdown.
        } catch (Throwable t) {
          LOG.error("Exception in doCheckpoint", t);
        } finally {
          synchronized (cancelLock) {
            canceler = null;

六. doCheckpoint(sendRequest);

整個檢查點執行操作的邏輯都是在doCheckpoint()方法中實現的。doCheckpoint()方法首先獲取當前儲存的fsimage的prevCheckpointTxId, 然後獲取最近更新的editlog的thisCheckpointTxId, 只有新的thisCheckpointTxId大於prevCheckpointTxId, 也
就是當前名稱空間有更新, 但是並沒有儲存到新的fsimage檔案中時, 才有必要進行一次檢查點操作。 判斷完成後, doCheckpoint()會呼叫saveNamespace()方法將最新的名稱空間儲存到fsimage檔案中。 之後構造一個執行緒, 將新產生的fsimage檔案通過HTTP方式上傳到

   * 整個檢查點執行操作的邏輯都是在doCheckpoint()方法中實現的。
   * doCheckpoint()方法首先獲取當前儲存的fsimage的prevCheckpointTxId,
   * 然後獲取最近更新的editlog的thisCheckpointTxId,
   * 只有新的thisCheckpointTxId大於prevCheckpointTxId,
   * 也就是當前名稱空間有更新, 但是並沒有儲存到新的fsimage檔案中時,
   * 才有必要進行一次 檢查點操作。
   * 判斷完成後,
   * doCheckpoint()會呼叫saveNamespace()方法將最新的名稱空間儲存到fsimage檔案中。
   * 之後構造一個執行緒, 將新產生的fsimage檔案通過HTTP方式上傳到AvtiveNamenode中。
   * @param sendCheckpoint
   * @throws InterruptedException
   * @throws IOException
  private void doCheckpoint(boolean sendCheckpoint) throws InterruptedException, IOException {
    assert canceler != null;
    final long txid;
    final NameNodeFile imageType;
    // Acquire cpLock to make sure no one is modifying the name system.
    // It does not need the full namesystem write lock, since the only thing
    // that modifies namesystem on standby node is edit log replaying.
    try {
      assert namesystem.getEditLog().isOpenForRead() :
        "Standby Checkpointer should only attempt a checkpoint when " +
        "NN is in standby mode, but the edit logs are in an unexpected state";

      //獲取當前Standby Namenode上儲存的最新的fsimage物件
      FSImage img = namesystem.getFSImage();

      //獲取fsimage中儲存的txid, 也就是上一次進行檢查點操作時儲存的txid
      long prevCheckpointTxId = img.getStorage().getMostRecentCheckpointTxId();

      //獲取當前名稱空間的最新的txid, 也就是收到的editlog的最新的txid
      long thisCheckpointTxId = img.getCorrectLastAppliedOrWrittenTxId();
      assert thisCheckpointTxId >= prevCheckpointTxId;

      //如果相等則沒有必要執行檢查點操作, 當前fsimage已經是最新的了
      if (thisCheckpointTxId == prevCheckpointTxId) {
        LOG.info("A checkpoint was triggered but the Standby Node has not " +
            "received any transactions since the last checkpoint at txid {}. " +
            "Skipping...", thisCheckpointTxId);

      if (namesystem.isRollingUpgrade()
          && !namesystem.getFSImage().hasRollbackFSImage()) {

        //如果當前Namenode正在執行升級操作, 則建立fsimage_rollback檔案
        // if we will do rolling upgrade but have not created the rollback image
        // yet, name this checkpoint as fsimage_rollback
        imageType = NameNodeFile.IMAGE_ROLLBACK;
      } else {

        imageType = NameNodeFile.IMAGE;

      img.saveNamespace(namesystem, imageType, canceler);

      txid = img.getStorage().getMostRecentCheckpointTxId();
      assert txid == thisCheckpointTxId : "expected to save checkpoint at txid=" +
          thisCheckpointTxId + " but instead saved at txid=" + txid;

      // Save the legacy OIV image, if the output dir is defined.
      String outputDir = checkpointConf.getLegacyOivImageDir();
      if (outputDir != null && !outputDir.isEmpty()) {
        try {
          img.saveLegacyOIVImage(namesystem, outputDir, canceler);
        } catch (IOException ioe) {
          LOG.warn("Exception encountered while saving legacy OIV image; "
                  + "continuing with other checkpointing steps", ioe);
    } finally {

    //early exit if we shouldn't actually send the checkpoint to the ANN

    //構造一個執行緒, 通過HTTP將fsimage上傳到Active Namenode中
    // Upload the saved checkpoint back to the active
    // Do this in a separate thread to avoid blocking transition to active, but don't allow more
    // than the expected number of tasks to run or queue up
    // See HDFS-4816
    ExecutorService executor = new ThreadPoolExecutor(0, activeNNAddresses.size(), 100,
        TimeUnit.MILLISECONDS, new LinkedBlockingQueue<Runnable>(activeNNAddresses.size()),
    // for right now, just match the upload to the nn address by convention. There is no need to
    // directly tie them together by adding a pair class.
    List<Future<TransferFsImage.TransferResult>> uploads =
        new ArrayList<Future<TransferFsImage.TransferResult>>();
    for (final URL activeNNAddress : activeNNAddresses) {
      Future<TransferFsImage.TransferResult> upload =
          executor.submit(new Callable<TransferFsImage.TransferResult>() {
            public TransferFsImage.TransferResult call()
                throws IOException, InterruptedException {

              return TransferFsImage.uploadImageFromStorage(activeNNAddress, conf, namesystem
                  .getFSImage().getStorage(), imageType, txid, canceler);

    InterruptedException ie = null;
    IOException ioe= null;
    int i = 0;
    boolean success = false;
    for (; i < uploads.size(); i++) {
      Future<TransferFsImage.TransferResult> upload = uploads.get(i);
      try {
        // TODO should there be some smarts here about retries nodes that are not the active NN?
        if (upload.get() == TransferFsImage.TransferResult.SUCCESS) {
          success = true;
          //avoid getting the rest of the results - we don't care since we had a successful upload

      } catch (ExecutionException e) {
        ioe = new IOException("Exception during image upload", e);
      } catch (InterruptedException e) {
        ie = e;
    if (ie == null && ioe == null) {
      //Update only when response from remote about success or
      lastUploadTime = monotonicNow();
      // we are primary if we successfully updated the ANN
      this.isPrimaryCheckPointer = success;
    // cleaner than copying code for multiple catch statements and better than catching all
    // exceptions, so we just handle the ones we expect.
    if (ie != null || ioe != null) {

      // cancel the rest of the tasks, and close the pool
      for (; i < uploads.size(); i++) {
        Future<TransferFsImage.TransferResult> upload = uploads.get(i);
        // The background thread may be blocked waiting in the throttler, so
        // interrupt it.

      // shutdown so we interrupt anything running and don't start anything new
      // this is a good bit longer than the thread timeout, just to make sure all the threads
      // that are not doing any work also stop
      executor.awaitTermination(500, TimeUnit.MILLISECONDS);

      // re-throw the exception we got, since one of these two must be non-null
      if (ie != null) {
        throw ie;
      } else if (ioe != null) {
        throw ioe;




















