一. 前言

在同一個HA HDFS叢集中，將會同時執行兩個Namenode例項，其中一個為Active Namenode,用於實時處理所有客戶端請求；另一個為Standby Namenode, StandbyNamenode的名稱空間與ActiveNamenode是完全保持一致的。所以當ActiveNamenode出現故障時， Standby Namenode可以立即切換成Active狀態。

二.checkpoint操作

為了讓Standby Namenode的名稱空間與Active Namenode保持同步，它們都將和JournalNodes守護程式通訊。

當Active Namenode執行任何修改名稱空間的操作時，它至少需要將產生的editlog檔案持久化到N-(N-1)/2個JournalNode節點上才能保證名稱空間修改的安全性。 換句話說， 如果在HA策略下啟動了N個JournalNode程式，那麼整個JournalNode叢集最多允許(N-1)/2個程式死掉，這樣才能保證editlog成功完整地被寫入。 比如叢集中有3個JournalNode時，最多允許1個JournalNode掛掉；叢集中有5個JournalNode時，最多允許2個JournalNode掛掉。 Standby Namenode則負責觀察editlog檔案的變化，它能夠從JournalNodes中讀取editlog檔案，然後合併更新到它自己的名稱空間中。一旦ActiveNamenode出現故障， Standby Namenode就會保證從JournalNodes中讀出全部的editlog檔案，然後切換成Active狀態。

Standby Namenode讀取全部的editlog檔案可確保在發生故障轉移之前，和Active Namenode擁有完全同步的名稱空間狀態。

Standby Namenode始終保持著一個最新版本的名稱空間，它會不斷地將讀入的editlog檔案與當前的名稱空間並 .

StandbyNamenode只需判斷當前是否滿足觸發檢查點操作的兩個條件，如果滿足觸發條件，則將Standby Namenode的名稱空間寫入一個新的fsimage檔案中，並通過HTTP將這個fsimage檔案傳回Active Namenode。

■ Standby Namenode檢查是否滿足觸發檢查點操作的兩個條件。
■ Standby Namenode將當前的名稱空間儲存到fsimage.ckpt_txid檔案中，這裡的txid是當前最新的editlog檔案中記錄的事務id。之後Standby Namenode寫入fsimage檔案的MD5校驗和，最後重新命名這個fsimage.ckpt_txid檔案為fsimage_txid。當執行這個操作時，其他的Standby Namenode操作將會被阻塞，例如Active Namenode發生錯誤時，需要進行主備切換或者訪問Standby Namenode的Web介面等操作。注意， Active Namenode的操作並不會被影響，例如listing、 reading、 writing檔案等。
■ Standby Namenode向Active Namenode的ImageServlet傳送HTTP GET請求/getimage?putimage=1。這個請求的URL中包含了新的fsimage檔案的事務id，以及Standby Namenode用於下載的埠和IP地址。
■ Active Namenode會根據Standby Namenode提供的資訊向Standby Namenode的ImageServlet發起HTTP GET請求下載fsimage檔案。 Namenode首先將下載檔案命名為fsimage.ckpt_*，然後建立MD5校驗和，最後將fsimage.ckpt_*重新命名為fsimage_*。

三. FSNamesystem#startStandbyServices

Namenode在啟動的時候會載入FSNamesystem, 在FSNamesystem中會通過startStandbyServices啟動一個StandbyCheckpointer類.用於處理checkpoint操作.

 /**
   * Start services required in standby or observer state
   * 
   * @throws IOException
   */
  void startStandbyServices(final Configuration conf, boolean isObserver)
      throws IOException {
    LOG.info("Starting services required for " +
        (isObserver ? "observer" : "standby") + " state");
    if (!getFSImage().editLog.isOpenForRead()) {
      // During startup, we're already open for read.
      getFSImage().editLog.initSharedJournalsForRead();
    }
    blockManager.setPostponeBlocksFromFuture(true);

    // Disable quota checks while in standby.
    dir.disableQuotaChecks();
    editLogTailer = new EditLogTailer(this, conf);
    editLogTailer.start();
    if (!isObserver && standbyShouldCheckpoint) {
      standbyCheckpointer = new StandbyCheckpointer(conf, this);
      standbyCheckpointer.start();
    }
  }

四. CheckpointerThread

StandbyCheckpointer在呼叫start方法的時候,會啟動CheckpointerThread執行緒,而執行run方法的時候, 會呼叫doWork方法.

private void doWork() {

      //間隔時長 1小時
      final long checkPeriod = 1000 * checkpointConf.getCheckPeriod();
      System.out.println("StandbyCheckpointer#doWork=>checkPeriod : "+ checkPeriod);
      // Reset checkpoint time so that we don't always checkpoint
      // on startup.
      lastCheckpointTime = monotonicNow();
      lastUploadTime = monotonicNow();
      while (shouldRun) {

        boolean needRollbackCheckpoint = namesystem.isNeedRollbackFsImage();

        if (!needRollbackCheckpoint) {
          try {
            Thread.sleep(checkPeriod);
          } catch (InterruptedException ie) {
          }
          if (!shouldRun) {
            break;
          }
        }
        try {
          // We may have lost our ticket since last checkpoint, log in again, just in case
          if (UserGroupInformation.isSecurityEnabled()) {
            UserGroupInformation.getCurrentUser().checkTGTAndReloginFromKeytab();
          }


          final long now = monotonicNow();
          //獲得最後一次往JournalNode寫入的txid和最近一次做檢查點的txid的差值
          final long uncheckpointed = countUncheckpointedTxns();

          //計算當前時間和上一次檢查點操作時間的間隔
          final long secsSinceLast = (now - lastCheckpointTime) / 1000;

          // if we need a rollback checkpoint, always attempt to checkpoint
          boolean needCheckpoint = needRollbackCheckpoint;

          if (needCheckpoint) {
            LOG.info("Triggering a rollback fsimage for rolling upgrade.");
          } else if (uncheckpointed >= checkpointConf.getTxnCount()) {

            ///第一種符合合併的情況：
            // 當最後一次往JournalNode寫入的txid和最近一次做檢查點的txid的差值
            // 大於或者等於dfs.namenode.checkpoint.txns配置的數量(預設為100萬)時做一次合併

            LOG.info("Triggering checkpoint because there have been {} txns " +
                "since the last checkpoint, " +
                "which exceeds the configured threshold {}",
                uncheckpointed, checkpointConf.getTxnCount());
            needCheckpoint = true;
          } else if (secsSinceLast >= checkpointConf.getPeriod()) {
            LOG.info("Triggering checkpoint because it has been {} seconds " +
                "since the last checkpoint, which exceeds the configured " +
                "interval {}", secsSinceLast, checkpointConf.getPeriod());

            //第二種符合合併的情況：
            // 當時間間隔大於或者等於dfs.namenode.checkpoint.period [一小時]
            // 配置的時間時做合併
            needCheckpoint = true;
          }

          //滿足檢查點執行條件， 則呼叫doCheckpoint()方法執行檢查點操作
          if (needCheckpoint) {
            synchronized (cancelLock) {
              if (now < preventCheckpointsUntil) {
                LOG.info("But skipping this checkpoint since we are about to failover!");
                canceledCount++;
                continue;
              }
              assert canceler == null;
              canceler = new Canceler();
            }

            // on all nodes, we build the checkpoint. However, we only ship the checkpoint if have a
            // rollback request, are the checkpointer, are outside the quiet period.
            final long secsSinceLastUpload = (now - lastUploadTime) / 1000;
            boolean sendRequest = isPrimaryCheckPointer || secsSinceLastUpload >= checkpointConf.getQuietPeriod();


            doCheckpoint(sendRequest);

            // reset needRollbackCheckpoint to false only when we finish a ckpt
            // for rollback image
            if (needRollbackCheckpoint
                && namesystem.getFSImage().hasRollbackFSImage()) {
              namesystem.setCreatedRollbackImages(true);
              namesystem.setNeedRollbackFsImage(false);
            }
            lastCheckpointTime = now;
            LOG.info("Checkpoint finished successfully.");
          }
        } catch (SaveNamespaceCancelledException ce) {
          LOG.info("Checkpoint was cancelled: {}", ce.getMessage());
          canceledCount++;
        } catch (InterruptedException ie) {
          LOG.info("Interrupted during checkpointing", ie);
          // Probably requested shutdown.
          continue;
        } catch (Throwable t) {
          LOG.error("Exception in doCheckpoint", t);
        } finally {
          synchronized (cancelLock) {
            canceler = null;
          }
        }
      }
    }

六. doCheckpoint(sendRequest);

整個檢查點執行操作的邏輯都是在doCheckpoint()方法中實現的。doCheckpoint()方法首先獲取當前儲存的fsimage的prevCheckpointTxId，然後獲取最近更新的editlog的thisCheckpointTxId，只有新的thisCheckpointTxId大於prevCheckpointTxId，也
就是當前名稱空間有更新，但是並沒有儲存到新的fsimage檔案中時，才有必要進行一次檢查點操作。判斷完成後， doCheckpoint()會呼叫saveNamespace()方法將最新的名稱空間儲存到fsimage檔案中。之後構造一個執行緒，將新產生的fsimage檔案通過HTTP方式上傳到
AvtiveNamenode中

/**
   * 整個檢查點執行操作的邏輯都是在doCheckpoint()方法中實現的。
   *
   * doCheckpoint()方法首先獲取當前儲存的fsimage的prevCheckpointTxId，
   * 然後獲取最近更新的editlog的thisCheckpointTxId，
   * 只有新的thisCheckpointTxId大於prevCheckpointTxId，
   * 也就是當前名稱空間有更新， 但是並沒有儲存到新的fsimage檔案中時，
   * 才有必要進行一次 檢查點操作。
   *
   * 判斷完成後，
   * doCheckpoint()會呼叫saveNamespace()方法將最新的名稱空間儲存到fsimage檔案中。
   *
   * 之後構造一個執行緒， 將新產生的fsimage檔案通過HTTP方式上傳到AvtiveNamenode中。
   *
   * @param sendCheckpoint
   * @throws InterruptedException
   * @throws IOException
   */
  private void doCheckpoint(boolean sendCheckpoint) throws InterruptedException, IOException {
    assert canceler != null;
    final long txid;
    final NameNodeFile imageType;
    // Acquire cpLock to make sure no one is modifying the name system.
    // It does not need the full namesystem write lock, since the only thing
    // that modifies namesystem on standby node is edit log replaying.
    namesystem.cpLockInterruptibly();
    try {
      assert namesystem.getEditLog().isOpenForRead() :
        "Standby Checkpointer should only attempt a checkpoint when " +
        "NN is in standby mode, but the edit logs are in an unexpected state";

      //獲取當前Standby Namenode上儲存的最新的fsimage物件
      FSImage img = namesystem.getFSImage();

      //獲取fsimage中儲存的txid， 也就是上一次進行檢查點操作時儲存的txid
      long prevCheckpointTxId = img.getStorage().getMostRecentCheckpointTxId();

      //獲取當前名稱空間的最新的txid， 也就是收到的editlog的最新的txid
      long thisCheckpointTxId = img.getCorrectLastAppliedOrWrittenTxId();
      assert thisCheckpointTxId >= prevCheckpointTxId;

      //如果相等則沒有必要執行檢查點操作， 當前fsimage已經是最新的了
      if (thisCheckpointTxId == prevCheckpointTxId) {
        LOG.info("A checkpoint was triggered but the Standby Node has not " +
            "received any transactions since the last checkpoint at txid {}. " +
            "Skipping...", thisCheckpointTxId);
        return;
      }

      if (namesystem.isRollingUpgrade()
          && !namesystem.getFSImage().hasRollbackFSImage()) {

        //如果當前Namenode正在執行升級操作， 則建立fsimage_rollback檔案
        // if we will do rolling upgrade but have not created the rollback image
        // yet, name this checkpoint as fsimage_rollback
        imageType = NameNodeFile.IMAGE_ROLLBACK;
      } else {

        //在正常情況下建立fsimage檔案
        imageType = NameNodeFile.IMAGE;
      }

      //呼叫saveNamespace()將當前名稱空間儲存到新的檔案中
      img.saveNamespace(namesystem, imageType, canceler);


      txid = img.getStorage().getMostRecentCheckpointTxId();
      assert txid == thisCheckpointTxId : "expected to save checkpoint at txid=" +
          thisCheckpointTxId + " but instead saved at txid=" + txid;

      // Save the legacy OIV image, if the output dir is defined.
      String outputDir = checkpointConf.getLegacyOivImageDir();
      if (outputDir != null && !outputDir.isEmpty()) {
        try {
          img.saveLegacyOIVImage(namesystem, outputDir, canceler);
        } catch (IOException ioe) {
          LOG.warn("Exception encountered while saving legacy OIV image; "
                  + "continuing with other checkpointing steps", ioe);
        }
      }
    } finally {
      namesystem.cpUnlock();
    }

    //early exit if we shouldn't actually send the checkpoint to the ANN
    if(!sendCheckpoint){
      return;
    }


    //構造一個執行緒， 通過HTTP將fsimage上傳到Active Namenode中
    // Upload the saved checkpoint back to the active
    // Do this in a separate thread to avoid blocking transition to active, but don't allow more
    // than the expected number of tasks to run or queue up
    // See HDFS-4816
    ExecutorService executor = new ThreadPoolExecutor(0, activeNNAddresses.size(), 100,
        TimeUnit.MILLISECONDS, new LinkedBlockingQueue<Runnable>(activeNNAddresses.size()),
        uploadThreadFactory);
    // for right now, just match the upload to the nn address by convention. There is no need to
    // directly tie them together by adding a pair class.
    List<Future<TransferFsImage.TransferResult>> uploads =
        new ArrayList<Future<TransferFsImage.TransferResult>>();
    for (final URL activeNNAddress : activeNNAddresses) {
      Future<TransferFsImage.TransferResult> upload =
          executor.submit(new Callable<TransferFsImage.TransferResult>() {
            @Override
            public TransferFsImage.TransferResult call()
                throws IOException, InterruptedException {


              CheckpointFaultInjector.getInstance().duringUploadInProgess();
              return TransferFsImage.uploadImageFromStorage(activeNNAddress, conf, namesystem
                  .getFSImage().getStorage(), imageType, txid, canceler);

            }
          });
      uploads.add(upload);
    }
    InterruptedException ie = null;
    IOException ioe= null;
    int i = 0;
    boolean success = false;
    for (; i < uploads.size(); i++) {
      Future<TransferFsImage.TransferResult> upload = uploads.get(i);
      try {
        // TODO should there be some smarts here about retries nodes that are not the active NN?
        if (upload.get() == TransferFsImage.TransferResult.SUCCESS) {
          success = true;
          //avoid getting the rest of the results - we don't care since we had a successful upload
          break;
        }

      } catch (ExecutionException e) {
        ioe = new IOException("Exception during image upload", e);
        break;
      } catch (InterruptedException e) {
        ie = e;
        break;
      }
    }
    if (ie == null && ioe == null) {
      //Update only when response from remote about success or
      lastUploadTime = monotonicNow();
      // we are primary if we successfully updated the ANN
      this.isPrimaryCheckPointer = success;
    }
    // cleaner than copying code for multiple catch statements and better than catching all
    // exceptions, so we just handle the ones we expect.
    if (ie != null || ioe != null) {

      // cancel the rest of the tasks, and close the pool
      for (; i < uploads.size(); i++) {
        Future<TransferFsImage.TransferResult> upload = uploads.get(i);
        // The background thread may be blocked waiting in the throttler, so
        // interrupt it.
        upload.cancel(true);
      }

      // shutdown so we interrupt anything running and don't start anything new
      executor.shutdownNow();
      // this is a good bit longer than the thread timeout, just to make sure all the threads
      // that are not doing any work also stop
      executor.awaitTermination(500, TimeUnit.MILLISECONDS);

      // re-throw the exception we got, since one of these two must be non-null
      if (ie != null) {
        throw ie;
      } else if (ioe != null) {
        throw ioe;
      }
    }
  }

Hadoop3.2.1 【 HDFS 】原始碼分析 : Standby Namenode解析

一. 前言

二.checkpoint操作

三. FSNamesystem#startStandbyServices

四. CheckpointerThread

六. doCheckpoint(sendRequest);

相關文章