Hadoop3.2.1 【 HDFS 】原始碼分析 : Secondary Namenode解析

張伯毅發表於2020-09-28

 

一. 前言

 Secondary NameNode 只有一個, 他的的作用是輔助NameNode進行原數的checkpoint操作, 即合併fsimage檔案.

Secondary NameNode是一個守護程式,定時觸發checkpoint操作操作, 使用NamenodeProtocol 與NameNode進行通訊.

 

引數:

序號引數預設值描述
1dfs.namenode.checkpoint.check.period60sSecondaryNameNode和CheckpointNode將每隔'dfs.namenode.checkpoint.period'秒以查詢未選中的事務數。
2dfs.namenode.checkpoint.period3600s [1小時]兩個連續checkpoint的最大延時
3dfs.namenode.checkpoint.txns100萬checkpoint最大事務數
4dfs.namenode.checkpoint.max-retries3次重試次數
    

 

 

二.checkpoints流程說明

在非HA部署環境下, 合併FSImage操作是由Secondary Namenode來執行的。

Namenode會觸發一次合併FSImage操作:

①超過了配置的檢查點操作時長(dfs.namenode.checkpoint.period配置項配置,預設值: 1小時) ;

②從上一次檢查點操作後, 發生的事務(transaction) 數超過了配置(dfs.namenode.checkpoint.txns配置項配置,預設值:100萬) 。

 

流程示意圖: 
 

 

■ Secondary Namenode檢查兩個觸發CheckPoint流程的條件是否滿足.由於在非HA狀態下, Secondary Namenode和Namenode之間並沒有共享的editlog檔案目錄, 所以最新的事務id(transactionId)是Secondary Namenode通過呼叫RPC方法
NamenodeProtocol.getTransactionId()獲取的。
■ Secondary Namenode呼叫RPC方法NamenodeProtocol.rollEditLog()觸發editlog重置操作, 將當前正在寫的editlog段落結束, 並建立新的edit.new檔案, 這個操作還會返回當前fsimage以及剛剛重置的editlog的事務id (seen_id) 。 這樣當Secondary Namenode從Namenode讀取editlog檔案時, 新的操作就可以寫入edit.new檔案中, 不影響editlog記錄功能。 在HA模式下, 並不需要顯式地觸發editlog的重置操作, 因為Standby Namenode會定期重置editlog。
■ 有了最新的txid以及seen_id, Secondary Namenode就會發起HTTP GET請求到Namenode的GetImageServlet以獲取新的fsimage和editlog檔案。 需要注意,Secondary Namenode在進行上一次的CheckPoint操作時, 可能已經獲取了部分fsimage和edits檔案。

■ Secondary Namenode會載入新下載的fsimage檔案以重建Secondary Namenode的名稱空間。
■ Secondary Namenode讀取edits中的記錄, 並與當前的名稱空間合併, 這樣Secondary Namenode的名稱空間和Namenode的名稱空間就同步了。
■ Secondary Namenode將最新的同步的名稱空間寫入新的fsimage檔案中。
■ Secondary Namenode向Namenode的ImageServlet傳送HTTP GET請求/getimage?putimage=1。 這個請求的URL中還包含了新的fsimage檔案的事務ID,以及Secondary Namenode用於下載的埠和IP地址。
■ Namenode會根據Secondary Namenode提供的資訊向Secondary Namenode的GetImageServlet發起HTTP GET請求下載fsimage檔案。 Namenode首先將下載檔案命名為fsimage.ckpt_, 然後建立MD5校驗和, 最後將fsimage.ckpt_重新命名為fsimage_xxxxx。

 

 

三. 啟動

直接看main函式, 有兩種啟動模式,

第一種: 執行一個命令,然後終止.

CHECKPOINT :手動執行checkpoint,但是如果沒有達到觸發條件,依舊不會執行checkpoint.
GETEDITSIZE: 獲取未執行checkpoint的事務數量

第二種, 作為一個守護程式進行啟動[ 開啟InfoServer 和 CheckpointThread : 定期執行checkpoint ]

 /**
   *
   * main() has some simple utility methods.
   * @param argv Command line parameters.
   * @exception Exception if the filesystem does not exist.
   */
  public static void main(String[] argv) throws Exception {
    CommandLineOpts opts = SecondaryNameNode.parseArgs(argv);
    if (opts == null) {
      LOG.error("Failed to parse options");
      terminate(1);
    } else if (opts.shouldPrintHelp()) {
      opts.usage();
      System.exit(0);
    }

    try {
      StringUtils.startupShutdownMessage(SecondaryNameNode.class, argv, LOG);
      Configuration tconf = new HdfsConfiguration();
      SecondaryNameNode secondary = null;
      secondary = new SecondaryNameNode(tconf, opts);

      // SecondaryNameNode can be started in 2 modes:
      // 1. run a command (i.e. checkpoint or geteditsize) then terminate
      // 2. run as a daemon when {@link #parseArgs} yields no commands
      if (opts != null && opts.getCommand() != null) {
        // mode 1
        int ret = secondary.processStartupCommand(opts);
        terminate(ret);
      } else {
        // mode 2
        secondary.startInfoServer();
        secondary.startCheckpointThread();
        secondary.join();
      }
    } catch (Throwable e) {
      LOG.error("Failed to start secondary namenode", e);
      terminate(1);
    }
  }

我們直接看第二種,

 

四.startInfoServer

首先要啟動一個http server [ 預設: dfs.namenode.secondary.http-address : 0.0.0.0:9869 ] 與namenode進行通訊.

 

/**
   * Start the web server.
   */
  @VisibleForTesting
  public void startInfoServer() throws IOException {
    final InetSocketAddress httpAddr = getHttpAddress(conf);

    // 預設: dfs.namenode.secondary.http-address : 0.0.0.0:9869
    final String httpsAddrString = conf.getTrimmed(
        DFSConfigKeys.DFS_NAMENODE_SECONDARY_HTTPS_ADDRESS_KEY,
        DFSConfigKeys.DFS_NAMENODE_SECONDARY_HTTPS_ADDRESS_DEFAULT);
    InetSocketAddress httpsAddr = NetUtils.createSocketAddr(httpsAddrString);


    // 構架http服務
    HttpServer2.Builder builder = DFSUtil.httpServerTemplateForNNAndJN(conf,
        httpAddr, httpsAddr, "secondary", DFSConfigKeys.
            DFS_SECONDARY_NAMENODE_KERBEROS_INTERNAL_SPNEGO_PRINCIPAL_KEY,
        DFSConfigKeys.DFS_SECONDARY_NAMENODE_KEYTAB_FILE_KEY);

    // dfs.xframe.enabled : 預設 true
    // 如果為true,則通過返回設定為SAMEORIGIN的X_FRAME_OPTIONS標題值來啟用防止單擊劫持的保護。
    // Clickjacking保護可防止攻擊者使用透明或不透明層誘騙使用者單擊另一頁上的按鈕或連結。
    final boolean xFrameEnabled = conf.getBoolean(
        DFSConfigKeys.DFS_XFRAME_OPTION_ENABLED,
        DFSConfigKeys.DFS_XFRAME_OPTION_ENABLED_DEFAULT);

    // dfs.xframe.value : SAMEORIGIN   可選:  DENY  SAMEORIGIN    ALLOW-FROM
    final String xFrameOptionValue = conf.getTrimmed(
        DFSConfigKeys.DFS_XFRAME_OPTION_VALUE,
        DFSConfigKeys.DFS_XFRAME_OPTION_VALUE_DEFAULT);

    builder.configureXFrame(xFrameEnabled).setXFrameOption(xFrameOptionValue);

    infoServer = builder.build();
    infoServer.setAttribute("secondary.name.node", this);
    infoServer.setAttribute("name.system.image", checkpointImage);
    infoServer.setAttribute(JspHelper.CURRENT_CONF, conf);
    infoServer.addInternalServlet("imagetransfer", ImageServlet.PATH_SPEC,
        ImageServlet.class, true);
    infoServer.start();

    LOG.info("Web server init done");

    HttpConfig.Policy policy = DFSUtil.getHttpPolicy(conf);
    int connIdx = 0;
    if (policy.isHttpEnabled()) {


      InetSocketAddress httpAddress =
          infoServer.getConnectorAddress(connIdx++);

      // dfs.namenode.secondary.http-address
      conf.set(DFSConfigKeys.DFS_NAMENODE_SECONDARY_HTTP_ADDRESS_KEY,
          NetUtils.getHostPortString(httpAddress));
    }

    if (policy.isHttpsEnabled()) {
      InetSocketAddress httpsAddress =
          infoServer.getConnectorAddress(connIdx);
      conf.set(DFSConfigKeys.DFS_NAMENODE_SECONDARY_HTTPS_ADDRESS_KEY,
          NetUtils.getHostPortString(httpsAddress));
    }
  }

 

五.startCheckpointThread

啟動checkpoint 執行緒. 這個沒啥說的,就是啟動了一個守護程式而已... 

SecondaryNameNode實現了Runnable介面,所以會直接排程用run() 方法

  public void startCheckpointThread() {
    Preconditions.checkState(checkpointThread == null,
        "Should not already have a thread");
    Preconditions.checkState(shouldRun, "shouldRun should be true");
    
    checkpointThread = new Daemon(this);
    checkpointThread.start();
  }

 

六. doWork()

 //
  // The main work loop
  //
  public void doWork() {
    //
    // Poll the Namenode (once every checkpointCheckPeriod seconds) to find the
    // number of transactions in the edit log that haven't yet been checkpointed.
    //
    long period = checkpointConf.getCheckPeriod();
    int maxRetries = checkpointConf.getMaxRetriesOnMergeError();

    while (shouldRun) {
      try {
        Thread.sleep(1000 * period);
      } catch (InterruptedException ie) {
        // do nothing
      }
      if (!shouldRun) {
        break;
      }
      try {
        // We may have lost our ticket since last checkpoint, log in again, just in case
        if(UserGroupInformation.isSecurityEnabled())
          UserGroupInformation.getCurrentUser().checkTGTAndReloginFromKeytab();
        
        final long monotonicNow = Time.monotonicNow();
        final long now = Time.now();

        //  是否超過最大事務數限制[預設100萬]
        //  或者兩次checkpoint超過1小時
        if (shouldCheckpointBasedOnCount() ||
            monotonicNow >= lastCheckpointTime + 1000 * checkpointConf.getPeriod()) {

          // 執行 checkpoint 操作
          doCheckpoint();
          
          lastCheckpointTime = monotonicNow;
          lastCheckpointWallclockTime = now;
        }
      } catch (IOException e) {
        LOG.error("Exception in doCheckpoint", e);
        e.printStackTrace();
        // Prevent a huge number of edits from being created due to
        // unrecoverable conditions and endless retries.
        if (checkpointImage.getMergeErrorCount() > maxRetries) {
          LOG.error("Merging failed " +
              checkpointImage.getMergeErrorCount() + " times.");
          terminate(1);
        }
      } catch (Throwable e) {
        LOG.error("Throwable Exception in doCheckpoint", e);
        e.printStackTrace();
        terminate(1, e);
      }
    }
  }

七. doCheckpoint [ 執行 checkpoint 核心操作 ]

 


  /**
   * Create a new checkpoint
   * @return if the image is fetched from primary or not
   */
  @VisibleForTesting
  @SuppressWarnings("deprecated")
  public boolean doCheckpoint() throws IOException {
    checkpointImage.ensureCurrentDirExists();
    NNStorage dstStorage = checkpointImage.getStorage();
    
    // Tell the namenode to start logging transactions in a new edit file
    // Returns a token that would be used to upload the merged image.

    // 告訴namenode在新的edits檔案中開始記錄事務 , 如果處於安全模式則失敗.
    // 返回一個token用於merge image
    CheckpointSignature sig = namenode.rollEditLog();
    
    boolean loadImage = false;
    boolean isFreshCheckpointer = (checkpointImage.getNamespaceID() == 0);

    boolean isSameCluster =
        (dstStorage.versionSupportsFederation(NameNodeLayoutVersion.FEATURES)
            && sig.isSameCluster(checkpointImage)) ||
        (!dstStorage.versionSupportsFederation(NameNodeLayoutVersion.FEATURES)
            && sig.namespaceIdMatches(checkpointImage));


    if (isFreshCheckpointer ||
        (isSameCluster &&
         !sig.storageVersionMatches(checkpointImage.getStorage()))) {
      // if we're a fresh 2NN, or if we're on the same cluster and our storage
      // needs an upgrade, just take the storage info from the server.
      dstStorage.setStorageInfo(sig);
      dstStorage.setClusterID(sig.getClusterID());
      dstStorage.setBlockPoolID(sig.getBlockpoolID());
      loadImage = true;
    }
    sig.validateStorageInfo(checkpointImage);

    // error simulation code for junit test
    CheckpointFaultInjector.getInstance().afterSecondaryCallsRollEditLog();

    RemoteEditLogManifest manifest =
      namenode.getEditLogManifest(sig.mostRecentCheckpointTxId + 1);

    // Fetch fsimage and edits. Reload the image if previous merge failed.
    // 拉取fsimage和edits, 如果merge失敗則重新載入image
    loadImage |= downloadCheckpointFiles(
        fsName, checkpointImage, sig, manifest) |
        checkpointImage.hasMergeError();
    try {
      //執行merge操作
      doMerge(sig, manifest, loadImage, checkpointImage, namesystem);
    } catch (IOException ioe) {
      // A merge error occurred. The in-memory file system state may be
      // inconsistent, so the image and edits need to be reloaded.
      checkpointImage.setMergeError();
      throw ioe;
    }
    // Clear any error since merge was successful.
    checkpointImage.clearMergeError();

    
    //
    // Upload the new image into the NameNode. Then tell the Namenode
    // to make this new uploaded image as the most current image.
    //  上傳新的image 到NameNode
    //  告訴Namenode將上傳的image作為最新的image
    long txid = checkpointImage.getLastAppliedTxId();
    
    //上傳湊在哦.
    TransferFsImage.uploadImageFromStorage(fsName, conf, dstStorage,
        NameNodeFile.IMAGE, txid);

    // error simulation code for junit test
    CheckpointFaultInjector.getInstance().afterSecondaryUploadsNewImage();

    LOG.warn("Checkpoint done. New Image Size: " 
             + dstStorage.getFsImageName(txid).length());

    if (legacyOivImageDir != null && !legacyOivImageDir.isEmpty()) {
      try {
        checkpointImage.saveLegacyOIVImage(namesystem, legacyOivImageDir,
            new Canceler());
      } catch (IOException e) {
        LOG.warn("Failed to write legacy OIV image: ", e);
      }
    }
    return loadImage;
  }

 


 

 

 

 

 

 

 

 

 

 

相關文章