Hadoop3.2.1 【 HDFS 】原始碼分析 : 檔案系統資料集 [一]

張伯毅發表於2020-11-10

原文網址 : https://zhangboyi.blog.csdn.net/article/details/109589382

一. 前言

Datanode可以配置多個儲存目錄儲存資料塊檔案（這裡要注意， Datanode的多個儲存目錄儲存的資料塊並不相同，並且不同的儲存目錄可以是異構的，這樣的設計可以提高資料塊IO的吞吐率），所以Datanode將底層資料塊的管理抽象為多個層次，並定義不同的類
來實現各個層次上資料塊的管理。

在這裡插入圖片描述

■ BlockPoolSlice： 管理一個指定塊池在一個指定儲存目錄下的所有資料塊。 由於Datanode可以定義多個儲存目錄， 所以塊池的資料塊會分佈在多個儲存目錄下。一個塊池會擁有多個BlockPoolSlice物件， 這個塊池對應的所有BlockPoolSlice物件共同管理塊池的所有資料塊。
■ FsVolumelmpl： 管理Datanode一個儲存目錄下的所有資料塊。 由於一個儲存目錄可以儲存多個塊池的資料塊， 所以FsVolumelmpl會持有這個儲存目錄中儲存的所有塊池的BlockPoolSlice物件。
■ FsVolumeList： Datanode可以定義多個儲存目錄， 每個儲存目錄下的資料塊是使用一個FsVolumelmpl物件管理的， 所以Datanode定義了FsVolumeList類儲存Datanode上所有的FsVolumelmpl物件， FsVolumeList對FsDatasetlmpl提供類似磁碟的服務。

二.Datanode上資料塊副本的狀態

FsDatasetlmpl會持有一個ReplicaMap物件，維護Datanode上所有資料塊副本的狀態。

Datanode上儲存的資料塊副本有下面5種狀態。
■ FINALIZED： Datanode上已經完成寫操作的副本。 Datanode定義了FinalizedReplica類來儲存FINALIZED狀態副本的資訊。
■ RBW（Replica Being Written） ：剛剛由HDFS客戶端建立的副本或者進行追加寫（append）操作的副本，副本的資料正在被寫入，部分資料對於客戶端是可見的。 Datanode定義了ReplicaBeingWritten類來儲存RBW狀態副本的資訊。
■ RUR（Relica Under Recovery） ：進行塊恢復時的副本。 Datanode定義了ReplicaUnderRecovery類來儲存RUR狀態副本的資訊。
■ RWR（ReplicaWaitingToBeRecovered） ：如果Datanode重啟或者當機，所有RBW狀態的副本在Datanode重啟後都將被載入為RWR狀態， RWR會等待塊恢復操作。 Datanode定義了ReplicaWaitingToBeRecovered類來儲存RWR狀態副本的資訊。
■ TEMPORARY： Datanode之間複製資料塊，或者進行叢集資料塊平衡操作（cluster balance）時，正在寫入副本的狀態就是TEMPORARY狀態。和RBW不同的是， TEMPORARY狀態的副本對於客戶端是不可見的， 同時Datanode重啟時將會直接刪除處於TEMPORARY狀態的副本。 Datanode定義了ReplicaInPipeline類來儲存TEMPORARY狀態副本的資訊。

Datanode會將不同狀態的副本儲存到磁碟的不同目錄下，也就是說， Datanode上副本的狀態是會被持久化的。 HDFS使BlockPoolSlice、 FsVolumelmpl以及FsVolumeList類對這些儲存目錄進行管理

三.BlockPoolSlice實現

BlockPoolSlice類負責管理指定儲存目錄下一個指定block pool 的所有資料塊。block pool在每個儲存目錄下都會有一個block pool 目錄儲存資料塊，換句話說， BlockPollSlice管理的就是這個block pool 目錄的所有資料塊。

在這裡插入圖片描述

一個塊池目錄存在current、finalized、rbw、lazypersist、tmp等目錄。
current目錄包含finalized、 rbw以及lazyPersisit三個子目錄。
finalized目錄儲存了所有FINALIZED狀態的副本， rbw目錄儲存了RBW（正在寫）、 RWR（等待恢復）、 RUR（恢復中）狀態的副本， tmp目錄儲存了TEMPORARY狀態的副本。
當在客戶端發起寫請求而建立一個新的副本時，這個副本將會被放到rbw目錄中。
當在資料塊備份和叢集平衡儲存過程中建立一個新的副本時，這個副本就會放到tmp目錄中。
一旦一個副本完成寫操作並被提交，它就會被移動到finalized目錄中。
當Datanode重啟時， tmp目錄中的所有副本將會被刪除， rbw目錄中的副本將會被載入為RWR狀態， finalized目錄中的副本將會被載入為FINALIZED狀態。
lazyPersist目錄為HDFS 2.6版本新加入的，用來支援在記憶體中寫入臨時資料塊副本以及慢持久化（lazypersist）資料塊副本的特性， lazyPersist目錄就是用於儲存這些慢持久化的資料塊副本的。

finalized子目錄的結構比較特殊，它即包含目錄也包含檔案。 finalized子目錄儲存了兩種型別的檔案。

在這裡插入圖片描述

■ 資料塊檔案，比如blk_1073741825資料塊檔案，1073741825是資料塊id。
■ meta字尾的校驗檔案，用來儲存資料塊檔案的校驗資訊，
blk_1073741825_1001.meta就是校驗檔案，檔名中的1001是資料塊的版本號

當叢集資料量達到一定程度時， finalized目錄下的資料塊將會非常多。為了便於組織管理，資料塊將按照資料塊id進行雜湊，擁有相同雜湊值的資料塊將處於同一個子目錄下。
在2.7.x版本 finalized目錄下最多擁有256個一級子目錄，每個一級子目錄下又可以擁有256個二級子目錄。
在2.8.x版本以後到現在的3.3.0 finalized目錄下最多擁有32個一級子目錄，每個一級子目錄下又可以擁有32個二級子目錄。

3.1.BlockPoolSlice的欄位

在這裡插入圖片描述

■ bpid：記錄當前BlockPoolSlice對應的block pool id。
■ 目錄相關欄位：儲存當前塊池目錄的current子目錄（currentDir欄位）、 finalized子目錄（finalizedDir欄位）、 lazyPersist子目錄（lazypersistDir欄位）、 rbw子目錄（rbwDir欄位)、 tmp子目錄（tmpDir欄位）的引用。
■ dfsUsage： DU型別，描述當前塊池目錄的磁碟使用情況。

3.2.BlockPoolSlice的方法

3.2.1.構造方法

BlockPoolSlice的構造方法首先會建立current目錄、 finalized目錄、 lazyPersist目錄、rbw目錄以及tmp目錄，並將這些目錄的引用賦值到對應的欄位上.完成子目錄的建立之後，構造方法會初始化dfsUsage欄位，用來統計塊池目錄的磁碟使用情況。最後構造方法會新增一個鉤子，當Datanode程式結束時儲存整個檔案系統的磁碟使用資訊。

呼叫鏈[很長很長…]:
Datanode#runDatanodeDaemon() ⇒ BlockPoolManager.startAll(); ⇒ BPOfferService#start() ⇒ BPServiceActor#start() => BPServiceActor#run() ⇒ connectToNNAndHandshake(); ==> bpos.verifyAndSetNamespaceInfo(this, nsInfo); ==> DataNode.initBlockPool(this); ⇒ FsDatasetImpl#addBlockPool() ⇒ FsVolumeList#addBlockPool() ⇒FsVolumeImpl#addBlockPool() ==> new BlockPoolSlice(bpid, this, bpdir, c, new Timer());


  /**
   * Create a blook pool slice
   * @param bpid Block pool Id
   * @param volume {@link FsVolumeImpl} to which this BlockPool belongs to
   * @param bpDir directory corresponding to the BlockPool
   * @param conf configuration
   * @param timer include methods for getting time
   * @throws IOException
   */
  BlockPoolSlice(String bpid, FsVolumeImpl volume, File bpDir,
      Configuration conf, Timer timer) throws IOException {
    // BP-451827885-192.168.8.156-1584099133244
    this.bpid = bpid;
    // /tools/hadoop-3.2.1/data/hdfs/data
    this.volume = volume;
    //
    this.fileIoProvider = volume.getFileIoProvider();
    //  /tools/hadoop-3.2.1/data/hdfs/data/current/BP-451827885-192.168.8.156-1584099133244/current
    this.currentDir = new File(bpDir, DataStorage.STORAGE_DIR_CURRENT);
    //  /tools/hadoop-3.2.1/data/hdfs/data/current/BP-451827885-192.168.8.156-1584099133244/current/finalized
    this.finalizedDir = new File( currentDir, DataStorage.STORAGE_DIR_FINALIZED);
    //  /tools/hadoop-3.2.1/data/hdfs/data/current/BP-451827885-192.168.8.156-1584099133244/current/lazypersist
    this.lazypersistDir = new File(currentDir, DataStorage.STORAGE_DIR_LAZY_PERSIST);

    if (!this.finalizedDir.exists()) {
      if (!this.finalizedDir.mkdirs()) {
        throw new IOException("Failed to mkdirs " + this.finalizedDir);
      }
    }
    // 4096
    this.ioFileBufferSize = DFSUtilClient.getIoFileBufferSize(conf);
    // dfs.datanode.duplicate.replica.deletion : true
    this.deleteDuplicateReplicas = conf.getBoolean(
        DFSConfigKeys.DFS_DATANODE_DUPLICATE_REPLICA_DELETION,
        DFSConfigKeys.DFS_DATANODE_DUPLICATE_REPLICA_DELETION_DEFAULT);
    // dfs.datanode.cached-dfsused.check.interval.ms : 600000 ms  =>  1 minute
    this.cachedDfsUsedCheckTime =
        conf.getLong(
            DFSConfigKeys.DFS_DN_CACHED_DFSUSED_CHECK_INTERVAL_MS,
            DFSConfigKeys.DFS_DN_CACHED_DFSUSED_CHECK_INTERVAL_DEFAULT_MS);
    // ipc.maximum.data.length : 64mb
    this.maxDataLength = conf.getInt(
        CommonConfigurationKeys.IPC_MAXIMUM_DATA_LENGTH,
        CommonConfigurationKeys.IPC_MAXIMUM_DATA_LENGTH_DEFAULT);

    this.timer = timer;

    // Files that were being written when the datanode was last shutdown
    // are now moved back to the data directory. It is possible that
    // in the future, we might want to do some sort of datanode-local
    // recovery for these blocks. For example, crc validation.
    // tmpDir :  /tools/hadoop-3.2.1/data/hdfs/data/current/BP-451827885-192.168.8.156-1584099133244/tmp
    this.tmpDir = new File(bpDir, DataStorage.STORAGE_DIR_TMP);

    if (tmpDir.exists()) {
      fileIoProvider.fullyDelete(volume, tmpDir);
    }
    // /tools/hadoop-3.2.1/data/hdfs/data/current/BP-451827885-192.168.8.156-1584099133244/current/rbw
    this.rbwDir = new File(currentDir, DataStorage.STORAGE_DIR_RBW);

    // create the rbw and tmp directories if they don't exist.

    fileIoProvider.mkdirs(volume, rbwDir);

    fileIoProvider.mkdirs(volume, tmpDir);

    // Use cached value initially if available. Or the following call will block until the initial du command completes.
    // du -sk /tools/hadoop-3.2.1/data/hdfs/data/current/BP-451827885-192.168.8.156-1584099133244
    // 132947968	/tools/hadoop-3.2.1/data/hdfs/data/current/BP-451827885-192.168.8.156-1584099133244
    this.dfsUsage = new CachingGetSpaceUsed.Builder().setPath(bpDir)
                                                     .setConf(conf)
                                                     .setInitialUsed(loadDfsUsed())
                                                     .build();

    if (addReplicaThreadPool == null) {
      // initialize add replica fork join pool
      initializeAddReplicaPool(conf);
    }
    // Make the dfs usage to be saved during shutdown.

    shutdownHook = new Runnable() {
      @Override
      public void run() {
        if (!dfsUsedSaved) {
          saveDfsUsed();
          addReplicaThreadPool.shutdownNow();
        }
      }
    };

    ShutdownHookManager.get().addShutdownHook(shutdownHook,
        SHUTDOWN_HOOK_PRIORITY);
  }

3.2.1.磁碟空間相關方法

BlockPoolSlice定義了5個用於操作當前塊池目錄佔用磁碟空間的方法，包括incDfsUsed()、 decDfsUsed()、 getDfsUsed()、 loadDfsUsed()、 saveDfsUsed()。這5個方法底層都是通過呼叫dfsUsage欄位的對應方法實現的。
dfsUsage欄位是DU型別的， DU類是用於統計磁碟使用情況的工具類. 是通過shell的方式獲取作業系統的磁碟使用情況:

比如: 
# -k：和--block-size=1k類似，以KB為單位。
# -s：--summarize，僅列出總量，而不列出每個目錄和檔案的大小，

du -sk /tools/hadoop-3.2.1/data/hdfs/data/current/BP-451827885-192.168.8.156-1584099133244

返回:

132947968	/tools/hadoop-3.2.1/data/hdfs/data/current/BP-451827885-192.168.8.156-1584099133244

3.2.1.塊池目錄子資料夾方法

我們知道塊池目錄下會包含若干子資料夾，包括current、 finalized、 rbw以及lazyPersist等。 BlockPoolSlice定義了4個用於獲取這些子資料夾引用的方法，包括getDirectory()、 getFinalizedDir()、 getLazypersistDir()、 getRbwDir()。這4個方法的實現非常簡單，直接返回由構造方法初始化的對應欄位即可。

File getDirectory() {
    return currentDir.getParentFile();
  }

  File getFinalizedDir() {
    return finalizedDir;
  }

  File getLazypersistDir() {
    return lazypersistDir;
  }

  File getRbwDir() {
    return rbwDir;
  }

  File getTmpDir() {
    return tmpDir;
  }

3.2.1.資料塊副本操作方法

BlockPoolSlice類提供了若干在塊池目錄下運算元據塊副本的方法，包括建立不同狀態的資料塊副本、獲取當前塊池目錄儲存副本的狀態等， FsVolumelmpl類會通過它持有的BlockPoolSlice物件呼叫這些方法完成運算元據塊副本的功能.

在塊池目錄中建立資料塊副本

在當前塊池目錄中建立副本包括以下幾種情況：建立TEMPORARY狀態、 RBW狀態以及FINALIZED狀態的副本等。
建立TEMPORARY狀態以及RBW狀態的副本操作都比較簡單，直接在tmp目錄以及rbw目錄下通過IO操作建立資料塊副本檔案即可。


  /**
   *
   * 臨時檔案。
   * 當block完成時，它們被移到finalized塊目錄中。
   *
   * Temporary files. They get moved to the finalized block directory when
   * the block is finalized.
   */
  File createTmpFile(Block b) throws IOException {
    File f = new File(tmpDir, b.getBlockName());
    //構建tmpFile
    File tmpFile = DatanodeUtil.createFileWithExistsCheck( volume, b, f, fileIoProvider);
    // If any exception during creation, its expected that counter will not be
    // incremented, So no need to decrement
    incrNumBlocks();
    return tmpFile;
  }



  /**
   * RBW files. They get moved to the finalized block directory when
   * the block is finalized.
   */
  File createRbwFile(Block b) throws IOException {
    File f = new File(rbwDir, b.getBlockName());
    File rbwFile = DatanodeUtil.createFileWithExistsCheck(
        volume, b, f, fileIoProvider);
    // If any exception during creation, its expected that counter will not be
    // incremented, So no need to decrement
    incrNumBlocks();
    return rbwFile;
  }

建立FINALIZED狀態的副本則由addFinalizedBlock()方法實現， addFinalizedBlock()方法首先呼叫DatanodeUtil.idToBlockDir()方法獲得該副本在finalized目錄的儲存路徑，然後在這個路徑下建立資料塊副本檔案以及校驗檔案，最後更改dfsUsage更新磁碟使用情況


  File addFinalizedBlock(Block b, ReplicaInfo replicaInfo) throws IOException {
    //呼叫DatanodeUtil.idToBlockDir()方法獲取副本在finalized目錄的儲存路徑
    File blockDir = DatanodeUtil.idToBlockDir(finalizedDir, b.getBlockId());
    fileIoProvider.mkdirsWithExistsCheck(volume, blockDir);

    //在儲存路徑中建立資料塊副本檔案以及校驗檔案
    File blockFile = FsDatasetImpl.moveBlockFiles(b, replicaInfo, blockDir);
    File metaFile = FsDatasetUtil.getMetaFile(blockFile, b.getGenerationStamp());


    //更新dfsUsage
    if (dfsUsage instanceof CachingGetSpaceUsed) {
      ((CachingGetSpaceUsed) dfsUsage).incDfsUsed(
          b.getNumBytes() + metaFile.length());
    }
    return blockFile;
  }

finalized目錄下的副本檔案是以多級子目錄方式存放的，DatanodeUtil.idToBlockDir()方法提供了確定這個副本儲存路徑的功能。


  /**
   * Get the directory where a finalized block with this ID should be stored.
   * Do not attempt to create the directory.
   * @param root the root directory where finalized blocks are stored
   * @param blockId
   * @return
   */
  public static File idToBlockDir(File root, long blockId) {


    System.out.println( blockId + " ====>   " + Long.toBinaryString(blockId));

    System.out.println(" 0x1F ====> long :   " + 0x1F );

    System.out.println(" 0x1F ====>   " + Long.toBinaryString(0x1F));



    // blockld的第三個位元組作為一級目錄索引
    int d1 = (int) ((blockId >> 16) & 0x1F);
    // blockld的第二個位元組作為二級目錄索引

    System.out.println(" d1 ====>   " + Integer.toBinaryString(d1));

    int d2 = (int) ((blockId >> 8) & 0x1F);

    System.out.println(" d2 ====>   " + Integer.toBinaryString(d2));

    //構造儲存路徑並返回
    String path = DataStorage.BLOCK_SUBDIR_PREFIX + d1 + SEP +
        DataStorage.BLOCK_SUBDIR_PREFIX + d2;
    return new File(root, path);
  }

獲取塊池目錄儲存副本的狀態
Datanode在啟動時會呼叫BlockPoolSlice.getVolumeMap()方法獲取當前BlockPoolSlice.持有的所有資料塊副本的狀態， DataNode類會使用一個ReplicaMap 物件儲存這些資料塊副本的資訊以及狀態。這裡要注意， ReplicaMap物件只儲存FINALIZED狀態以及RBR狀態的資料塊副本。 getVolumeMap()的程式碼如下所示，它會首先恢復lazyPersist狀態的副本，將它們轉變為FINALIZED狀態，然後getVolumeMap()方法會呼叫addToReplicasMap()方法處理finalized目錄以及rbw目錄中儲存的所有資料塊副本。



  void getVolumeMap(ReplicaMap volumeMap,
                    final RamDiskReplicaTracker lazyWriteReplicaMap)
      throws IOException {
    // Recover lazy persist replicas, they will be added to the volumeMap
    // when we scan the finalized directory.
    //恢復lazyPersist狀態的副本， 先將它們轉變為FINALIZED狀態
    if (lazypersistDir.exists()) {
      int numRecovered = moveLazyPersistReplicasToFinalized(lazypersistDir);
      FsDatasetImpl.LOG.info(
          "Recovered " + numRecovered + " replicas from " + lazypersistDir);
    }

    //將所有FINALIZED狀態的副本加入ReplicaMap中
    boolean  success = readReplicasFromCache(volumeMap, lazyWriteReplicaMap);
    if (!success) {
      List<IOException> exceptions = Collections
          .synchronizedList(new ArrayList<IOException>());
      Queue<RecursiveAction> subTaskQueue =
          new ConcurrentLinkedQueue<RecursiveAction>();

      // add finalized replicas
      AddReplicaProcessor task = new AddReplicaProcessor(volumeMap,
          finalizedDir, lazyWriteReplicaMap, true, exceptions, subTaskQueue);
      ForkJoinTask<Void> finalizedTask = addReplicaThreadPool.submit(task);

      // add rbw replicas
      task = new AddReplicaProcessor(volumeMap, rbwDir, lazyWriteReplicaMap,
          false, exceptions, subTaskQueue);
      ForkJoinTask<Void> rbwTask = addReplicaThreadPool.submit(task);

      try {
        finalizedTask.get();
        rbwTask.get();
      } catch (InterruptedException | ExecutionException e) {
        exceptions.add(new IOException(
            "Failed to start sub tasks to add replica in replica map :"
                + e.getMessage()));
      }

      //wait for all the tasks to finish.
      waitForSubTaskToFinish(subTaskQueue, exceptions);
    }
  }


  private boolean readReplicasFromCache(ReplicaMap volumeMap,  final RamDiskReplicaTracker lazyWriteReplicaMap) {

    ReplicaMap tmpReplicaMap = new ReplicaMap(new AutoCloseableLock());
    File replicaFile = new File(currentDir, REPLICA_CACHE_FILE);

    // Check whether the file exists or not.
    if (!replicaFile.exists()) {
      LOG.info("Replica Cache file: "+  replicaFile.getPath() +
          " doesn't exist ");
      return false;
    }

    long fileLastModifiedTime = replicaFile.lastModified();

    if (System.currentTimeMillis() > fileLastModifiedTime + replicaCacheExpiry) {
      LOG.info("Replica Cache file: " + replicaFile.getPath() +
          " has gone stale");
      // Just to make findbugs happy
      if (!replicaFile.delete()) {
        LOG.info("Replica Cache file: " + replicaFile.getPath() +
            " cannot be deleted");
      }
      return false;
    }
    FileInputStream inputStream = null;
    try {

      inputStream = fileIoProvider.getFileInputStream(volume, replicaFile);

      BlockListAsLongs blocksList =  BlockListAsLongs.readFrom(inputStream, maxDataLength);

      if (blocksList == null) {
        return false;
      }

      for (BlockReportReplica replica : blocksList) {
        switch (replica.getState()) {
        case FINALIZED:
          //將所有FINALIZED狀態的副本加入ReplicaMap中
          addReplicaToReplicasMap(replica, tmpReplicaMap, lazyWriteReplicaMap, true);
          break;
        case RUR:
        case RBW:
        case RWR:
          //將所有RBR狀態的副本加入ReplicaMap中
          addReplicaToReplicasMap(replica, tmpReplicaMap, lazyWriteReplicaMap, false);
          break;
        default:
          break;
        }
      }
      // Now it is safe to add the replica into volumeMap
      // In case of any exception during parsing this cache file, fall back
      // to scan all the files on disk.
      for (Iterator<ReplicaInfo> iter =
          tmpReplicaMap.replicas(bpid).iterator(); iter.hasNext(); ) {
        ReplicaInfo info = iter.next();
        // We use a lightweight GSet to store replicaInfo, we need to remove
        // it from one GSet before adding to another.
        iter.remove();
        volumeMap.add(bpid, info);
      }
      LOG.info("Successfully read replica from cache file : "
          + replicaFile.getPath());
      return true;
    } catch (Exception e) {
      // Any exception we need to revert back to read from disk
      // Log the error and return false
      LOG.info("Exception occurred while reading the replicas cache file: "
          + replicaFile.getPath(), e );
      return false;
    }
    finally {
      // close the inputStream
      IOUtils.closeStream(inputStream);

      if (!fileIoProvider.delete(volume, replicaFile)) {
        LOG.info("Failed to delete replica cache file: " +
            replicaFile.getPath());
      }
    }
  }

addToReplicasMap()方法會遍歷dir引數指定的資料夾下的所有副本檔案，如果是finalized資料夾中的，則載入為FINALIZED狀態；如果不是finalized資料夾中的，則判斷當前資料夾中是否存在.restart檔案（Datanode快速重啟時會在磁碟上儲存一個檔案，裡面儲存了一個大概的重啟時間，用於恢復檔案的10流），如果存在這個檔案，並且時間還在重啟視窗內，則當前資料塊副本可以恢復為RBW狀態，將當前資料塊載入為RBW狀態；如果不存在這個檔案，則將當前資料塊載入為RWR狀態。



  /**
   * 將給定目錄下的副本新增到卷對映
   * Add replicas under the given directory to the volume map
   * @param volumeMap the replicas map
   * @param dir an input directory
   * @param lazyWriteReplicaMap Map of replicas on transient
   *                                storage.
   * @param isFinalized true if the directory has finalized replicas;
   *                    false if the directory has rbw replicas
   * @param exceptions list of exception which need to return to parent thread.
   * @param subTaskQueue queue of sub tasks
   */
  void addToReplicasMap(ReplicaMap volumeMap, File dir,
      final RamDiskReplicaTracker lazyWriteReplicaMap, boolean isFinalized,
      List<IOException> exceptions, Queue<RecursiveAction> subTaskQueue)
      throws IOException {
    //迴圈遍歷資料夾中的所有檔案
    File[] files = fileIoProvider.listFiles(volume, dir);

    Arrays.sort(files, FILE_COMPARATOR);

    for (int i = 0; i < files.length; i++) {

      File file = files[i];

      if (file.isDirectory()) {

        // Launch new sub task.
        // 啟動一個子程式處理
        AddReplicaProcessor subTask = new AddReplicaProcessor(volumeMap, file,
            lazyWriteReplicaMap, isFinalized, exceptions, subTaskQueue);
        subTask.fork();
        subTaskQueue.add(subTask);
      }

      //如果資料夾中存在臨時檔案， 則首先進行恢復操作
      if (isFinalized && FsDatasetUtil.isUnlinkTmpFile(file)) {
        file = recoverTempUnlinkedBlock(file);
        if (file == null) { // the original block still exists, so we cover it
          // in another iteration and can continue here
          // 如果臨時檔案對應的原始檔存在， 那麼跳過該臨時檔案
          continue;
        }
      }
      if (!Block.isBlockFilename(file)) {
        continue;
      }

      long genStamp = FsDatasetUtil.getGenerationStampFromFile(
          files, file, i);

      long blockId = Block.filename2id(file.getName());

      Block block = new Block(blockId, file.length(), genStamp);

      addReplicaToReplicasMap(block, volumeMap, lazyWriteReplicaMap,  isFinalized);
    }
  }


private void addReplicaToReplicasMap(Block block, ReplicaMap volumeMap,
      final RamDiskReplicaTracker lazyWriteReplicaMap,boolean isFinalized)
      throws IOException {
    ReplicaInfo newReplica = null;
    long blockId = block.getBlockId();
    long genStamp = block.getGenerationStamp();

    //如果是finalized資料夾， 則將所有資料塊載入為FINALIZED狀態
    if (isFinalized) {
      newReplica = new ReplicaBuilder(ReplicaState.FINALIZED)
          .setBlockId(blockId)
          .setLength(block.getNumBytes())
          .setGenerationStamp(genStamp)
          .setFsVolume(volume)
          .setDirectoryToUse(DatanodeUtil.idToBlockDir(finalizedDir, blockId))
          .build();
    } else {
      File file = new File(rbwDir, block.getBlockName());
      boolean loadRwr = true;
      File restartMeta = new File(file.getParent()  +
          File.pathSeparator + "." + file.getName() + ".restart");
      Scanner sc = null;
      try {
        //重啟後設資料檔案存在， 並且當前時間在重啟視窗內
        sc = new Scanner(restartMeta, "UTF-8");
        // The restart meta file exists
        if (sc.hasNextLong() && (sc.nextLong() > timer.now())) {
          // It didn't expire. Load the replica as a RBW.
          // We don't know the expected block length, so just use 0
          // and don't reserve any more space for writes.
          //將資料塊載入為RBW狀態
          newReplica = new ReplicaBuilder(ReplicaState.RBW)
              .setBlockId(blockId)
              .setLength(validateIntegrityAndSetLength(file, genStamp))
              .setGenerationStamp(genStamp)
              .setFsVolume(volume)
              .setDirectoryToUse(file.getParentFile())
              .setWriterThread(null)
              .setBytesToReserve(0)
              .build();
          loadRwr = false;
        }
        sc.close();
        if (!fileIoProvider.delete(volume, restartMeta)) {
          FsDatasetImpl.LOG.warn("Failed to delete restart meta file: " +
              restartMeta.getPath());
        }
      } catch (FileNotFoundException fnfe) {
        // nothing to do hereFile dir =
      } finally {
        if (sc != null) {
          sc.close();
        }
      }
      //重啟後設資料檔案不存在， 將資料塊載入為RWR狀態
      // Restart meta doesn't exist or expired.
      if (loadRwr) {
        ReplicaBuilder builder = new ReplicaBuilder(ReplicaState.RWR)
            .setBlockId(blockId)
            .setLength(validateIntegrityAndSetLength(file, genStamp))
            .setGenerationStamp(genStamp)
            .setFsVolume(volume)
            .setDirectoryToUse(file.getParentFile());
        newReplica = builder.build();
      }
    }

    ReplicaInfo tmpReplicaInfo = volumeMap.addAndGet(bpid, newReplica);

    //將資料塊資訊放入volumeMap中
    ReplicaInfo oldReplica = (tmpReplicaInfo == newReplica) ? null
        : tmpReplicaInfo;
    if (oldReplica != null) {
      // We have multiple replicas of the same block so decide which one
      // to keep.
      newReplica = resolveDuplicateReplicas(newReplica, oldReplica, volumeMap);
    }

    // If we are retaining a replica on transient storage make sure
    // it is in the lazyWriteReplicaMap so it can be persisted
    // eventually.
    if (newReplica.getVolume().isTransientStorage()) {
      lazyWriteReplicaMap.addReplica(bpid, blockId,
          (FsVolumeImpl) newReplica.getVolume(), 0);
    } else {
      lazyWriteReplicaMap.discardReplica(bpid, blockId, false);
    }
    if (oldReplica == null) {
      incrNumBlocks();
    }
  }

ecoverTempUnlinkedBlock()
addToReplicasMap()方法呼叫了recoverTempUnlinkedBlock()恢復臨時檔案（*.unlinked檔案）。我們首先看一下臨時檔案的概念。 Datanode在升級過程中或者回滾到上一版本前， previous目錄和current目錄會使用硬連結保留相同資料塊的引用。如果這時客戶端對
任意資料塊進行追加寫操作，將會同時影響兩個引用，也就是影響上一版本資料的快照。這時需要去除current目錄對該資料塊的硬連結，保證後續對該資料塊的修改不會影響previous目錄中儲存的上一版本資料塊的快照。
Datanode目前的邏輯是，將需要更改的資料塊先複製一份，建立一個臨時檔案並命名為“${blockName}.unlinked”，這樣臨時檔案將不會存在任何硬連結，且內容與原始檔完全相同。當臨時檔案建立成功後，再把臨時檔案改名為原始檔。這時current和previous中將
存在兩個不同的資料塊檔案，但內容卻是完全相同的。

recoverTempUnlinkedBlock()方法就是用來對臨時檔案（*.unlinked檔案）進行恢復操作的。恢復操作的邏輯是首先判斷原始檔是否存在，如果不存在則將臨時檔案改名為原始檔；如果原始檔存在，則將臨時檔案直接刪除，非常簡單。


  /**
   * Recover an unlinked tmp file on datanode restart. If the original block
   * does not exist, then the tmp file is renamed to be the
   * original file name and the original name is returned; otherwise the tmp
   * file is deleted and null is returned.
   */
  File recoverTempUnlinkedBlock(File unlinkedTmp) throws IOException {
    File blockFile = FsDatasetUtil.getOrigFile(unlinkedTmp);
    if (blockFile.exists()) {
      // If the original block file still exists, then no recovery is needed.
      if (!fileIoProvider.delete(volume, unlinkedTmp)) {
        //原始檔存在， 則刪除臨時檔案， 如果刪除失敗則丟擲異常
        throw new IOException("Unable to cleanup unlinked tmp file " +
            unlinkedTmp);
      }
      return null;
    } else {
      //原始檔不存在， 則將臨時檔案改名為原始檔
      fileIoProvider.rename(volume, unlinkedTmp, blockFile);
      return blockFile;
    }
  }

Hadoop3.2.1 【 HDFS 】原始碼分析 : DataXceiver: 讀取資料塊解析 [二]
2020-11-23
Hadoop原始碼
Hadoop3.2.1 【 HDFS 】原始碼分析 : Secondary Namenode解析
2020-09-28
Hadoop原始碼
Hadoop3.2.1 【 HDFS 】原始碼分析 : Standby Namenode解析
2020-09-28
Hadoop原始碼
大資料檔案儲存系統HDFS
2019-01-15
大資料
大資料 | 分散式檔案系統 HDFS
2021-07-09
大資料分散式
分散式檔案系統HDFS，大資料儲存實戰（一）
2019-02-18
分散式大資料
Hadoop 系列（一）—— 分散式檔案系統 HDFS
2019-08-05
Hadoop分散式
分散式檔案系統-HDFS
2019-04-04
分散式
HDFS分散式檔案系統
2020-06-05
分散式
Hadoop大資料實戰系列文章之HDFS檔案系統
2020-11-06
Hadoop大資料
鴻蒙輕核心原始碼分析：檔案系統LittleFS
2022-02-08
鴻蒙原始碼
Hadoop學習（一）——HDFS分散式檔案系統
2019-02-19
Hadoop分散式
Hadoop基礎（一）：分散式檔案系統HDFS
2020-12-11
Hadoop分散式
hdfs小檔案分析
2024-10-10
分散式檔案系統(HDFS）與 linux系統檔案系統對比
2018-09-14
分散式Linux
PHP檔案分享系統原始碼
2021-12-08
PHP原始碼
Hadoop3.2.1 【 YARN 】原始碼分析 :AdminService 淺析
2020-12-11
HadoopYarn原始碼
Hadoop 基石HDFS 一文了解檔案儲存系統
2021-06-04
Hadoop
大資料專案實踐（一）——之HDFS叢集配置
2018-08-21
大資料
Hadoop3.2.1 【 YARN 】原始碼分析 :RPC通訊解析
2020-12-07
HadoopYarn原始碼RPC
ext2檔案系統super.c原始碼分析（Linux 2.6.24）
2020-11-10
原始碼Linux
Abp原始碼分析之虛擬檔案系統Volo.Abp.VirtualFileSystem
2024-11-12
原始碼
線上直播系統原始碼，前後端大檔案上傳程式碼分析
2024-04-13
原始碼後端
【大資料】【hadoop】檢視hdfs檔案命令
2020-11-29
大資料Hadoop
HDFS原始碼解析：教你用HDFS客戶端寫資料
2021-12-30
原始碼客戶端
原始碼|HDFS之DataNode：寫資料塊（2）
2019-02-28
原始碼
vue原始碼分析系列之入口檔案分析
2019-02-16
Vue原始碼
好程式設計師大資料學習路線分享分散式檔案系統HDFS
2019-08-22
程式設計師大資料分散式
Flume採集資料時在HDFS上產生大量小檔案的問題
2018-07-31
HDFS架構指南（分散式系統Hadoop的檔案系統架構）
2019-01-14
架構分散式Hadoop
hadoop 原始碼分析HDFS架構演進
2022-09-20
Hadoop原始碼架構
HDFS原始碼解析系列一——HDFS通訊協議
2022-02-16
原始碼協議
【核心檔案系統】原始碼閱讀stat.h
2020-12-14
原始碼
必須掌握的分散式檔案儲存系統—HDFS
2020-10-27
分散式
Hadoop分散式檔案系統（HDFS）會不會被淘汰？
2022-11-23
Hadoop分散式
Springboot 載入配置檔案原始碼分析
2021-11-21
Spring Boot原始碼
Android 系統原始碼-1：Android 系統啟動流程原始碼分析
2019-01-20
Android原始碼
Hadoop HDFS分散式檔案系統常用命令彙總
2018-11-02
Hadoop分散式

Hadoop3.2.1 【 HDFS 】原始碼分析 : 檔案系統資料集 [一]

一. 前言

二.Datanode上資料塊副本的狀態

三.BlockPoolSlice實現

3.1.BlockPoolSlice的欄位

3.2.BlockPoolSlice的方法

3.2.1.構造方法

3.2.1.磁碟空間相關方法

3.2.1.塊池目錄子資料夾方法

3.2.1.資料塊副本操作方法

相關文章