Hadoop2原始碼分析－HDFS核心模組分析

哥不是小蘿莉發表於2015-06-04

1.概述

　　這篇部落格接著《Hadoop2原始碼分析－RPC機制初識》來講述，前面我們對MapReduce、序列化、RPC進行了分析和探索，對Hadoop V2的這些模組都有了大致的瞭解，通過對這些模組的研究，我們明白了MapReduce的執行流程以及內部的實現機制，Hadoop的序列化以及它的通訊機制（RPC）。今天我們來研究另一個核心的模組，那就是Hadoop的分散式檔案儲存系統——HDFS，下面是今天分享的內容目錄：

HDFS簡述
NameNode
DataNode

　　接下來，我們開始今天的分享內容。

2.HDFS簡述

　　HDFS全稱Hadoop Distributed File System，在HDFS中有幾個基本的概念，首先是它的資料塊（Block），HDFS的設計是用於支援大檔案的。執行在HDFS上的程式也是用於處理大資料集的。這些程式僅寫一次資料，一次或多次讀資料請求，並且這些讀操作要求滿足流式傳輸速度。HDFS支援檔案的一次寫多次讀操作。HDFS中典型的塊大小是64MB，一個HDFS檔案可以被被切分成多個64MB大小的塊，如果需要，每一個塊可以分佈在不同的資料節點上。HDFS 中，如果一個檔案小於一個資料塊的大小，並不佔用整個資料塊儲存空間。

　　HDFS提供了一個可操作檔案系統的抽象類org.apache.hadoop.fs.FileSystem，該類被劃分在Hadoop-Common部分，其原始碼地址為：hadoop-2.6.0-src/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java，如下是FileSystem的部分原始碼，如下所示：

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class FileSystem extends Configured implements Closeable {
        // 程式碼內容省略
        // ...            
}

　　我們可以使用著抽象類，去操作HDFS系統上的內容，實現程式碼如下所示：

private static void dfs() {
        FileSystem fs = null;
        try {
            fs = FileSystem.get(conf);// get file object
            FileStatus[] list = fs.listStatus(new Path("/"));// file status list
            for (FileStatus file : list) {
                LOGGER.info(file.getPath().getName());// print file names
            }
        } catch (IOException e) {
            e.printStackTrace();
            LOGGER.error("Get hdfs path has error,msg is " + e.getMessage());
        } finally {
            try {
                if (fs != null) {
                    fs.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
                LOGGER.error("Close fs object has error,msg is " + e.getMessage());
            }
        }
    }

　　下面，我們來看另一個概念是後設資料節點(Namenode)和資料節點(datanode)，這2個是HDFS的核心模組，下面我們分別來看看這2個核心模組。

3.NameNode

　　NN節點用來管理檔案系統的NameSpace，將所有的檔案和資料夾的Meta儲存在一個檔案系統中，是HDFS中檔案目錄和檔案分配的管理者，儲存的重要資訊如下所示：

　　在HDFS叢集上可能包含成百上千個DataNode（簡稱DN）節點，這些DN節點定時和NameNode（簡稱NN）節點保持通訊，接受NN節點的一些指令，為了減小NN的壓力，NN上並不永久儲存那個DN上報的資料塊資訊，而是通過DN上報的狀態來更新NN上的對映表資訊。DN和NN建立連線後，會和NN保持心跳，心跳返回的資訊包含了NN對DN的一些指令資訊，如刪除資料，複製資料到其他的DN節點。值得注意的是NN不會主動去請求DN，這是一個嚴格意義上的C/S架構模型，同時，客戶端在操作HDFS叢集時，DN節點會互相配合，保證資料的一致性。

　　NN節點資訊儲存，部分截圖資訊如下所示：

4.DataNode

　　下面我們來分析一下DN的實現，DN的實現包含以下部分，一部分是對本地Block的管理，另一部分就是和其他的Entity進行資料互動。首先，我們先看本地的Block管理部分。我們在搭建Hadoop叢集時，會指定Block的儲存路徑，我們可以找到配置的儲存路徑，在hdfs-site.xml檔案下，內容路徑如下所示：

<property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/hadoop/data/dfs/data</value>
</property>

　　然後，我們進入到DN節點上，找到對應的儲存目錄，如下圖所示：

　　這裡面in_use.lock的作用是做一個排斥操作，在對應的應用上面加鎖。然後current目錄存放的是當前有效的Block，進入到current目錄後，出現如下圖所示的目錄：

　　VERSION存放著一些檔案的Meta，接著還有一系列的Block檔案和Meta檔案，Block檔案是儲存了HDFS中的資料的。儲存的Block，一個Block在多個DN節點上有備份，其備份引數可以調節，在hdfs-site.xml檔案中，屬性設定如下所示：

<property>
        <name>dfs.replication</name>
        <value>3</value>
</property>

　　首先，我們來看DateNode的類，部分程式碼如下所示：

@VisibleForTesting
  @InterfaceAudience.Private
  public static DataNode createDataNode(String args[], Configuration conf,
      SecureResources resources) throws IOException {
    DataNode dn = instantiateDataNode(args, conf, resources);// init dn
    if (dn != null) {
      dn.runDatanodeDaemon();// register to nn and back to dn thread
    }
    return dn;
  }

/** Instantiate a single datanode object, along with its secure resources. 
   * This must be run by invoking{@link DataNode#runDatanodeDaemon()} 
   * subsequently. 
   */
  public static DataNode instantiateDataNode(String args [], Configuration conf,
      SecureResources resources) throws IOException {
    if (conf == null)
      conf = new HdfsConfiguration();
    
    if (args != null) {
      // parse generic hadoop options
      GenericOptionsParser hParser = new GenericOptionsParser(conf, args);
      args = hParser.getRemainingArgs();
    }
    
    if (!parseArguments(args, conf)) {
      printUsage(System.err);
      return null;
    }
    Collection<StorageLocation> dataLocations = getStorageLocations(conf);
    UserGroupInformation.setConfiguration(conf);
    SecurityUtil.login(conf, DFS_DATANODE_KEYTAB_FILE_KEY,
        DFS_DATANODE_KERBEROS_PRINCIPAL_KEY);
    return makeInstance(dataLocations, conf, resources);
  }

static DataNode makeInstance(Collection<StorageLocation> dataDirs,
      Configuration conf, SecureResources resources) throws IOException {
    LocalFileSystem localFS = FileSystem.getLocal(conf);
    FsPermission permission = new FsPermission(
        conf.get(DFS_DATANODE_DATA_DIR_PERMISSION_KEY,
                 DFS_DATANODE_DATA_DIR_PERMISSION_DEFAULT));
    DataNodeDiskChecker dataNodeDiskChecker =
        new DataNodeDiskChecker(permission);
    List<StorageLocation> locations =
        checkStorageLocations(dataDirs, localFS, dataNodeDiskChecker);
    DefaultMetricsSystem.initialize("DataNode");

    assert locations.size() > 0 : "number of data directories should be > 0";
    return new DataNode(conf, locations, resources);// create dn obejct
  }

public void runDatanodeDaemon() throws IOException {
    blockPoolManager.startAll();

    // start dataXceiveServer
    dataXceiverServer.start();
    if (localDataXceiverServer != null) {
      localDataXceiverServer.start();
    }
    ipcServer.start();
    startPlugins(conf);
  }

public static void secureMain(String args[], SecureResources resources) {
    int errorCode = 0;
    try {
      StringUtils.startupShutdownMessage(DataNode.class, args, LOG);
      DataNode datanode = createDataNode(args, null, resources);
      if (datanode != null) {
        datanode.join();
      } else {
        errorCode = 1;
      }
    } catch (Throwable e) {
      LOG.fatal("Exception in secureMain", e);
      terminate(1, e);
    } finally {
      // We need to terminate the process here because either shutdown was called
      // or some disk related conditions like volumes tolerated or volumes required
      // condition was not met. Also, In secure mode, control will go to Jsvc
      // and Datanode process hangs if it does not exit.
      LOG.warn("Exiting Datanode");
      terminate(errorCode);
    }
  }

Main函式入口

　　下面給出DN類的Main函式入口，程式碼片段如下所示：

 public static void main(String args[]) {
    if (DFSUtil.parseHelpArgument(args, DataNode.USAGE, System.out, true)) {
      System.exit(0);
    }

    secureMain(args, null);
  }

5.總結

　　在研究HDFS的相關模組時，這裡需要明白各個模組的功能及作用，這裡為大家介紹了DN類的部分程式碼片段，以及給程式碼片段重要部分新增了程式碼註釋，若是大家需要了解詳細的相關流程及程式碼，可以閱讀Hadoop的HDFS部分的原始碼。

6.結束語

　　這篇部落格就和大家分享到這裡，如果大家在研究學習的過程當中有什麼問題，可以加群進行討論或傳送郵件給我，我會盡我所能為您解答，與君共勉！

Hadoop2原始碼分析－MapReduce篇
2015-04-17
Hadoop原始碼
Hadoop2原始碼分析－準備篇
2015-04-12
Hadoop原始碼
Hadoop2原始碼分析－序列化篇
2015-04-21
Hadoop原始碼
Hadoop2原始碼分析－RPC探索實戰
2017-11-20
Hadoop原始碼RPC
YARN 核心原始碼分析
2022-09-21
Yarn原始碼
mmap核心原始碼分析
2017-02-27
原始碼
Hadoop2原始碼分析－RPC機制初識
2015-04-27
Hadoop原始碼RPC
Hadoop2原始碼分析－YARN RPC 示例介紹
2015-07-21
Hadoop原始碼YarnRPC
toa 核心模組分析
2022-03-14
Hadoop2原始碼分析－Hadoop V2初識
2015-04-15
Hadoop原始碼
hadoop 原始碼分析HDFS架構演進
2022-09-20
Hadoop原始碼架構
Zepto原始碼分析之form模組
2017-10-01
原始碼ORM
Hadoop2原始碼分析－YARN 的服務庫和事件庫
2015-07-23
Hadoop原始碼Yarn事件
Hadoop3.2.1 【 HDFS 】原始碼分析 : Standby Namenode解析
2020-09-28
Hadoop原始碼
Swoole 原始碼分析——Client模組之Recv
2019-05-14
原始碼client
Swoole 原始碼分析——Client模組之Send
2019-05-12
原始碼client
Django（49）drf解析模組原始碼分析
2021-06-08
Django原始碼
Django（51）drf渲染模組原始碼分析
2021-06-08
Django原始碼
從原始碼分析Node的Cluster模組
2019-01-21
原始碼
Laravel核心解讀–Cookie原始碼分析
2018-09-01
LaravelCookie原始碼
Hadoop3.2.1 【 HDFS 】原始碼分析 : Secondary Namenode解析
2020-09-28
Hadoop原始碼
(一) Mybatis原始碼分析-解析器模組
2020-04-19
MyBatis原始碼
Swoole 原始碼分析——Client模組之Connect
2019-05-14
原始碼client
Swoole 原始碼分析——Server模組之OpenSSL (上)
2019-05-12
原始碼Server
JavaScript 模組化及 SeaJs 原始碼分析
2021-09-09
JavaScriptJS原始碼
Django（48）drf請求模組原始碼分析
2021-06-07
Django原始碼
mybaits原始碼分析--日誌模組（四）
2021-09-01
AI原始碼
mybaits原始碼分析--快取模組（六）
2021-09-03
AI原始碼快取
mybaits原始碼分析--binding模組（五）
2021-09-06
AI原始碼
Swoole 原始碼分析——Server 模組之 OpenSSL (上)
2018-09-18
原始碼Server
Swoole 原始碼分析——Server 模組之 OpenSSL (下)
2018-09-22
原始碼Server
beego cache模組原始碼分析筆記四
2018-12-26
Go原始碼筆記
Swoole 原始碼分析——Reactor 模組之 ReactorEpoll
2018-08-06
原始碼React
比特幣原始碼分析:VersionBits模組解析
2018-02-22
比特幣原始碼
MyBatis原始碼分析之核心處理層
2020-08-26
MyBatis原始碼
workerman 框架原始碼核心分析和註解
2019-08-08
框架原始碼
鴻蒙輕核心原始碼分析：Newlib C
2022-01-25
鴻蒙原始碼
Spring IOC容器核心流程原始碼分析
2021-08-16
Spring原始碼