MapReduce InputFormat之FileInputFormat

Thinkgamer_gyt發表於2015-11-30

一：簡單認識InputFormat類

InputFormat主要用於描述輸入資料的格式，提供了以下兩個功能：

1）、資料切分，按照某個策略將輸入資料且分成若干個split，以便確定Map Task的個數即Mapper的個數，在MapReduce框架中，一個split就意味著需要一個Map Task;

2)為Mapper提供輸入資料，即給定一個split，(使用其中的RecordReader物件)將之解析為一個個的key/value鍵值對。

下面我們先來看以下1.0版本中的老的InputFormat介面：

Java程式碼 

public interface InputFormat<K,V>{  

   //獲取所有的split分片     

   public InputSplit[] getSplits(JobConf job,int numSplits) throws IOException;   

   //獲取讀取split的RecordReader物件，實際上是由RecordReader物件將  

   //split解析成一個個的key/value對兒  

   public RecordReader<K,V> getRecordReader(InputSplit split,  

                               JobConf job,  

                               Reporter reporter) throws IOException;   

}

InputSplit
getSplit(...)方法主要用於切分資料，它會嘗試浙江輸入資料且分成numSplits個InputSplit的栓皮櫟split分片。InputSplit主要有以下特點：
1）、邏輯分片，之前我們已經學習過split和block的對應關係和區別，split只是在邏輯上對資料分片，並不會在磁碟上講資料切分成split物理分片，實際上資料在HDFS上還是以block為基本單位來儲存資料的。InputSplit只記錄了Mapper要處理的資料的後設資料資訊，如起始位置、長度和所在的節點；

2）、可序列化，在Hadoop中，序列化主要起兩個作用，程式間通訊和資料持久化儲存。在這裡，InputSplit主要用於程式間的通訊。
在作業被提交到JobTracker之前，Client會先呼叫作業InputSplit中的getSplit()方法，並將得到的分片資訊序列化到檔案中，這樣，在作業在JobTracker端初始化時，便可並解析出所有split分片，建立物件應的Map Task。
InputSplit也是一個interface，具體返回什麼樣的implement，這是由具體的InputFormat來決定的。InputSplit也只有兩個介面函式：

Java程式碼 

public interface InputSplit extends Writable {  

  /** 

   * 獲取split分片的長度 

   *  

   * @return the number of bytes in the input split. 

   * @throws IOException 

   */  

  long getLength() throws IOException;  

  /** 

   * 獲取存放這個Split的Location資訊（也就是這個Split在HDFS上存放的機器。它可能有 

   * 多個replication，存在於多臺機器上 

   *  

   * @return list of hostnames where data of the <code>InputSplit</code> is 

   *         located as an array of <code>String</code>s. 

   * @throws IOException 

   */  

  String[] getLocations() throws IOException;  

}

在需要讀取一個Split的時候，其對應的InputSplit會被傳遞到InputFormat的第二個介面函式getRecordReader，然後被用於初始化一個RecordReader，以便解析輸入資料，描述Split的重要資訊都被隱藏了，只有具體的InputFormat自己知道，InputFormat只需要保證getSplits返回的InputSplit和getRecordReader所關心的InputSplit是同樣的implement就行了，這給InputFormat的實現提供了巨大的靈活性。
在MapReduce框架中最常用的FileInputFormat為例，其內部使用的就是FileSplit來描述InputSplit。我們來看一下FileSplit的一些定義資訊：

Java程式碼  

/** A section of an input file.  Returned by {@link 

 * InputFormat#getSplits(JobConf, int)} and passed to 

 * {@link InputFormat#getRecordReader(InputSplit,JobConf,Reporter)}.  

 */  

public class FileSplit extends org.apache.hadoop.mapreduce.InputSplit   

                       implements InputSplit {  

  // Split所在的檔案  

  private Path file;  

  // Split的起始位置  

  private long start;  

  // Split的長度  

  private long length;  

  // Split所在的機器名稱  

  private String[] hosts;  

  FileSplit() {}  

  /** Constructs a split. 

   * @deprecated 

   * @param file the file name 

   * @param start the position of the first byte in the file to process 

   * @param length the number of bytes in the file to process 

   */  

  @Deprecated  

  public FileSplit(Path file, long start, long length, JobConf conf) {  

    this(file, start, length, (String[])null);  

  }  

  /** Constructs a split with host information 

   * 

   * @param file the file name 

   * @param start the position of the first byte in the file to process 

   * @param length the number of bytes in the file to process 

   * @param hosts the list of hosts containing the block, possibly null 

   */  

  public FileSplit(Path file, long start, long length, String[] hosts) {  

    this.file = file;  

    this.start = start;  

    this.length = length;  

    this.hosts = hosts;  

  }  

  /** The file containing this split's data. */  

  public Path getPath() { return file; }  

  /** The position of the first byte in the file to process. */  

  public long getStart() { return start; }  

  /** The number of bytes in the file to process. */  

  public long getLength() { return length; }  

  public String toString() { return file + ":" + start + "+" + length; }  

  ////////////////////////////////////////////  

  // Writable methods  

  ////////////////////////////////////////////  

  public void write(DataOutput out) throws IOException {  

    UTF8.writeString(out, file.toString());  

    out.writeLong(start);  

    out.writeLong(length);  

  }  

  public void readFields(DataInput in) throws IOException {  

    file = new Path(UTF8.readString(in));  

    start = in.readLong();  

    length = in.readLong();  

    hosts = null;  

  }  

  public String[] getLocations() throws IOException {  

    if (this.hosts == null) {  

      return new String[]{};  

    } else {  

      return this.hosts;  

    }  

  }  

}

從上面的程式碼中我們可以看到，FileSplit就是InputSplit介面的一個實現。InputFormat使用的RecordReader將從FileSplit中獲取資訊，解析FileSplit物件從而獲得需要的資料的起始位置、長度和節點位置。

RecordReader
對於getRecordReader(...)方法，它返回一個RecordReader物件，該物件可以講輸入的split分片解析成一個個的key/value對兒。在Map Task的執行過程中，會不停的呼叫RecordReader物件的方法，迭代獲取key/value並交給map()方法處理：

Java程式碼 

//呼叫InputFormat的getRecordReader()獲取RecordReader<K,V>物件，  

//並由RecordReader物件解析其中的input(split)...  

K1 key = input.createKey();  

V1 value = input.createValue();  

while(input.next(key,value)){//從input讀取下一個key/value對  

    //呼叫使用者編寫的map()方法  

}  

input.close();

         RecordReader主要有兩個功能：
         ●定位記錄的邊界：由於FileInputFormat是按照資料量對檔案進行切分，因而有可能會將一條完整的記錄切成2部分，分別屬於兩個split分片，為了解決跨InputSplit分片讀取資料的問題，RecordReader規定每個分片的第一條不完整的記錄劃給前一個分片處理。
         ●解析key/value：定位一條新的記錄，將記錄分解成key和value兩部分供Mapper處理。

InputFormat
MapReduce自帶了一些InputFormat的實現類：

下面我們看幾個有代表性的InputFormat：
FileInputFormat
FileInputFormat是一個抽象類，它最重要的功能是為各種InputFormat提供統一的getSplits()方法，該方法最核心的是檔案切分演算法和Host選擇演算法：

Java程式碼 

/** Splits files returned by {@link #listStatus(JobConf)} when 

   * they're too big.*/   

@SuppressWarnings("deprecation")  

public InputSplit[] getSplits(JobConf job, int numSplits)  

    throws IOException {  

    FileStatus[] files = listStatus(job);  

    // Save the number of input files in the job-conf  

    job.setLong(NUM_INPUT_FILES, files.length);  

    long totalSize = 0;                           // compute total size  

    for (FileStatus file: files) {                // check we have valid files  

      if (file.isDir()) {  

        throw new IOException("Not a file: "+ file.getPath());  

      }  

      totalSize += file.getLen();  

    }  

    long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);  

    long minSize = Math.max(job.getLong("mapred.min.split.size", 1),  

                            minSplitSize);  

    // 定義要生成的splits（FileSplit）的集合  

    ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);  

    NetworkTopology clusterMap = new NetworkTopology();  

    for (FileStatus file: files) {  

      Path path = file.getPath();  

      FileSystem fs = path.getFileSystem(job);  

      long length = file.getLen();  

      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);  

      if ((length != 0) && isSplitable(fs, path)) {   

        long blockSize = file.getBlockSize();  

        //獲取最終的split分片的大小，該值很可能和blockSize不相等  

        long splitSize = computeSplitSize(goalSize, minSize, blockSize);  

        long bytesRemaining = length;  

        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {  

          //獲取split分片所在的host的節點資訊  

          String[] splitHosts = getSplitHosts(blkLocations,   

              length-bytesRemaining, splitSize, clusterMap);  

          //最終生成所有分片  

          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,   

              splitHosts));  

          bytesRemaining -= splitSize;  

        }  

        if (bytesRemaining != 0) {  

          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining,   

                     blkLocations[blkLocations.length-1].getHosts()));  

        }  

      } else if (length != 0) {  

        //獲取split分片所在的host的節點資訊  

        String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);  

        //最終生成所有分片  

        splits.add(new FileSplit(path, 0, length, splitHosts));  

      } else {   

        //Create empty hosts array for zero length files  

        //最終生成所有分片  

        splits.add(new FileSplit(path, 0, length, new String[0]));  

      }  

    }  

    LOG.debug("Total # of splits: " + splits.size());  

    return splits.toArray(new FileSplit[splits.size()]);  

}

          1）、檔案切分演算法
          檔案切分演算法主要用於確定InputSplit的個數以及每個InputSplit對應的資料段，FileInputSplit以檔案為單位切分生成InputSplit。有三個屬性值來確定InputSplit的個數：
          ●goalSize：該值由totalSize/numSplits來確定InputSplit的長度，它是根據使用者的期望的InputSplit個數計算出來的；numSplits為使用者設定的Map Task的個數，預設為1。
          ●minSize：由配置引數mapred.min.split.size決定的InputFormat的最小長度，預設為1。
          ●blockSize：HDFS中的檔案儲存塊block的大小，預設為64MB。
          這三個引數決定一個InputFormat分片的最終的長度，計算方法如下：
                      splitSize = max{minSize,min{goalSize,blockSize}}
計算出了分片的長度後，也就確定了InputFormat的數目。

          2）、host選擇演算法
          InputFormat的切分方案確定後，接下來就是要確定每一個InputSplit的後設資料資訊。InputSplit後設資料通常包括四部分，<file,start,length,hosts>其意義為：
          ●file標識InputSplit分片所在的檔案；
          ●InputSplit分片在檔案中的的起始位置；
          ●InputSplit分片的長度；
          ●分片所在的host節點的列表。
          InputSplit的host列表的算作策略直接影響到執行作業的本地性。我們知道，由於大檔案儲存在HDFS上的block可能會遍佈整個Hadoop叢集，而一個InputSplit分片的劃分演算法可能會導致一個split分片對應多個不在同一個節點上的blocks，這就會使得在Map Task執行過程中會涉及到讀其他節點上的屬於該Task的block中的資料，從而不能實現資料本地性，而造成更多的網路傳輸開銷。
          一個InputSplit分片對應的blocks可能位於多個資料節點地上，但是基於任務排程的效率，通常情況下，不會把一個分片涉及的所有的節點資訊都加到其host列表中，而是選擇包含該分片的資料總量的最大的前幾個節點，作為任務排程時判斷是否具有本地性的主要憑證。
         FileInputFormat使用了一個啟發式的host選擇演算法：首先按照rack機架包含的資料量對rack排序，然後再在rack內部按照每個node節點包含的資料量對node排序，最後選取前N個(N為block的副本數)node的host作為InputSplit分片的host列表。當任務地排程Task作業時，只要將Task排程給host列表上的節點，就可以認為該Task滿足了本地性。
         從上面的資訊我們可以知道，當InputSplit分片的大小大於block的大小時，Map Task並不能完全滿足資料的本地性，總有一本分的資料要通過網路從遠端節點上讀資料，故為了提高Map Task的資料本地性，減少網路傳輸的開銷，應儘量是InputFormat的大小和HDFS的block塊大小相同。

          TextInputFormat
          預設情況下，MapReduce使用的是TextInputFormat來讀分片並將記錄資料解析成一個個的key/value對，其中key為該行在整個檔案(注意而不是在一個block)中的偏移量，而行的內容即為value。
          CombineFileInputFormat
          CombineFileInputFormat的作用是把許多檔案合併作為一個map的輸入，它的主要思路是把輸入目錄下的大檔案分成多個map的輸入, 併合並小檔案, 做為一個map的輸入。適合在處理多個小檔案的場景。
          SequenceFileInputFormat
          SequenceFileInputFormat是一個順序的二進位制的FileInputFormat，內部以key/value的格式儲存資料，通常會結合LZO或Snappy壓縮演算法來讀取或儲存可分片的資料檔案。

MapReduce之自定義InputFormat
2020-07-19
ORM
MapReduce InputFormat——DBInputFormat
2015-11-30
ORM
hadoop之mapreduce.input.fileinputformat.split.minsize引數
2018-10-24
HadoopORM
MapReduce之topN
2015-01-29
MapReduce之自定義OutputFormat
2020-08-05
ORM
MapReduce之WritableComparable排序
2020-07-29
排序
MapReduce之自定義partitioner
2015-02-02
Hadoop面試題之MapReduce
2021-12-23
Hadoop面試題
MapReduce之MapTask工作機制
2020-07-19
APT
MapReduce之自定義分割槽器Partitioner
2020-07-21
Hadoop學習之YARN及MapReduce
2018-01-24
HadoopYarn
Hadoop MapReduce之wordcount(詞頻統計)
2016-02-28
Hadoop
尚矽谷大資料Hadoop（30）P120-P127Mapreduce-FileinputFormat實現類KeyValueTextInputFormat案例實現NLineInputFormat案例實現
2020-12-13
大資料HadoopORM
Hadoop之MapReduce2架構設計
2018-05-28
Hadoop架構
MapReduce程式設計例項之倒排索引 1
2015-11-24
程式設計索引
MapReduce程式設計例項之自定義排序
2015-11-25
程式設計排序
MapReduce初探
2014-03-22
MapReduce理解
2024-11-02
Hadoop 學習系列（四）之 MapReduce 原理講解
2019-03-04
Hadoop
Hadoop之MapReduce2基礎梳理及案例
2018-05-28
Hadoop
Hadoop-MapReduce之自定義資料型別
2014-10-31
Hadoop資料型別
MapReduce最佳化之位元組級別快速排序
2015-06-10
排序
MapReduce程式設計例項之資料去重
2015-11-24
程式設計
MapReduce程式設計例項之自定義分割槽
2015-11-25
程式設計
MapReduce: 提高MapReduce效能的七點建議[譯]
2014-05-03
Hadoop 三劍客之 —— 分散式計算框架 MapReduce
2019-06-27
Hadoop分散式框架
MapReduce矩陣；及快排單連結串列之解答
2013-07-17
矩陣
MapReduce 簡介
2016-04-13
Mongodb MapReduce使用
2016-05-26
MongoDB
Lab 1: MapReduce
2024-08-25
MapReduce程式設計實踐之自定義資料型別
2015-11-24
程式設計資料型別
MapReduce 示例：減少 Hadoop MapReduce 中的側連線
2021-09-17
Hadoop
MapReduce執行流程
2021-11-09
MapReduce工作流程
2016-10-19
mapreduce框架詳解
2016-09-13
框架
MapReduce模型講解
2017-03-06
模型
MapReduce&&Hadoop
2017-12-16
Hadoop
MapReduce(四)：shuffer原理
2015-09-18

MapReduce InputFormat之FileInputFormat

相關文章