Hadoop原始碼篇---解讀Mapprer原始碼Input輸入

LHBlog發表於2018-01-09

原文網址 : https://www.cnblogs.com/LHWorldBlog/p/8246406.html

一。前述

上次分析了客戶端原始碼，這次分析mapper原始碼讓大家對hadoop框架有更清晰的認識

二。程式碼

自定義程式碼如下：

public class MyMapper extends  Mapper<Object, Text, Text, IntWritable>{

      
       private final static IntWritable one = new IntWritable(1);
       private Text word = new Text();
       
       public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
           
           
         StringTokenizer itr = new StringTokenizer(value.toString());
         
         
         while (itr.hasMoreTokens()) {
           word.set(itr.nextToken());
           context.write(word, one);
         }
       }

繼承Mapper原始碼如下：

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

  /**
   * The <code>Context</code> passed on to the {@link Mapper} implementations.
   */
  public abstract class Context
    implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }
  
  /**
   * Called once at the beginning of the task.
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Called once for each key/value pair in the input split. Most applications
   * should override this, but the default is the identity function.
   */
  @SuppressWarnings("unchecked")
  protected void map(KEYIN key, VALUEIN value, 
                     Context context) throws IOException, InterruptedException {
    context.write((KEYOUT) key, (VALUEOUT) value);
  }

  /**
   * Called once at the end of the task.
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }
  
  /**
   * Expert users can override this method for more complete control over the
   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }
}

解析：我們重新了map方法，所以傳進run方法中才能不斷執行。

MapperTask原始碼解析：

Container封裝了一個指令碼命令，通過遠端呼叫啟動Yarnchild,如果是MapTask任務，然後把反射城MapTask的物件，啟動mapTask的run方法

Maptask中的run方法：

public void run(final JobConf job, final TaskUmbilicalProtocol umbilical)
    throws IOException, ClassNotFoundException, InterruptedException {
    this.umbilical = umbilical;

    if (isMapTask()) {
      // If there are no reducers then there won't be any sort. Hence the map 
      // phase will govern the entire attempt's progress.
      if (conf.getNumReduceTasks() == 0) {//假如沒有reduce階段
        mapPhase = getProgress().addPhase("map", 1.0f);
      } else {
        // If there are reducers then the entire attempt's progress will be 
        // split between the map phase (67%) and the sort phase (33%).
        mapPhase = getProgress().addPhase("map", 0.667f);
        sortPhase  = getProgress().addPhase("sort", 0.333f);//假如有reduce階段需要排序，說明沒有reduce任務則不需要排序
      }
    }
 if (useNewApi) {
      runNewMapper(job, splitMetaInfo, umbilical, reporter);//用新api
    } else {
      runOldMapper(job, splitMetaInfo, umbilical, reporter);
    }
    done(umbilical, reporter);
  }

runNewMapper解析：

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewMapper(final JobConf job,
                    final TaskSplitIndex splitIndex,
                    final TaskUmbilicalProtocol umbilical,
                    TaskReporter reporter
                    ) throws IOException, ClassNotFoundException,
                             InterruptedException {
    // make a task context so we can get the classes
    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, //我們自定義的job
                                                                  getTaskID(),
                                                                  reporter);//建立上下文
    // make a mapper
    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);//反射把自定的Mapper類反射出來  對應解析1
    // make the input format
    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);//反射把自定的InputFormat類反射出來 對應解析2
    // rebuild the input split
    org.apache.hadoop.mapreduce.InputSplit split = null;
    split = getSplitDetails(new Path(splitIndex.getSplitLocation()),//每一個切片條目對應的是一個MapTask 每個切片中對應的4個東西（檔案歸屬，偏移量，長度，位置資訊）
        splitIndex.getStartOffset());
    LOG.info("Processing split: " + split);

    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
      new NewTrackingRecordReader<INKEY,INVALUE>//對應解析3
        (split, inputFormat, reporter, taskContext);//上面準備的輸入格式化和切片為輸入準備，拿到流，怎麼讀按文字方式讀，行級
    
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    org.apache.hadoop.mapreduce.RecordWriter output = null;
    
    // get an output object
    if (job.getNumReduceTasks() == 0) {
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);
    }

    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
    mapContext = 
      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), //對應解析4
          input, output, //mapContext即上下文物件封裝了輸入輸出，所以可通過上下文拿到值 則可以得出Mapper類中的content的getCurrentyKey實際上是取得輸入物件的LineRecorder
          committer, 
          reporter, split);

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 
        mapperContext = 
          new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
              mapContext); 
 try {
      input.initialize(split, mapperContext);//輸入 對應解析5
      mapper.run(mapperContext);//run  對應解析6
      mapPhase.complete();
      setPhase(TaskStatus.Phase.SORT);
      statusUpdate(umbilical);
      input.close();
      input = null;
      output.close(mapperContext);//輸出
      output = null;
    } finally {
      closeQuietly(input);
      closeQuietly(output, mapperContext);
    }
  }

解析1原始碼

 @SuppressWarnings("unchecked")
  public Class<? extends Mapper<?,?,?,?>> getMapperClass() 
     throws ClassNotFoundException {
    return (Class<? extends Mapper<?,?,?,?>>) 
      conf.getClass(MAP_CLASS_ATTR, Mapper.class);//使用者配置則從配置中取，如果沒設定取預設。
  }

解析2原始碼

 public Class<? extends InputFormat<?,?>> getInputFormatClass()
     throws ClassNotFoundException {
    return (Class<? extends InputFormat<?,?>>)
      conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);//如果使用者設定取使用者的，沒有則取TextinputfRrmat!!!
  }

結論：框架預設使用的是TextInputFormat!!!

補充：繼承關係InputFormat>FileInputformat>textInputformat

解析3原始碼：

static class NewTrackingRecordReader<K,V> 
    extends org.apache.hadoop.mapreduce.RecordReader<K,V> {
    private final org.apache.hadoop.mapreduce.RecordReader<K,V> real;
    private final org.apache.hadoop.mapreduce.Counter inputRecordCounter;
    private final org.apache.hadoop.mapreduce.Counter fileInputByteCounter;
    private final TaskReporter reporter;
    private final List<Statistics> fsStats;
    
    NewTrackingRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
        org.apache.hadoop.mapreduce.InputFormat<K, V> inputFormat,
        TaskReporter reporter,
        org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
        throws InterruptedException, IOException {
      this.reporter = reporter;
      this.inputRecordCounter = reporter
          .getCounter(TaskCounter.MAP_INPUT_RECORDS);
      this.fileInputByteCounter = reporter
          .getCounter(FileInputFormatCounter.BYTES_READ);

      List <Statistics> matchedStats = null;
      if (split instanceof org.apache.hadoop.mapreduce.lib.input.FileSplit) {
        matchedStats = getFsStatistics(((org.apache.hadoop.mapreduce.lib.input.FileSplit) split)
            .getPath(), taskContext.getConfiguration());
      }
      fsStats = matchedStats;

      long bytesInPrev = getInputBytes(fsStats);
      this.real = inputFormat.createRecordReader(split, taskContext);解析3.1 原始碼 real來源Linerecordere
      long bytesInCurr = getInputBytes(fsStats);
      fileInputByteCounter.increment(bytesInCurr - bytesInPrev);
    }

解析3.1 原始碼

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text> 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
    return new LineRecordReader(recordDelimiterBytes);//返回Linerorder,行讀取器
  }

解析4原始碼：

 public MapContextImpl(Configuration conf, TaskAttemptID taskid,
                        RecordReader<KEYIN,VALUEIN> reader,//reader即輸入物件
                        RecordWriter<KEYOUT,VALUEOUT> writer,
                        OutputCommitter committer,
                        StatusReporter reporter,
                        InputSplit split) {
    super(conf, taskid, writer, committer, reporter);
    this.reader = reader;
    this.split = split;
  }
 /**
   * Get the input split for this map.
   */
  public InputSplit getInputSplit() {
    return split;
  }

  @Override
  public KEYIN getCurrentKey() throws IOException, InterruptedException {
    return reader.getCurrentKey();//呼叫輸入的input 包含一個Linerecorder物件
  }

  @Override
  public VALUEIN getCurrentValue() throws IOException, InterruptedException {
    return reader.getCurrentValue();
  }

  @Override
  public boolean nextKeyValue() throws IOException, InterruptedException {
    return reader.nextKeyValue();
  }

解析5原始碼：

public void initialize(InputSplit genericSplit,
                         TaskAttemptContext context) throws IOException {
    FileSplit split = (FileSplit) genericSplit;
    Configuration job = context.getConfiguration();
    this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
    start = split.getStart();//切片的起始位置
    end = start + split.getLength();//切片的結束位置
    final Path file = split.getPath();

    // open the file and seek to the start of the split
    final FileSystem fs = file.getFileSystem(job);
    fileIn = fs.open(file);
    
    CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);
    if (null!=codec) {
      isCompressedInput = true;    
      decompressor = CodecPool.getDecompressor(codec);
      if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
        in = new CompressedSplitLineReader(cIn, job,
            this.recordDelimiterBytes);
        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
      } else {
        in = new SplitLineReader(codec.createInputStream(fileIn,
            decompressor), job, this.recordDelimiterBytes);
        filePosition = fileIn;
      }
    } else {
      fileIn.seek(start);//很多mapper 去讀對應的切片數量
      in = new UncompressedSplitLineReader(
          fileIn, job, this.recordDelimiterBytes, split.getLength());
      filePosition = fileIn;
    }
    // If this is not the first split, we always throw away first record
    // because we always (except the last split) read one extra line in
    // next() method.
    if (start != 0) {//除了第一個切片
      start += in.readLine(new Text(), 0, maxBytesToConsume(start));//匿名寫法 輸入初始化的時候 對於非第一個切片 讀一行放空，算出長度，然後更新起始位置為第二行  這樣每一個切片處理完的時候再多處理一行，這樣就能保證還原。！！！
    }
    this.pos = start;
  }

解析6實際上呼叫的就是Mapper中的run方法。

public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {/解析6.1
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }
}

解析6.1追蹤後實際上呼叫的是LineRewcorder中的NextKeyValue方法

public boolean nextKeyValue() throws IOException {
    if (key == null) {
      key = new LongWritable();//Key中要放置偏移量
    }
    key.set(pos);//偏移量
    if (value == null) {
      value = new Text();//預設
    }
    int newSize = 0;
    // We always read one extra line, which lies outside the upper
    // split limit i.e. (end - 1)
    while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
      if (pos == 0) {
        newSize = skipUtfByteOrderMark();
      } else {
        newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));//讀到真的值了
        pos += newSize;
      }

      if ((newSize == 0) || (newSize < maxLineLength)) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }
    if (newSize == 0) {
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }
@Override//由nextkeyValue更新值所以直接取值這塊  這種取值方式叫做引用傳遞！！！
  public LongWritable getCurrentKey() {
    return key;
  }

  @Override
  public Text getCurrentValue() {
    return value;
  }

持續更新中。。。。，歡迎大家關注我的公眾號LHWorld.

Axios 原始碼解讀 —— 原始碼實現篇
2022-01-22
iOS原始碼
拜讀及分析Element原始碼-input元件篇
2018-09-07
原始碼元件
Element原始碼分析系列5-Input(輸入框)
2019-03-03
原始碼
Input原始碼解讀——從"Show tabs"開始
2023-01-08
原始碼
Axios 原始碼解讀 —— request 篇
2022-01-10
iOS原始碼
vux之x-input使用以及原始碼解讀
2018-11-03
UX原始碼
PostgreSQL 原始碼解讀（3）- 如何閱讀原始碼
2018-08-02
SQL原始碼
【解讀 ahooks 原始碼系列】DOM篇（一）
2023-03-10
Hook原始碼
原始碼閱讀技巧篇
2019-03-04
原始碼
vuex 原始碼：原始碼系列解讀總結
2018-05-06
Vue原始碼
hadoop原始碼分析
2022-09-21
Hadoop原始碼
Axios 原始碼解讀 —— 網路請求篇
2022-01-16
iOS原始碼
Laravel 原始碼解讀
2019-03-29
Laravel原始碼
reselect原始碼解讀
2019-03-03
原始碼
Swoft 原始碼解讀
2019-02-16
原始碼
Seajs原始碼解讀
2019-03-04
JS原始碼
ReentrantLock原始碼解讀
2018-11-24
ReentrantLock原始碼
MJExtension原始碼解讀
2018-11-29
原始碼
Axios 原始碼解讀
2019-04-16
iOS原始碼
SDWebImage原始碼解讀
2019-01-18
Web原始碼
MJRefresh原始碼解讀
2018-12-06
原始碼
Handler原始碼解讀
2019-02-26
原始碼
LifeCycle原始碼解讀
2018-10-15
原始碼
LinkedHashMap原始碼解讀
2018-07-11
HashMap原始碼
ConcurrentHashMap原始碼解讀
2018-07-12
HashMap原始碼
Redux原始碼解讀
2018-05-01
Redux原始碼
ThreadLocal原始碼解讀
2018-04-26
thread原始碼
WeakHashMap，原始碼解讀
2019-09-07
HashMap原始碼
ThreadLocal 原始碼解讀
2019-04-01
thread原始碼
Masonry原始碼解讀
2018-12-28
原始碼
ZooKeeper原始碼解讀
2024-06-15
原始碼
HashMap原始碼解讀
2018-03-23
HashMap原始碼
FairyGUI原始碼解讀
2024-10-29
AIGUI原始碼
【C++】【原始碼解讀】std::is_same函式原始碼解讀
2022-02-09
C++原始碼函式
小白讀原始碼 | RxJava2 入門篇(一)
2018-12-15
原始碼RxJava
【原始碼閱讀】AndPermission原始碼閱讀
2019-05-09
原始碼
Flutter引擎原始碼解讀-記憶體管理篇
2020-05-07
Flutter原始碼記憶體
深入淺出解讀 Spring 原始碼：IOC/AOP 篇
2018-04-12
Spring原始碼
Hadoop原理與原始碼
2021-11-10
Hadoop原始碼

Hadoop原始碼篇---解讀Mapprer原始碼Input輸入

相關文章