Hadoop原始碼篇--Client原始碼

LHBlog發表於2018-01-08

一。前述

今天起剖析原始碼，先從Client看起，因為Client在MapReduce的過程中承擔了很多重要的角色。

二。MapReduce框架主類

程式碼如下：

public static void main(String[] args) throws Exception {
        
        Configuration conf = new Configuration(true);
        //job  作業
        Job  job = Job.getInstance(conf);
        
         // Create a new Job
//         Job job = Job.getInstance();
         job.setJarByClass(MyWC.class);
         
         // Specify various job-specific parameters     
         job.setJobName("myjob");
         
//         job.setInputPath(new Path("in"));
//         job.setOutputPath(new Path("out"));
         
         Path input = new Path("/user/root");
        FileInputFormat.addInputPath(job, input );
         
         Path output = new Path("/output/wordcount");
         if(output.getFileSystem(conf).exists(output)){
             output.getFileSystem(conf).delete(output,true);
         }
        FileOutputFormat.setOutputPath(job, output );
         
         
         
         
         job.setMapperClass(MyMapper.class);
         job.setMapOutputKeyClass(Text.class);
         job.setMapOutputValueClass(IntWritable.class);
         job.setReducerClass(MyReducer.class);

         // Submit the job, then poll for progress until the job is complete
         job.waitForCompletion(true);

第一步，先分析Job，可以看見原始碼中Job實現了public class Job extends JobContextImpl implements JobContext

然後JobContext實現了 MRJobConfig,可以看見其中有很多配置

因為job中傳的引數為conf,所以這裡的配置即對應我們的配置檔案中的屬性值。

  Job  job = Job.getInstance(conf);

挑幾個重要的看下：

public static final int DEFAULT_MAP_MEMORY_MB = 1024;//預設的Mapper任務記憶體大小。

第二步，分析提交過程 job.waitForCompletion(true); 追蹤原始碼發現主要實現這個類


JobStatus submitJobInternal(Job job, Cluster cluster) 
  throws ClassNotFoundException, InterruptedException, IOException

Checking the input and output specifications of the job.//檢查輸入輸出路徑
Computing the InputSplits for the job.//檢查切片
Setup the requisite accounting information for the DistributedCache of the job, if necessary.
Copying the job's jar and configuration to the map-reduce system directory on the distributed file-system.
Submitting the job to the JobTracker and optionally monitoring it's status.

在此方法中，中重點看下此方法 int maps = writeSplits(job, submitJobDir);

追蹤後具體實現可知

private <T extends InputSplit>
  int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
      InterruptedException, ClassNotFoundException {
    Configuration conf = job.getConfiguration();
    InputFormat<?, ?> input =
      ReflectionUtils.newInstance(job.getInputFormatClass(), conf);

    List<InputSplit> splits = input.getSplits(job);
    T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);

    // sort the splits into order based on size, so that the biggest
    // go first
    Arrays.sort(array, new SplitComparator());
    JobSplitWriter.createSplitFiles(jobSubmitDir, conf,
        jobSubmitDir.getFileSystem(conf), array);
    return array.length;
  }

追蹤job.getInputFormatClass()可以發現如下程式碼： 

public Class<? extends InputFormat<?,?>> getInputFormatClass() 
     throws ClassNotFoundException {
    return (Class<? extends InputFormat<?,?>>) 
      conf.getClass(INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
//根據使用者配置檔案首先取用，如果沒有被取用則使用預設輸入格式TextInputFormat
  }

所以可得知使用者的預設輸入類是TextInputformat類並且繼承關係如下：

TextInputforMat-->FileinputFormat-->InputFormat

追蹤 List<InputSplit> splits = input.getSplits(job);可以得到如下原始碼：

最為重要的一個原始碼！！！！！！！！！！！

public List<InputSplit> getSplits(JobContext job) throws IOException {
    Stopwatch sw = new Stopwatch().start();
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));如果使用者設定則取使用者，沒有是1
    long maxSize = getMaxSplitSize(job);//如果使用者設定則取使用者，沒有取最大值

    // generate splits
    List<InputSplit> splits = new ArrayList<InputSplit>();
    List<FileStatus> files = listStatus(job);
    for (FileStatus file: files) {
      Path path = file.getPath();//取輸入檔案的大小和路徑
      long length = file.getLen();
      if (length != 0) {
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          FileSystem fs = path.getFileSystem(job.getConfiguration());
          blkLocations = fs.getFileBlockLocations(file, 0, length);//獲得所有塊的位置。
        }
        if (isSplitable(job, path)) {
          long blockSize = file.getBlockSize();
          long splitSize = computeSplitSize(blockSize, minSize, maxSize);//獲得切片大小

          long bytesRemaining = length;
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);//這一塊傳參傳的是切塊的偏移量，返回這個塊的索引
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),//根據當前塊的索引號取出來塊的位置包括副本的位置 然後傳遞給切片，然後切片知道往哪運算。即往塊的位置資訊計算
                        blkLocations[blkIndex].getCachedHosts()));
            bytesRemaining -= splitSize;
          }

          if (bytesRemaining != 0) {
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                       blkLocations[blkIndex].getHosts(),
                       blkLocations[blkIndex].getCachedHosts()));
          }
        } else { // not splitable
          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                      blkLocations[0].getCachedHosts()));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    // Save the number of input files for metrics/loadgen
    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.elapsedMillis());
    }
    return splits;
  }

 1.long splitSize = computeSplitSize(blockSize, minSize, maxSize);追蹤原始碼發現

protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }

切片大小預設是塊的大小！！！！

假如讓切片大小 < 塊的大小則更改配置的最大值MaxSize，讓其小於blocksize

假如讓切片大小 > 塊的大小則更改配置的最小值MinSize，讓其大於blocksize

通過FileInputFormat.setMinInputSplitSize即可。

2. int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining) 追蹤原始碼發現

 protected int getBlockIndex(BlockLocation[] blkLocations, 
                              long offset) {
    for (int i = 0 ; i < blkLocations.length; i++) {
      // is the offset inside this block?
      if ((blkLocations[i].getOffset() <= offset) &&
          (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){//切片要大於>=塊的起始量，小於一個塊的末尾量。
        return i;//返回這個塊
      }
    }
    BlockLocation last = blkLocations[blkLocations.length -1];
    long fileLength = last.getOffset() + last.getLength() -1;
    throw new IllegalArgumentException("Offset " + offset + 
                                       " is outside of file (0.." +
                                       fileLength + ")");
  }

3. splits.add(makeSplit(path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts()

建立切片的時候，一個切片對應一個mapperr任務，所以建立切片的四個位置（path,0,10,host）

根據host可知mapper任務的計算位置，則對應計算向資料移動！！！！塊是邏輯的，並沒有真正切割資料。！！

4.上述getSplits方法最終得到一個切片的清單,清單的數目就是mapper的數量！！即開始方法的入口 int maps = writeSplits(job, submitJobDir);返回值。

5.計算向資料移動時會拉取只屬於自己的檔案。

持續更新中。。。。，歡迎大家關注我的公眾號LHWorld.

Hadoop學習——Client原始碼分析
2019-04-06
Hadoopclient原始碼
Hadoop原始碼篇--Reduce篇
2018-01-10
Hadoop原始碼
Hadoop原始碼篇---解讀Mapprer原始碼Input輸入
2018-01-09
Hadoop原始碼APP
Hadoop原始碼篇---解讀Mapprer原始碼outPut輸出
2018-01-09
Hadoop原始碼APP
Hadoop2原始碼分析－MapReduce篇
2015-04-17
Hadoop原始碼
Hadoop2原始碼分析－準備篇
2015-04-12
Hadoop原始碼
hadoop原始碼分析
2022-09-21
Hadoop原始碼
Hadoop2原始碼分析－序列化篇
2015-04-21
Hadoop原始碼
Android原始碼下載之《Android新聞client原始碼》
2015-12-31
Android原始碼client
Hadoop原理與原始碼
2021-11-10
Hadoop原始碼
k8s client-go原始碼分析 informer原始碼分析(6)-Indexer原始碼分析
2022-06-19
K8SclientGo原始碼ORMIndex
k8s client-go原始碼分析 informer原始碼分析(4)-DeltaFIFO原始碼分析
2022-05-22
K8SclientGo原始碼ORM
ElementUI 原始碼簡析——原始碼結構篇
2019-07-18
UI原始碼
Swoole 原始碼分析——Client模組之Recv
2019-05-14
原始碼client
Swoole 原始碼分析——Client模組之Send
2019-05-12
原始碼client
Axios 原始碼解讀 —— 原始碼實現篇
2022-01-22
iOS原始碼
原始碼原始碼原始碼樹品原始碼
2022-06-22
原始碼
Swoole 原始碼分析——Client模組之Connect
2019-05-14
原始碼client
Vuex 原始碼解析（如何閱讀原始碼實踐篇）
2017-07-09
Vue原始碼
k8s client-go原始碼分析 informer原始碼分析(5)-Controller&Processor原始碼分析
2022-06-05
K8SclientGo原始碼ORMController
Hadoop原始碼分類概要整理
2017-07-02
Hadoop原始碼
Hadoop原始碼分析(完整版)
2014-05-06
Hadoop原始碼
k8s client-go原始碼分析 informer原始碼分析(1)-概要分析
2022-04-23
K8SclientGo原始碼ORM
jQuery 原始碼分析第一篇之入口原始碼
2015-08-17
jQuery原始碼
Redux原始碼分析–Reducer篇
2019-02-26
Redux原始碼
原始碼閱讀技巧篇
2019-03-04
原始碼
Redux原始碼分析--bindActionCreators篇
2018-07-15
Redux原始碼
Nginx篇--Nginx原始碼搭建
2018-01-16
Nginx原始碼
Boost原始碼剖析開篇
2012-05-05
原始碼
Kubernetes client-go 原始碼分析 - Reflector
2021-10-20
clientGo原始碼
Kubernetes: client-go 原始碼剖析（一）
2024-08-22
clientGo原始碼
iOS開發原始碼閱讀篇--FMDB原始碼分析1(FMResultSet)
2018-12-01
iOS原始碼
iOS開發原始碼閱讀篇--FMDB原始碼分析2(FMResultSet)
2018-12-03
iOS原始碼
Hadoop單步debug追原始碼
2018-01-25
Hadoop原始碼
hadoop中的TextInputFormat類原始碼分析
2014-05-06
HadoopORM原始碼
hadoop原始碼解讀——Configured
2013-11-02
Hadoop原始碼
Hadoop 學習之-原始碼繫結
2015-04-22
Hadoop原始碼
如何高效的閱讀hadoop原始碼？
2015-04-29
Hadoop原始碼

Hadoop原始碼篇--Client原始碼

相關文章