Hadoop學習——Client原始碼分析

Hiway發表於2019-04-06

原文網址 : https://juejin.im/post/5ca87829f265da30dc7ad344

我們從client端的入口job.waitForCompletion(true)開始看：

public static void main(String[] args) throws Exception{
        Configuration configuration = new Configuration();
        configuration.set("hbase.zookeeper.quorum",Hconfiguration.hbase_zookeeper_quorum);
        configuration.set("hbase.zookeeper.clientPort","2181");
        configuration.set(TableOutputFormat.OUTPUT_TABLE,Hconfiguration.tableName);
        configuration.set("dfs.socket.timeout","1800000");

        MRDriver myDriver = MRDriver.getInstance();

        /*try{
            myDriver.creatTable(Hconfiguration.tableName, Hconfiguration.colFamily);
        }catch (Exception e){
            e.printStackTrace();
        }*/

        Job job = new Job(configuration,"Map+ReduceImport");
        job.setMapperClass(HMapper.class);
        job.setReducerClass(HReducer.class);
        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TableOutputFormat.class);

        FileInputFormat.setInputPaths(job, Hconfiguration.mapreduce_inputPath);
        job.waitForCompletion(true);
    }
複製程式碼

public boolean waitForCompletion(boolean verbose) throws IOException, InterruptedException,ClassNotFoundException {
    if (state == JobState.DEFINE) {
    //重點是提交的過程
    submit();
}
    if (verbose) {
    //監控並列印執行過程
        monitorAndPrintJob();
    } else {
        ……
    }
    return isSuccessful();
}
複製程式碼

我們跟進去看，繼續submit()方法,submit()呼叫 submitJobInternal()方法把作業提交到叢集

public void submit()throws IOException, InterruptedException,ClassNotFoundException {
    ensureState(JobState.DEFINE);
    //判斷使用的是 hadoop 1.x 還是 2.x 的 jar 包
    setUseNewAPI();
    //連線叢集
    connect();
    final JobSubmitter submitter = getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {public JobStatus run() throws IOException,
        InterruptedException, ClassNotFoundException {
            //把作業提交到叢集
            return submitter.submitJobInternal(Job.this, cluster);
        }
        });
    ……
}
複製程式碼

submitJobInternal()方法詳解:

JobStatus submitJobInternal(Job job, Cluster cluster)
throws ClassNotFoundException, InterruptedException, IOException {
    //validate the jobs output specs
    //Checking the input and output specifications of the job. 檢查輸入輸出路徑
    checkSpecs(job);
    Configuration conf = job.getConfiguration();
    addMRFrameworkToDistributedCache(conf);
    Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, conf);
    //configure the command line options correctly on the submitting dfs
    InetAddress ip = InetAddress.getLocalHost();
    if (ip != null) {
        //封裝提交的資訊
        submitHostAddress = ip.getHostAddress();
        submitHostName = ip.getHostName();
        conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);
        conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,
        submitHostAddress);
    }
    JobID jobId = submitClient.getNewJobID();
    job.setJobID(jobId);
    //獲得提交的目錄
    Path submitJobDir = new Path(jobStagingArea, jobId.toString());
    JobStatus status = null;
    ……
    //copy 配置檔案
    copyAndConfigureFiles(job, submitJobDir);
    Path submitJobFile =
    JobSubmissionFiles.getJobConfPath(submitJobDir);// Create the splits for the job 建立切片
    LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));
    //建立切片的方法
    int maps = writeSplits(job, submitJobDir);
    
    conf.setInt(MRJobConfig.NUM_MAPS, maps);
    LOG.info("number of splits:" + maps);
    ……
    // Write job file to submit dirwriteConf(conf, submitJobFile);
    // Now, actually submit the job (using the submit name)
    printTokens(jobId, job.getCredentials());
    //之前都是提交前的準備， 最終提交作業
    status = submitClient.submitJob(
    jobId, submitJobDir.toString(), job.getCredentials());
    ……
}
複製程式碼

writeSplits()呼叫 writeNewSplits()

private int writeSplits(org.apache.hadoop.mapreduce.JobContext job, Path jobSubmitDir)
    throws IOException, InterruptedException, ClassNotFoundException {
        JobConf jConf = (JobConf)job.getConfiguration();
        int maps;
        //根據前面的資訊選擇使用 1.x 或者 2.x 的配置
        if (jConf.getUseNewMapper()) {
            maps = writeNewSplits(job, jobSubmitDir);
        } else {
            maps = writeOldSplits(jConf, jobSubmitDir);
        }
        return maps;
    }
複製程式碼

我們繼續跟進看writeNewSplits(job, jobSubmitDir)方法

private <T extends InputSplit> int writeNewSplits(JobContext job, Path jobSubmitDir) throws
    IOException,InterruptedException, ClassNotFoundException {
        Configuration conf = job.getConfiguration();
        //通過反射得到 InputFormatClass
        InputFormat<?, ?> input = ReflectionUtils.newInstance(job.getInputFormatClass(), conf);
        List<InputSplit> splits = input.getSplits(job);
        ……
    }
複製程式碼

這裡有兩個方法重點跟進一下，一個是getInputFormatClass(),另一個是getSplits(job)。我們先看一下getInputFormatClass()方法

public Class<? extends InputFormat<?,?>> getInputFormatClass() throws ClassNotFoundException {
    return (Class<? extends InputFormat<?,?>>)
    //如果使用者設定過 InputFormat,
    //job.setInputFormatClass(cls);
    //就使用使用者設定的
    //否則使用預設的 Textconf.getClass(INPUT_FORMAT_CLASS_ATTR,TextInputFormat.class);
}
複製程式碼

然後我們繼續看getSplits(job)方法。這個方法非常重要

public List<InputSplit> getSplits(JobContext job) throws IOException {
    Stopwatch sw = new Stopwatch().start();
    //在使用者沒有干預的情況下， 值為 1
    long minSize = Math.max(getFormatMinSplitSize(),getMinSplitSize(job));
    /*
    protected long getFormatMinSplitSize() {
        return 1;
    }
    public static long getMinSplitSize(JobContext job) {
        如果使用者設定了，用使用者設定的值，否則使用1
        //FileInputFormat.setMinInputSplitSize(job, size);
        return job.getConfiguration().getLong(SPLIT_MINSIZE, 1L);
    }
    */
    long maxSize = getMaxSplitSize(job);
    /*
    如果使用者設定了， 去使用者的值， 否則去一個無限大的值
    public static long getMaxSplitSize(JobContext context) {
        return context.getConfiguration().getLong(SPLIT_MAXSIZE,Long.MAX_VALUE);
    }
    */
    // generate splits
    List<InputSplit> splits = new ArrayList<InputSplit>();
    List<FileStatus> files = listStatus(job);
    //迭代使用者給的目錄下的所有檔案，得到每個檔案的
    //BlockLocations
    for (FileStatus file: files) {
        Path path = file.getPath();
        long length = file.getLen();
        if (length != 0) {
            BlockLocation[] blkLocations;
            if (file instanceof LocatedFileStatus) {
                    blkLocations = ((LocatedFileStatus)file).getBlockLocations();
                } else {
                    FileSystem fs = path.getFileSystem(job.getConfiguration());
                    blkLocations = fs.getFileBlockLocations(file, 0, length);
                }
            if (isSplitable(job, path)) {
                long blockSize = file.getBlockSize();
                long splitSize = computeSplitSize(blockSize, minSize, maxSize);
        /*
        在使用者沒有干預的情況下
        取 maxSize 和 blockSize 的最小值， 預設情況下為 blockSize
        取 blockSize 和 minSize 的最大值， 最後結果為 blockSize
        protected long computeSplitSize(long blockSize, long minSize,long maxSize) {
        return  Math.max(minSize, Math.min(maxSize, blockSize));
        }
        */
            long bytesRemaining = length;
            while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
                //計算切片屬於哪個 block
                int blkIndex = getBlockIndex(blkLocations, lengthbytesRemaining);
        /*
        protected int getBlockIndex(BlockLocation[] blkLocations,long offset) {
        判斷 offset 在 block 塊的偏移量的哪個範圍
            for (int i = 0 ; i < blkLocations.length; i++) {
            // is the offset inside this block?
                if ((blkLocations[i].getOffset() <= offset) && (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){ 
                    return i;
                }
            }
            BlockLocation last =blkLocations[blkLocations.length -1];
            long fileLength = last.getOffset() + last.getLength() -1;
            throw new IllegalArgumentException("Offset " + offset +" is outside of file (0.." + fileLength + ")");
        }
        */
                splits.add(makeSplit(path,
                length-bytesRemaining, splitSize,
                blkLocations[blkIndex].getHosts(),
                blkLocations[blkIndex].getCachedHosts()));
                bytesRemaining -= splitSize;
            }
            if (bytesRemaining != 0) {
                int blkIndex = getBlockIndex(blkLocations, lengthbytesRemaining);
                //建立切片
                //切片資訊包括檔名，偏移量，大小，位置資訊
                splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                    blkLocations[blkIndex].getHosts(),
                    blkLocations[blkIndex].getCachedHosts()));
            }
            } else { // not splitable
                ……
            }
        } else {
        //Create empty hosts array for zero length files
        ……
        }
    }
    ……
    return splits;  
}
複製程式碼

總的來說，客戶端做了以下幾件事：

配置完善
檢查路徑
計算 split： maps
資源提交到 HDFS
提交任務

然後， AppMaster 根據 split 列表資訊向 ResourceManager 申請資源， RS 建立 container，然後 AppMaster 啟動 container，把 MapReducer 任務放進去。

總結圖

MAPTASK 記憶體預設 1G（調優可以更改）
public static final String MAP_MEMORY_MB = "mapreduce.map.memory.mb";
public static final int DEFAULT_MAP_MEMORY_MB = 1024;
Reduce 記憶體預設 1G，這個預設數值太小，應該調整
public static final String REDUCE_MEMORY_MB ="mapreduce.reduce.memory.mb";
public static final int DEFAULT_REDUCE_MEMORY_MB = 1024;

hadoop原始碼分析
2022-09-21
Hadoop原始碼
k8s client-go原始碼分析 informer原始碼分析(6)-Indexer原始碼分析
2022-06-19
K8SclientGo原始碼ORMIndex
k8s client-go原始碼分析 informer原始碼分析(4)-DeltaFIFO原始碼分析
2022-05-22
K8SclientGo原始碼ORM
Kubernetes client-go 原始碼分析 - Reflector
2021-10-20
clientGo原始碼
k8s client-go原始碼分析 informer原始碼分析(1)-概要分析
2022-04-23
K8SclientGo原始碼ORM
Java容器原始碼學習--ArrayList原始碼分析
2021-12-26
Java原始碼
k8s client-go原始碼分析 informer原始碼分析(5)-Controller&Processor原始碼分析
2022-06-05
K8SclientGo原始碼ORMController
Swoole 原始碼分析——Client模組之Send
2019-05-12
原始碼client
Swoole 原始碼分析——Client模組之Connect
2019-05-14
原始碼client
Swoole 原始碼分析——Client模組之Recv
2019-05-14
原始碼client
Riffa學習——Linux Driver原始碼分析
2024-07-21
Linux原始碼
hadoop 原始碼分析HDFS架構演進
2022-09-20
Hadoop原始碼架構
EventEmitter3原始碼分析與學習
2019-03-01
MIT原始碼
k8s client-go原始碼分析 informer原始碼分析(2)-初始化與啟動分析
2022-05-08
K8SclientGo原始碼ORM
JDK1.8原始碼分析01之學習建議（可以延伸其他原始碼學習）
2019-08-06
JDK原始碼
Hadoop3.2.1 【 HDFS 】原始碼分析 : Secondary Namenode解析
2020-09-28
Hadoop原始碼
Hadoop3.2.1 【 HDFS 】原始碼分析 : Standby Namenode解析
2020-09-28
Hadoop原始碼
Hadoop3.2.1 【 YARN 】原始碼分析 :AdminService 淺析
2020-12-11
HadoopYarn原始碼
PHP-7.1 原始碼學習：詞法分析
2019-02-16
PHP原始碼詞法分析
Hadoop3.2.1 【 YARN 】原始碼分析 :RPC通訊解析
2020-12-07
HadoopYarn原始碼RPC
原始翎風CLIENT8位 (8) CLUNIT的學習
2024-05-07
client
原始翎風CLIENT8位 (9) magieff的學習
2024-05-07
client
原始翎風CLIENT8位 (11) fsata的學習
2024-05-07
client
原始翎風CLIENT8位 (12) playscn的學習
2024-05-07
client
原始翎風CLIENT8位 (10) tscenc的學習
2024-05-07
client
原始翎風CLIENT8位 (13) actor的學習
2024-05-07
client
Hadoop學習
2024-07-30
Hadoop
原始碼學習
2024-11-24
原始碼
SpringBoot原始碼學習4——SpringBoot內嵌Tomcat啟動流程原始碼分析
2023-05-15
Spring Boot原始碼Tomcat
Hadoop原理與原始碼
2021-11-10
Hadoop原始碼
奈學教育：Hadoop原始碼編譯全流程分享
2020-06-03
Hadoop原始碼編譯
fishhook原始碼學習
2018-10-24
Hook原始碼
MMKV原始碼學習
2019-03-16
原始碼
vue原始碼學習
2019-02-27
Vue原始碼
【原始碼學習】ThreadLocal
2020-10-12
原始碼thread
EventBus原始碼學習
2019-04-24
原始碼
ObjectMapper原始碼學習
2018-03-21
ObjectAPP原始碼
express原始碼學習
2018-03-18
Express原始碼

Hadoop學習——Client原始碼分析

總結圖

相關文章