一個MapReduce 程式示例 細節決定成敗(六) :CombineFileInputFormat
hadoop的mr 任務設計上是針對大檔案的,但實踐中難免會遇到大量小檔案的情況,就像我們這個字元數量統計的mr。
輸入是三個小檔案。所以每個檔案至少都會產生一下split,每個split 又會產生一個map 任務。對於很少的資料量,啟動一個jvm的代價顯得有點過大了。
所以這時就需要我們使用CombineFileInputFormat來對輸入的小檔案進行合併。
本次實驗,我們使用自定義的InputFormat,用來減少mapper 任務資料。
InputFormat 描述了一個MR Job的輸入格式。那有以下幾個作用:
- 校驗輸入檔案是否有效。
- 把輸入檔案劃分為邏輯的InputSplit ,每一個InputSplit 例項會送往不同的Mapper(重點,這是我們本篇的重點,也是優化的地方,意味著過多的inputsplit 會造成過多的map 任務)。
- 提供一下RecordReader,用來從inputsplit例項中讀取key value 來供maper 使用。
基於檔案輸入的InputFormat預設是按照輸入檔案的大小來劃分inputsplit。一般情況下,inputsplit的最大值為分散式檔案系統的塊大小(預設為128m、之前版本為64m),inputsplit的最小值可以通過mapreduce.input.fileinputformat.split.minsize.進行設定。
InputSplit 描述了哪兒些資料裝被送往不同的maper。一般情況下,InputSplit 包含的是基於位元組的資料檢視、這些位元組的解析則是由後面要說到的RecordReader.
RecordReader 負責從InputSplit中 解析出供map 函式使用的對。
一下InputFormat中最主要的兩個功能就是getInputSplits 和 createRecordReader 。 而一般 getInputSplit 在FileInputFormat中已經完成,FileInputFormat一般
是所有基於檔案的InputFormat要繼承的類。
所以重點就放在了RecordReader.
下面直接上程式碼吧。
關注以下幾個地方:
- MyCombinedFilesInputFormat 的父類是CombineFileInputFormat
- createRecordReader 返回的是一個CombineFileInputFormat,其建構函式資料一個自定義的RecordReader.
- MyRecordReader 的父類是RecordReader.
- 關注initialize 方法中split的處理。
點選(此處)摺疊或開啟
-
package wordcount;
-
-
import java.io.IOException;
-
-
import org.apache.hadoop.io.LongWritable;
-
import org.apache.hadoop.io.Text;
-
import org.apache.hadoop.mapreduce.InputSplit;
-
import org.apache.hadoop.mapreduce.RecordReader;
-
import org.apache.hadoop.mapreduce.TaskAttemptContext;
-
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
-
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
-
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
-
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
-
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
-
-
-
public class MyCombinedFilesInputFormat extends CombineFileInputFormat<LongWritable, Text> {
-
-
-
@Override
-
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
-
return new CombineFileRecordReader<LongWritable, Text>((CombineFileSplit) split,context,MyRecordReader.class);
-
}
-
public static class MyRecordReader extends RecordReader<LongWritable, Text> {
-
private Integer index;
-
private LineRecordReader reader;
-
-
public MyRecordReader(CombineFileSplit split, TaskAttemptContext context, Integer index) {
-
this.index = index;
-
reader = new LineRecordReader();
-
}
-
-
@Override
-
public void initialize(InputSplit split, TaskAttemptContext context)
-
throws IOException, InterruptedException {
-
CombineFileSplit cfsplit = (CombineFileSplit) split;
-
FileSplit fileSplit = new FileSplit(cfsplit.getPath(index),
-
cfsplit.getOffset(index),
-
cfsplit.getLength(index),
-
cfsplit.getLocations()
-
);
-
reader.initialize(fileSplit, context);
-
}
-
-
@Override
-
public boolean nextKeyValue() throws IOException, InterruptedException {
-
return reader.nextKeyValue();
-
}
-
-
@Override
-
public LongWritable getCurrentKey() throws IOException, InterruptedException {
-
return reader.getCurrentKey();
-
}
-
-
@Override
-
public Text getCurrentValue() throws IOException,
-
InterruptedException {
-
return reader.getCurrentValue();
-
}
-
-
@Override
-
public float getProgress() throws IOException, InterruptedException {
-
return reader.getProgress();
-
}
-
-
@Override
-
public void close() throws IOException {
-
reader.close();
-
}
-
-
}
- }
下面看一下job的配置
注意以下兩個地方:
-
job.setInputFormatClass(MyCombinedFilesInputFormat.class);
- MyCombinedFilesInputFormat.setMaxInputSplitSize(job, 1024*1024*64);
點選(此處)摺疊或開啟
-
點選(此處)摺疊或開啟
-
@Override
-
public int run(String[] args) throws Exception {
-
//valid the parameters
-
if(args.length !=2){
-
return -1;
-
}
-
-
Job job = Job.getInstance(getConf(), "MyWordCountJob");
-
job.setJarByClass(MyWordCountJob.class);
-
-
Path inPath = new Path(args[0]);
-
Path outPath = new Path(args[1]);
-
-
outPath.getFileSystem(getConf()).delete(outPath,true);
-
TextInputFormat.setInputPaths(job, inPath);
-
TextOutputFormat.setOutputPath(job, outPath);
-
-
-
job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
-
job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);
-
-
job.setInputFormatClass(MyCombinedFilesInputFormat.class);
-
MyCombinedFilesInputFormat.setMaxInputSplitSize(job, 1024*1024*64);
-
job.setOutputFormatClass(TextOutputFormat.class);
-
job.setMapOutputKeyClass(Text.class);
-
job.setMapOutputValueClass(IntWritable.class);
-
job.setOutputKeyClass(Text.class);
-
job.setOutputValueClass(IntWritable.class);
-
-
-
return job.waitForCompletion(true)?0:1;
- }
可以看到,輸入是三個小檔案、但只起了一個map task。
點選(此處)摺疊或開啟
-
[train@sandbox MyWordCount]$ hadoop jar mywordcount.jar mrdemo/ output
-
16/05/12 11:12:48 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
-
16/05/12 11:12:49 INFO input.FileInputFormat: Total input paths to process : 3
-
16/05/12 11:12:49 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 157
-
16/05/12 11:12:49 INFO mapreduce.JobSubmitter: number of splits:1
-
16/05/12 11:12:49 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
-
16/05/12 11:12:49 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
-
16/05/12 11:12:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1462517728035_0104
-
16/05/12 11:12:50 INFO impl.YarnClientImpl: Submitted application application_1462517728035_0104 to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
-
16/05/12 11:12:50 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1462517728035_0104/
-
16/05/12 11:12:50 INFO mapreduce.Job: Running job: job_1462517728035_0104
-
16/05/12 11:12:58 INFO mapreduce.Job: Job job_1462517728035_0104 running in uber mode : false
-
16/05/12 11:12:58 INFO mapreduce.Job: map 0% reduce 0%
-
16/05/12 11:13:10 INFO mapreduce.Job: map 100% reduce 0%
-
16/05/12 11:13:17 INFO mapreduce.Job: map 100% reduce 100%
-
16/05/12 11:13:18 INFO mapreduce.Job: Job job_1462517728035_0104 completed successfully
-
16/05/12 11:13:18 INFO mapreduce.Job: Counters: 43
-
File System Counters
-
FILE: Number of bytes read=1198
-
FILE: Number of bytes written=170905
-
FILE: Number of read operations=0
-
FILE: Number of large read operations=0
-
FILE: Number of write operations=0
-
HDFS: Number of bytes read=487
-
HDFS: Number of bytes written=108
-
HDFS: Number of read operations=8
-
HDFS: Number of large read operations=0
-
HDFS: Number of write operations=2
-
Job Counters
-
Launched map tasks=1
-
Launched reduce tasks=1
-
Other local map tasks=1
-
Total time spent by all maps in occupied slots (ms)=71600
-
Total time spent by all reduces in occupied slots (ms)=33720
-
Map-Reduce Framework
-
Map input records=8
-
Map output records=149
-
Map output bytes=894
-
Map output materialized bytes=1198
-
Input split bytes=330
-
Combine input records=0
-
Combine output records=0
-
Reduce input groups=26
-
Reduce shuffle bytes=1198
-
Reduce input records=149
-
Reduce output records=26
-
Spilled Records=298
-
Shuffled Maps =1
-
Failed Shuffles=0
-
Merged Map outputs=1
-
GC time elapsed (ms)=54
-
CPU time spent (ms)=1850
-
Physical memory (bytes) snapshot=445575168
-
Virtual memory (bytes) snapshot=1995010048
-
Total committed heap usage (bytes)=345636864
-
Shuffle Errors
-
BAD_ID=0
-
CONNECTION=0
-
IO_ERROR=0
-
WRONG_LENGTH=0
-
WRONG_MAP=0
-
WRONG_REDUCE=0
-
File Input Format Counters
-
Bytes Read=0
-
File Output Format Counters
- Bytes Written=108
本篇中簡單處理了MyRecordReader,在下篇中演示如何使用一下MyRecordReader 實現讀取自定義的Key Value對輸入到map函式中。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30066956/viewspace-2109215/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- 一個MapReduce 程式示例 細節決定成敗(一)
- 一個MapReduce 程式示例 細節決定成敗(五) :Partitioner
- 一個MapReduce 程式示例 細節決定成敗(九):RawComparator
- 一個MapReduce 程式示例 細節決定成敗(八):TotalOrderPartitioner
- 一個MapReduce 程式示例 細節決定成敗(三) :Combiner
- 一個MapReduce 程式示例 細節決定成敗(二) :觀察日誌及 Counter
- 一個MapReduce 程式示例 細節決定成敗(七) :自定義Key 及RecordReader
- 細節決定成敗 MapReduce任務實戰 Map Join
- 細節決定成敗 MapReduce任務實戰 Reduce Join
- 細節決定成敗 MapReduce任務實戰 倒排索引索引
- 第五章 Vlookup函式示例-細節決定成敗函式
- 細節決定成敗!APP設計不容忽視的20個細節APP
- 如何讓程式設計師幸福工作:細節決定成敗程式設計師
- 面試:黃金法則——細節決定成敗面試
- Java集合詳解8:Java集合類細節精講,細節決定成敗Java
- 開發者談F2P模式:細節決定成敗模式
- 邦芒簡歷:求職簡歷細節決定成敗求職
- Python讀書筆記:細節決定成敗(2)Python筆記
- Python讀書筆記:細節決定成敗(1)Python筆記
- 汪峰FIIL Diva智慧耳機究竟如何?細節決定成敗
- 細節決定ERP專案啟動會的成敗
- 細節決定成敗,不容忽視的10道Node面試題面試題
- 細節決定成敗——無CSS時網頁的可讀性CSS網頁
- 軟體設計是怎樣煉成的(7)——細節決定成敗(詳細設計)
- 【原創】構建高效能ASP.NET站點之三 細節決定成敗ASP.NET
- 一個簡單的MapReduce示例(多個MapReduce任務處理)
- 【2024-03-06】細節成敗
- 決定開發者面試成敗的 3 個問題面試
- AutoCAD 2024:細節決定成敗,精準設計從這裡開始 mac/win啟用版Mac
- MapReduce 程式設計模型 & WordCount 示例程式設計模型
- 細節決定一切-多工控制檔案有感
- 領導方式決定團隊成敗 (轉)
- 六細節,讓MySQL管理精益求精MySql
- Hadoop 除錯第一個mapreduce程式過程詳細記錄總結Hadoop除錯
- Laravel 的一個命名細節分享Laravel
- 一個小的技術細節
- h5效能優化,細節決定結果。H5優化
- 一個程式設計師的成長的六個階段(轉帖)程式設計師