MapReduce之自定義InputFormat

孫晨c發表於2020-07-19

原文網址 : https://www.cnblogs.com/sunbr/p/13339671.html

在企業開發中，Hadoop框架自帶的InputFormat型別不能滿足所有應用場景，需要自定義InputFormat來解決實際問題。
自定義InputFormat步驟如下：

（1）自定義一個類繼承FilelnputFormat。
（2）自定義一個類繼承RecordReader，實現一次讀取一個完整檔案，將檔名為key，檔案內容為value。
（3）在輸出時使用SequenceFileOutPutFormat輸出合併檔案。

無論HDFS還是MapReduce，在處理小檔案時效率都非常低，但又難免面臨處理大量小檔案的場景，此時，就需要有相應解決方案。可以自定義InputFormat實現小檔案的合併。

1. 需求

將多個小檔案合併成一個SequenceFile檔案（SequenceFile檔案是Hadoop用來儲存二進位制形式的key-value(bytes) 對的檔案格式），SequenceFile裡面儲存著多個檔案，儲存的形式為檔案路徑+名稱為key，檔案內容為value。

（1）輸入資料
在這裡插入圖片描述
（2）期望輸出檔案格式

2. 需求分析

自定義一個類繼承FileInputFormat
（1）重寫isSplitable()方法，返回false，讓檔案不可切，整個檔案作為1片。
（2）重寫createRecordReader()，返回自定義的RecordReader物件
自定義一個類繼承RecordReader
在RecordReader中，nextKeyValue()是最重要的方法，返回當前讀取到的key-value，如果讀到返回true，呼叫Mapper的map()來處理，否則返回false

3. 編寫程式

MyInputFormat.java

/*
 * 1. 改變切片策略，一個檔案固定切1片，通過指定檔案不可切
 * 
 * 2. 提供RR ，這個RR讀取切片的檔名作為key,讀取切片的內容封裝到bytes作為value
 */
public class MyInputFormat extends FileInputFormat {

	@Override
	public RecordReader createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
		return new MyRecordReader();
	}
	
	@Override
	protected boolean isSplitable(JobContext context, Path filename) {
		return false;
	}
}

MyRecordReader.java

/*
 * RecordReader從MapTask處理的當前切片中讀取資料
 * 
 * XXXContext都是Job的上下文，通過XXXContext可以獲取Job的配置Configuration物件
 */
public class MyRecordReader extends RecordReader {
	
	private Text key;
	private BytesWritable value;
	
	private String filename;
	private int length;
	
	private FileSystem fs;
	private Path path;
	
	private FSDataInputStream is;
	
	private boolean flag=true;

	// MyRecordReader在建立後，在進入Mapper的run()之前，自動呼叫
	// 檔案的所有內容設定為1個切片，切片的長度等於檔案的長度
	@Override
	public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {

		FileSplit fileSplit=(FileSplit) split;
		
		filename=fileSplit.getPath().getName();
		
		length=(int) fileSplit.getLength();
		
		path=fileSplit.getPath();
		
		//獲取當前Job的配置物件
		Configuration conf = context.getConfiguration();
		
		//獲取當前Job使用的檔案系統
		fs=FileSystem.get(conf);
		
		 is = fs.open(path);
		
	}

	// 讀取一組輸入的key-value，讀到返回true,否則返回false
	// 將檔案的名稱封裝為key，將檔案的內容封裝為BytesWritable型別的value，返回true
	// 第二次呼叫nextKeyValue()返回false
	@Override
	public boolean nextKeyValue() throws IOException, InterruptedException {
		
		if (flag) {
			
			//例項化物件
			if (key==null) {
				key=new Text();
			}
			
			if (value==null) {
				value=new BytesWritable();
			}
			
			//賦值
			//將檔名封裝到key中
			key.set(filename);
			
			// 將檔案的內容讀取到BytesWritable中
			byte [] content=new byte[length];
			
			IOUtils.readFully(is, content, 0, length);
			
			value.set(content, 0, length);
			
			flag=false;
			
			return true;
			
		}
		return false;
	}

	//返回當前讀取到的key-value中的key
	@Override
	public Object getCurrentKey() throws IOException, InterruptedException {
		return key;
	}

	//返回當前讀取到的key-value中的value
	@Override
	public Object getCurrentValue() throws IOException, InterruptedException {
		return value;
	}

	//返回讀取切片的進度
	@Override
	public float getProgress() throws IOException, InterruptedException {
		return 0;
	}

	// 在Mapper的輸入關閉時呼叫，清理工作
	@Override
	public void close() throws IOException {
		if (is != null) {
			IOUtils.closeStream(is);
		}	
		if (fs !=null) {
			fs.close();
		}
	}
}

CustomIFMapper.java

public class CustomIFMapper extends Mapper<Text, BytesWritable, Text, BytesWritable>{

}

CustomIFReducer.java

public class CustomIFReducer extends Reducer<Text, BytesWritable, Text, BytesWritable>{

}

CustomIFDriver.java

public class CustomIFDriver {
	
	public static void main(String[] args) throws Exception {
		
		Path inputPath=new Path("e:/mrinput/custom");
		Path outputPath=new Path("e:/mroutput/custom");
		
		//作為整個Job的配置
		Configuration conf = new Configuration();
		//保證輸出目錄不存在
		FileSystem fs=FileSystem.get(conf);
		
		if (fs.exists(outputPath)) {
			fs.delete(outputPath, true);
		}
		
		// 建立Job
		Job job = Job.getInstance(conf);

		// 設定Job執行的Mapper，Reducer型別，Mapper,Reducer輸出的key-value型別
		job.setMapperClass(CustomIFMapper.class);
		job.setReducerClass(CustomIFReducer.class);
		
		// Job需要根據Mapper和Reducer輸出的Key-value型別準備序列化器，通過序列化器對輸出的key-value進行序列化和反序列化
		// 如果Mapper和Reducer輸出的Key-value型別一致，直接設定Job最終的輸出型別
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(BytesWritable.class);
		
		// 設定輸入目錄和輸出目錄
		FileInputFormat.setInputPaths(job, inputPath);
		FileOutputFormat.setOutputPath(job, outputPath);
		
		// 設定輸入和輸出格式
		job.setInputFormatClass(MyInputFormat.class);
		job.setOutputFormatClass(SequenceFileOutputFormat.class);
		
		// ③執行Job
		job.waitForCompletion(true);
			
	}
}

MapReduce之自定義OutputFormat
2020-08-05
ORM
MapReduce之自定義分割槽器Partitioner
2020-07-21
黑猴子的家：HBase 自定義HBase-MapReduce案列一
2018-10-05
自定義View之SwitchView
2018-09-19
View
Mybaitis之自定義TypeHandler
2019-04-25
AI
Java之自定義異常
2018-08-27
Java
Android 自定義 View 之 LeavesLoading
2019-02-01
AndroidView
NLog自定義Target之MQTT
2022-06-21
MQQT
自定義View事件之進階篇(四)-自定義Behavior實戰
2019-08-02
View事件
自定義View 之 RecyclerView.ItemDecoration
2019-03-02
View
Android自定義View之捲尺
2018-11-30
AndroidView
Preference元件探究之自定義Preference
2019-08-15
元件
RecyclerView之自定義LayoutManager和SnapHelper
2019-04-26
View
Flutter 之自定義控制元件
2021-03-22
Flutter控制元件
MapReduce之WritableComparable排序
2020-07-29
排序
Android 自定義 View 實戰之 PuzzleView
2019-02-27
AndroidView
Android 自定義 View 之入門篇
2019-04-18
AndroidView
Android自定義view之emoji鍵盤
2018-05-05
AndroidView
Android自定義View之Canvas的使用
2018-08-05
AndroidViewCanvas
玩轉docker之自定義PHP容器
2020-09-20
DockerPHP
PyQT5之自定義訊號
2024-06-08
QT
【朝花夕拾】Android自定義View篇之（四）自定義View的三種實現方式及自定義屬性詳解
2019-06-05
AndroidView
MapReduce實現與自定義詞典檔案基於hanLP的中文分詞詳解
2018-10-15
HanLP中文分詞
擴充spring元件之自定義標籤
2019-04-14
Spring元件
Android自定義View之定點寫文字
2018-07-31
AndroidView
15.prometheus之pushgateway自定義監控
2024-04-25
PrometheusGateway
Hexo 主題開發之自定義模板
2023-12-16
Hexo
造輪子之自定義授權策略
2023-10-09
小代學Spring Boot之自定義Starter
2019-07-23
Spring Boot
Flink的sink實戰之四：自定義
2020-11-11
BeetleX之webapi自定義響應內容
2020-10-13
WebAPI
xmake高階特性之自定義選項
2019-05-11
Spring Boot之自定義JSON轉換器
2019-04-28
Spring BootJSON
微信開發之自定義元件（Toast）
2018-04-23
元件AST
flutter系列之:在flutter中自定義themes
2023-03-06
Flutter
皕傑報表之自定義函式
2022-12-12
函式
Python 日誌列印之自定義logger handler
2021-01-24
Python
MapReduce之MapTask工作機制
2020-07-19
APT

MapReduce之自定義InputFormat

1. 需求

2. 需求分析

3. 編寫程式

相關文章