MapReduce之WritableComparable排序

孫晨c發表於2020-07-29

原文網址 : https://www.cnblogs.com/sunbr/p/13398499.html

排序概述

排序是MapReduce框架中最重要的操作之一。
Map Task和ReduceTask均會預設對資料按照key進行排序。該操作屬於Hadoop的預設行為。任何應用程式中的資料均會被排序，而不管邏輯上是否需要。
黑預設排序是按照字典順序排序，且實現該排序的方法是快速排序。
對於MapTask，它會將處理的結果暫時放到一個緩衝區中，當緩衝區使用率達到一定閾值後，再對緩衝區中的資料進行一次排序，並將這些有序資料寫到磁碟上，而當資料處理完畢後，它會對磁碟上所有檔案進行一次合併，以將這些檔案合併成一個大的有序檔案。
對於ReduceTask，它從每個MapTak上遠端拷貝相應的資料檔案，如果檔案大小超過一定闌值，則放到磁碟上，否則放到記憶體中。如果磁碟上檔案數目達到一定閾值，則進行一次合併以生成一個更大檔案；如果記憶體中檔案大小或者數目超過一定閾值，則進行一次合併後將資料寫到磁碟上。當所有資料拷貝完畢後，ReduceTask統一對記憶體和磁碟上的所有資料進行一次歸併排序。
排序器：排序器影響的是排序的速度（效率，對什麼排序？），QuickSorter
比較器：比較器影響的是排序的結果（按照什麼規則排序）

獲取Mapper輸出的key的比較器(原始碼)

public RawComparator getOutputKeyComparator() {

// 從配置中獲取mapreduce.job.output.key.comparator.class的值，必須是RawComparator型別，如果沒有配置，預設為null
    Class<? extends RawComparator> theClass = getClass(JobContext.KEY_COMPARATOR, null, RawComparator.class);

// 一旦使用者配置了此引數，例項化一個使用者自定義的比較器例項
    if (theClass != null){
      return ReflectionUtils.newInstance(theClass, this);
   }
   
//使用者沒有配置，判斷Mapper輸出的key的型別是否是WritableComparable的子類，如果不是，就拋異常，如果是，系統會自動為我們提供一個key的比較器
    return WritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class), this);
  }

案例實操（區內排序）

需求
對每個手機號按照上行流量和下行流量的總和進行內部排序。

在這裡插入圖片描述
思考
因為Map Task和ReduceTask均會預設對資料按照key進行排序，所以需要把流量總和設定為Key，手機號等其他內容設定為value

FlowBeanMapper.java

public class FlowBeanMapper extends Mapper<LongWritable, Text, LongWritable, Text>{
	
	private LongWritable out_key=new LongWritable();
	private Text out_value=new Text();
	
	@Override
	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		String[] words = value.toString().split("\t");
		
		//封裝總流量為key
		out_key.set(Long.parseLong(words[3]));//切分後，流量和的下標為3
		
		//封裝其他內容為value
		out_value.set(words[0]+"\t"+words[1]+"\t"+words[2]);
		
		context.write(out_key, out_value);
	}

}

FlowBeanReducer.java

public class FlowBeanReducer extends Reducer<LongWritable, Text, Text, LongWritable>{
	
	@Override
	protected void reduce(LongWritable key, Iterable<Text> values,
			Reducer<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {
		
		for (Text value : values) {
			context.write(value, key);
		}
	}
	
}

FlowBeanDriver.java

public class FlowBeanDriver {
	
	public static void main(String[] args) throws Exception {
		
		Path inputPath=new Path("E:\\mroutput\\flowbean");
		Path outputPath=new Path("e:/mroutput/flowbeanSort1");
		
		//作為整個Job的配置
		Configuration conf = new Configuration();
		
		//保證輸出目錄不存在
		FileSystem fs=FileSystem.get(conf);
		
		if (fs.exists(outputPath)) {
			fs.delete(outputPath, true);
		}
		
		// ①建立Job
		Job job = Job.getInstance(conf);
		
		// ②設定Job
		// 設定Job執行的Mapper，Reducer型別，Mapper,Reducer輸出的key-value型別
		job.setMapperClass(FlowBeanMapper.class);
		job.setReducerClass(FlowBeanReducer.class);
		
		// Job需要根據Mapper和Reducer輸出的Key-value型別準備序列化器，通過序列化器對輸出的key-value進行序列化和反序列化
		// 如果Mapper和Reducer輸出的Key-value型別一致，直接設定Job最終的輸出型別
		//由於Mapper和Reducer輸出的Key-value型別不一致(maper輸出型別是long-text，而reducer是text-value)
		//所以需要額外設定
		job.setMapOutputKeyClass(LongWritable.class);
		job.setMapOutputValueClass(Text.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
		
		// 設定輸入目錄和輸出目錄
		FileInputFormat.setInputPaths(job, inputPath);
		FileOutputFormat.setOutputPath(job, outputPath);
		
		// 預設升序排，可以設定使用自定義的比較器
		//job.setSortComparatorClass(DecreasingComparator.class);
		
		// ③執行Job
		job.waitForCompletion(true);
			
	}
}

執行結果(預設升序排)
在這裡插入圖片描述

自定義排序器，使用降序

方法一：自定義類，這個類必須是RawComparator型別，通過設定mapreduce.job.output.key.comparator.class自定義的類的型別。
自定義類時，可以繼承WriableComparator類，也可以實現RawCompartor
呼叫方法時，先呼叫RawCompartor. compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2)，再呼叫RawCompartor.compare()
方法二：定義Mapper輸出的key,讓key實現WritableComparable，實現CompareTo()

MyDescComparator.java

public class MyDescComparator extends WritableComparator{
	
	@Override
    public int compare(byte[] b1, int s1, int l1,byte[] b2, int s2, int l2) {
      long thisValue = readLong(b1, s1);
      long thatValue = readLong(b2, s2);
      //這裡把第一個-1改成1，把第二個1改成-1，就是降序排
      return (thisValue<thatValue ? 1 : (thisValue==thatValue ? 0 : -1));
    }

}

執行結果
在這裡插入圖片描述

MapReduce之自定義OutputFormat
2020-08-05
ORM
MapReduce之自定義InputFormat
2020-07-19
ORM
MapReduce之MapTask工作機制
2020-07-19
APT
Hadoop面試題之MapReduce
2021-12-23
Hadoop面試題
排序之快速排序
2018-08-08
排序
MapReduce之自定義分割槽器Partitioner
2020-07-21
Java排序之計數排序
2019-01-19
Java排序
Hadoop之MapReduce2架構設計
2018-05-28
Hadoop架構
排序演算法之 '快速排序'
2020-07-20
排序演算法
排序演算法之——桶排序
2021-09-09
排序演算法
hadoop之mapreduce.input.fileinputformat.split.minsize引數
2018-10-24
HadoopORM
Hadoop 學習系列（四）之 MapReduce 原理講解
2019-03-04
Hadoop
有Hive之後，為何還要學mapreduce
2018-08-20
Hive
Hadoop之MapReduce2基礎梳理及案例
2018-05-28
Hadoop
常用排序演算法之桶排序
2019-11-26
排序演算法
排序演算法之 '歸併排序'
2020-07-19
排序演算法
經典排序之選擇排序(Java)
2020-11-10
排序Java
MapReduce理解
2024-11-02
Hadoop 三劍客之 —— 分散式計算框架 MapReduce
2019-06-27
Hadoop分散式框架
PHP 排序演算法之選擇排序
2019-12-08
PHP排序演算法
PHP 排序演算法之希爾排序
2019-12-08
PHP排序演算法
PHP 排序演算法之插入排序
2019-12-08
PHP排序演算法
Python排序演算法之選擇排序
2019-08-20
Python排序演算法
排序演算法之折半插入排序
2019-10-21
排序演算法
Java排序演算法之氣泡排序
2021-09-09
Java排序演算法
基於桶的排序之計數排序
2022-11-26
排序
排序演算法之「選擇排序(SelectionSort) 」
2021-01-31
排序演算法
排序演算法入門之「選擇排序」
2020-10-21
排序演算法
排序演算法入門之「插入排序」
2020-10-20
排序演算法
排序演算法之「歸併排序(Merge Sort)」
2021-02-17
排序演算法
排序演算法之「插入排序(Insertion Sort)」
2021-03-08
排序演算法
排序演算法之快速排序的實現
2020-12-18
排序演算法
iOS CoreData排序之 NSFetchRequest
2018-08-03
iOS排序
基於桶的排序之基數排序以及排序方法總結
2022-11-27
排序
Lab 1: MapReduce
2024-08-25
演算法之常見排序演算法-氣泡排序、歸併排序、快速排序
2019-10-21
演算法排序
MapReduce 示例：減少 Hadoop MapReduce 中的側連線
2021-09-17
Hadoop
夯實基礎:排序演算法之堆排序
2020-09-23
排序演算法

MapReduce之WritableComparable排序

排序概述

獲取Mapper輸出的key的比較器(原始碼)

案例實操（區內排序）

自定義排序器，使用降序

相關文章