MapReduce 二次排序詳解

Thinkgamer_gyt發表於2015-08-06

1 首先說一下工作原理：

在map階段，使用job.setInputFormatClass定義的InputFormat將輸入的資料集分割成小資料塊splites，同時InputFormat提供一個RecordReder的實現。本例子中使用的是TextInputFormat，他提供的RecordReder會將文字的一行的行號作為key，這一行的文字作為value。這就是自定義Map的輸入是<LongWritable, Text>的原因。然後呼叫自定義Map的map方法，將一個個<LongWritable, Text>對輸入給Map的map方法。注意輸出應該符合自定義Map中定義的輸出<IntPair, IntWritable>。最終是生成一個List<IntPair, IntWritable>。在map階段的最後，會先呼叫job.setPartitionerClass對這個List進行分割槽，每個分割槽對映到一個reducer。每個分割槽內又呼叫job.setSortComparatorClass設定的key比較函式類排序。可以看到，這本身就是一個二次排序。如果沒有通過job.setSortComparatorClass設定key比較函式類，則使用key的實現的compareTo方法。在第一個例子中，使用了IntPair實現的compareTo方法，而在下一個例子中，專門定義了key比較函式類。

在reduce階段，reducer接收到所有對映到這個reducer的map輸出後，也是會呼叫job.setSortComparatorClass設定的key比較函式類對所有資料對排序。然後開始構造一個key對應的value迭代器。這時就要用到分組，使用jobjob.setGroupingComparatorClass設定的分組函式類。只要這個比較器比較的兩個key相同，他們就屬於同一個組，它們的value放在一個value迭代器，而這個迭代器的key使用屬於同一個組的所有key的第一個key。最後就是進入Reducer的reduce方法，reduce方法的輸入是所有的（key和它的value迭代器）。同樣注意輸入與輸出的型別必須與自定義的Reducer中宣告的一致。

2 二次排序就是首先按照第一欄位排序，然後再對第一欄位相同的行按照第二欄位排序，注意不能破壞第一次排序的結果。例如

輸入檔案

20 21
50 51
50 52
50 53
50 54
60 51
60 53
60 52
60 56
60 57
70 58
60 61
70 54
70 55
70 56
70 57
70 58
1 2
3 4
5 6
7 82
203 21
50 512
50 522
50 53
530 54
40 511
20 53
20 522
60 56
60 57
740 58
63 61
730 54
71 55
71 56
73 57
74 58
12 211
31 42
50 62
7 8

輸出：（注意需要分割線）

------------------------------------------------
1       2
------------------------------------------------
3       4
------------------------------------------------
5       6
------------------------------------------------
7       8
7       82
------------------------------------------------
12      211
------------------------------------------------
20      21
20      53
20      522
------------------------------------------------
31      42
------------------------------------------------
40      511
------------------------------------------------
50      51
50      52
50      53
50      53
50      54
50      62
50      512
50      522
------------------------------------------------
60      51
60      52
60      53
60      56
60      56
60      57
60      57
60      61
------------------------------------------------
63      61
------------------------------------------------
70      54
70      55
70      56
70      57
70      58
70      58
------------------------------------------------
71      55
71      56
------------------------------------------------
73      57
------------------------------------------------
74      58
------------------------------------------------
203     21
------------------------------------------------
530     54
------------------------------------------------
730     54
------------------------------------------------
740     58

3 具體步驟：

1 自定義key。

在mr中，所有的key是需要被比較和排序的，並且是二次，先根據partitione，再根據大小。而本例中也是要比較兩次。先按照第一欄位排序，然後再對第一欄位相同的按照第二欄位排序。根據這一點，我們可以構造一個複合類IntPair，他有兩個欄位，先利用分割槽對第一欄位排序，再利用分割槽內的比較對第二欄位排序。
所有自定義的key應該實現介面WritableComparable，因為是可序列的並且可比較的。並過載方法
//反序列化，從流中的二進位制轉換成IntPair
public void readFields(DataInput in) throws IOException

//序列化，將IntPair轉化成使用流傳送的二進位制
public void write(DataOutput out)

//key的比較
public int compareTo(IntPair o)

另外新定義的類應該重寫的兩個方法
//The hashCode() method is used by the HashPartitioner (the default partitioner in MapReduce)
public int hashCode()
public boolean equals(Object right)

2 由於key是自定義的，所以還需要自定義一下類：

2.1 分割槽函式類。這是key的第一次比較。
public static class FirstPartitioner extends Partitioner<IntPair,IntWritable>

在job中設定使用setPartitionerClasss

2.2 key比較函式類。這是key的第二次比較。這是一個比較器，需要繼承WritableComparator。
public static class KeyComparator extends WritableComparator
必須有一個建構函式，並且過載 public int compare(WritableComparable w1, WritableComparable w2)
另一種方法是實現介面RawComparator。
在job中設定使用setSortComparatorClass。

2.3 分組函式類。在reduce階段，構造一個key對應的value迭代器的時候，只要first相同就屬於同一個組，放在一個value迭代器。這是一個比較器，需要繼承WritableComparator。
public static class GroupingComparator extends WritableComparator
同key比較函式類，必須有一個建構函式，並且過載 public int compare(WritableComparable w1, WritableComparable w2)
同key比較函式類，分組函式類另一種方法是實現介面RawComparator。
在job中設定使用setGroupingComparatorClass。

另外注意的是，如果reduce的輸入與輸出不是同一種型別，則不要定義Combiner也使用reduce，因為Combiner的輸出是reduce的輸入。除非重新定義一個Combiner。

4 程式碼。這個例子中沒有使用key比較函式類，而是使用key的實現的compareTo方法

package sorttwice;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class sorttwice {

	//自己定義的key類實現WritableComparable介面
	public static class IntPair implements WritableComparable{

		int first,second;
		/**
		 * 設定 left,right
		 */
		public void set(int left,int right){
			first=left;
			second=right;
		}
		public int getFirst(){
			return first;
		}
		public int getSecond(){
			return second;
		}
		//反序列化，從流中讀進二進位制轉換成IntPair
		@Override
		public void readFields(DataInput in) throws IOException {
			// TODO Auto-generated method stub
			first = in.readInt();
			second = in.readInt();
		}
		//序列化，將IntPair轉換成二進位制輸出
		@Override
		public void write(DataOutput out) throws IOException {
			// TODO Auto-generated method stub
			out.writeInt(first);
			out.writeInt(second);
		}
		//key的比較
		@Override
		public int compareTo(IntPair o) {
			// TODO Auto-generated method stub
			if(first!=o.first){
				return first{

		@Override
		public int getPartition(IntPair key, IntWritable value, int numParatitions) {
			// TODO Auto-generated method stub
			return Math.abs(key.getFirst()*127) % numParatitions;
		}
	}
	/**
     * 分組函式類。只要first相同就屬於同一個組。
     */
    //第一種方法，實現介面RawComparator
    public static class GroupingComparator implements RawComparator {
        @Override
        public int compare(IntPair o1, IntPair o2) {
            int l = o1.getFirst();
            int r = o2.getFirst();
            return l == r ? 0 : (l < r ? -1 : 1);
        }
        @Override
        //一個位元組一個位元組的比，直到找到一個不相同的位元組，然後比這個位元組的大小作為兩個位元組流的大小比較結果。
        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2){
            // TODO Auto-generated method stub
             return WritableComparator.compareBytes(b1, s1, Integer.SIZE/8,
                     b2, s2, Integer.SIZE/8);
        }
    }

	/*
	//第二種方法：繼承WritableComparator
	public static class GroupingComparator extends WritableComparator{
		private GroupingComparator(){
			super(IntPair.class,true);
		}
		@Override
		//Compare two WritableComparatles
		public int compare(WritableComparable w1,WritableComparable w2){
			IntPair ip1 = (IntPair) w1;
			IntPair ip2 = (IntPair) w2;
			int l = ip1.getFirst();
			int r = ip2.getSecond();
			return l == r ? 0 : (l{
		private final IntPair intkey = new IntPair();
		private final IntWritable intvalue = new IntWritable();
		public void map(Object key,Text value,Context context) throws IOException, InterruptedException{
			StringTokenizer itr = new StringTokenizer(value.toString());
			int left =0;
			int right =0;
			while(itr.hasMoreTokens()){
				left = Integer.parseInt(itr.nextToken());
				if(itr.hasMoreTokens())
					right = Integer.parseInt(itr.nextToken());
				intkey.set(left, right);
				intvalue.set(right);
				context.write(intkey, intvalue);
				
			}
		}
	}
	
	public static class Reduce extends Reducer{
		private final Text left = new Text();
		private static final Text SEPAPATOR= new Text("================================");
		public void reduce(IntPair key,Iterablevalues,Context context) throws IOException, InterruptedException{
			context.write(SEPAPATOR, null);
			left.set(Integer.toString(key.getFirst()));
			for(IntWritable val:values){
				context.write(left, val);
			}
		}
	}
	
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		// TODO Auto-generated method stub

		//讀取hadoop配置
		Configuration conf = new Configuration();
		
		//初始化作業
		Job job =new Job(conf,"Secondray Sort");
		job.setJarByClass(sorttwice.class);
		
		job.setNumReduceTasks(1);
		
		//設定map和reduce類
		job.setMapperClass(Map.class);
		job.setReducerClass(Reduce.class);
		
		//設定map的輸出key類,value類
		job.setMapOutputKeyClass(IntPair.class);
		job.setMapOutputValueClass(IntWritable.class);
		
		//設定輸出Reduce的key，value
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		
		//將輸入的資料分割成小數塊，並使用RecordReader類實現
		job.setInputFormatClass(TextInputFormat.class);
		
		//提供一個RecordReader實現資料的輸出
		job.setOutputFormatClass(TextOutputFormat.class);
		
		//重寫分割槽函式
		job.setPartitionerClass(FirstPartition.class);
		
		//重寫分組函式
		job.setGroupingComparatorClass(GroupingComparator.class);
		
		//設定路徑
		FileInputFormat.addInputPath(job, new Path("/thinkgamer/input"));
		FileOutputFormat.setOutputPath(job,new Path("/thinkgamer/output"));
		
		//提交作業
		System.exit(job.waitForCompletion(true)?0:1);
	}
}

MapReduce過程詳解
2019-04-29
MapReduce之WritableComparable排序
2020-07-29
排序
MapReduce 詳解與原始碼分析
2022-09-21
原始碼
堆排序詳解
2019-03-09
排序
JavaScript表格排序詳解
2018-09-19
JavaScript排序
Java 物件排序詳解
2018-03-21
Java物件排序
氣泡排序和選擇排序詳解
2021-09-09
排序
詳解MapReduce中的五大程式設計模型
2019-03-18
程式設計模型
O(nlogn)快速排序-雙路排序+詳細註解
2020-11-20
排序
演算法 | 快速排序詳解
2023-05-07
演算法排序
JavaScript快速排序功能詳解
2018-07-12
JavaScript排序
Mapreduce Job提交流程原始碼和切片原始碼詳解
2020-10-13
原始碼
氣泡排序與選擇排序超詳細講解
2024-11-05
排序
java中Collections.sort排序詳解
2021-10-03
Java排序
氣泡排序及優化詳解
2019-07-26
排序優化
歸併排序詳解及應用
2022-02-28
排序
瞭解MapReduce_2
2018-05-18
圖的拓撲排序詳解與實現
2020-11-08
排序
十大排序演算法詳解
2021-03-15
排序演算法
Python入門小專案-氣泡排序詳解
2022-04-08
Python排序
Java map 詳解 - 用法、遍歷、排序、常用API等
2022-01-25
Java排序API
JavaScript中陣列Array.sort()排序方法詳解
2018-05-26
JavaScript陣列排序
基礎排序演算法詳解與優化
2018-09-30
排序演算法優化
資料結構 8 基礎排序演算法詳解、快速排序的實現、瞭解分治法
2020-05-26
資料結構排序演算法
MapReduce實現與自定義詞典檔案基於hanLP的中文分詞詳解
2018-10-15
HanLP中文分詞
NumPy 陣列排序、過濾與隨機數生成詳解
2024-05-20
陣列排序隨機
資料結構與演算法排序演算法快速排序【詳細步驟圖解】
2020-05-22
資料結構演算法排序圖解
＜qsort實現一個通用的氣泡排序,排序各種型別的資料詳解＞
2020-11-15
排序型別
MapReduce理解
2024-11-02
MySQL：排序（filesort）詳細解析
2020-01-09
MySql排序
Hadoop 學習系列（四）之 MapReduce 原理講解
2019-03-04
Hadoop
資料結構系列：圖文詳解氣泡排序 & 優化
2020-10-14
資料結構排序優化
堆排序的Python實現(附詳細過程圖和講解)
2019-02-18
排序Python
Lab 1: MapReduce
2024-08-25
圖解快速排序
2020-12-25
圖解排序
DataX二次開發詳解-Kingbasees86Reader、Kingbasees86Writer外掛
2024-04-23
圖解選擇排序與插入排序
2019-04-15
圖解排序
MapReduce 示例：減少 Hadoop MapReduce 中的側連線
2021-09-17
Hadoop
C#二次開發BIMFACE系列61 File Management檔案管理服務介面二次開發及實戰詳解
2022-04-18
C#

MapReduce 二次排序詳解

相關文章