MapReduce(三)：分割槽、排序、合併

fan_rockrock發表於2015-09-18

1.分割槽

實現分割槽的步驟：
1.1先分析一下具體的業務邏輯，確定大概有多少個分割槽
1.2首先書寫一個類，它要繼承org.apache.hadoop.mapreduce.Partitioner這個類
1.3重寫public int getPartition這個方法，根據具體邏輯，讀資料庫或者配置返回相同的數字
1.4在main方法中設定Partioner的類，job.setPartitionerClass(DataPartitioner.class);
1.5設定Reducer的數量，job.setNumReduceTasks(6);

以下例子是統計同一手機號的上行流量、下行流量以及總流量，要求分割槽

DataInfo.java

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

public class DataInfo implements Writable{
    private String tel;//手機號
    private long upFlow;//上行流量
	private long downFlow;//下行流量
    private long sumFlow;//總流量
	
    public DataInfo(){}
    public DataInfo(String tel,long upFlow,long downFlow)
    {
    	this.tel=tel;
    	this.upFlow=upFlow;
    	this.downFlow=downFlow;
    	this.sumFlow=upFlow+downFlow;
    }
    
    @Override//序列化成流
	public void write(DataOutput out) throws IOException {
		out.writeUTF(tel);
		out.writeLong(upFlow);
		out.writeLong(downFlow);
		out.writeLong(sumFlow);
	}
    @Override//反序列化成物件,注意順序不要錯了
	public void readFields(DataInput in) throws IOException {
	   	this.tel=in.readUTF();
	   	this.upFlow=in.readLong();
	   	this.downFlow=in.readLong();
	   	this.sumFlow=in.readLong();
	}
    
	@Override
	public String toString() {
		return (upFlow+"\t"+downFlow+"\t"+sumFlow);
	}
	public String getTel() {
		return tel;
	}

	public void setTel(String tel) {
		this.tel = tel;
	}

	public long getUpFlow() {
		return upFlow;
	}

	public void setUpFlow(long upFlow) {
		this.upFlow = upFlow;
	}

	public long getDownFlow() {
		return downFlow;
	}

	public void setDownFlow(long downFlow) {
		this.downFlow = downFlow;
	}

	public long getSumFlow() {
		return sumFlow;
	}

	public void setSumFlow(long sumFlow) {
		this.sumFlow = sumFlow;
	} 
}

DataCount.java

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class DataCount {

	//Map
	public static class DCMapper extends Mapper<LongWritable,Text,Text,DataInfo>
	{
		private Text text=new Text();
		protected void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException
		{
			String line=value.toString();
			String[] str=line.split("	");
			String tel=str[0];
			long up=Long.parseLong(str[1]);
			long down=Long.parseLong(str[2]);
			DataInfo data=new DataInfo(tel,up,down);
			text.set(tel);
			context.write(text, data);
		}
	}
	//Partition
	public static class DCPartitioner extends Partitioner<Text,DataInfo>
	{
		private static Map<String,Integer>provider=new HashMap<String,Integer>();
		static{
			provider.put("134", 1);
			provider.put("134", 1);
			provider.put("135", 2);
			provider.put("135", 2);
			provider.put("136", 3);
			provider.put("136", 3);
		}
		@Override
		public int getPartition(Text key, DataInfo data, int arg2) {
			//向資料庫或配置資訊讀寫
			String tel=key.toString().substring(0, 3);
			Integer num=provider.get(tel);
			if(num==null)
				num=0;
			return num;
		}
		
	}
	//Reducer
	public static class DCReducer extends Reducer<Text,DataInfo,Text,DataInfo>
	{
		protected void reduce(Text key,Iterable<DataInfo> values,Context context) throws IOException, InterruptedException
		{
			long up=0;
			long down=0;
			for(DataInfo data:values)
			{
				up+=data.getUpFlow();
				down+=data.getDownFlow();
			}
			DataInfo dataInfo=new DataInfo("",up,down);
			context.write(key, dataInfo);
		}
	}
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
         Configuration conf=new Configuration();
         Job job=Job.getInstance(conf,"patition");
         
         job.setJarByClass(DataCount.class);
         
         job.setMapperClass(DCMapper.class);
         job.setMapOutputKeyClass(Text.class);
         job.setMapOutputValueClass(DataInfo.class);
        
         job.setReducerClass(DCReducer.class);
         job.setOutputKeyClass(Text.class);
         job.setOutputValueClass(DataInfo.class);
         
         job.setPartitionerClass(DCPartitioner.class);
         
         FileInputFormat.setInputPaths(job, new Path(args[0]));
         FileOutputFormat.setOutputPath(job, new Path(args[1]));
         
         job.setNumReduceTasks(Integer.parseInt(args[2]));//設定Reduce數量，即分割槽數量，這裡最少為3，因為分割槽是3
         
         System.exit(job.waitForCompletion(true)? 0:1);
	}

}

2.排序

排序MR預設是按key2進行排序的，如果想自定義排序規則，被排序的物件要實現WritableComparable介面
，在compareTo方法中實現排序規則(MapReduce的shuffer會自動呼叫這個方法)，然後將這個物件當做k2，即可完成排序

部分程式碼如下：

	@Override
	public int compareTo(InfoBean o) {
		if(this.income == o.getIncome()){
			return this.expenses > o.getExpenses() ? 1 : -1;
		}
		return this.income > o.getIncome() ? 1 : -1;
	}

3.合併

combiner的作用就是在map端對輸出先做一次合併(其實相當於一個reducer)，以減少傳輸到reducer的資料量。

如果不用combiner，那麼，所有的結果都是reduce完成，效率會相對低下。使用combiner，先完成的map會在本地聚合，提升速度。

注意：Combiner的輸出是Reducer的輸入，如果Combiner是可插拔的，新增Combiner絕不能改變最終的計算結果。所以Combiner只應該用於那種Reduce的輸入key/value與輸出key/value型別完全一致，且不影響最終結果的場景。比如累加，最大值等。

下面看一個排序索引的例子

a.txt: hello tom
hello jerry
.....
b.txt: hello jerry
hello tom
....
輸出：
hello a.txt->2b.txt->2

......
---------------------------------
Map階段
<0,"hello tom">
....
context.write("hello->a.txt",1);
context.write("hello->a.txt",1);
context.write("hello->a.txt",1);
context.write("hello->a.txt",1);
context.write("hello->a.txt",1);

context.write("hello->b.txt",1);
context.write("hello->b.txt",1);
context.write("hello->b.txt",1);
--------------------------------------------------------
combiner階段
<"hello->a.txt",1>
<"hello->a.txt",1>
<"hello->a.txt",1>
<"hello->a.txt",1>
<"hello->a.txt",1>

<"hello->b.txt",1>
<"hello->b.txt",1>
<"hello->b.txt",1>

context.write("hello","a.txt->5");
context.write("hello","b.txt->3");
--------------------------------------------------------
Reducer階段
<"hello",{"a.txt->5","b.txt->3"}>

context.write("hello","a.txt->5 b.txt->3");
-------------------------------------------------------
結果：
hello "a.txt->5 b.txt->3"
tom "a.txt->2 b.txt->1"
kitty "a.txt->1"
.......

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class Combine {

	//Map
	public static class CBMapper extends Mapper<Object,Text,Text,Text>
	{
		private Text k=new Text();
		private Text v=new Text();
		protected void map(Object key,Text value,Context context) throws IOException, InterruptedException
		{
			String line=value.toString();
			String[]str=line.split("	");
			FileSplit inputSplit=(FileSplit)context.getInputSplit();
			String path=inputSplit.getPath().toString();//得到路徑，通過context這個上下文得到
			for(String word:str)
			{
				k.set(word+"->"+path);
				v.set("1");
				context.write(k, v);
			}
			
		}
	}
	//Combiner(Reduce1)
	public static class CBCombiner extends Reducer<Text,Text,Text,Text>
	{
		private Text k=new Text();
		private Text v=new Text();
		protected void reduce(Text key,Iterable<Text>values,Context context) throws IOException, InterruptedException
		{
			String line=key.toString();
			String[]str=line.split("->");
			String k1=str[0];
			String path=str[1];
			int count=0;
			for(Text t:values)
			{
				count+=Integer.parseInt(t.toString());
			}
			k.set(k1);
			v.set(path+"->"+count);
			context.write(k,v);
		}
	}
	
	//Reduce(Reduce2)
	public static class CBReducer extends Reducer<Text,Text,Text,Text>
	{
		private Text v=new Text();
		@Override
		protected void reduce(Text key, Iterable<Text> values,Context context)
				throws IOException, InterruptedException {
		  String result="";
		  for(Text t:values)
		  {
			  result+=t.toString()+"\t";
		  }
		  v.set(result);
		  context.write(key, v);
		}
		
	}
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException 
	{
		//構建job物件
		Configuration conf=new Configuration();
		Job job=Job.getInstance(conf, "Combiner");
		
		//設定main方法所在的類
		job.setJarByClass(Combine.class);
		
		//設定mapper相關屬性
		job.setMapperClass(CBMapper.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		
		//設定Combiner相關屬性
		
		job.setCombinerClass(CBCombiner.class);
		
		//設定reducer相關屬性
		job.setReducerClass(CBReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		
		//設定檔案輸入輸出
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job,new Path(args[1]));
		
		//提交任務
		System.exit(job.waitForCompletion(true)? 0:1);
	}
    
}

win10怎麼合併分割槽_win10合併分割槽的方法
2019-11-25
Win10
mac分割槽合併APFS容器
2019-01-17
Mac
win10u盤分割槽怎麼合併 win10如何把U分割槽合併
2020-10-06
Win10
MapReduce之自定義分割槽器Partitioner
2020-07-21
Hive的分割槽和排序
2024-11-13
Hive排序
win10分割槽合併不了的解決方法_win10分割槽合併不了如何處理
2020-02-14
Win10
win10分割槽合併保留資料怎麼操作_win10硬碟合併分割槽並保留檔案詳細步驟
2020-06-16
Win10硬碟
Oracle分割槽表基礎運維-05組合分割槽
2020-05-15
Oracle運維
win10系統OEM分割槽怎麼合併或刪除
2019-08-18
Win10
影片合併分割軟體如何合併影片
2018-11-02
Flink SQL FileSystem Connector 分割槽提交與自定義小檔案合併策略
2020-10-31
SQL
PostgreSQL11preview-索引優化。filter智慧消除、分割槽索引智慧合併
2018-04-18
SQLView索引優化Filter
MapReduce實現之Reduce端重分割槽Join操作最佳化！
2018-11-07
第三章：查詢與排序-------3.5快排之三指標分割槽法
2019-03-08
排序指標
WINDOWS還原系統後原多個分割槽合併為一個分割槽後怎麼進行資料恢復
2021-12-13
Windows資料恢復
windows10系統合併分割槽後c盤過大怎麼辦
2019-03-04
Windows
WINDOWS還原系統後分割槽合併為一個怎麼恢復
2021-08-20
Windows
Linux分割槽方案、分割槽建議
2024-11-04
Linux
win10系統下怎麼將D盤合併分割槽到c盤
2019-03-28
Win10
硬碟/行動硬碟分割槽合併失敗資料丟失瞭如何恢復?
2022-08-01
硬碟
MapReduce之WritableComparable排序
2020-07-29
排序
oracle分割槽表和分割槽表exchange
2024-03-15
Oracle
PostgreSQL/LightDB 分割槽表之分割槽裁剪
2022-07-14
SQL
Linux 分割槽擴容（根分割槽擴容，SWAP 分割槽擴容，掛載新分割槽為目錄）
2021-08-27
Linux
shell 檔案合併去重分割
2020-09-25
關於numpy的索引、合併、分割
2020-10-18
索引
java快速分割及合併檔案
2021-09-09
Java
Oracle分割槽表基礎運維-07增加分割槽(2 HASH分割槽)
2020-05-18
Oracle運維
oracle分割槽表和非分割槽表exchange
2024-03-15
Oracle
非分割槽錶轉換成分割槽表
2022-07-09
[oracle] expdp 匯出分割槽表的分割槽
2021-01-28
Oracle
Oracle分割槽表基礎運維-07增加分割槽(1範圍分割槽)
2020-05-18
Oracle運維
openGauss 分割槽
2024-03-30
mysql 分割槽
2019-07-15
MySql
分割槽Partition
2024-11-14
lvs 分割槽
2022-10-21
Kafka 分割槽
2022-05-30
Kafka
Windows伺服器如何磁碟分割槽，Windows伺服器磁碟分割槽常見的三種
2021-07-12
Windows伺服器
第三章：查詢與排序-------3.4快排之雙向掃描分割槽法
2019-03-08
排序

MapReduce(三)：分割槽、排序、合併

相關文章