MapReduce(三):分割槽、排序、合併
1.分割槽
實現分割槽的步驟:
1.1先分析一下具體的業務邏輯,確定大概有多少個分割槽
1.2首先書寫一個類,它要繼承org.apache.hadoop.mapreduce.Partitioner這個類
1.3重寫public int getPartition這個方法,根據具體邏輯,讀資料庫或者配置返回相同的數字
1.4在main方法中設定Partioner的類,job.setPartitionerClass(DataPartitioner.class);
1.5設定Reducer的數量,job.setNumReduceTasks(6);
以下例子是統計同一手機號的上行流量、下行流量以及總流量,要求分割槽
DataInfo.java
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class DataInfo implements Writable{
private String tel;//手機號
private long upFlow;//上行流量
private long downFlow;//下行流量
private long sumFlow;//總流量
public DataInfo(){}
public DataInfo(String tel,long upFlow,long downFlow)
{
this.tel=tel;
this.upFlow=upFlow;
this.downFlow=downFlow;
this.sumFlow=upFlow+downFlow;
}
@Override//序列化成流
public void write(DataOutput out) throws IOException {
out.writeUTF(tel);
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(sumFlow);
}
@Override//反序列化成物件,注意順序不要錯了
public void readFields(DataInput in) throws IOException {
this.tel=in.readUTF();
this.upFlow=in.readLong();
this.downFlow=in.readLong();
this.sumFlow=in.readLong();
}
@Override
public String toString() {
return (upFlow+"\t"+downFlow+"\t"+sumFlow);
}
public String getTel() {
return tel;
}
public void setTel(String tel) {
this.tel = tel;
}
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
}
DataCount.java
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class DataCount {
//Map
public static class DCMapper extends Mapper<LongWritable,Text,Text,DataInfo>
{
private Text text=new Text();
protected void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException
{
String line=value.toString();
String[] str=line.split(" ");
String tel=str[0];
long up=Long.parseLong(str[1]);
long down=Long.parseLong(str[2]);
DataInfo data=new DataInfo(tel,up,down);
text.set(tel);
context.write(text, data);
}
}
//Partition
public static class DCPartitioner extends Partitioner<Text,DataInfo>
{
private static Map<String,Integer>provider=new HashMap<String,Integer>();
static{
provider.put("134", 1);
provider.put("134", 1);
provider.put("135", 2);
provider.put("135", 2);
provider.put("136", 3);
provider.put("136", 3);
}
@Override
public int getPartition(Text key, DataInfo data, int arg2) {
//向資料庫或配置資訊讀寫
String tel=key.toString().substring(0, 3);
Integer num=provider.get(tel);
if(num==null)
num=0;
return num;
}
}
//Reducer
public static class DCReducer extends Reducer<Text,DataInfo,Text,DataInfo>
{
protected void reduce(Text key,Iterable<DataInfo> values,Context context) throws IOException, InterruptedException
{
long up=0;
long down=0;
for(DataInfo data:values)
{
up+=data.getUpFlow();
down+=data.getDownFlow();
}
DataInfo dataInfo=new DataInfo("",up,down);
context.write(key, dataInfo);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf=new Configuration();
Job job=Job.getInstance(conf,"patition");
job.setJarByClass(DataCount.class);
job.setMapperClass(DCMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(DataInfo.class);
job.setReducerClass(DCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DataInfo.class);
job.setPartitionerClass(DCPartitioner.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setNumReduceTasks(Integer.parseInt(args[2]));//設定Reduce數量,即分割槽數量,這裡最少為3,因為分割槽是3
System.exit(job.waitForCompletion(true)? 0:1);
}
}
排序MR預設是按key2進行排序的,如果想自定義排序規則,被排序的物件要實現WritableComparable介面
,在compareTo方法中實現排序規則(MapReduce的shuffer會自動呼叫這個方法),然後將這個物件當做k2,即可完成排序
部分程式碼如下:
@Override
public int compareTo(InfoBean o) {
if(this.income == o.getIncome()){
return this.expenses > o.getExpenses() ? 1 : -1;
}
return this.income > o.getIncome() ? 1 : -1;
}
combiner的作用就是在map端對輸出先做一次合併(其實相當於一個reducer),以減少傳輸到reducer的資料量。
如果不用combiner,那麼,所有的結果都是reduce完成,效率會相對低下。使用combiner,先完成的map會在本地聚合,提升速度。
注意:Combiner的輸出是Reducer的輸入,如果Combiner是可插拔的,新增Combiner絕不能改變最終的計算結果。所以Combiner只應該用於那種Reduce的輸入key/value與輸出key/value型別完全一致,且不影響最終結果的場景。比如累加,最大值等。
下面看一個排序索引的例子
a.txt: hello tom
hello jerry
.....
b.txt: hello jerry
hello tom
....
輸出:
hello a.txt->2b.txt->2
......
---------------------------------
Map階段
<0,"hello tom">
....
context.write("hello->a.txt",1);
context.write("hello->a.txt",1);
context.write("hello->a.txt",1);
context.write("hello->a.txt",1);
context.write("hello->a.txt",1);
context.write("hello->b.txt",1);
context.write("hello->b.txt",1);
context.write("hello->b.txt",1);
--------------------------------------------------------
combiner階段
<"hello->a.txt",1>
<"hello->a.txt",1>
<"hello->a.txt",1>
<"hello->a.txt",1>
<"hello->a.txt",1>
<"hello->b.txt",1>
<"hello->b.txt",1>
<"hello->b.txt",1>
context.write("hello","a.txt->5");
context.write("hello","b.txt->3");
--------------------------------------------------------
Reducer階段
<"hello",{"a.txt->5","b.txt->3"}>
context.write("hello","a.txt->5 b.txt->3");
-------------------------------------------------------
結果:
hello "a.txt->5 b.txt->3"
tom "a.txt->2 b.txt->1"
kitty "a.txt->1"
.......
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Combine {
//Map
public static class CBMapper extends Mapper<Object,Text,Text,Text>
{
private Text k=new Text();
private Text v=new Text();
protected void map(Object key,Text value,Context context) throws IOException, InterruptedException
{
String line=value.toString();
String[]str=line.split(" ");
FileSplit inputSplit=(FileSplit)context.getInputSplit();
String path=inputSplit.getPath().toString();//得到路徑,通過context這個上下文得到
for(String word:str)
{
k.set(word+"->"+path);
v.set("1");
context.write(k, v);
}
}
}
//Combiner(Reduce1)
public static class CBCombiner extends Reducer<Text,Text,Text,Text>
{
private Text k=new Text();
private Text v=new Text();
protected void reduce(Text key,Iterable<Text>values,Context context) throws IOException, InterruptedException
{
String line=key.toString();
String[]str=line.split("->");
String k1=str[0];
String path=str[1];
int count=0;
for(Text t:values)
{
count+=Integer.parseInt(t.toString());
}
k.set(k1);
v.set(path+"->"+count);
context.write(k,v);
}
}
//Reduce(Reduce2)
public static class CBReducer extends Reducer<Text,Text,Text,Text>
{
private Text v=new Text();
@Override
protected void reduce(Text key, Iterable<Text> values,Context context)
throws IOException, InterruptedException {
String result="";
for(Text t:values)
{
result+=t.toString()+"\t";
}
v.set(result);
context.write(key, v);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException
{
//構建job物件
Configuration conf=new Configuration();
Job job=Job.getInstance(conf, "Combiner");
//設定main方法所在的類
job.setJarByClass(Combine.class);
//設定mapper相關屬性
job.setMapperClass(CBMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//設定Combiner相關屬性
job.setCombinerClass(CBCombiner.class);
//設定reducer相關屬性
job.setReducerClass(CBReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//設定檔案輸入輸出
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
//提交任務
System.exit(job.waitForCompletion(true)? 0:1);
}
}
相關文章
- win10怎麼合併分割槽_win10合併分割槽的方法Win10
- mac分割槽合併APFS容器Mac
- win10u盤分割槽怎麼合併 win10如何把U分割槽合併Win10
- MapReduce之自定義分割槽器Partitioner
- Hive的分割槽和排序Hive排序
- win10分割槽合併不了的解決方法_win10分割槽合併不了如何處理Win10
- win10分割槽合併保留資料怎麼操作_win10硬碟合併分割槽並保留檔案詳細步驟Win10硬碟
- Oracle分割槽表基礎運維-05組合分割槽Oracle運維
- win10系統OEM分割槽怎麼合併或刪除Win10
- 影片合併分割軟體如何合併影片
- Flink SQL FileSystem Connector 分割槽提交與自定義小檔案合併策略 SQL
- PostgreSQL11preview-索引優化。filter智慧消除、分割槽索引智慧合併SQLView索引優化Filter
- MapReduce實現之Reduce端重分割槽Join操作最佳化!
- 第三章 :查詢與排序-------3.5快排之三指標分割槽法排序指標
- WINDOWS還原系統後原多個分割槽合併為一個分割槽後怎麼進行資料恢復Windows資料恢復
- windows10系統合併分割槽後c盤過大怎麼辦Windows
- WINDOWS還原系統後分割槽合併為一個怎麼恢復Windows
- Linux分割槽方案、分割槽建議Linux
- win10系統下怎麼將D盤合併分割槽到c盤Win10
- 硬碟/行動硬碟分割槽合併失敗資料丟失瞭如何恢復?硬碟
- MapReduce之WritableComparable排序排序
- oracle分割槽表和分割槽表exchangeOracle
- PostgreSQL/LightDB 分割槽表之分割槽裁剪SQL
- Linux 分割槽擴容(根分割槽擴容,SWAP 分割槽擴容,掛載新分割槽為目錄)Linux
- shell 檔案合併 去重 分割
- 關於numpy的索引、合併、分割索引
- java快速分割及合併檔案Java
- Oracle分割槽表基礎運維-07增加分割槽(2 HASH分割槽)Oracle運維
- oracle分割槽表和非分割槽表exchangeOracle
- 非分割槽錶轉換成分割槽表
- [oracle] expdp 匯出分割槽表的分割槽Oracle
- Oracle分割槽表基礎運維-07增加分割槽(1範圍分割槽)Oracle運維
- openGauss 分割槽
- mysql 分割槽MySql
- 分割槽Partition
- lvs 分割槽
- Kafka 分割槽Kafka
- Windows伺服器如何磁碟分割槽,Windows伺服器磁碟分割槽常見的三種Windows伺服器
- 第三章 :查詢與排序-------3.4快排之雙向掃描分割槽法排序