一個MapReduce 程式示例細節決定成敗(五) ：Partitioner

self_control發表於2016-05-27

原文網址 : http://blog.itpub.net/30066956/viewspace-2108214/

上一個博文中，我們分別介紹了使用combiner 和in-map 聚合來優化 map reduce job。
這一篇中，繼續優化這個統計字元數量的mr 程式。
再次介紹下map reduce程式的目標：統計輸入檔案中 a~z 字元分別有多少個。 A 與 a 都統計到 a 中。
先看一下上次執行的Job Counters 計數，可以看到Reducer Task的個數為1.

點選(此處)摺疊或開啟

Job Counters
Launched map tasks=3
Launched reduce tasks=1
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=1207744
Total time spent by all reduces in occupied slots (ms)=341424

知識點：MR Job 預設的reducer 個數為1. reducer的個數可以通過job.setNumReducerTasks(num) 來進行設定。
留個問題：那map 任務的個數怎麼設定？

假如資料量很大時，一個reducer task 可能成為任務的bottle neck。那我們手工設定一下reducer 個數。

點選(此處)摺疊或開啟

@Override
public int run(String[] args) throws Exception {
//valid the parameters
if(args.length !=2){
return -1;
}
Job job = Job.getInstance(getConf(), "MyWordCountJob");
job.setJarByClass(MyWordCountJob.class);
Path inPath = new Path(args[0]);
Path outPath = new Path(args[1]);
outPath.getFileSystem(getConf()).delete(outPath,true);
TextInputFormat.setInputPaths(job, inPath);
TextOutputFormat.setOutputPath(job, outPath);
job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(3);
return job.waitForCompletion(true)?0:1;
}

編譯後再次執行檢視日誌可以看到 reducer的個數是3，符合我們的預期。但問題還沒有完！

點選(此處)摺疊或開啟

File System Counters
FILE: Number of bytes read=642
FILE: Number of bytes written=507033
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=556
HDFS: Number of bytes written=107
HDFS: Number of read operations=18
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Job Counters
Launched map tasks=3
Launched reduce tasks=3
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=1207744
Total time spent by all reduces in occupied slots (ms)=341424
Map-Reduce Framework
Map input records=8
Map output records=78
Map output bytes=468
Map output materialized bytes=678
Input split bytes=399
Combine input records=0
Combine output records=0
Reduce input groups=26
Reduce shuffle bytes=678
Reduce input records=78
Reduce output records=26
Spilled Records=156
Shuffled Maps =9
Failed Shuffles=0
Merged Map outputs=9
GC time elapsed (ms)=507
CPU time spent (ms)=7770
Physical memory (bytes) snapshot=1329672192
Virtual memory (bytes) snapshot=5978918912
Total committed heap usage (bytes)=1004273664

下面看一下輸出，有三個輸出檔案，正常，一個reducer 對應一個輸出檔案。

點選(此處)摺疊或開啟

[train@sandbox MyWordCount]$ hdfs dfs -ls output
Found 4 items
-rw-r--r-- 3 train hdfs 0 2016-05-11 11:48 output/_SUCCESS
-rw-r--r-- 3 train hdfs 37 2016-05-11 11:48 output/part-r-00000
-rw-r--r-- 3 train hdfs 34 2016-05-11 11:48 output/part-r-00001
-rw-r--r-- 3 train hdfs 36 2016-05-11 11:48 output/part-r-00002

我們看一下結果檔案中的內容

點選(此處)摺疊或開啟

[train@sandbox MyWordCount]$ hdfs dfs -cat output/part-r-00000
b 3
e 11
h 8
k 3
n 4
q 3
t 4
w 7
z 3
[train@sandbox MyWordCount]$ hdfs dfs -cat output/part-r-00001
c 4
f 4
i 5
l 6
o 12
r 13
u 6
x 3
[train@sandbox MyWordCount]$ hdfs dfs -cat output/part-r-00002
a 8
d 4
g 3
j 3
m 7
p 3
s 4
v 0
y 6

發現三個檔案中沒有按順序輸出，現在我們想要的是a~h 放到一個檔案中，i~q 放到第二個檔案中、r~z 放到第三個檔案中。我們應該怎麼辦？
這就要用到我們本次將要介紹的利器： Partitioner 。
Partition 就是用來決定map的輸出會輸送到哪兒個reducer中。當只有一個reducer 時，不會呼叫Partitioner，即使配置了也不會呼叫。
hadoop 框架中預設的Partitioner 是HashPartitioner。它是根據key的hash值對reducer個數取餘進行分配的。
說明一下：返回0即本記錄將發往第一個reducer,返回1則本記錄發往第二個reducer.依次類推。

點選(此處)摺疊或開啟

public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}

那接下來我們要定義自己的一個Partitioner ,它要按照我們預定的a~h 發往第一個reducer，i~q 發往第二個reducer、r~z發往第三個reducer。

點選(此處)摺疊或開啟

public static class MyWordCountPartitioner extends Partitioner<Text,IntWritable>{
@Override
public int getPartition(Text key, IntWritable value, int numPartitions) {
char c =key.toString().charAt(0);
if(c>='a'&& c<'i')
return 0;
else if(c>='i' && c<'q')
return 1;
else
return 2;
}
}

執行檢視結果，可見實現了我們的預期。

點選(此處)摺疊或開啟

[train@sandbox MyWordCount]$ hdfs dfs -cat output/part-r-00000
a 8
b 3
c 4
d 4
e 11
f 4
g 3
h 8
[train@sandbox MyWordCount]$ hdfs dfs -cat output/part-r-00001
i 5
j 3
k 3
l 6
m 7
n 4
o 12
p 3
[train@sandbox MyWordCount]$ hdfs dfs -cat output/part-r-00002
q 3
r 13
s 4
t 4
u 6
v 0
w 7
x 3
y 6
z 3

下面照例把整個程式碼貼上

點選(此處)摺疊或開啟

package wordcount;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.log4j.Logger;
public class MyWordCountJob extends Configured implements Tool {
Logger log = Logger.getLogger(MyWordCountJob.class);
public static class MyWordCountMapper extends
Mapper<LongWritable, Text, Text, IntWritable> {
Logger log = Logger.getLogger(MyWordCountJob.class);
Map<Character,Integer> map = new HashMap<Character,Integer>();
Text mapKey = new Text();
IntWritable mapValue = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
for(char c :value.toString().toLowerCase().toCharArray()){
if(c>='a' && c <='z'){
map.put(c,map.get(c)+1);
}
}
}
@Override
protected void cleanup(Context context) throws IOException,
InterruptedException {
for(char key : map.keySet()){
mapKey.set(String.valueOf(key));
mapValue.set(map.get(key));
context.write(mapKey, mapValue);
}
}
@Override
protected void setup(Context context) throws IOException,
InterruptedException {
for(char c='a';c<='z' ;c++){
map.put(c, 0);
}
}
}
public class MyWordCountPartitioner extends Partitioner<Text,IntWritable>{
@Override
public int getPartition(Text key, IntWritable value, int numPartitions) {
char c =key.toString().charAt(0);
if(c>='a'&& c<'i')
return 0;
else if(c>='i' && c<'q')
return 1;
else
return 2;
}
}
public static class MyWordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
Text rkey = new Text();
IntWritable rvalue = new IntWritable(1);
@Override
protected void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException {
int n=0;
for(IntWritable value :values){
n+= value.get();
}
rvalue.set(n);
context.write(key, rvalue);
}
}
@Override
public int run(String[] args) throws Exception {
//valid the parameters
if(args.length !=2){
return -1;
}
Job job = Job.getInstance(getConf(), "MyWordCountJob");
job.setJarByClass(MyWordCountJob.class);
Path inPath = new Path(args[0]);
Path outPath = new Path(args[1]);
outPath.getFileSystem(getConf()).delete(outPath,true);
TextInputFormat.setInputPaths(job, inPath);
TextOutputFormat.setOutputPath(job, outPath);
job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);
job.setPartitionerClass(MyWordCountPartitioner.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(3);
return job.waitForCompletion(true)?0:1;
}
public static void main(String [] args){
int result = 0;
try {
result = ToolRunner.run(new Configuration(), new MyWordCountJob(), args);
} catch (Exception e) {
e.printStackTrace();
}
System.exit(result);
}
}

最後再嘮叨兩句：一般情況要，自定義partitioner 是為了解決資料分佈不均的情況，又叫資料傾斜。而且自定義的partitioner 要保證，相同的key要發往相同的reducer。

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/30066956/viewspace-2108214/，如需轉載，請註明出處，否則將追究法律責任。

第五章 Vlookup函式示例-細節決定成敗
2020-03-13
函式
《細節決定成敗》汪中求著
2020-04-06
Java集合詳解8：Java集合類細節精講，細節決定成敗
2019-11-07
Java
邦芒簡歷：求職簡歷細節決定成敗
2024-01-23
求職
細節決定成敗，淺析《合金彈頭》的成功之道
2020-01-06
熟悉海外客戶區域特點，細節決定成敗！
2023-03-25
開發者談F2P模式：細節決定成敗
2021-12-06
模式
細節決定成敗，不容忽視的10道Node面試題
2019-05-13
面試題
MapReduce之自定義分割槽器Partitioner
2020-07-21
【2024-03-06】細節成敗
2024-03-11
AutoCAD 2024：細節決定成敗，精準設計從這裡開始 mac/win啟用版
2024-03-07
Mac
MapReduce 程式設計模型 & WordCount 示例
2019-08-01
程式設計模型
穩定幣——決定公鏈成敗的關鍵
2019-03-26
MapReduce 示例：減少 Hadoop MapReduce 中的側連線
2021-09-17
Hadoop
Laravel 的一個命名細節分享
2020-01-10
Laravel
一個小的技術細節
2021-12-15
h5效能優化，細節決定結果。
2019-02-16
H5優化
五個你所不知道的Flutter開發細節
2018-10-14
Flutter
PoweJob高階特性-MapReduce完整示例
2022-07-08
寶鯤財經：一個好習慣決定你的交易成敗
2018-10-16
細化特徵也決定了這個市場不可能一
2022-01-05
特徵
Django搭建示例專案實戰與避坑細節
2020-12-05
Django
詳解MapReduce中的五大程式設計模型
2019-03-18
程式設計模型
決定專案成敗的三件事 - 企業工藝
2020-04-10
Vision Pro 發售只是開始，這兩個問題才是決定行業成敗的關鍵
2024-02-26
行業
Node 中如何引入一個模組及其細節
2020-07-22
SQL2017 安裝教程圖解（詳細到每一個細節）
2024-08-06
SQL圖解
十年風雨，一個普通程式設計師的成長之路（五）
2019-06-10
程式設計師
國五東風灑水車操作使用細節
2022-03-04
Spring 註解學習詳細程式碼示例
2020-08-03
Spring
思考一個小細節，從如何反轉字典說起
2018-12-05
[20200409]使用ash_wait_chains注意的一個細節.txt
2020-04-09
AI
機器級程式的小細節
2018-11-05
Docker 19.03.13的四個使用細節
2020-11-20
Docker
TCP 中的兩個細節點
2021-05-25
TCP
一個毛利堪比茅臺的暴利行業，中國企業正在節節敗退
2021-07-20
行業
決定資訊部門數字化專案建設成敗的因素有哪些？
2022-07-11
解析SwiftUI佈局細節（一）
2020-12-17
SwiftUI
拋開技術細節，分享五年職場的個人感悟及道理總結
2020-05-27

一個MapReduce 程式示例 細節決定成敗(五) ：Partitioner

相關文章

一個MapReduce 程式示例細節決定成敗(五) ：Partitioner