細節決定成敗 MapReduce任務實戰 Map Join

self_control發表於2016-06-14

一、任務描述

在hdfs 上有兩個檔案

學生資訊檔案：hdfs://..***:8020/user/train/joinjob/student.txt

以逗號分隔，第一列為學號，學號每個學生唯一，第二列為學生姓名。

點選(此處)摺疊或開啟

2016001,Join

2016002,Abigail

2016003,Abby

2016004,Alexandra

2016005,Cathy

2016006,Katherine

學生成績資訊檔案：hdfs://..***:8020/user/train/joinjob/student_score.txt

以逗號分隔，第一列為學號，學號每個學生唯一，第二列為課程程式碼，第三列為分數。

點選(此處)摺疊或開啟

2016001,YY,60
2016001,SX,88
2016001,YW,91
2016002,SX,77
2016002,YW,33
.............

期望的結果檔案形式如下

點選(此處)摺疊或開啟

2016001,Join,YY,60
2016001,Join,SX,88
2016001,Join,YW,91
2016002,Abigail,SX,77
2016002,Abigail,YW,33
2016002,Abigail,YY,56
2016003,Abby,YY,34
2016003,Abby,SX,84
2016003,Abby,YW,69
2016004,Alexandra,YY,89
2016004,Alexandra,SX,84
.......

二、任務分析

這是一個兩個資料集關聯的任務。關聯有map端join ，和reduce端join。通常在有一個資料集比較小，可在全部放到記憶體中時，選擇map端join；當兩個資料集都很大時，使用reduce端join，這時可以使用Bloom過濾器來增加效率。本次任務我們使用map端join實現。
此任務關聯到直接輸出即可，不需要reduce過程。
本任務會使用到的技術有：分散式快取DistributedCache、mapreduce中的本地檔案讀取

三、實現思路

把一個學生資訊使用分散式快取分發到各個任務節點。
學生成績資訊當做map的input。
在mapper 的setup函式中讀取已快取到本的學生資訊，並儲存到一個HashMap中，key為學號，value為姓名。
在mapper函式中對每一個學生成績資訊，根據學號去上一步的map中去取姓名資訊。
封裝到Text物件中輸出。

三、實現程式碼

點選(此處)摺疊或開啟

package join;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.log4j.Logger;
public class MapJoinDirectOutPutJob extends Configured implements Tool {
private static String STUDENTNAME_PARAM ="STUDENTNAME_PARAM";
private static String INPATH_SCORE = "joinjob/student_score.txt";
private static String OUTPATH = "joinjob/output";
public static class MapJoinDirectOutPutMapper extends Mapper<LongWritable,Text,NullWritable,Text>{
private BufferedReader br = null;
private Map <String,String> map =new HashMap<String,String>();
private Text newValue = new Text();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String [] words = StringUtils.split(value.toString(),',');
String name = map.get(words[0]);
newValue.set(words[0]+","+name +","+words[1]+","+words[2]);
context.write(NullWritable.get(), newValue);
}
@Override
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
br = new BufferedReader(new InputStreamReader(fs.open(new Path("studentLink.txt"))));
String current;
while((current= br.readLine())!=null){
String [] words = current.split(",");
map.put(words[0],words[1]);
}
br.close();
}
}
@Override
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(),"MapJoinDirectOutPutJob");
job.setJarByClass(getClass());
Configuration conf = job.getConfiguration();
conf.set(STUDENTNAME_PARAM, args[0]);
Path in = new Path(INPATH_SCORE);
Path out = new Path(OUTPATH);
FileSystem.get(conf).delete(out,true);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(MapJoinDirectOutPutMapper.class);
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
URI uri = new URI("hdfs://***.***.***.***:8020/user/train/joinjob/student.txt#studentLink.txt");
job.addCacheFile(uri);
return job.waitForCompletion(true) ?0:1;
}
public static void main(String [] args){
int r = 0;
try{
r = ToolRunner.run(new Configuration(),new MapJoinDirectOutPutJob(),args);
}catch(Exception e){
e.printStackTrace();
}
System.exit(r);
}
}

本文地址：http://blog.itpub.net/30066956/viewspace-2120063/

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/30066956/viewspace-2120063/，如需轉載，請註明出處，否則將追究法律責任。

細節決定成敗 MapReduce任務實戰 Reduce Join
2016-06-14
細節決定成敗 MapReduce任務實戰倒排索引
2016-06-15
索引
一個MapReduce 程式示例細節決定成敗（一）
2016-05-26
一個MapReduce 程式示例細節決定成敗(五) ：Partitioner
2016-05-27
一個MapReduce 程式示例細節決定成敗(九)：RawComparator
2016-06-01
一個MapReduce 程式示例細節決定成敗(八)：TotalOrderPartitioner
2016-05-30
一個MapReduce 程式示例細節決定成敗(六) ：CombineFileInputFormat
2016-05-30
ORM
一個MapReduce 程式示例細節決定成敗(三) ：Combiner
2016-05-27
一個MapReduce 程式示例細節決定成敗(二) ：觀察日誌及 Counter
2016-05-27
一個MapReduce 程式示例細節決定成敗(七) ：自定義Key 及RecordReader
2016-05-30
面試：黃金法則——細節決定成敗
2009-08-05
面試
細節決定成敗！APP設計不容忽視的20個細節
2015-05-04
APP
Java集合詳解8：Java集合類細節精講，細節決定成敗
2019-11-07
Java
開發者談F2P模式：細節決定成敗
2021-12-06
模式
邦芒簡歷：求職簡歷細節決定成敗
2024-01-23
求職
Python讀書筆記：細節決定成敗(2)
2016-01-30
Python筆記
Python讀書筆記：細節決定成敗（1）
2016-01-14
Python筆記
第五章 Vlookup函式示例-細節決定成敗
2020-03-13
函式
汪峰FIIL Diva智慧耳機究竟如何？細節決定成敗
2016-08-10
如何讓程式設計師幸福工作：細節決定成敗
2014-10-12
程式設計師
細節決定ERP專案啟動會的成敗
2011-05-06
細節決定成敗，不容忽視的10道Node面試題
2019-05-13
面試題
細節決定成敗——無CSS時網頁的可讀性
2011-09-23
CSS網頁
軟體設計是怎樣煉成的（7）——細節決定成敗（詳細設計）
2014-03-04
【原創】構建高效能ASP.NET站點之三細節決定成敗
2014-03-12
ASP.NET
【2024-03-06】細節成敗
2024-03-11
【freertos】006-任務切換實現細節
2022-03-31
專案實戰！接入分散式定時任務框架
2022-03-25
分散式框架
【freertos】004-任務建立與刪除及其實現細節
2022-03-29
備份任務實戰
2024-10-04
AutoCAD 2024：細節決定成敗，精準設計從這裡開始 mac/win啟用版
2024-03-07
Mac
MapReduce框架-Join的使用
2021-08-13
框架
責任執行細節
2011-12-09
Java定時任務解決方案
2018-08-16
Java
安全性配置－定義任務流節點
2007-11-27
定時任務的實現
2016-03-02
一個簡單的MapReduce示例（多個MapReduce任務處理）
2017-05-27
MapReduce實戰：倒排索引
2014-12-24
索引

細節決定成敗 MapReduce任務實戰 Map Join

一、任務描述

學生資訊檔案：hdfs://***.***.***:8020/user/train/joinjob/student.txt

點選(此處)摺疊或開啟 2016001,Join 2016002,Abigail 2016003,Abby 2016004,Alexandra 2016005,Cathy 2016006,Katherine

學生成績資訊檔案：hdfs://***.***.***:8020/user/train/joinjob/student_score.txt

期望的結果檔案形式如下

二、任務分析

三、實現思路

三、實現程式碼

相關文章

學生資訊檔案：hdfs://..***:8020/user/train/joinjob/student.txt

點選(此處)摺疊或開啟

2016001,Join

2016002,Abigail

2016003,Abby

2016004,Alexandra

2016005,Cathy

2016006,Katherine

學生成績資訊檔案：hdfs://..***:8020/user/train/joinjob/student_score.txt