1. 設計思路
去重,重點就是無論某個資料在檔案中出現多少次,最後只是輸出一次就可以。 根據這一點,我們聯想到在reduce階段資料輸入形式是 <key, value list>,只要是key相同的,在shuffle階段都會聚合在一起,所以只要在map階段將要去重的資料作為key值就可以達到目的。
2. 具體實現
package moverepeat; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /** * * @author Amei 去除重複的資料項 */ public class Remove { public static class Map extends Mapper<LongWritable, Text, Text, Text> { protected void map(LongWritable key, Text value, Context output) throws java.io.IOException, InterruptedException { //將每一行的資料作為map輸出的key output.write(value, new Text("")); }; } public static class Reduce extends Reducer<Text, Text, Text, Text> { protected void reduce(Text key, Iterable<Text> values, Context output) throws java.io.IOException, InterruptedException { //經過shuffle階段後,Reduce的輸入資料格式為<key, value list>,此時key沒有相同的值 output.write(key, new Text("")); }; } public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration configuration = new Configuration(); Job job = new Job(configuration, "remove"); job.setJarByClass(Remove.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path( "/user/hadoop_admin/removein")); FileOutputFormat.setOutputPath(job, new Path( "/user/hadoop_admin/removeout")); System.exit((job.waitForCompletion(true) ? 0 : 1)); } }
測試用例
file01
wangkun 12 wangkun 13 wangkun 15 amei 12 amei 13
file02
wangkun 11 wangkun 13 wangkun 16 amei 12 amei 13
去重結果:
amei 12 amei 13 wangkun 11 wangkun 12 wangkun 13 wangkun 15 wangkun 16