MapReduce應用案例--簡單的資料去重

Amei1314發表於2016-04-11

1. 設計思路

  去重,重點就是無論某個資料在檔案中出現多少次,最後只是輸出一次就可以。 根據這一點,我們聯想到在reduce階段資料輸入形式是 <key, value list>,只要是key相同的,在shuffle階段都會聚合在一起,所以只要在map階段將要去重的資料作為key值就可以達到目的。

2. 具體實現

  

package moverepeat;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * 
 * @author Amei 去除重複的資料項
 */

public class Remove {
    public static class Map extends Mapper<LongWritable, Text, Text, Text> {
        protected void map(LongWritable key, Text value, Context output)
                throws java.io.IOException, InterruptedException {
            //將每一行的資料作為map輸出的key
            output.write(value, new Text(""));
        };
    }

    public static class Reduce extends Reducer<Text, Text, Text, Text> {
        protected void reduce(Text key, Iterable<Text> values, Context output)
                throws java.io.IOException, InterruptedException {
            //經過shuffle階段後,Reduce的輸入資料格式為<key, value list>,此時key沒有相同的值
            output.write(key, new Text(""));
        };
    }

    public static void main(String[] args) throws IOException,
            ClassNotFoundException, InterruptedException {
        Configuration configuration = new Configuration();
        Job job = new Job(configuration, "remove");
        job.setJarByClass(Remove.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(
                "/user/hadoop_admin/removein"));
        FileOutputFormat.setOutputPath(job, new Path(
                "/user/hadoop_admin/removeout"));
        System.exit((job.waitForCompletion(true) ? 0 : 1));
    }
}

   測試用例

  file01

wangkun 12
wangkun 13
wangkun 15
amei 12
amei 13

  file02

  

wangkun 11
wangkun 13
wangkun 16
amei 12
amei 13

 去重結果:

  

amei 12    
amei 13    
wangkun 11    
wangkun 12    
wangkun 13    
wangkun 15    
wangkun 16    

 

相關文章