2. MapReduce 簡介
MapReduce 實際上是分為兩個過程
- map 過程 : 資料的讀取
- reduce 過程 : 資料的計算
平行計算是一個非常複雜的過程, mapreduce是一個並行框架。
在Hadoop中,每個MapReduce任務都被初始化為一個Job,每個Job又可以分為兩種階段:map階段和reduce階段。這兩個階段分別用兩個函式表示,即map函式和reduce函式
我們可以看下典型的官方列子
開發
用idea 開發開發
pom.xml 新增依賴
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>1.2.1</version>
</dependency>
</dependencies>
寫程式碼:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
/**
Created by diwu.sld on 2016/4/13.
*/
public class WordCount{public static class CountMap extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();public void map(LongWritable longWritable, Text text, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter) throws IOException { String line = text.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while(tokenizer.hasMoreTokens()){ word.set(tokenizer.nextToken()); outputCollector.collect(word, one); } }
}
public static class CountReduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iteratorvalues,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(CountMap.class); conf.setCombinerClass(CountReduce.class); conf.setReducerClass(CountReduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);
}
}
然後打好包 HadoopDemo:
1. Project Sturcture->Artifacts->+
2. Build Artifacts
放到 hadoop 目錄下執行
執行
- bin/hadoop fs -mkdir -p input
- bin/hadoop fs -copyFromLocal README.txt input
- bin/hadoop jar demos/HadoopDemo.jar WorldCount input output
- bin/hadoop fs -cat output/* 或者bin/hadoop fs -ls output
- bin/hadoop fs -cat output/part-r-00000
總結
如果有N個檔案,和對這個N個檔案的計算,我們可以用並行來提高執行效率。但是檔案有大有小, 計算量有多又少, 如何進行並行和分配任務是一個非常繁瑣的事情。 所以有了Hadoop這個並行框架來解決我們的問題。
Hadoop 主要分為兩大塊: 分散式檔案儲存和分散式計算。
在分散式檔案儲存中,他會把檔案分割為想多相同的小塊。