1.概述
最近有同學反應,如何在配置了HA的Hadoop平臺執行MapReduce程式呢?對於剛步入Hadoop行業的同學,這個疑問卻是會存在,其實仔細想想,如果你之前的語言功底不錯的,應該會想到自動重連,自動重連也可以幫我我們解決執行MapReduce程式的問題。然後,今天我贅述的是利用Hadoop的Java API 來實現。
2.介紹
下面直接附上程式碼,程式碼中我都有註釋。
2.1Java操作HDFS HA的API
程式碼如下:
/** * */ package cn.hdfs.mr.example; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; /** * @author dengjie * @date 2015年3月24日 * @description TODO */ public class DFS { public static void main(String[] args) { Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://cluster1");//指定hdfs的nameservice為cluster1,是NameNode的URI conf.set("dfs.nameservices", "cluster1");//指定hdfs的nameservice為cluster1 conf.set("dfs.ha.namenodes.cluster1", "nna,nns");//cluster1下面有兩個NameNode,分別是nna,nns conf.set("dfs.namenode.rpc-address.cluster1.nna", "10.211.55.26:9000");//nna的RPC通訊地址 conf.set("dfs.namenode.rpc-address.cluster1.nns", "10.211.55.27:9000");//nns的RPC通訊地址 conf.set("dfs.client.failover.proxy.provider.cluster1", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");//配置失敗自動切換實現方式 FileSystem fs = null; try { fs = FileSystem.get(conf);//獲取檔案物件 FileStatus[] list = fs.listStatus(new Path("/"));//檔案狀態集合 for (FileStatus file : list) { System.out.println(file.getPath().getName());//列印目錄名 } } catch (IOException e) { e.printStackTrace(); } finally { try { if (fs != null) { fs.close(); } } catch (IOException e) { e.printStackTrace(); } } } }
接下來,附上 Java 執行 MapReduce 程式的 API 程式碼。
2.2Java 執行 MapReduce 程式的 API
以 WordCount 為例子,程式碼如下:
package cn.jpush.hdfs.mr.example; import java.io.IOException; import java.util.Random; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import cn.jpush.hdfs.utils.ConfigUtils; /** * * @author dengjie * @date 2014年11月29日 * @description Wordcount的例子是一個比較經典的mapreduce例子,可以叫做Hadoop版的hello world。 * 它將檔案中的單詞分割取出,然後shuffle,sort(map過程),接著進入到彙總統計 * (reduce過程),最後寫道hdfs中。基本流程就是這樣。 */ public class WordCount { private static Logger log = LoggerFactory.getLogger(WordCount.class); public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); /* * 原始檔:a b b * * map之後: * * a 1 * * b 1 * * b 1 */ public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString());// 整行讀取 while (itr.hasMoreTokens()) { word.set(itr.nextToken());// 按空格分割單詞 context.write(word, one);// 每次統計出來的單詞+1 } } } /* * reduce之前: * * a 1 * * b 1 * * b 1 * * reduce之後: * * a 1 * * b 2 */ public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } @SuppressWarnings("deprecation") public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); conf.set("fs.defaultFS", "hdfs://cluster1"); conf.set("dfs.nameservices", "cluster1"); conf.set("dfs.ha.namenodes.cluster1", "nna,nns"); conf.set("dfs.namenode.rpc-address.cluster1.nna", "10.211.55.26:9000"); conf.set("dfs.namenode.rpc-address.cluster1.nns", "10.211.55.27:9000"); conf.set("dfs.client.failover.proxy.provider.cluster1", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"); long random1 = new Random().nextLong();// 重定下輸出目錄 log.info("random1 -> " + random1); Job job1 = new Job(conf, "word count"); job1.setJarByClass(WordCount.class); job1.setMapperClass(TokenizerMapper.class);// 指定Map計算的類 job1.setCombinerClass(IntSumReducer.class);// 合併的類 job1.setReducerClass(IntSumReducer.class);// Reduce的類 job1.setOutputKeyClass(Text.class);// 輸出Key型別 job1.setOutputValueClass(IntWritable.class);// 輸出值型別 FileInputFormat.addInputPath(job1, new Path("/home/hdfs/test/hello.txt"));// 指定輸入路徑 FileOutputFormat.setOutputPath(job1, new Path(String.format(ConfigUtils.HDFS.WORDCOUNT_OUT, random1)));// 指定輸出路徑 System.exit(job1.waitForCompletion(true) ? 0 : 1);// 執行完MR任務後退出應用 } }
3.執行結果
下面附上部分執行 Log 日誌,如下所示:
[Job.main] - Running job: job_local551164419_0001
2015-03-24 11:52:09 INFO [LocalJobRunner.Thread-12] - OutputCommitter set in config null
2015-03-24 11:52:09 INFO [LocalJobRunner.Thread-12] - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2015-03-24 11:52:10 INFO [LocalJobRunner.Thread-12] - Waiting for map tasks
2015-03-24 11:52:10 INFO [LocalJobRunner.LocalJobRunner Map Task Executor #0] - Starting task: attempt_local551164419_0001_m_000000_0
2015-03-24 11:52:10 INFO [ProcfsBasedProcessTree.LocalJobRunner Map Task Executor #0] - ProcfsBasedProcessTree currently is supported only on Linux.
2015-03-24 11:52:10 INFO [Task.LocalJobRunner Map Task Executor #0] - Using ResourceCalculatorProcessTree : null
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - Processing split: hdfs://cluster1/home/hdfs/test/hello.txt:0+24
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - (EQUATOR) 0 kvi 26214396(104857584)
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - mapreduce.task.io.sort.mb: 100
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - soft limit at 83886080
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - bufstart = 0; bufvoid = 104857600
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - kvstart = 26214396; length = 6553600
2015-03-24 11:52:10 INFO [LocalJobRunner.LocalJobRunner Map Task Executor #0] -
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - Starting flush of map output
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - Spilling map output
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - bufstart = 0; bufend = 72; bufvoid = 104857600
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - kvstart = 26214396(104857584); kvend = 26214352(104857408); length = 45/6553600
2015-03-24 11:52:10 INFO [MapTask.LocalJobRunner Map Task Executor #0] - Finished spill 0
2015-03-24 11:52:10 INFO [Task.LocalJobRunner Map Task Executor #0] - Task:attempt_local551164419_0001_m_000000_0 is done. And is in the process of committing
2015-03-24 11:52:10 INFO [LocalJobRunner.LocalJobRunner Map Task Executor #0] - map
2015-03-24 11:52:10 INFO [Task.LocalJobRunner Map Task Executor #0] - Task 'attempt_local551164419_0001_m_000000_0' done.
2015-03-24 11:52:10 INFO [LocalJobRunner.LocalJobRunner Map Task Executor #0] - Finishing task: attempt_local551164419_0001_m_000000_0
2015-03-24 11:52:10 INFO [LocalJobRunner.Thread-12] - map task executor complete.
2015-03-24 11:52:10 INFO [LocalJobRunner.Thread-12] - Waiting for reduce tasks
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - Starting task: attempt_local551164419_0001_r_000000_0
2015-03-24 11:52:10 INFO [ProcfsBasedProcessTree.pool-6-thread-1] - ProcfsBasedProcessTree currently is supported only on Linux.
2015-03-24 11:52:10 INFO [Task.pool-6-thread-1] - Using ResourceCalculatorProcessTree : null
2015-03-24 11:52:10 INFO [ReduceTask.pool-6-thread-1] - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@1197414
2015-03-24 11:52:10 INFO [MergeManagerImpl.pool-6-thread-1] - MergerManager: memoryLimit=1503238528, maxSingleShuffleLimit=375809632, mergeThreshold=992137472, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2015-03-24 11:52:10 INFO [EventFetcher.EventFetcher for fetching Map Completion Events] - attempt_local551164419_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2015-03-24 11:52:10 INFO [LocalFetcher.localfetcher#1] - localfetcher#1 about to shuffle output of map attempt_local551164419_0001_m_000000_0 decomp: 50 len: 54 to MEMORY
2015-03-24 11:52:10 INFO [InMemoryMapOutput.localfetcher#1] - Read 50 bytes from map-output for attempt_local551164419_0001_m_000000_0
2015-03-24 11:52:10 INFO [MergeManagerImpl.localfetcher#1] - closeInMemoryFile -> map-output of size: 50, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->50
2015-03-24 11:52:10 INFO [EventFetcher.EventFetcher for fetching Map Completion Events] - EventFetcher is interrupted.. Returning
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - 1 / 1 copied.
2015-03-24 11:52:10 INFO [MergeManagerImpl.pool-6-thread-1] - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2015-03-24 11:52:10 INFO [Merger.pool-6-thread-1] - Merging 1 sorted segments
2015-03-24 11:52:10 INFO [Merger.pool-6-thread-1] - Down to the last merge-pass, with 1 segments left of total size: 46 bytes
2015-03-24 11:52:10 INFO [MergeManagerImpl.pool-6-thread-1] - Merged 1 segments, 50 bytes to disk to satisfy reduce memory limit
2015-03-24 11:52:10 INFO [MergeManagerImpl.pool-6-thread-1] - Merging 1 files, 54 bytes from disk
2015-03-24 11:52:10 INFO [MergeManagerImpl.pool-6-thread-1] - Merging 0 segments, 0 bytes from memory into reduce
2015-03-24 11:52:10 INFO [Merger.pool-6-thread-1] - Merging 1 sorted segments
2015-03-24 11:52:10 INFO [Merger.pool-6-thread-1] - Down to the last merge-pass, with 1 segments left of total size: 46 bytes
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - 1 / 1 copied.
2015-03-24 11:52:10 INFO [deprecation.pool-6-thread-1] - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2015-03-24 11:52:10 INFO [Task.pool-6-thread-1] - Task:attempt_local551164419_0001_r_000000_0 is done. And is in the process of committing
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - 1 / 1 copied.
2015-03-24 11:52:10 INFO [Task.pool-6-thread-1] - Task attempt_local551164419_0001_r_000000_0 is allowed to commit now
2015-03-24 11:52:10 INFO [FileOutputCommitter.pool-6-thread-1] - Saved output of task 'attempt_local551164419_0001_r_000000_0' to hdfs://cluster1/output/result/-3636988299559297154/_temporary/0/task_local551164419_0001_r_000000
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - reduce > reduce
2015-03-24 11:52:10 INFO [Task.pool-6-thread-1] - Task 'attempt_local551164419_0001_r_000000_0' done.
2015-03-24 11:52:10 INFO [LocalJobRunner.pool-6-thread-1] - Finishing task: attempt_local551164419_0001_r_000000_0
2015-03-24 11:52:10 INFO [LocalJobRunner.Thread-12] - reduce task executor complete.
2015-03-24 11:52:10 INFO [Job.main] - Job job_local551164419_0001 running in uber mode : false
2015-03-24 11:52:10 INFO [Job.main] - map 100% reduce 100%
2015-03-24 11:52:10 INFO [Job.main] - Job job_local551164419_0001 completed successfully
2015-03-24 11:52:10 INFO [Job.main] - Counters: 35
File System Counters
FILE: Number of bytes read=462
FILE: Number of bytes written=466172
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=48
HDFS: Number of bytes written=24
HDFS: Number of read operations=13
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Map-Reduce Framework
Map input records=2
Map output records=12
Map output bytes=72
Map output materialized bytes=54
Input split bytes=105
Combine input records=12
Combine output records=6
Reduce input groups=6
Reduce shuffle bytes=54
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=13
Total committed heap usage (bytes)=514850816
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=24
File Output Format Counters
Bytes Written=24
原檔案如下所示:
a a c v d d
a d d s s x
Reduce 結果圖,如下所示:
4.總結
我們可以按以下步驟進行驗證程式碼的可用性:
- 保證 NNA( active 狀態)和 NNS( standby 狀態)。注意,DN 節點都是正常執行的。
- 然後,我們執行 WordCount 程式,看能否統計出結果。
- 若安上述步驟下來,可以統計;我們接著往下執行。若不行,請排查錯誤,然後繼續。
- 然後,我們 kill 掉 NNA 節點的 NameNode 程式,此時,NNS 的狀態會由 standby 轉變為 active
- 接著我們在支援 WordCount 程式,看能否統計結果;若是能統計結果,表示程式碼可用。
以上就是整個驗證的流程。
5.結束語
這篇文章就分享到這裡,如果在驗證的過程當中有什麼問題,可以加群進行討論或傳送郵件給我,我會盡我所能為您解答,與君共勉!