高可用Hadoop平臺－執行MapReduce程式

哥不是小蘿莉發表於2015-03-24

1.概述

　　最近有同學反應，如何在配置了HA的Hadoop平臺執行MapReduce程式呢？對於剛步入Hadoop行業的同學，這個疑問卻是會存在，其實仔細想想，如果你之前的語言功底不錯的，應該會想到自動重連，自動重連也可以幫我我們解決執行MapReduce程式的問題。然後，今天我贅述的是利用Hadoop的Java API 來實現。

2.介紹

　　下面直接附上程式碼，程式碼中我都有註釋。

2.1Java操作HDFS HA的API

　　程式碼如下：

/**
 * 
 */
package cn.hdfs.mr.example;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

/**
 * @author dengjie
 * @date 2015年3月24日
 * @description TODO
 */
public class DFS {

    public static void main(String[] args) {
    Configuration conf = new Configuration();
    conf.set("fs.defaultFS", "hdfs://cluster1");//指定hdfs的nameservice為cluster1,是NameNode的URI
    conf.set("dfs.nameservices", "cluster1");//指定hdfs的nameservice為cluster1
    conf.set("dfs.ha.namenodes.cluster1", "nna,nns");//cluster1下面有兩個NameNode，分別是nna，nns
    conf.set("dfs.namenode.rpc-address.cluster1.nna", "10.211.55.26:9000");//nna的RPC通訊地址
    conf.set("dfs.namenode.rpc-address.cluster1.nns", "10.211.55.27:9000");//nns的RPC通訊地址
    conf.set("dfs.client.failover.proxy.provider.cluster1", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");//配置失敗自動切換實現方式
    FileSystem fs = null;
    try {
        fs = FileSystem.get(conf);//獲取檔案物件
        FileStatus[] list = fs.listStatus(new Path("/"));//檔案狀態集合
        for (FileStatus file : list) {
        System.out.println(file.getPath().getName());//列印目錄名
        }
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
        if (fs != null) {
            fs.close();
        }
        } catch (IOException e) {
        e.printStackTrace();
        }
    }
    }

}

　　接下來，附上 Java 執行 MapReduce 程式的 API 程式碼。

2.2Java 執行 MapReduce 程式的 API

　　以 WordCount 為例子，程式碼如下：

package cn.jpush.hdfs.mr.example;

import java.io.IOException;
import java.util.Random;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import cn.jpush.hdfs.utils.ConfigUtils;

/**
 * 
 * @author dengjie
 * @date 2014年11月29日
 * @description Wordcount的例子是一個比較經典的mapreduce例子，可以叫做Hadoop版的hello world。
 *              它將檔案中的單詞分割取出，然後shuffle，sort（map過程），接著進入到彙總統計
 *              （reduce過程），最後寫道hdfs中。基本流程就是這樣。
 */
public class WordCount {

    private static Logger log = LoggerFactory.getLogger(WordCount.class);

    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    /*
     * 原始檔：a b b
     * 
     * map之後：
     * 
     * a 1
     * 
     * b 1
     * 
     * b 1
     */
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        StringTokenizer itr = new StringTokenizer(value.toString());// 整行讀取
        while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());// 按空格分割單詞
        context.write(word, one);// 每次統計出來的單詞+1
        }
    }
    }

    /*
     * reduce之前：
     * 
     * a 1
     * 
     * b 1
     * 
     * b 1
     * 
     * reduce之後:
     * 
     * a 1
     * 
     * b 2
     */
    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
        sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
    }

    @SuppressWarnings("deprecation")
    public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    conf.set("fs.defaultFS", "hdfs://cluster1");
    conf.set("dfs.nameservices", "cluster1");
    conf.set("dfs.ha.namenodes.cluster1", "nna,nns");
    conf.set("dfs.namenode.rpc-address.cluster1.nna", "10.211.55.26:9000");
    conf.set("dfs.namenode.rpc-address.cluster1.nns", "10.211.55.27:9000");
    conf.set("dfs.client.failover.proxy.provider.cluster1", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
    long random1 = new Random().nextLong();// 重定下輸出目錄
    log.info("random1 -> " + random1);
    
    Job job1 = new Job(conf, "word count");
    job1.setJarByClass(WordCount.class);
    job1.setMapperClass(TokenizerMapper.class);// 指定Map計算的類
    job1.setCombinerClass(IntSumReducer.class);// 合併的類
    job1.setReducerClass(IntSumReducer.class);// Reduce的類
    job1.setOutputKeyClass(Text.class);// 輸出Key型別
    job1.setOutputValueClass(IntWritable.class);// 輸出值型別  
    
    FileInputFormat.addInputPath(job1, new Path("/home/hdfs/test/hello.txt"));// 指定輸入路徑
    FileOutputFormat.setOutputPath(job1, new Path(String.format(ConfigUtils.HDFS.WORDCOUNT_OUT, random1)));// 指定輸出路徑

    System.exit(job1.waitForCompletion(true) ? 0 : 1);// 執行完MR任務後退出應用
    }
}

3.執行結果

　　下面附上部分執行 Log 日誌，如下所示：

[Job.main] - Running job: job_local551164419_0001
2015-03-24 11:52:09 INFO  [LocalJobRunner.Thread-12] - OutputCommitter set in config null
2015-03-24 11:52:09 INFO  [LocalJobRunner.Thread-12] - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2015-03-24 11:52:10 INFO  [LocalJobRunner.Thread-12] - Waiting for map tasks
2015-03-24 11:52:10 INFO  [LocalJobRunner.LocalJobRunner Map Task Executor #0] - Starting task: attempt_local551164419_0001_m_000000_0
2015-03-24 11:52:10 INFO  [ProcfsBasedProcessTree.LocalJobRunner Map Task Executor #0] - ProcfsBasedProcessTree currently is supported only on Linux.
2015-03-24 11:52:10 INFO  [Task.LocalJobRunner Map Task Executor #0] -  Using ResourceCalculatorProcessTree : null
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - Processing split: hdfs://cluster1/home/hdfs/test/hello.txt:0+24
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - (EQUATOR) 0 kvi 26214396(104857584)
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - mapreduce.task.io.sort.mb: 100
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - soft limit at 83886080
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - bufstart = 0; bufvoid = 104857600
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - kvstart = 26214396; length = 6553600
2015-03-24 11:52:10 INFO  [LocalJobRunner.LocalJobRunner Map Task Executor #0] - 
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - Starting flush of map output
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - Spilling map output
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - bufstart = 0; bufend = 72; bufvoid = 104857600
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - kvstart = 26214396(104857584); kvend = 26214352(104857408); length = 45/6553600
2015-03-24 11:52:10 INFO  [MapTask.LocalJobRunner Map Task Executor #0] - Finished spill 0
2015-03-24 11:52:10 INFO  [Task.LocalJobRunner Map Task Executor #0] - Task:attempt_local551164419_0001_m_000000_0 is done. And is in the process of committing
2015-03-24 11:52:10 INFO  [LocalJobRunner.LocalJobRunner Map Task Executor #0] - map
2015-03-24 11:52:10 INFO  [Task.LocalJobRunner Map Task Executor #0] - Task 'attempt_local551164419_0001_m_000000_0' done.
2015-03-24 11:52:10 INFO  [LocalJobRunner.LocalJobRunner Map Task Executor #0] - Finishing task: attempt_local551164419_0001_m_000000_0
2015-03-24 11:52:10 INFO  [LocalJobRunner.Thread-12] - map task executor complete.
2015-03-24 11:52:10 INFO  [LocalJobRunner.Thread-12] - Waiting for reduce tasks
2015-03-24 11:52:10 INFO  [LocalJobRunner.pool-6-thread-1] - Starting task: attempt_local551164419_0001_r_000000_0
2015-03-24 11:52:10 INFO  [ProcfsBasedProcessTree.pool-6-thread-1] - ProcfsBasedProcessTree currently is supported only on Linux.
2015-03-24 11:52:10 INFO  [Task.pool-6-thread-1] -  Using ResourceCalculatorProcessTree : null
2015-03-24 11:52:10 INFO  [ReduceTask.pool-6-thread-1] - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@1197414
2015-03-24 11:52:10 INFO  [MergeManagerImpl.pool-6-thread-1] - MergerManager: memoryLimit=1503238528, maxSingleShuffleLimit=375809632, mergeThreshold=992137472, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2015-03-24 11:52:10 INFO  [EventFetcher.EventFetcher for fetching Map Completion Events] - attempt_local551164419_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2015-03-24 11:52:10 INFO  [LocalFetcher.localfetcher#1] - localfetcher#1 about to shuffle output of map attempt_local551164419_0001_m_000000_0 decomp: 50 len: 54 to MEMORY
2015-03-24 11:52:10 INFO  [InMemoryMapOutput.localfetcher#1] - Read 50 bytes from map-output for attempt_local551164419_0001_m_000000_0
2015-03-24 11:52:10 INFO  [MergeManagerImpl.localfetcher#1] - closeInMemoryFile -> map-output of size: 50, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->50
2015-03-24 11:52:10 INFO  [EventFetcher.EventFetcher for fetching Map Completion Events] - EventFetcher is interrupted.. Returning
2015-03-24 11:52:10 INFO  [LocalJobRunner.pool-6-thread-1] - 1 / 1 copied.
2015-03-24 11:52:10 INFO  [MergeManagerImpl.pool-6-thread-1] - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2015-03-24 11:52:10 INFO  [Merger.pool-6-thread-1] - Merging 1 sorted segments
2015-03-24 11:52:10 INFO  [Merger.pool-6-thread-1] - Down to the last merge-pass, with 1 segments left of total size: 46 bytes
2015-03-24 11:52:10 INFO  [MergeManagerImpl.pool-6-thread-1] - Merged 1 segments, 50 bytes to disk to satisfy reduce memory limit
2015-03-24 11:52:10 INFO  [MergeManagerImpl.pool-6-thread-1] - Merging 1 files, 54 bytes from disk
2015-03-24 11:52:10 INFO  [MergeManagerImpl.pool-6-thread-1] - Merging 0 segments, 0 bytes from memory into reduce
2015-03-24 11:52:10 INFO  [Merger.pool-6-thread-1] - Merging 1 sorted segments
2015-03-24 11:52:10 INFO  [Merger.pool-6-thread-1] - Down to the last merge-pass, with 1 segments left of total size: 46 bytes
2015-03-24 11:52:10 INFO  [LocalJobRunner.pool-6-thread-1] - 1 / 1 copied.
2015-03-24 11:52:10 INFO  [deprecation.pool-6-thread-1] - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2015-03-24 11:52:10 INFO  [Task.pool-6-thread-1] - Task:attempt_local551164419_0001_r_000000_0 is done. And is in the process of committing
2015-03-24 11:52:10 INFO  [LocalJobRunner.pool-6-thread-1] - 1 / 1 copied.
2015-03-24 11:52:10 INFO  [Task.pool-6-thread-1] - Task attempt_local551164419_0001_r_000000_0 is allowed to commit now
2015-03-24 11:52:10 INFO  [FileOutputCommitter.pool-6-thread-1] - Saved output of task 'attempt_local551164419_0001_r_000000_0' to hdfs://cluster1/output/result/-3636988299559297154/_temporary/0/task_local551164419_0001_r_000000
2015-03-24 11:52:10 INFO  [LocalJobRunner.pool-6-thread-1] - reduce > reduce
2015-03-24 11:52:10 INFO  [Task.pool-6-thread-1] - Task 'attempt_local551164419_0001_r_000000_0' done.
2015-03-24 11:52:10 INFO  [LocalJobRunner.pool-6-thread-1] - Finishing task: attempt_local551164419_0001_r_000000_0
2015-03-24 11:52:10 INFO  [LocalJobRunner.Thread-12] - reduce task executor complete.
2015-03-24 11:52:10 INFO  [Job.main] - Job job_local551164419_0001 running in uber mode : false
2015-03-24 11:52:10 INFO  [Job.main] -  map 100% reduce 100%
2015-03-24 11:52:10 INFO  [Job.main] - Job job_local551164419_0001 completed successfully
2015-03-24 11:52:10 INFO  [Job.main] - Counters: 35
    File System Counters
        FILE: Number of bytes read=462
        FILE: Number of bytes written=466172
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=48
        HDFS: Number of bytes written=24
        HDFS: Number of read operations=13
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
    Map-Reduce Framework
        Map input records=2
        Map output records=12
        Map output bytes=72
        Map output materialized bytes=54
        Input split bytes=105
        Combine input records=12
        Combine output records=6
        Reduce input groups=6
        Reduce shuffle bytes=54
        Reduce input records=6
        Reduce output records=6
        Spilled Records=12
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=13
        Total committed heap usage (bytes)=514850816
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=24
    File Output Format Counters 
        Bytes Written=24

　　原檔案如下所示：

a a c v d d
a d d s s x

　　Reduce 結果圖，如下所示：

4.總結

　　我們可以按以下步驟進行驗證程式碼的可用性：

保證 NNA（ active 狀態）和 NNS（ standby 狀態）。注意，DN 節點都是正常執行的。
然後，我們執行 WordCount 程式，看能否統計出結果。
若安上述步驟下來，可以統計；我們接著往下執行。若不行，請排查錯誤，然後繼續。
然後，我們 kill 掉 NNA 節點的 NameNode 程式，此時，NNS 的狀態會由 standby 轉變為 active
接著我們在支援 WordCount 程式，看能否統計結果；若是能統計結果，表示程式碼可用。

　　以上就是整個驗證的流程。

5.結束語

　　這篇文章就分享到這裡，如果在驗證的過程當中有什麼問題，可以加群進行討論或傳送郵件給我，我會盡我所能為您解答，與君共勉！

hadoop的mapreduce串聯執行
2018-09-01
Hadoop
MapReduce程式執行流程
2018-06-12
在Docker容器中使用Hadoop執行Python MapReduce作業
2022-05-12
DockerHadoopPython
MapReduce執行流程
2021-11-09
MapReduce如何作為Yarn應用程式執行？
2018-09-06
Yarn
Hadoop（十四）MapReduce概述
2024-09-18
Hadoop
hadoop_MapReduce yarn
2020-11-11
HadoopYarn
Hadoop學習——MapReduce
2019-04-06
Hadoop
Hadoop框架：HDFS高可用環境配置
2020-10-27
Hadoop框架
使用MapReduce執行WordCount案例
2020-07-15
MapReduce的執行流程概述
2020-07-11
基於 ZooKeeper 搭建 Hadoop 高可用叢集
2019-06-25
Hadoop
第一章：Hadoop生態系統及執行MapReduce任務介紹！
2018-09-04
Hadoop
MapReduce 執行全過程解析
2019-08-05
高效能、高可用平臺架構演變史
2018-07-29
架構
Hadoop面試題之MapReduce
2021-12-23
Hadoop面試題
Hadoop 專欄 - MapReduce 入門
2021-01-21
Hadoop
MapReduce 示例：減少 Hadoop MapReduce 中的側連線
2021-09-17
Hadoop
Hadoop（三）通過C#/python實現Hadoop MapReduce
2022-05-01
HadoopC#Python
python用什麼平臺執行
2021-09-11
Python
執行緒同步(windows平臺)：事件
2018-10-24
執行緒Windows事件
jenkins後臺程式執行
2018-04-14
Jenkins
Hadoop-叢集執行
2022-04-04
Hadoop
hadoop(二)—hadoop配置、執行錯誤總結
2021-09-09
Hadoop
Hadoop（十九）MapReduce OutputFormat 資料壓縮
2024-09-19
HadoopORM
Hadoop面試題總結（三）——MapReduce
2021-10-16
Hadoop面試題
從分治演算法到 Hadoop MapReduce
2018-11-23
演算法Hadoop
Hadoop學習（二）——MapReduce\Yarn架構
2019-02-20
HadoopYarn架構
執行緒同步(windows平臺)：互斥物件
2018-10-23
執行緒Windows物件
【轉】Windows平臺執行Masscan和Nmap
2019-03-27
Windows
最大個人直播平臺Twitch如何實現99.99%高可用性？
2022-03-04
IDEA本地執行hadoop程式成功，叢集執行找不到自定義的Mapper類
2020-12-30
IdeaHadoopAPP
打破平臺限制，小程式如何在硬體裝置上執行？
2022-08-31
1.搭建Hadoop實驗平臺
2021-01-22
Hadoop
Hadoop的mapreduce出現問題，報錯The auxService:mapreduce_shuffle does not exist
2020-12-24
HadoopUX
黑猴子的家：Hadoop NameNode 高可用 (High Availability) 實現解析
2018-09-05
HadoopAI
API服務平臺，可進行分散式執行
2021-10-09
API分散式
搭建本地執行Hadoop環境
2018-11-01
Hadoop
智慧綜合交通執行監測平臺
2023-04-21