一：任務：

/datacenter/shell/user/full/bdl_social_test1.sh

跑全量資料時！Namenode掛掉！

二：報錯日誌：

/var/log/hadoop-hdfs/hadoop-hdfs-namenode-yhbd01.log

內容：

2015-08-27 10:23:04,392 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:05,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:06,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:15,483 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000029706041 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000029706041-0000000000029706137
2015-08-27 10:23:15,484 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 29706138
2015-08-27 10:23:16,485 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:17,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:18,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:19,487 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:20,488 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:21,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:21,489 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:22,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:22,490 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:23,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-08-27 10:23:23,487 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:23,491 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:24,489 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:24,492 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:25,490 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:25,493 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:26,491 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ]
2015-08-27 10:23:27,492 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ]
2015-08-27 10:23:28,493 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ]
2015-08-27 10:23:29,494 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ]
2015-08-27 10:23:30,495 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ]
2015-08-27 10:23:31,496 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ]
2015-08-27 10:23:32,497 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ]
2015-08-27 10:23:33,499 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ]
2015-08-27 10:23:34,500 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ]
2015-08-27 10:23:35,484 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 29706138 failed for required journal (JournalAndStream(mgr=QJM to [10.171.82.67:8485, 10.171.81.96:8485, 10.170.255.96:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1181)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1150)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1231)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6119)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:908)
at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139)
        at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:11214)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
2015-08-27 10:23:35,486 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2015-08-27 10:23:35,490 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at yhbd01/10.171.82.67
************************************************************/
2015-08-27 10:28:48,352 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = yhbd01/10.171.82.67
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.5.0-cdh5.3.3

三：故障分析：

（1）

yhbd02上的jounalnode程式沒有起來！

而Namenode在合併fsimage 和 edits 檔案的時候，向jounalnode叢集提交合並的檔案，但是此時一直訪問不了yhbd02的8485埠，導致Namenode節點訪問jounalnode超時，Namenode shutdown。

（2）

然而，當NameNode切換好之後，所以JN叢集都正常，繼續跑任務，yhbd02上的NameNode繼續掛掉了：

查了查log，應該是JN訪問超時，這種問題通常是Full GC導致的問題， namenode這個時間點進行了一次時間比較長的 full gc，

導致寫 journalnode 超時（預設是20s）， namenode程式退出。故需要調整mapred-site.xml引數。

一些最佳化JVM的思路，從JVM Dump出的檔案檢視出的內容：

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 350748672 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
# Out of Memory Error (os_linux.cpp:2809), pid=16666, tid=139882867275520
#
# JRE version: (7.0_79-b14) (build )
# Java VM: OpenJDK 64-Bit Server VM (24.79-b02 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.5.5
# Distribution: Built on CentOS release 6.6 (Final) (Wed Apr 15 00:02:21 UTC 2015)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

四：解決辦法：

（1）

啟動yhbd02上的jounalnode：

service hadoop-hdfs-jounalnode start

並啟動namenode：

service hadoop-hdfs-namenode start

（2）

調整mapred-site.xml的引數：

點選(此處)摺疊或開啟

<property>
<name>mapreduce.reduce.shuffle.memory.limit.percent</name>
<value>0.13</value>
</property>

原來是0.25現在修改為0.13，減少shuffle階段對記憶體的佔用，更多的使用本地。

查詢發現，這裡的每個reducer設定的資料過大，把這句註釋掉，再執行：

#set hive.exec.reducers.bytes.per.reducer=1000000000;

#set mapred.min.split.size.per.node=300000000;
#set mapred.min.split.size.per.rack=300000000;

調節Map和Reduce的記憶體引數：

mapred-site.xml

mapreduce.map.memory.mb //map記憶體控制

mapreduce.reduce.memory.mb //reduce記憶體控制

（3）

http://caiguangguang.blog.51cto.com/1652935/1418895/

後續解決方法：

1）調節journalnode 的寫入超時時間

1	dfs.qjournal.write-txns.timeout.ms

2）調整namenode 的java引數，提前觸發 full gc，這樣full gc 的時間就會小一些。

3）預設namenode的fullgc方式是parallel gc，是stw模式的，更改為cms的格式。調整namenode的啟動引數：

1

2

3

4

5

6

-XX:+UseCompressedOops

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled

-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0

-XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC

-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75

-XX:SoftRefLRUPolicyMSPerMB=0

點選(此處)摺疊或開啟

<property>
<name>dfs.qjournal.start-segment.timeout.ms</name>
<value>600000000</value>
</property>
<property>
<name>dfs.qjournal.prepare-recovery.timeout.ms</name>
<value>600000000</value>
</property>
<property>
<name>dfs.qjournal.accept-recovery.timeout.ms</name>
<value>600000000</value>
</property>
<property>
<name>dfs.qjournal.finalize-segment.timeout.ms</name>
<value>600000000</value>
</property>
<property>
<name>dfs.qjournal.select-input-streams.timeout.ms</name>
<value>600000000</value>
</property>
<property>
<name>dfs.qjournal.get-journal-state.timeout.ms</name>
<value>600000000</value>
</property>
<property>
<name>dfs.qjournal.new-epoch.timeout.ms</name>
<value>600000000</value>
</property>
<property>
<name>dfs.qjournal.write-txns.timeout.ms</name>
<value>600000000</value>
</property>

Namenode Full GC過多，每次GC，系統停頓3-4分鐘

由於namenode儲存後設資料在記憶體，所以對老生代的壓力比較大，每次full gc時，系統就會卡死幾分鐘，解決方法如下:
(1). 使用cms gc演算法
(2). 修改新生代和老生代比例為1:2,如果是1:4,會造成大物件在做ygc時，大物件直接進入老生代，造成老生代記憶體快速增長，full gc更加頻繁。

五：延伸：

在hadoop中edits和fsimage是兩個至關重要的檔案，其中edits負責儲存自最新檢查點後名稱空間的變化，起著日誌的作用，而fsimage（HDFS後設資料映象檔案）則儲存了最新的檢查點資訊。這個兩個檔案中的內容使用普通文字編輯器是無法直接檢視的，幸運的是hadoop為此準備了專門的工具用於檢視檔案的內容，這些工具分別為oev和oiv，可以使用hdfs呼叫執行。

所以可以透過以下來研究hadoop2是怎樣透過edits和fsimage來實現HA的：

http://www.iteblog.com/archives/974

其步驟可以歸類如下：
　　（1）、配置好HA後，客戶端所有的更新操作將會寫到JournalNodes節點的共享目錄中，可以透過下面配置

點選(此處)摺疊或開啟

<property>
　　<name>dfs.namenode.shared.edits.dir</name>
　　<value>qjournal://XXXX/mycluster</value>
</property>
<property>
　　<name>dfs.journalnode.edits.dir</name>
　　<value>/export1/hadoop2x/dfs/journal</value>
</property>

　　（2）、Active Namenode和Standby NameNode從JournalNodes的edits共享目錄中同步edits到自己edits目錄中；
　　（3）、Standby NameNode中的StandbyCheckpointer類會定期的檢查合併的條件是否成立，如果成立會合並fsimage和edits檔案；
　　（4）、Standby NameNode中的StandbyCheckpointer類合併完之後，將合併之後的fsimage上傳到Active NameNode相應目錄中；
　　（5）、Active NameNode接到最新的fsimage檔案之後，將舊的fsimage和edits檔案清理掉；
　　（6）、透過上面的幾步，fsimage和edits檔案就完成了合併，由於HA機制，會使得Standby NameNode和Active NameNode都擁有最新的fsimage和edits檔案（之前Hadoop 1.x的SecondaryNameNode中的fsimage和edits不是最新的）

因為Active Namenode轉移到了yhbd02上，而active Resource Manager還是在yhbd01上：

這樣Namenode在一臺機器，Resource Manager在一臺機器，效能明顯比以前提升很多！因為這兩個程式都需要佔用很大的資源。

比如，提交的job，在資源分配上更加合理，container在叢集中的分佈更均勻。

最佳化操作開始：

1：修改了mapred-site.xml

2：修改了hadoop-env.xml

針對Full GC的問題：

解決思路是：

將使用者全量語句。拆解成以年/月為單位，來跑資料！

將journal node延時時間調長

調節journalnode 的寫入超時時間

dfs.qjournal.write-txns.timeout.ms

http://blog.csdn.net/qq1010885678/article/details/46673079

調節之後重啟journalnode叢集。

然後namenode節點不再因為連不上journalnode叢集報錯，而是因為以下資訊：

IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error

切換到namenode2上之後穩定執行！

繼續最佳化JVM。

journalnode的超時時間最佳化後的報錯：

2015-09-02 00:13:03,408 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631554
2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
2015-09-02 00:13:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:13:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:14:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:14:04,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:14:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:14:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:15:04,049 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.171.81.96
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 31631554
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 4 54
2015-09-02 00:15:04,121 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 73 56
2015-09-02 00:15:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000031631554 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000031631554-0000000000031631629
2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631630
2015-09-02 00:16:28,744 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 30
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.172.7.24:36833 Call#486892 Retry#0: output error
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 10.171.82.67:48509 Call#1099647 Retry#0
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.165.115.190:56256 Call#464826 Retry#0: output error
2015-09-02 00:16:28,945 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15538 Call#1238456 Retry#0
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15554 Call#1238459 Retry#0
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:58665 Call#495767 Retry#0: output error
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,947 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for active state
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15561 Call#1238462 Retry#0
2015-09-02 00:16:28,946 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 84780 millisecond(s).
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.96:15952 Call#452032 Retry#0: output error
2015-09-02 00:16:28,948 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020 caught an exception

找錯誤原因：

step1:

org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds

原始碼位置：

org.apache.hadoop.hdfs.server.blockmanagement;

說明：

Scans the namesystem, scheduling blocks to be cached as appropriate.

The CacheReplicationMonitor does a full scan when the NameNode first starts up, and at configurable intervals afterwards.

step2：報錯開始！

org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:53949 Call#525896 Retry#0: output error

之後修改yarn-site.xml和mapred-site.xml引數，Namenode不再掛掉！

修改

點選(此處)摺疊或開啟

yarn-site.xml
<property>
<description>The minimum allocation for every container request at the RM,
in MBs. Memory requests lower than this won't take effect,
and the specified value will get allocated at minimum.
yarn.scheduler.minimum-allocation-mb
1500
The maximum allocation for every container request at the RM,
in MBs. Memory requests higher than this won't take effect,
and will get capped to this value.</description>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
mapred-site.xml
<property>
<name>mapreduce.map.memory.mb</name>
<value>1500</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024M</value>
<description>Larger heap-size for child jvms of maps.</description>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2100</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1900M</value>
<description>Larger heap-size for child jvms of reduces.</description>
</property>
<property>
<name>mapreduce.job.counters.limit</name>
<value>12000</value>
</property>

最後總結一下：

遇到這種情況，從三個方面入手：

(1)journalnode超時時間；

(2)JVM引數最佳化；

(3)yarn-site.xml和yarn-site.xml配置檔案最佳化。

轉:

java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond

相關文章