java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond

hackeruncle發表於2016-07-19

一:任務:

/datacenter/shell/user/full/bdl_social_test1.sh
跑全量資料時!Namenode掛掉!

二:報錯日誌:
/var/log/hadoop-hdfs/hadoop-hdfs-namenode-yhbd01.log
內容:
2015-08-27 10:23:04,392 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:05,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:06,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:15,483 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000029706041 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000029706041-0000000000029706137
2015-08-27 10:23:15,484 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 29706138
2015-08-27 10:23:16,485 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:17,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:18,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:19,487 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:20,488 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:21,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:21,489 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:22,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:22,490 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:23,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-08-27 10:23:23,487 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:23,491 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:24,489 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:24,492 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:25,490 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]
2015-08-27 10:23:25,493 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-08-27 10:23:26,491 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  ]
2015-08-27 10:23:27,492 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  ]
2015-08-27 10:23:28,493 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  ]
2015-08-27 10:23:29,494 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  ]
2015-08-27 10:23:30,495 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  ]
2015-08-27 10:23:31,496 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  ]
2015-08-27 10:23:32,497 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  ]
2015-08-27 10:23:33,499 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  ]
2015-08-27 10:23:34,500 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  ]
2015-08-27 10:23:35,484 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 29706138 failed for required journal (JournalAndStream(mgr=QJM to [10.171.82.67:8485, 10.171.81.96:8485, 10.170.255.96:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1181)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1150)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1231)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6119)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:908)
        at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139)
        at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:11214)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
2015-08-27 10:23:35,486 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2015-08-27 10:23:35,490 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at yhbd01/10.171.82.67
************************************************************/
2015-08-27 10:28:48,352 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = yhbd01/10.171.82.67
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.5.0-cdh5.3.3
 
三:故障分析:
(1)
yhbd02上的jounalnode程式沒有起來!
而Namenode在合併fsimage 和 edits 檔案的時候,向jounalnode叢集提交合並的檔案,但是此時一直訪問不了yhbd02的8485埠,導致Namenode節點訪問jounalnode超時,Namenode shutdown。
(2)
然而,當NameNode切換好之後,所以JN叢集都正常,繼續跑任務,yhbd02上的NameNode繼續掛掉了:
查了查log,應該是JN訪問超時,這種問題通常是Full GC導致的問題, namenode這個時間點進行了一次時間比較長的 full gc,
導致寫 journalnode 超時(預設是20s), namenode程式退出。故需要調整mapred-site.xml引數。

一些最佳化JVM的思路,從JVM Dump出的檔案檢視出的內容:
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 350748672 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2809), pid=16666, tid=139882867275520
#
# JRE version:  (7.0_79-b14) (build )
# Java VM: OpenJDK 64-Bit Server VM (24.79-b02 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.5.5
# Distribution: Built on CentOS release 6.6 (Final) (Wed Apr 15 00:02:21 UTC 2015)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again


四:解決辦法:
(1)
啟動yhbd02上的jounalnode:
service hadoop-hdfs-jounalnode start
並啟動namenode:
service hadoop-hdfs-namenode start
(2)
調整mapred-site.xml的引數:

點選(此處)摺疊或開啟

  1. <property>
  2.         <name>mapreduce.reduce.shuffle.memory.limit.percent</name>
  3.         <value>0.13</value>
  4.    </property>
  原來是0.25現在修改為0.13,減少shuffle階段對記憶體的佔用,更多的使用本地。
 
查詢發現,這裡的每個reducer設定的資料過大,把這句註釋掉,再執行:
#set hive.exec.reducers.bytes.per.reducer=1000000000;
#set mapred.min.split.size.per.node=300000000;
#set mapred.min.split.size.per.rack=300000000;

調節Map和Reduce的記憶體引數:
mapred-site.xml
mapreduce.map.memory.mb      //map記憶體控制       
mapreduce.reduce.memory.mb  //reduce記憶體控制
(3)

後續解決方法:

1)調節journalnode 的寫入超時時間

1
dfs.qjournal.write-txns.timeout.ms

2)調整namenode 的java引數,提前觸發 full gc,這樣full gc 的時間就會小一些。

3)預設namenode的fullgc方式是parallel gc,是stw模式的,更改為cms的格式。調整namenode的啟動引數:

1
2
3
4
5
6
-XX:+UseCompressedOops 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled 
-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 
-XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC 
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75
 -XX:SoftRefLRUPolicyMSPerMB=0

點選(此處)摺疊或開啟

  1. <property>
  2.    <name>dfs.qjournal.start-segment.timeout.ms</name>
  3.    <value>600000000</value>
  4.   </property>
  5.  
  6.   <property>
  7.    <name>dfs.qjournal.prepare-recovery.timeout.ms</name>
  8.    <value>600000000</value>
  9.   </property>
  10.  
  11.   <property>
  12.    <name>dfs.qjournal.accept-recovery.timeout.ms</name>
  13.    <value>600000000</value>
  14.   </property>
     
  15.  
  16.   <property>
  17.    <name>dfs.qjournal.finalize-segment.timeout.ms</name>
  18.    <value>600000000</value>
  19.   </property>
  20.  
  21.   <property>
  22.    <name>dfs.qjournal.select-input-streams.timeout.ms</name>
  23.    <value>600000000</value>
  24.   </property>
  25.  
  26.   <property>
  27.    <name>dfs.qjournal.get-journal-state.timeout.ms</name>
  28.    <value>600000000</value>
  29.   </property>
  30.  
  31.   <property>
  32.    <name>dfs.qjournal.new-epoch.timeout.ms</name>
  33.    <value>600000000</value>
  34.   </property>
  35.  
  36.   <property>
  37.    <name>dfs.qjournal.write-txns.timeout.ms</name>
  38.    <value>600000000</value>
  39.   </property>

Namenode Full GC過多,每次GC,系統停頓3-4分鐘

  由於namenode儲存後設資料在記憶體,所以對老生代的壓力比較大,每次full gc時,系統就會卡死幾分鐘,解決方法如下:
  (1). 使用cms gc演算法
  (2). 修改新生代和老生代比例為1:2,如果是1:4,會造成大物件在做ygc時,大物件直接進入老生代,造成老生代記憶體快速增長,full gc更加頻繁。



五:延伸:
  在hadoop中edits和fsimage是兩個至關重要的檔案,其中edits負責儲存自最新檢查點後名稱空間的變化,起著日誌的作用,而fsimage(HDFS後設資料映象檔案則儲存了最新的檢查點資訊。這個兩個檔案中的內容使用普通文字編輯器是無法直接檢視的,幸運的是hadoop為此準備了專門的工具用於檢視檔案的內容,這些工具分別為oev和oiv,可以使用hdfs呼叫執行。

所以可以透過以下來研究hadoop2是怎樣透過edits和fsimage來實現HA的:

其步驟可以歸類如下:
  (1)、配置好HA後,客戶端所有的更新操作將會寫到JournalNodes節點的共享目錄中,可以透過下面配置

點選(此處)摺疊或開啟

  1. <property>
  2.       <name>dfs.namenode.shared.edits.dir</name>
  3.       <value>qjournal://XXXX/mycluster</value>
  4.     </property>
  5.     
  6.     <property>
  7.       <name>dfs.journalnode.edits.dir</name>
  8.       <value>/export1/hadoop2x/dfs/journal</value>
  9.     </property>


  (2)、Active Namenode和Standby NameNode從JournalNodes的edits共享目錄中同步edits到自己edits目錄中;
  (3)、Standby NameNode中的StandbyCheckpointer類會定期的檢查合併的條件是否成立,如果成立會合並fsimage和edits檔案;
  (4)、Standby NameNode中的StandbyCheckpointer類合併完之後,將合併之後的fsimage上傳到Active NameNode相應目錄中;
  (5)、Active NameNode接到最新的fsimage檔案之後,將舊的fsimage和edits檔案清理掉;
  (6)、透過上面的幾步,fsimage和edits檔案就完成了合併,由於HA機制,會使得Standby NameNode和Active NameNode都擁有最新的fsimage和edits檔案(之前Hadoop 1.x的SecondaryNameNode中的fsimage和edits不是最新的)
 
因為Active Namenode轉移到了yhbd02上,而active Resource Manager還是在yhbd01上:
這樣Namenode在一臺機器,Resource Manager在一臺機器,效能明顯比以前提升很多!因為這兩個程式都需要佔用很大的資源。
比如,提交的job,在資源分配上更加合理,container在叢集中的分佈更均勻。
 
 
 
最佳化操作開始:
1:修改了mapred-site.xml    
2:修改了hadoop-env.xml

針對Full GC的問題:
解決思路是:
將使用者全量語句。拆解成以年/月為單位,來跑資料!
將journal node延時時間調長

調節journalnode 的寫入超時時間

dfs.qjournal.write-txns.timeout.ms
調節之後重啟journalnode叢集。
然後namenode節點不再因為連不上journalnode叢集報錯,而是因為以下資訊:
IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error
切換到namenode2上之後穩定執行!
繼續最佳化JVM。


journalnode的超時時間最佳化後的報錯:
2015-09-02 00:13:03,408 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631554
2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
2015-09-02 00:13:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:13:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:14:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:14:04,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:14:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:14:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-09-02 00:15:04,049 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.171.81.96
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 31631554
2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 4 54
2015-09-02 00:15:04,121 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 73 56
2015-09-02 00:15:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000031631554 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000031631554-0000000000031631629
2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631630
2015-09-02 00:16:28,744 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 30
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.172.7.24:36833 Call#486892 Retry#0: output error
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 10.171.82.67:48509 Call#1099647 Retry#0
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.165.115.190:56256 Call#464826 Retry#0: output error
2015-09-02 00:16:28,945 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15538 Call#1238456 Retry#0
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15554 Call#1238459 Retry#0
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:58665 Call#495767 Retry#0: output error
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020 caught an exception
java.nio.channels.ClosedChannelException
     at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265)
     at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474)
     at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534)
     at org.apache.hadoop.ipc.Server.access$1900(Server.java:130)
     at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965)
     at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030)
     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068)
2015-09-02 00:16:28,947 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for active state
2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15561 Call#1238462 Retry#0
2015-09-02 00:16:28,946 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 84780 millisecond(s).
2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.96:15952 Call#452032 Retry#0: output error
2015-09-02 00:16:28,948 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020 caught an exception

 
找錯誤原因:
step1:
 org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
原始碼位置:
org.apache.hadoop.hdfs.server.blockmanagement;
說明:
Scans the namesystem, scheduling blocks to be cached as appropriate.
The CacheReplicationMonitor does a full scan when the NameNode first starts up, and at configurable intervals afterwards.

step2:報錯開始!
org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:53949 Call#525896 Retry#0: output error


之後修改yarn-site.xml和mapred-site.xml引數,Namenode不再掛掉!


修改

點選(此處)摺疊或開啟

  1. yarn-site.xml
  2.       <property>
  3.                <description>The minimum allocation for every container request at the RM,
  4.                                                     in MBs. Memory requests lower than this won't take effect,
  5.                    and the specified value will get allocated at minimum.
  6.                yarn.scheduler.minimum-allocation-mb
  7.                1500
  8.         

  9.         
  10.                The maximum allocation for every container request at the RM,
  11.                   in MBs. Memory requests higher than this won't take effect,
  12.                    and will get capped to this value.</description>
  13.                <name>yarn.scheduler.maximum-allocation-mb</name>
  14.                <value>8192</value>
  15.         </property>

  16. mapred-site.xml
  17.          <property>
  18.         <name>mapreduce.map.memory.mb</name>
  19.         <value>1500</value>
  20.         </property>

  21.         <property>
  22.         <name>mapreduce.map.java.opts</name>
  23.         <value>-Xmx1024M</value>
  24.         <description>Larger heap-size for child jvms of maps.</description>
  25.         </property>

  26.         <property>
  27.         <name>mapreduce.reduce.memory.mb</name>
  28.         <value>2100</value>
  29.         </property>

  30.         <property>
  31.         <name>mapreduce.reduce.java.opts</name>
  32.         <value>-Xmx1900M</value>
  33.         <description>Larger heap-size for child jvms of reduces.</description>
  34.         </property>

  35.         <property>
  36.                 <name>mapreduce.job.counters.limit</name>
  37.                 <value>12000</value>
  38.         </property>

後總結一下:
遇到這種情況,從三個方面入手:
(1)journalnode超時時間;
(2)JVM引數最佳化;
(3)yarn-site.xml和yarn-site.xml配置檔案最佳化。

 

轉:

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30089851/viewspace-2122226/,如需轉載,請註明出處,否則將追究法律責任。

相關文章