java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond
一:任務:
/datacenter/shell/user/full/bdl_social_test1.sh
跑全量資料時!Namenode掛掉!
二:報錯日誌:
/var/log/hadoop-hdfs/hadoop-hdfs-namenode-yhbd01.log
內容:
2015-08-27 10:23:04,392 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:05,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:06,393 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:15,483 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000029706041 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000029706041-0000000000029706137 2015-08-27 10:23:15,484 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 29706138 2015-08-27 10:23:16,485 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:17,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:18,486 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:19,487 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:20,488 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:21,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 6001 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485] 2015-08-27 10:23:21,489 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:22,486 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7002 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485] 2015-08-27 10:23:22,490 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:23,337 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-08-27 10:23:23,487 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8003 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485] 2015-08-27 10:23:23,491 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:24,489 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9005 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485] 2015-08-27 10:23:24,492 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:25,490 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10006 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485] 2015-08-27 10:23:25,493 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: yhbd02/10.171.81.96:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-08-27 10:23:26,491 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 11007 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ] 2015-08-27 10:23:27,492 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 12008 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ] 2015-08-27 10:23:28,493 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 13009 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ] 2015-08-27 10:23:29,494 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 14010 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ] 2015-08-27 10:23:30,495 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 15011 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ] 2015-08-27 10:23:31,496 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 16012 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ] 2015-08-27 10:23:32,497 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17013 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ] 2015-08-27 10:23:33,499 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18015 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ] 2015-08-27 10:23:34,500 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19016 ms (timeout=20000 ms) for a response for startLogSegment(29706138). Succeeded so far: [10.171.82.67:8485]. Exceptions so far: [10.171.81.96:8485: Call From yhbd01/10.171.82.67 to yhbd02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: ] 2015-08-27 10:23:35,484 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log segment 29706138 failed for required journal (JournalAndStream(mgr=QJM to [10.171.82.67:8485, 10.171.81.96:8485, 10.170.255.96:8485], stream=null)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:403) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$3.apply(JournalSet.java:222) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:219) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:1181) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:1150) at org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1231) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6119) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:908) at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:139) at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:11214) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 2015-08-27 10:23:35,486 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 2015-08-27 10:23:35,490 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at yhbd01/10.171.82.67 ************************************************************/ 2015-08-27 10:28:48,352 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = yhbd01/10.171.82.67 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.5.0-cdh5.3.3 |
三:故障分析:
(1)
yhbd02上的jounalnode程式沒有起來!
而Namenode在合併fsimage 和 edits 檔案的時候,向jounalnode叢集提交合並的檔案,但是此時一直訪問不了yhbd02的8485埠,導致Namenode節點訪問jounalnode超時,Namenode shutdown。
(2)
然而,當NameNode切換好之後,所以JN叢集都正常,繼續跑任務,yhbd02上的NameNode繼續掛掉了:
查了查log,應該是JN訪問超時,這種問題通常是Full GC導致的問題, namenode這個時間點進行了一次時間比較長的 full gc,
導致寫 journalnode 超時(預設是20s), namenode程式退出。故需要調整mapred-site.xml引數。
一些最佳化JVM的思路,從JVM Dump出的檔案檢視出的內容:
# There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 350748672 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space # In 32 bit mode, the process size limit was hit # Possible solutions: # Reduce memory load on the system # Increase physical memory or swap space # Check if swap backing store is full # Use 64 bit Java on a 64 bit OS # Decrease Java heap size (-Xmx/-Xms) # Decrease number of Java threads # Decrease Java thread stack sizes (-Xss) # Set larger code cache with -XX:ReservedCodeCacheSize= # This output file may be truncated or incomplete. # # Out of Memory Error (os_linux.cpp:2809), pid=16666, tid=139882867275520 # # JRE version: (7.0_79-b14) (build ) # Java VM: OpenJDK 64-Bit Server VM (24.79-b02 mixed mode linux-amd64 compressed oops) # Derivative: IcedTea 2.5.5 # Distribution: Built on CentOS release 6.6 (Final) (Wed Apr 15 00:02:21 UTC 2015) # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again |
四:解決辦法:
(1)
啟動yhbd02上的jounalnode:
service hadoop-hdfs-jounalnode start
並啟動namenode:
service hadoop-hdfs-namenode start
(2)
調整mapred-site.xml的引數:
點選(此處)摺疊或開啟
-
<property>
-
<name>mapreduce.reduce.shuffle.memory.limit.percent</name>
-
<value>0.13</value>
- </property>
原來是0.25現在修改為0.13,減少shuffle階段對記憶體的佔用,更多的使用本地。
查詢發現,這裡的每個reducer設定的資料過大,把這句註釋掉,再執行:
#set hive.exec.reducers.bytes.per.reducer=1000000000;
#set mapred.min.split.size.per.node=300000000;
#set mapred.min.split.size.per.rack=300000000;
#set mapred.min.split.size.per.rack=300000000;
調節Map和Reduce的記憶體引數:
mapred-site.xml
mapreduce.map.memory.mb //map記憶體控制
mapreduce.reduce.memory.mb //reduce記憶體控制
(3)
後續解決方法:
1)調節journalnode 的寫入超時時間
1
|
dfs.qjournal.write-txns.timeout.ms
|
2)調整namenode 的java引數,提前觸發 full gc,這樣full gc 的時間就會小一些。
3)預設namenode的fullgc方式是parallel gc,是stw模式的,更改為cms的格式。調整namenode的啟動引數:
1
2
3
4
5
6
|
-XX:+UseCompressedOops
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0
-XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75
-XX:SoftRefLRUPolicyMSPerMB=0
|
點選(此處)摺疊或開啟
-
<property>
-
<name>dfs.qjournal.start-segment.timeout.ms</name>
-
<value>600000000</value>
-
</property>
-
-
<property>
-
<name>dfs.qjournal.prepare-recovery.timeout.ms</name>
-
<value>600000000</value>
-
</property>
-
-
<property>
-
<name>dfs.qjournal.accept-recovery.timeout.ms</name>
-
<value>600000000</value>
-
</property>
-
-
<property>
-
<name>dfs.qjournal.finalize-segment.timeout.ms</name>
-
<value>600000000</value>
-
</property>
-
-
<property>
-
<name>dfs.qjournal.select-input-streams.timeout.ms</name>
-
<value>600000000</value>
-
</property>
-
-
<property>
-
<name>dfs.qjournal.get-journal-state.timeout.ms</name>
-
<value>600000000</value>
-
</property>
-
-
<property>
-
<name>dfs.qjournal.new-epoch.timeout.ms</name>
-
<value>600000000</value>
-
</property>
-
-
<property>
-
<name>dfs.qjournal.write-txns.timeout.ms</name>
-
<value>600000000</value>
- </property>
Namenode Full GC過多,每次GC,系統停頓3-4分鐘
由於namenode儲存後設資料在記憶體,所以對老生代的壓力比較大,每次full gc時,系統就會卡死幾分鐘,解決方法如下:
(1). 使用cms gc演算法
(2). 修改新生代和老生代比例為1:2,如果是1:4,會造成大物件在做ygc時,大物件直接進入老生代,造成老生代記憶體快速增長,full gc更加頻繁。
五:延伸:
在hadoop中edits和fsimage是兩個至關重要的檔案,其中edits負責儲存自最新檢查點後名稱空間的變化,起著日誌的作用,而fsimage(HDFS後設資料映象檔案)則儲存了最新的檢查點資訊。這個兩個檔案中的內容使用普通文字編輯器是無法直接檢視的,幸運的是hadoop為此準備了專門的工具用於檢視檔案的內容,這些工具分別為oev和oiv,可以使用hdfs呼叫執行。
所以可以透過以下來研究hadoop2是怎樣透過edits和fsimage來實現HA的:
其步驟可以歸類如下:
(1)、配置好HA後,客戶端所有的更新操作將會寫到JournalNodes節點的共享目錄中,可以透過下面配置
點選(此處)摺疊或開啟
-
<property>
-
<name>dfs.namenode.shared.edits.dir</name>
-
<value>qjournal://XXXX/mycluster</value>
-
</property>
-
-
<property>
-
<name>dfs.journalnode.edits.dir</name>
-
<value>/export1/hadoop2x/dfs/journal</value>
- </property>
(2)、Active Namenode和Standby NameNode從JournalNodes的edits共享目錄中同步edits到自己edits目錄中;
(3)、Standby NameNode中的StandbyCheckpointer類會定期的檢查合併的條件是否成立,如果成立會合並fsimage和edits檔案;
(4)、Standby NameNode中的StandbyCheckpointer類合併完之後,將合併之後的fsimage上傳到Active NameNode相應目錄中;
(5)、Active NameNode接到最新的fsimage檔案之後,將舊的fsimage和edits檔案清理掉;
(6)、透過上面的幾步,fsimage和edits檔案就完成了合併,由於HA機制,會使得Standby NameNode和Active NameNode都擁有最新的fsimage和edits檔案(之前Hadoop 1.x的SecondaryNameNode中的fsimage和edits不是最新的)
(3)、Standby NameNode中的StandbyCheckpointer類會定期的檢查合併的條件是否成立,如果成立會合並fsimage和edits檔案;
(4)、Standby NameNode中的StandbyCheckpointer類合併完之後,將合併之後的fsimage上傳到Active NameNode相應目錄中;
(5)、Active NameNode接到最新的fsimage檔案之後,將舊的fsimage和edits檔案清理掉;
(6)、透過上面的幾步,fsimage和edits檔案就完成了合併,由於HA機制,會使得Standby NameNode和Active NameNode都擁有最新的fsimage和edits檔案(之前Hadoop 1.x的SecondaryNameNode中的fsimage和edits不是最新的)
因為Active Namenode轉移到了yhbd02上,而active Resource Manager還是在yhbd01上:
這樣Namenode在一臺機器,Resource Manager在一臺機器,效能明顯比以前提升很多!因為這兩個程式都需要佔用很大的資源。
比如,提交的job,在資源分配上更加合理,container在叢集中的分佈更均勻。
最佳化操作開始:
1:修改了mapred-site.xml
2:修改了hadoop-env.xml
針對Full GC的問題:
解決思路是:
將使用者全量語句。拆解成以年/月為單位,來跑資料!
將journal node延時時間調長
調節journalnode 的寫入超時時間
dfs.qjournal.write-txns.timeout.ms
調節之後重啟journalnode叢集。
然後namenode節點不再因為連不上journalnode叢集報錯,而是因為以下資訊:
IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error
切換到namenode2上之後穩定執行!
繼續最佳化JVM。
journalnode的超時時間最佳化後的報錯:
2015-09-02 00:13:03,408 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631554 2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-09-02 00:13:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). 2015-09-02 00:13:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-09-02 00:13:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 2015-09-02 00:14:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-09-02 00:14:04,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 2015-09-02 00:14:34,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-09-02 00:14:34,167 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 2015-09-02 00:15:04,049 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.171.81.96 2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs 2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 31631554 2015-09-02 00:15:04,050 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 4 54 2015-09-02 00:15:04,121 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 76 Total time for transactions(ms): 6 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 73 56 2015-09-02 00:15:04,166 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds 2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /Data/cloudera/hadoop/dfs/name/current/edits_inprogress_0000000000031631554 -> /Data/cloudera/hadoop/dfs/name/current/edits_0000000000031631554-0000000000031631629 2015-09-02 00:15:04,167 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 31631630 2015-09-02 00:16:28,744 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 30 2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 10.171.81.96:15527 Call#19 Retry#0: output error 2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.172.7.24:36833 Call#486892 Retry#0: output error 2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534) at org.apache.hadoop.ipc.Server.access$1900(Server.java:130) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068) 2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.getServiceStatus from 10.171.82.67:48509 Call#1099647 Retry#0 2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.165.115.190:56256 Call#464826 Retry#0: output error 2015-09-02 00:16:28,945 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8020 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534) at org.apache.hadoop.ipc.Server.access$1900(Server.java:130) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068) 2015-09-02 00:16:28,944 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 8020 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534) at org.apache.hadoop.ipc.Server.access$1900(Server.java:130) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068) 2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15538 Call#1238456 Retry#0 2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15554 Call#1238459 Retry#0 2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:58665 Call#495767 Retry#0: output error 2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020 caught an exception java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:265) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:474) at org.apache.hadoop.ipc.Server.channelWrite(Server.java:2534) at org.apache.hadoop.ipc.Server.access$1900(Server.java:130) at org.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:965) at org.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1030) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2068) 2015-09-02 00:16:28,947 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for active state 2015-09-02 00:16:28,946 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020: skipped org.apache.hadoop.ha.HAServiceProtocol.transitionToStandby from 10.171.81.96:15561 Call#1238462 Retry#0 2015-09-02 00:16:28,946 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 84780 millisecond(s). 2015-09-02 00:16:28,944 WARN org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.96:15952 Call#452032 Retry#0: output error 2015-09-02 00:16:28,948 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 8020 caught an exception |
找錯誤原因:
step1:
org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
原始碼位置:
org.apache.hadoop.hdfs.server.blockmanagement;
說明:
Scans the namesystem, scheduling blocks to be cached as appropriate.
The CacheReplicationMonitor does a full scan when the NameNode first starts up, and at configurable intervals afterwards.
step2:報錯開始!
org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8020, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.170.255.188:53949 Call#525896 Retry#0: output error
之後修改yarn-site.xml和mapred-site.xml引數,Namenode不再掛掉!
修改
點選(此處)摺疊或開啟
-
yarn-site.xml
-
<property>
-
<description>The minimum allocation for every container request at the RM,
-
in MBs. Memory requests lower than this won't take effect,
-
and the specified value will get allocated at minimum.
-
yarn.scheduler.minimum-allocation-mb
-
1500
-
-
-
-
The maximum allocation for every container request at the RM,
-
in MBs. Memory requests higher than this won't take effect,
-
and will get capped to this value.</description>
-
<name>yarn.scheduler.maximum-allocation-mb</name>
-
<value>8192</value>
-
</property>
-
-
mapred-site.xml
-
<property>
-
<name>mapreduce.map.memory.mb</name>
-
<value>1500</value>
-
</property>
-
-
<property>
-
<name>mapreduce.map.java.opts</name>
-
<value>-Xmx1024M</value>
-
<description>Larger heap-size for child jvms of maps.</description>
-
</property>
-
-
<property>
-
<name>mapreduce.reduce.memory.mb</name>
-
<value>2100</value>
-
</property>
-
-
<property>
-
<name>mapreduce.reduce.java.opts</name>
-
<value>-Xmx1900M</value>
-
<description>Larger heap-size for child jvms of reduces.</description>
-
</property>
-
-
<property>
-
<name>mapreduce.job.counters.limit</name>
-
<value>12000</value>
- </property>
最後總結一下:
遇到這種情況,從三個方面入手:
(1)journalnode超時時間;
(2)JVM引數最佳化;
(3)yarn-site.xml和yarn-site.xml配置檔案最佳化。
轉:
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/30089851/viewspace-2122226/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- oracle11gR2 Timed out waiting for the CRS stack to startOracleAI
- Detailed Item Cost Report (XML) timed out waiting for the Output Post-processor to finishAIXML
- Oracle RAC root.sh 報錯 Timed out waiting for the CRS stack to start 解決方法OracleAI
- grid安裝執行root.sh時Timed out waiting for the CRS stack to start - 解除安裝gridAI
- [Clickhouse] Clickhouse 報SQLException : Read timed outSQLException
- sqlnet.log error - Operation timed outSQLError
- ORA-13639: The CURRENT operation was interrupted because it timed OUT
- oracle 11.2.0.3 grid ons 程式 checked timed outOracle
- 【Oracle】 inbound connection timed out (ORA-3136)Oracle
- BLOCKED,WAITING,TIMED_WAITING有什麼區別?-用生活的例子解釋BloCAI
- pip安裝selenium報錯:Read timed out
- Android-問題-obtainBuffer timed out (is the CPU pegged?)AndroidAI
- 報錯(已解決)Command timed out after no timeout
- java.sql.SQLException:IO 錯誤:Socket read timed out !JavaSQLException
- Quorum FailGroupAI
- 執行systemctl status ssh返回“Failed to get properties: Connection timed out”AI
- 【解決】io.lettuce.core.RedisCommandTimeoutException: Command timed outRedisException
- 解決 connect to host github.com port 22 operation timed outGithub
- java.sql.SQLRecoverableException: IO Error: Socket read timed out 排查歷程JavaSQLExceptionError
- oozie.action.hadoop.LauncherException: IO error Connection timed out: no further informationHadoopExceptionErrorORM
- java.net.SocketTimeoutException: Read timed out異常解決方法JavaException
- WARNING: inbound connection timed out (ORA-3136)錯誤分析
- ORA-3136錯誤分析——WARNING Inbound Connection Timed Out
- io.lettuce.core.RedisCommandTimeoutException: Command timed out 解決辦法RedisException
- jdbc訪問KingbaseES資料庫SocketTimeoutException Read timed outJDBC資料庫Exception
- Feign,HTTP連線超時問題SocketTimeoutException: Read timed outHTTPException
- eclipse 線上安裝android包:Download interrupted: Read timed outEclipseAndroid
- kernel: ide-cd: cmd 0x1e timed out -- LinuxIDELinux
- 資料庫日誌中出現啟動JOB程式的TIMED OUT資訊資料庫
- ORA-25254: time-out in LISTEN while waiting for a message 解決WhileAI
- Loadrunner 伺服器壓力上不去,客戶端出現大量timed out伺服器客戶端
- CM agent 傳送heartbeat給CM server 不斷髮生 connection refused:timed outServer
- open-falcon ---安裝Dashboard時候報錯"SSLError: The read operation timed out"Error
- adstrtal.sh報超時錯誤 ERROR : Timed out( 100000 ): Interrupted ExceptionErrorException
- WARNING: inbound connection timed out (ORA-3136)連線超時問題
- Fatal NI Connect Error 12170, 'TNS-12535: TNS:operation timed outError
- win 10.1下 ORA-25254: time-out in LISTEN while waiting for a messageWhileAI
- 【python】pip安裝庫時出現Read timed out.解決辦法Python