週末處理的一次故障,這裡簡單記錄下。
故障現象:
6點1 分左右開始, Hadoop叢集異常,所有的hdfs操作都出現問題。
幾十個 job報以下錯
FAILED: RuntimeException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category WRITE is not supported in state standby 或者: FAILED: RuntimeException org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
故障原因分析:
1)看nn的日誌,可以發現是寫jn導致的,懷疑是網路問題導致,由於是5臺jn,如果是網路問題的話,根據major的演算法,需要3臺寫入失敗才會導致出現問題,雖然jn和nn有些機器不在同一個交換機下,但是,從監控上來看,網路沒有異常,排除網路問題
2)仔細檢視nn的日誌,和對原始碼的分析,發現是由於 namenode這個時間點進行了一次時間比較長的 full gc (96s ),導致寫 journalnode 超時(預設是20s), namenode程式退出。同時從jn的日誌可以看出,後面是沒有nn的請求過來的(因為nn已經掛掉了)。
Gc日誌:
2014-05-25T05:59:53.578 +0800: 2478553.316: [Full GC [PSYoungGen: 2784K->0K(4179328K)] [PSOldGen: 28566944K->9208608K(28573696K)] 28569728K->9208608K(32753024K) [PSPermGen: 41405K->41405K(524288K)], 95.6706740 secs] [Times: user=95.62 sys=0.00, real= 95.67 secs] Heap PSYoungGen total 4179328K, used 87037K [0x00007f70e5000000, 0x00007f71e5000000, 0x00007f71e5000000) eden space 4164800K, 2% used [0x00007f70e5000000,0x00007f70ea4ff590,0x00007f71e3330000) from space 14528K, 0% used [0x00007f71e3330000,0x00007f71e3330000,0x00007f71e4160000) to space 14336K, 0% used [0x00007f71e4200000,0x00007f71e4200000,0x00007f71e5000000) PSOldGen total 28573696K, used 9208608K [0x00007f6a15000000, 0x00007f70e5000000, 0x00007f70e5000000) object space 28573696K, 32% used [0x00007f6a15000000,0x00007f6c470c8078,0x00007f70e5000000) PSPermGen total 524288K, used 41466K [0x00007f69f5000000, 0x00007f6a15000000, 0x00007f6a15000000) object space 524288K, 7% used [0x00007f69f5000000,0x00007f69f787e8f8,0x00007f6a15000000)
jn日誌:
2014-05-25 05:58:45,432 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /home/vipshop/hard_disk/nn_shared_disk/bipcluster/current/edits_inprogress_0000000001665741687 -> /home/vipshop/hard_disk/nn_shared_disk/bipcluster/current/edits_0000000001665741687-0000000001665777062 2014-05-25 06:44:54,299 ERROR org.apache.hadoop.hdfs.qjournal.server.JournalNode: RECEIVED SIGNAL 15: SIGTERM 2014-05-25 06:44:54,302 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: SHUTDOWN_MSG:
nn錯誤日誌:
2014-05-25 06:01:29,258 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 96275 ms (timeout=20000 ms) for a response for sendEdits. No responses yet. 2014-05-25 06:01:29,259 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 96145ms GC pool `PS MarkSweep` had collection(s): count=1 time=95670ms GC pool `PS Scavenge` had collection(s): count=1 time=592ms 2014-05-25 06:01:29,259 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 96281ms to send a batch of 2 edits (358 bytes) to remote journal xxx:8485 2014-05-25 06:01:29,259 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 96283ms to send a batch of 2 edits (358 bytes) to remote journal xxx:8485 2014-05-25 06:01:29,259 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 20019 Total time for transactions(ms): 436 Number of transactions batched in Syncs: 9441 Number of syncs: 5160 SyncTimes(ms): 41623 9305 2014-05-25 06:01:29,259 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 96275ms to send a batch of 13 edits (1638 bytes) to remote journal xxx:8485 2014-05-25 06:01:29,259 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [xxx:8485, xxx:8485, xxx:8485, xxx:8485, xxx:8485], stream=QuorumOutputStream starting at txid 1665777063)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:490) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:350) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:53) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:486) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:581) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1879) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1845) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:439) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:207) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44942) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1746) 2014-05-25 06:01:29,259 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 96276ms to send a batch of 13 edits (1638 bytes) to remote journal 10.201.202.21:8485 2014-05-25 06:01:29,263 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /tmp/hive-hdfs/hive_2014-05-25_05-59-16_296_5866222407565920996/_task_tmp.-ext-10002/_tmp.001379_0. BP-180494678-xxx-1366627257763 blk_-3717787322078480343_437088977{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[xxx:50010|RBW], ReplicaUnderConstruction[xxx:50010|RBW], ReplicaUnderConstruction[xxx:50010|RBW]]} 2014-05-25 06:01:29,263 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault: Not able to place enough replicas, still in need of 2 to reach 3 For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger 2014-05-25 06:01:29,263 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 1665777063 ...... 2014-05-25 06:01:29,268 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 96285ms to send a batch of 13 edits (1638 bytes) to remote journal xxx:8485 ...... 2014-05-25 06:01:29,324 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 2014-05-25 06:01:29,364 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
在寫journalnode超時時,觸發了 ExitUtil類的terminate 方法,終止當前的程式:
JournalSet類中:
for (JournalAndStream jas : journals) { try { closure.apply(jas); } catch (Throwable t) { if (jas.isRequired()) { final String msg = "Error: " + status + " failed for required journal (" + jas + ")" ; LOG .fatal(msg, t); // If we fail on *any* of the required journals, then we must not // continue on any of the other journals. Abort them to ensure that // retry behavior doesn`t allow them to keep going in any way. abortAllJournals(); // the current policy is to shutdown the NN on errors to shared edits // dir . There are many code paths to shared edits failures - syncs , // roll of edits etc. All of them go through this common function // where the isRequired() check is made. Applying exit policy here // to catch all code paths. terminate(1, msg); } else { LOG .error("Error: " + status + " failed for (journal " + jas + ")" , t); badJAS.add(jas); } } }
ExitUtil類的terminate方法,呼叫了System.exit方法:
/** * Terminate the current process. Note that terminate is the *only* method * that should be used to terminate the daemon processes. * @param status exit code * @param msg message used to create the ExitException * @throws ExitException if System.exit is disabled for test purposes */ public static void terminate(int status, String msg) throws ExitException { LOG.info( "Exiting with status " + status); if (systemExitDisabled) { ExitException ee = new ExitException(status, msg); LOG.fatal( "Terminate called", ee); if (null == firstExitException) { firstExitException = ee; } throw ee; } System.exit(status); }
處理方法:
重啟叢集,在 6:50左右恢復正常
後續解決方法:
1)調節journalnode 的寫入超時時間
dfs.qjournal.write-txns.timeout.ms
2)調整namenode 的java引數,提前觸發 full gc,這樣full gc 的時間就會小一些。
3)預設namenode的fullgc方式是parallel gc,是stw模式的,更改為cms的格式。調整namenode的啟動引數:
-XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75 -XX:SoftRefLRUPolicyMSPerMB=0
另外一個比較坑的是flume不會自動恢復,需要重啟。。
否則會一直報java.lang.InterruptedException: sleep interrupted 錯誤。