namenode gc導致的故障一例薦

菜菜光發表於2014-05-29

原文網址 : https://flycode.co/archives/256360

週末處理的一次故障，這裡簡單記錄下。

故障現象：

6點1 分左右開始， Hadoop叢集異常，所有的hdfs操作都出現問題。

幾十個 job報以下錯

FAILED:
 RuntimeException 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
 Operation category WRITE is not supported in state standby
或者： FAILED:
 RuntimeException 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
 Operation category READ is not supported in state standby

故障原因分析:

1)看nn的日誌，可以發現是寫jn導致的，懷疑是網路問題導致，由於是5臺jn，如果是網路問題的話，根據major的演算法，需要3臺寫入失敗才會導致出現問題，雖然jn和nn有些機器不在同一個交換機下，但是，從監控上來看，網路沒有異常，排除網路問題

2)仔細檢視nn的日誌，和對原始碼的分析，發現是由於 namenode這個時間點進行了一次時間比較長的 full gc （96s ），導致寫 journalnode 超時（預設是20s）， namenode程式退出。同時從jn的日誌可以看出，後面是沒有nn的請求過來的（因為nn已經掛掉了）。

Gc日誌：

2014-05-25T05:59:53.578 +0800:
 2478553.316: [Full GC [PSYoungGen: 2784K->0K(4179328K)] [PSOldGen: 
28566944K->9208608K(28573696K)] 28569728K->9208608K(32753024K)
[PSPermGen: 41405K->41405K(524288K)], 95.6706740 secs] [Times: user=95.62 sys=0.00, real= 95.67 secs]
Heap
PSYoungGen      total 4179328K, used 87037K [0x00007f70e5000000, 0x00007f71e5000000, 0x00007f71e5000000)
  eden space 4164800K, 2% used [0x00007f70e5000000,0x00007f70ea4ff590,0x00007f71e3330000)
  from space 14528K, 0% used [0x00007f71e3330000,0x00007f71e3330000,0x00007f71e4160000)
  to   space 14336K, 0% used [0x00007f71e4200000,0x00007f71e4200000,0x00007f71e5000000)
PSOldGen        total 28573696K, used 9208608K [0x00007f6a15000000, 0x00007f70e5000000, 0x00007f70e5000000)
  object space 28573696K, 32% used [0x00007f6a15000000,0x00007f6c470c8078,0x00007f70e5000000)
PSPermGen       total 524288K, used 41466K [0x00007f69f5000000, 0x00007f6a15000000, 0x00007f6a15000000)
  object space 524288K, 7% used [0x00007f69f5000000,0x00007f69f787e8f8,0x00007f6a15000000)

jn日誌：

2014-05-25 05:58:45,432 INFO 
org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing 
edits file 
/home/vipshop/hard_disk/nn_shared_disk/bipcluster/current/edits_inprogress_0000000001665741687
 -> 
/home/vipshop/hard_disk/nn_shared_disk/bipcluster/current/edits_0000000001665741687-0000000001665777062
2014-05-25 06:44:54,299 ERROR org.apache.hadoop.hdfs.qjournal.server.JournalNode: RECEIVED SIGNAL 15: SIGTERM
2014-05-25 06:44:54,302 INFO org.apache.hadoop.hdfs.qjournal.server.JournalNode: SHUTDOWN_MSG:

nn錯誤日誌：

2014-05-25 06:01:29,258 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 
96275 ms (timeout=20000 ms) for a response for sendEdits. No responses 
yet.
2014-05-25 06:01:29,259 WARN 
org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host 
machine (eg GC): pause of approximately 96145ms
GC pool `PS MarkSweep` had collection(s): count=1 time=95670ms
GC pool `PS Scavenge` had collection(s): count=1 time=592ms
2014-05-25
 06:01:29,259 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 
96281ms to send a batch of 2 edits (358 bytes) to remote journal 
xxx:8485
2014-05-25 06:01:29,259 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 
96283ms to send a batch of 2 edits (358 bytes) to remote journal 
xxx:8485
2014-05-25 06:01:29,259 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:
Number of transactions: 20019
Total time for transactions(ms): 436
Number of transactions batched in Syncs: 9441
Number of syncs: 5160
SyncTimes(ms): 41623 9305
2014-05-25
 06:01:29,259 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 
96275ms to send a batch of 13 edits (1638 bytes) to remote journal 
xxx:8485
2014-05-25 06:01:29,259 FATAL 
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed 
for required journal (JournalAndStream(mgr=QJM to [xxx:8485, xxx:8485, 
xxx:8485, xxx:8485, xxx:8485], stream=QuorumOutputStream starting at 
txid 1665777063))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:490)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:350)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:53)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:486)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:581)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1879)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1845)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:439)
       
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:207)
       
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44942)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1746)
2014-05-25
 06:01:29,259 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 
96276ms to send a batch of 13 edits (1638 bytes) to remote journal 
10.201.202.21:8485
2014-05-25 06:01:29,263 INFO 
org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: 
/tmp/hive-hdfs/hive_2014-05-25_05-59-16_296_5866222407565920996/_task_tmp.-ext-10002/_tmp.001379_0.
 BP-180494678-xxx-1366627257763 
blk_-3717787322078480343_437088977{blockUCState=UNDER_CONSTRUCTION, 
primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[xxx:50010|RBW], 
ReplicaUnderConstruction[xxx:50010|RBW], 
ReplicaUnderConstruction[xxx:50010|RBW]]}
2014-05-25 06:01:29,263 
WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault:
 Not able to place enough replicas, still in need of 2 to reach 3
For more information, please enable DEBUG log level on org.apache.commons.logging.impl.Log4JLogger
2014-05-25
 06:01:29,263 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting 
QuorumOutputStream starting at txid 1665777063
......
2014-05-25 
06:01:29,268 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 
96285ms to send a batch of 13 edits (1638 bytes) to remote journal 
xxx:8485
......
2014-05-25 06:01:29,324 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2014-05-25 06:01:29,364 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

在寫journalnode超時時，觸發了 ExitUtil類的terminate 方法，終止當前的程式：

JournalSet類中：

     for (JournalAndStream jas :  journals) {
      try {
        closure.apply(jas);
      }  catch (Throwable t) {
         if (jas.isRequired()) {
           final String msg = "Error: "  + status +  " failed for required journal ("
            + jas +  ")" ;
           LOG .fatal(msg, t);
           // If we fail on *any* of the required journals, then we must not
           // continue on any of the other journals. Abort them to ensure that
           // retry behavior doesn`t allow them to keep going in any way.
          abortAllJournals();
           // the current policy is to shutdown the NN on errors to shared edits
           // dir  . There are many code paths to shared edits failures - syncs ,
           // roll of edits etc. All of them go through this common function
           // where the isRequired() check is made. Applying exit policy here
           // to catch all code paths.
          terminate(1, msg);
        }  else {
           LOG .error("Error: "  + status +  " failed for (journal " + jas + ")"  , t);
          badJAS.add(jas);         
        }
      }
    }

ExitUtil類的terminate方法，呼叫了System.exit方法：

  /**
   * Terminate the current process. Note that terminate is the *only* method
   * that should be used to terminate the daemon processes.
   * @param status exit code
   * @param msg message used to create the ExitException
   * @throws ExitException if System.exit is disabled for test purposes
   */
  public static void terminate(int status, String msg) throws ExitException {
    LOG.info( "Exiting with status " + status);
    if (systemExitDisabled) {
      ExitException ee = new ExitException(status, msg);
      LOG.fatal( "Terminate called", ee);
      if (null == firstExitException) {
        firstExitException = ee;
      }
      throw ee;
    }
    System.exit(status);
  }

處理方法：

重啟叢集，在 6:50左右恢復正常

後續解決方法：

1）調節journalnode 的寫入超時時間

dfs.qjournal.write-txns.timeout.ms

2）調整namenode 的java引數，提前觸發 full gc，這樣full gc 的時間就會小一些。

3）預設namenode的fullgc方式是parallel gc，是stw模式的，更改為cms的格式。調整namenode的啟動引數：

-XX:+UseCompressedOops 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled 
-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 
-XX:+CMSParallelRemarkEnabled -XX:+DisableExplicitGC 
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=75
 -XX:SoftRefLRUPolicyMSPerMB=0

另外一個比較坑的是flume不會自動恢復，需要重啟。。

否則會一直報java.lang.InterruptedException: sleep interrupted 錯誤。

hive 故障一例
2020-07-09
Hive
pymysql 非執行緒安全導致的故障.
2020-04-07
MySql執行緒
strict weak ordering導致公司級故障
2022-01-06
Hadoop中Namenode單點故障的解決方案
2018-04-15
Hadoop
故障分析 | replace into 導致主備不一致
2022-12-06
故障分析 | DDL 導致的 Xtrabackup 備份失敗
2022-06-02
ORA-19909(一次DataGuard Failover導致的故障
2020-06-10
AI
故障分析 | MySQL : slave_compressed_protocol 導致 crash
2021-12-28
MySqlProtocol
MAC address(實體地址)重複導致的網路故障
2019-02-22
Mac
案例分享-full gc導致k8s pod重啟
2023-05-04
GCK8S
Oracle RAC日常運維-NetworkManager導致叢集故障
2019-06-08
Oracle運維
MySQL：產生大量小relay log的故障一例
2019-10-11
MySql
修改PT622光貓導致的一系列故障
2019-09-22
磁碟IO故障導致的SQLServer資料庫無法寫入
2021-11-25
SQLServer資料庫
Oracle SYSAUX 表空間使用率100% 導致的DB 故障
2021-08-25
OracleUX
記一次儲存問題導致的rac故障案例
2022-10-09
JVM 輸出 GC 日誌導致 JVM 卡住，我 TM 人傻了
2022-06-17
JVMGC
enq: TX - index contention故障修復一例
2021-06-01
ENQIndex
執行SQL查詢導致磁碟耗盡故障演示
2018-06-04
SQL
又抓了一個導致頻繁GC的鬼--陣列動態擴容
2020-05-28
GC陣列
Oracle死鎖一例(ORA-00060)，鎖表導致的業務死鎖問題
2018-07-05
Oracle
gc current request 引起長期鎖表的故障
2018-06-20
GC
一次對pool的誤用導致的.net頻繁gc的診斷分析
2023-03-28
GC
線上排查：記憶體異常使用導致full gc頻繁
2023-02-07
記憶體GC
故障分析 | MySQL鎖等待超時一例分析
2022-11-23
MySql
子游標過多導致大量mutex爭用故障分析
2023-12-25
Mutex
故障分析 | 手動 rm 掉 binlog 導致主從報錯
2022-05-16
故障分析 | MySQL convert 函式導致的字符集報錯處理
2023-02-08
MySql函式
Linux主機USB RNDIS網路卡驅動實現不完整導致的一例問題
2021-05-22
Linux
年年出妖事，一例由JSON解析導致的"薛定諤BUG"排查過程記錄
2022-05-07
JSON
故障分析 | MySQL 5.7 使用臨時表導致資料庫 Crash
2022-05-18
MySql資料庫
MySQL：5.6 大事務show engine innodb status故障一例
2020-03-02
MySql
伺服器不同的故障導致資料丟失都怎麼解決的
2019-09-26
伺服器
因壞道問題導致的硬碟故障如何進行資料恢復？
2019-07-25
硬碟資料恢復
成功恢復某公司伺服器故障導致的資料庫損壞
2019-04-25
伺服器資料庫
阿里雲大面積故障原因：觸發一個未知程式碼bug導致
2018-06-28
阿里
三星Galaxy Fold螢幕故障原因浮出：系設計缺陷導致
2019-04-26
故障分析 | 大量短時程式導致 cpu 負載過高案例一則
2022-05-31
負載
故障分析 | MySQL 耗盡主機記憶體一例分析
2022-07-12
MySql記憶體

namenode gc導致的故障一例薦

相關文章