線上某個hive job執行失敗,報錯如下
Container [pid=28474,containerID=container_1411897705890_0181_01_000012] is running beyond physical memory limits. Current usage: 1.0 GB of 1 GB physical memory used; 1.5 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1411897705890_0181_01_000012 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 28474 19508 28474 28474 (bash) 0 0 9416704 309 /bin/bash -c /usr/java/jdk1.7.0_67/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1024m -Djava.io.tmpdir=/data/yarn/local/usercache/hadoop/appcache/application_1411897705890_0181/container_1411897705890_0181_01_000012/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/data/yarn/logs/application_1411897705890_0181/container_1411897705890_0181_01_000012 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 10.10.11.161 32875 attempt_1411897705890_0181_r_000000_3 12 1>/data/yarn/logs/application_1411897705890_0181/container_1411897705890_0181_01_000012/stdout 2>/data/yarn/logs/application_1411897705890_0181/container_1411897705890_0181_01_000012/stderr |- 28481 28474 28474 28474 (java) 2356 397 1630285824 264098 /usr/java/jdk1.7.0_67/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1024m -Djava.io.tmpdir=/data/yarn/local/usercache/hadoop/appcache/application_1411897705890_0181/container_1411897705890_0181_01_000012/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/data/yarn/logs/application_1411897705890_0181/container_1411897705890_0181_01_000012 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 10.10.11.161 32875 attempt_1411897705890_0181_r_000000_3 12
根據異常分析應該是記憶體使用超過限制,ContainersMonitorImpl將程式kill導致的,檢視下JVM記憶體回收的情況
[GC [PSYoungGen: 241753K->16036K(306176K)] 241753K->16116K(1005568K), 0.0362550 secs] [Times: user=0.31 sys=0.05, real=0.04 secs] [GC [PSYoungGen: 210741K->4826K(306176K)] 210821K->282228K(1005568K), 0.0996080 secs] [Times: user=1.58 sys=0.15, real=0.10 secs] [GC [PSYoungGen: 194630K->4762K(306176K)] 472032K->624439K(1005568K), 0.1418910 secs] [Times: user=2.30 sys=0.14, real=0.14 secs] [Full GC [PSYoungGen: 4762K->0K(306176K)] [ParOldGen: 619677K->359650K(699392K)] 624439K->359650K(1005568K) [PSPermGen: 21635K->21622K(43520K)], 0.1742260 secs] [Times: user=0.82 sys=0.07, real=0.17 secs] [GC-- [PSYoungGen: 192581K->192581K(306176K)] 655085K->833707K(1005568K), 0.0634170 secs] [Times: user=1.08 sys=0.00, real=0.06 secs] [Full GC [PSYoungGen: 192581K->0K(306176K)] [ParOldGen: 641125K->640707K(699392K)] 833707K->640707K(1005568K) [PSPermGen: 21674K->21674K(49152K)], 0.0663990 secs] [Times: user=0.65 sys=0.05, real=0.07 secs] [Full GC [PSYoungGen: 262656K->0K(306176K)] [ParOldGen: 640709K->8142K(699392K)] 903365K->8142K(1005568K) [PSPermGen: 24649K->24647K(49152K)], 0.0662210 secs] [Times: user=0.37 sys=0.00, real=0.07 secs] [GC [PSYoungGen: 262656K->15936K(327680K)] 270798K->24078K(1027072K), 0.0175890 secs] [Times: user=0.14 sys=0.14, real=0.02 secs] Heap PSYoungGen total 327680K, used 201250K [0x00000000eaa80000, 0x0000000100000000, 0x0000000100000000) eden space 284160K, 65% used [0x00000000eaa80000,0x00000000f5f78b18,0x00000000fc000000) from space 43520K, 36% used [0x00000000fd580000,0x00000000fe510010,0x0000000100000000) to space 22016K, 0% used [0x00000000fc000000,0x00000000fc000000,0x00000000fd580000) ParOldGen total 699392K, used 8142K [0x00000000bff80000, 0x00000000eaa80000, 0x00000000eaa80000) object space 699392K, 1% used [0x00000000bff80000,0x00000000c07738d8,0x00000000eaa80000) PSPermGen total 49152K, used 24726K [0x00000000bad80000, 0x00000000bdd80000, 0x00000000bff80000) object space 49152K, 50% used [0x00000000bad80000,0x00000000bc5a5908,0x00000000bdd80000)
並沒有明顯的記憶體洩露或者記憶體溢位的情況,還是從堆記憶體入手,由於MR JOB的物件生存週期普遍比較短,嘗試調大新生代,讓更多物件在新生代進行回收,提高回收的效率,引數調整為
-Xms1024m -Xmx1024m -Xmn600m
問題解決。這次遇到的是實體記憶體超過限制的問題,還有一種是虛擬記憶體超過限制導致任務被kill
引數yarn.nodemanager.vmem-pmem-ratio的含義是每單位的實體記憶體總量對應的虛擬記憶體量,預設是2.1,表示每使用1MB的實體記憶體,最多可以使用2.1MB的虛擬記憶體總量,解決虛擬記憶體問題可以適當調高該引數或者還從JVM記憶體回收方面來優化。
最後說下ContainersMonitorImpl的監控策略,它儲存了每個Container的pid,內部的MonitoringThread執行緒每隔一段時間掃描執行的Container程式樹。NodeManager通過讀取/proc/<pid>/stat檔案構造以該Container程式為根的程式樹,通過監控程式樹使用的記憶體量來限制任務的記憶體量。
private class MonitoringThread extends Thread { public void run() { while (true) { //獲取程式樹 ResourceCalculatorProcessTree pTree = ptInfo.getProcessTree(); pTree.updateProcessTree(); //獲取container程式樹的記憶體使用量 long currentVmemUsage = pTree.getCumulativeVmem(); long currentPmemUsage = pTree.getCumulativeRssmem(); //獲取程式樹中年齡大於1的程式的記憶體使用量 long curMemUsageOfAgedProcesses = pTree.getCumulativeVmem(1); long curRssMemUsageOfAgedProcesses = pTree.getCumulativeRssmem(1); long vmemLimit = ptInfo.getVmemLimit(); long pmemLimit = ptInfo.getPmemLimit(); boolean isMemoryOverLimit = false; String msg = ""; //如果一個Container程式樹中所有程式(年齡大於0)總記憶體超過設定最大值的兩倍或者 //年齡大於1的程式總記憶體量超過設定最大值,則將該Container殺死 if (isVmemCheckEnabled() && isProcessTreeOverLimit(containerId.toString(), currentVmemUsage, curMemUsageOfAgedProcesses, vmemLimit)) { msg = formatErrorMessage("virtual", currentVmemUsage, vmemLimit, currentPmemUsage, pmemLimit, pId, containerId, pTree); isMemoryOverLimit = true; } else if (isPmemCheckEnabled() && isProcessTreeOverLimit(containerId.toString(), currentPmemUsage, curRssMemUsageOfAgedProcesses, pmemLimit)) { msg = formatErrorMessage("physical", currentVmemUsage, vmemLimit, currentPmemUsage, pmemLimit, pId, containerId, pTree); isMemoryOverLimit = true; } } }
所有有些時候並不是某個JVM程式的堆記憶體溢位才可以導致Task被kill,需要調整好對應引數才行,堆記憶體也並不是越大越好,調整好各代所佔的比例也很重要。