系統記憶體不足導致oracle程式被誤殺terminating the instance due to error 822

shawnloong發表於2015-09-21
今天收到一個報警郵件,oracle程式已經不存在了
Alarm Time:2015-09-21 17:45:38
Trigger: Alive xyxdb_oa
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:
Item values:
1. Alive (x.x.x.x:alive): 0
2. *UNKNOWN* (x.x.x.x :*UNKNOWN*): *UNKNOWN*
Original event ID: 760121


檢視到alert日誌
System state dump requested by (instance=1, osid=2044 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/xyxdbp/xyxdb/trace/xyxdb_diag_2062_20150921174417.trc
Mon Sep 21 17:44:18 2015
PMON (ospid: 2044): terminating the instance due to error 822
Dumping diagnostic data in directory=[cdmp_20150921174417], requested by (instance=1, osid=2044 (PMON)), summary=[abnormal instance termination].
Instance terminated by PMON, pid = 2044
Mon Sep 21 17:46:39 2015
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = 64 KB

Total Shared Global Region in Large Pages = 0 KB (0%)

Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB

RECOMMENDATION:
  Total System Global Area size is 3282 MB. For optimal performance,
  prior to the next instance restart:
  1. Increase the number of unused large pages by
at least 1641 (page size 2048 KB, total size 3282 MB) system wide to


RECOMMENDATION:
  Total System Global Area size is 3282 MB. For optimal performance,
  prior to the next instance restart:
  1. Increase the number of unused large pages by
at least 1641 (page size 2048 KB, total size 3282 MB) system wide to
  get 100% of the System Global Area allocated with large pages
  2. Large pages are automatically locked into physical memory.
Increase the per process memlock (soft) limit to at least 3290 MB to lock
100% System Global Area's large pages into physical memory
********************************************************************
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Initial number of CPU is 6
Number of processor cores in the system is 6
Number of processor sockets in the system is 1
CELL communication is configured to use 0 interface(s):
CELL IP affinity details:
    NUMA status: non-NUMA system
    cellaffinity.ora status: N/A
CELL communication will use 1 IP group(s):
    Grp 0:



[root@OA01-1-24 scripts]# cat /proc/50966/oom_
oom_adj        oom_score      oom_score_adj 
[root@OA01-1-24 scripts]# cat /proc/50966/oom_adj
0
[root@OA01-1-24 scripts]# vim oomscore.sh
[root@OA01-1-24 scripts]# chmod u+x oomscore.sh
[root@OA01-1-24 scripts]# ./oomscore.sh
63 37608 /usr/bin/java -Djava.util.logging.config.file=/usr
31 51010 ora_mman_xyxdb
20 37579 /usr/bin/java -Djava.util.logging.config.file=/usr
16 51938 /usr/java/jdk1.7.0_79/jre/bin/java -Djava.util.log
14 51496 oraclexyxdb (LOCAL=NO)
13 51026 ora_smon_xyxdb
8 51167 oraclexyxdb (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROT
7 51034 ora_mmon_xyxdb
7 51014 ora_dbw0_xyxdb
6 51480 oraclexyxdb (LOCAL=NO)

查詢系統日誌
Sep 21 17:44:15 OA01-1-24 kernel: [39519]   500 39519   900699     5848   2       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [39521]   500 39521   900699     5877   5       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [42514]   500 42514   900846    10963   1       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [42578]   500 42578   900706     9012   1       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [43519]     0 43519    24998     1489   5       0             0 sshd
Sep 21 17:44:15 OA01-1-24 kernel: [43533]     0 43533    14309      550   5       0             0 sftp-server
Sep 21 17:44:15 OA01-1-24 kernel: [43557]     0 43557    14432      671   5       0             0 sftp-server
Sep 21 17:44:15 OA01-1-24 kernel: [44331]    89 44331    20234      861   2       0             0 pickup
Sep 21 17:44:15 OA01-1-24 kernel: [44491]     0 44491  1107908   148835   4       0             0 java
Sep 21 17:44:15 OA01-1-24 kernel: [44684]   500 44684   900015     4658   0       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [45199]   500 45199   900699     5525   3       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [45201]   500 45201   900699     5548   4       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [45203]   500 45203   900704     8184   5       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [45211]   500 45211   900699     5506   0       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [45213]   500 45213   900699     5504   4       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [45901]     0 45901  1051478   117538   2       0             0 java
Sep 21 17:44:15 OA01-1-24 kernel: [45943]   500 45943   900956     7194   0       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [45945]   500 45945   900315     5444   1       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [45947]   500 45947   900315     5423   5       0             0 oracle
Sep 21 17:44:15 OA01-1-24 kernel: [46232]     0 46232    25226      152   4       0             0 sleep
Sep 21 17:44:15 OA01-1-24 kernel: Out of memory: Kill process 2074 (oracle) score 125 or sacrifice child
Sep 21 17:44:15 OA01-1-24 kernel: Killed process 2074, UID 500, (oracle) total-vm:3600064kB, anon-rss:3444kB, file-rss:1510892kB

通常是因為某時刻應用程式大量請求記憶體導致系統記憶體不足造成的,這通常會觸發 Linux 核心裡的 Out of Memory (OOM) killer,OOM killer 會殺掉某個程式以騰出記憶體留給系統用,不致於讓系統立刻崩潰。
後來檢視到開發人員在這臺db伺服器啟用了兩個tomcat應用,由於程式故障導致大量記憶體使用

我們可以配置核心引數來防止程式被殺
透過指令碼找出最容易被殺的程式
# vi oomscore.sh
#!/bin/bash
for proc in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); do
    printf "%2d %5d %s\n" \
        "$(cat $proc/oom_score)" \
        "$(basename $proc)" \
        "$(cat $proc/cmdline | tr '\0' ' ' | head -c 50)"
done 2>/dev/null | sort -nr | head -n 10


[root@OA01-1-24 scripts]# ./oomscore.sh
63 37608 /usr/bin/java -Djava.util.logging.config.file=/usr
31 51010 ora_mman_xyxdb
20 37579 /usr/bin/java -Djava.util.logging.config.file=/usr
16 51938 /usr/java/jdk1.7.0_79/jre/bin/java -Djava.util.log
14 51496 oraclexyxdb (LOCAL=NO)
13 51026 ora_smon_xyxdb
8 51167 oraclexyxdb (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROT
7 51034 ora_mmon_xyxdb
7 51014 ora_dbw0_xyxdb
6 51480 oraclexyxdb (LOCAL=NO)
[root@OA01-1-24 scripts]# cat /proc/51034/oom_score
7
[root@OA01-1-24 scripts]# cat /proc/51034/oom_score_adj
0
[root@OA01-1-24 scripts]# echo -15 >/proc/51034/oom_adj
[root@OA01-1-24 scripts]# cat /proc/51034/oom_score
1
[root@OA01-1-24 scripts]# cat /proc/51026/oom_adj
0
[root@OA01-1-24 scripts]# cat /proc/51026/oom_score
13
[root@OA01-1-24 scripts]# echo -15 >/proc/51026/oom_adj
[root@OA01-1-24 scripts]# cat /proc/51026/oom_adj
-15
[root@OA01-1-24 scripts]# cat /proc/51026/oom_score
1
[root@OA01-1-24 scripts]# echo -15 >/proc/51010/oom_adj
[root@OA01-1-24 scripts]# cat /proc/51010/oom_score
1
[root@OA01-1-24 scripts]# ./
alertbyday.sh          oracle_cron.sh         sendrman.py            updatedb/             
installora/            rmanbackup.sh          sync_date.sh           uploadbackup.sh       
oomscore.sh            senderrorlog.py        tablespace_monitor.py 
[root@OA01-1-24 scripts]# ./oomscore.sh
63 37608 /usr/bin/java -Djava.util.logging.config.file=/usr
20 37579 /usr/bin/java -Djava.util.logging.config.file=/usr
16 51938 /usr/java/jdk1.7.0_79/jre/bin/java -Djava.util.log
14 51496 oraclexyxdb (LOCAL=NO)
8 51167 oraclexyxdb (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROT
7 51014 ora_dbw0_xyxdb
6 51480 oraclexyxdb (LOCAL=NO)
5 52007 oraclexyxdb (LOCAL=NO)
5 51474 oraclexyxdb (LOCAL=NO)
5 51466 oraclexyxdb (LOCAL=NO)

後來還檢查到一個問題,關於swap使用配置
[root@OA01-1-24 scripts]# cat /proc/sys/vm/swappiness
0
這裡0代表不使用swap
系統工程師更改的時候沒有注意,oracle最好不要關掉swap
重新修改
[root@OA01-1-24 scripts]# cat /proc/sys/vm/swappiness
60
總結:DB伺服器儘量專用,不然會出現很多意想不到事兒

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/24486203/viewspace-1805598/,如需轉載,請註明出處,否則將追究法律責任。

相關文章