資料庫異常崩潰的元凶--OOM killer

jolly10發表於2010-03-19

資料庫版本是10.2.0.4 ,作業系統是linux as 4u4

最近一個星期出現兩次了,一臺資料庫異常中斷,錯誤如下:

Fri Mar 19 12:15:09 2010
Errors in file /u01/app/oracle/admin/orcl/bdump/orcl_pmon_13421.trc:
ORA-00470: LGWR process terminated with error
Fri Mar 19 12:15:09 2010
PMON: terminating instance due to error 470

[@more@]

trc檔案如下:

/u01/app/oracle/admin/orcl/bdump/orcl_pmon_13421.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/10201
System name: Linux
Node name: qht113
Release: 2.6.9-55.ELsmp
Version: #1 SMP Fri Apr 20 17:03:35 EDT 2007
Machine: i686
Instance name: orcl
Redo thread mounted by this instance: 1
Oracle process number: 2
Unix process pid: 13421, image: oracle@qht113 (PMON)

*** 2010-03-19 12:14:58.874
*** SERVICE NAME:(SYS$BACKGROUND) 2010-03-19 12:14:58.679
*** SESSION ID:(555.1) 2010-03-19 12:14:58.679
Background process LGWR found dead
......

error 470 detected in background process
*** 2010-03-19 12:15:09.299
ORA-00470: LGWR process terminated with error

根據錯誤提示:LGRW程式出現問題,中斷服務了。

檢視一下這段時間的OS日誌:

# vi /var/log/messages

Mar 19 12:14:03 qht113 kernel: 0 bounce buffer pages
Mar 19 12:14:03 qht113 kernel: Free swap: 9988688kB
Mar 19 12:14:03 qht113 kernel: 1179648 pages of RAM
Mar 19 12:14:03 qht113 kernel: 818784 pages of HIGHMEM
Mar 19 12:14:03 qht113 kernel: 142184 reserved pages
Mar 19 12:14:03 qht113 kernel: 4277680 pages shared
Mar 19 12:14:03 qht113 kernel: 8025 pages swap cached
Mar 19 12:14:03 qht113 kernel: Out of Memory: Killed process 29907 (oracle).
Mar 19 12:14:03 qht113 kernel: oom-killer: gfp_mask=0xd0
Mar 19 12:14:03 qht113 kernel: Mem-info:
Mar 19 12:14:03 qht113 kernel: DMA per-cpu:
Mar 19 12:14:03 qht113 kernel: cpu 0 hot: low 2, high 6, batch 1
Mar 19 12:14:03 qht113 kernel: cpu 0 cold: low 0, high 2, batch 1
Mar 19 12:14:03 qht113 kernel: cpu 1 hot: low 2, high 6, batch 1
Mar 19 12:14:03 qht113 kernel: cpu 1 cold: low 0, high 2, batch 1
Mar 19 12:14:03 qht113 kernel: Normal per-cpu:
Mar 19 12:14:03 qht113 kernel: cpu 0 hot: low 32, high 96, batch 16
Mar 19 12:14:03 qht113 kernel: cpu 0 cold: low 0, high 32, batch 16
Mar 19 12:14:03 qht113 kernel: cpu 1 hot: low 32, high 96, batch 16
Mar 19 12:14:03 qht113 kernel: cpu 1 cold: low 0, high 32, batch 16
Mar 19 12:14:03 qht113 kernel: HighMem per-cpu:
Mar 19 12:14:03 qht113 kernel: cpu 0 hot: low 32, high 96, batch 16
Mar 19 12:14:03 qht113 kernel: cpu 0 cold: low 0, high 32, batch 16
Mar 19 12:14:03 qht113 kernel: cpu 1 hot: low 32, high 96, batch 16
Mar 19 12:14:03 qht113 kernel: cpu 1 cold: low 0, high 32, batch 16
Mar 19 12:14:03 qht113 kernel:
Mar 19 12:14:03 qht113 kernel: Free pages: 202364kB (189056kB HighMem)
Mar 19 12:14:03 qht113 kernel: Active:365041 inactive:591049 dirty:0 writeback:0 unstable:0 free:50591 slab:11215 mapped:392246 pagetables:14438
Mar 19 12:14:03 qht113 kernel: DMA free:12516kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB pages_scanned:890 all_unreclaimable? yes
Mar 19 12:14:03 qht113 kernel: protections[]: 0 0 0
Mar 19 12:14:03 qht113 kernel: Normal free:792kB min:928kB low:1856kB high:2784kB active:343544kB inactive:458948kB present:901120kB pages_scanned:1035962 all_unreclaimable? yes
Mar 19 12:14:03 qht113 kernel: protections[]: 0 0 0
Mar 19 12:14:03 qht113 kernel: HighMem free:189056kB min:512kB low:1024kB high:1536kB active:1116620kB inactive:1905248kB present:3801088kB pages_scanned:0 all_unreclaimable? no
Mar 19 12:14:03 qht113 kernel: protections[]: 0 0 0
Mar 19 12:14:03 qht113 kernel: DMA: 1*4kB 2*8kB 3*16kB 3*32kB 3*64kB 3*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 2*4096kB = 12516kB
Mar 19 12:14:03 qht113 kernel: Normal: 12*4kB 13*8kB 2*16kB 1*32kB 1*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 792kB
Mar 19 12:14:03 qht113 kernel: HighMem: 1190*4kB 8601*8kB 6366*16kB 402*32kB 4*64kB 0*128kB 2*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 189056kB
Mar 19 12:14:03 qht113 kernel: Swap cache: add 12012517, delete 12004498, find 5695310/7634496, race 2+35
Mar 19 12:14:03 qht113 kernel: 0 bounce buffer pages
Mar 19 12:14:03 qht113 kernel: Free swap: 9988688kB
Mar 19 12:14:03 qht113 kernel: 1179648 pages of RAM
Mar 19 12:14:03 qht113 kernel: 818784 pages of HIGHMEM
Mar 19 12:14:03 qht113 kernel: 142184 reserved pages
Mar 19 12:14:03 qht113 kernel: 4141749 pages shared
Mar 19 12:14:03 qht113 kernel: 8025 pages swap cached
Mar 19 12:14:03 qht113 kernel: Out of Memory: Killed process 29909 (oracle).
Mar 19 12:14:03 qht113 kernel: oom-killer: gfp_mask=0xd0

看到oracle的程式被oom給殺掉了,很有可能就是她把LGWR程式給殺了;聯想想之前給資料庫增加表空間時,經常加了一半時,連線的程式突然中斷了,一直以為是磁碟的問題,現在看來是記憶體不足,OOM將其程式殺掉了。

參考metalink文件:Note:228203.1、Note:452000.1、Note:445163.1說白了 OOM Killer 就是一層保護機制,用於避免 Linux 在記憶體不足的時候不至於出太嚴重的問題,把無關緊要的程式殺掉。對於 RHEL 4 ,新增了一個引數: vm.lower_zone_protection 。這個引數預設的單位為 MB,預設 0 的時候,LowMem 為 16MB。建議設定 vm.lower_zone_protection = 200 甚至更大以避免 LowMem 區域的碎片,是絕對能解決這個問題。
解決辦法:
# vi /etc/sysctl.conf (增加以下引數)
vm.lower_zone_protection=200
# sysctl -p (讓引數生效)

Tip: OOM Killer 的關閉與啟用方式:

# echo "0" > /proc/sys/vm/oom-kill

# echo "1" > /proc/sys/vm/oom-kill

refer:

http://hi.baidu.com/edeed/blog/item/03e5cd116ae4b816b9127bde.html

http://www.dbanotes.net/database/linux_outofmemory_oom_killer.html

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/271283/viewspace-1032172/,如需轉載,請註明出處,否則將追究法律責任。

相關文章