linux下修改drop_cache引數觸發ORA-600 [KGHLKREM1]

skuary發表於2012-06-26

昨天在主站的3個節點上執行了如下命令:

echo 3 > /proc/sys/vm/drop_cache

直接導致其中一個節點2例項宕掉,詳細的告警日誌資訊如下:

Mon Jun 25 17:06:51 CST 2012
Errors in file /oracle/admin/yesmynet/bdump/yesmynet2_lmon_10048.trc:
ORA-00600: internal error code, arguments: [KGHLKREM1], [0x4BC000020], [], [], [], [], [], []
Mon Jun 25 17:06:52 CST 2012
Trace dumping is performing id=[cdmp_20120625170652]
Mon Jun 25 17:06:52 CST 2012
Errors in file /oracle/admin/yesmynet/bdump/yesmynet2_lmon_10048.trc:
ORA-00600: internal error code, arguments: [KGHLKREM1], [0x4BC000020], [], [], [], [], [], []
Mon Jun 25 17:06:52 CST 2012
LMON: terminating instance due to error 481
Mon Jun 25 17:06:52 CST 2012
Shutting down instance (abort)
License high water mark = 798
Mon Jun 25 17:06:57 CST 2012
Instance terminated by LMON, pid = 10048
Mon Jun 25 17:06:57 CST 2012
Instance terminated by USER, pid = 29345

可以看出,17:06分的時候,lmon程式直接terminate例項2,mos相關文件描述如下:

ORA-600 [KGHLKREM1] On Linux Using Parameter drop_cache On hugepages Configuration [ID 1070812.1]

  修改時間 20-DEC-2011     型別 PROBLEM     狀態 PUBLISHED  

In this Document
  
  

  

asm1_lmd0_8600.trc
~~~~~~~~~~~~~~~~~~
*** 2010-02-08 15:57:38.274
***** Internal heap ERROR KGHLKREM1 addr=0x6c400020 ds=0x60000058 *****
***** Dump of memory around addr 0x6c400020:
06C3FF020 00000000 00000000 00000000 00000000 [................]
Repeat 511 times





 

Changes

1. On your system you are running with vm.drop_caches=1 (or 3), drop_cache have been set to a value greater than zero , or you are executing

echo 3 > /proc/sys/vm/drop_caches


 

/proc/sys/vm/drop_caches (since Linux 2.6.16)
Writing to this file causes the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.

To free pagecache:

* echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes:

* echo 2 > /proc/sys/vm/drop_caches

To free pagecache, dentries and inodes:

* echo 3 > /proc/sys/vm/drop_caches

As this is a non-destructive operation, and dirty objects are not freeable, the user should run "sync" first in order to make sure all cached objects are freed.


2. You have setup the Hugepages

Cause

This is a Linux Kernel issue.
Using the linux kernel "drop_cache" parameter and having the hugepages a memory corruption can occurs.

Per internal Bug 9461825, executing vm.drop_caches corrupts Oracle Database SGA hugepages;
it is fixed in Linux Kernel version 2.6.18-194.0.0.0.4.EL5


Solution

1.  As a workaround when hugepages are set avoid any vm.drop_cache settings.

OR

2.  Upgrade to Linux Kernel version 2.6.18-194.0.0.0.4.EL5


References

BUG:9358381 - ASM INSTANCE IS CRASHING AS ORA-600[KGHLKREM1] WHEN HUGEPAGES ARE IN USE
https://bugzilla.redhat.com/show_bug.cgi?id=578977

而3個節點只有節點2使用了hugepage:

[root@rac2 ~]# grep Huge /proc/meminfo
HugePages_Total:  9885
HugePages_Free:   9836
HugePages_Rsvd:   4868
Hugepagesize:     2048 kB
 
linux核心版本如下:
 
[root@rac2 ~]# uname -a
Linux rac2 2.6.18-128.el5

看來,linux下在使用hugepages引數的情況下,儘量不要隨便修改drop_cache引數,要麼就直接升級linux核心版本到

2.6.18-194.0.0.0.4.EL5

最後關閉所有節點2的相關叢集程式,然後在開啟,終於恢復正常了!

記錄一下~~

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/25618347/viewspace-733804/,如需轉載,請註明出處,否則將追究法律責任。

相關文章