這個案例發生有段時間了，但是今天無意中看到當時的郵件，感覺還是收益匪淺，看來還是細節決定成敗啊。從一些日誌或trace 檔案中的警告資訊中我們可以發掘出潛在的問題。
當時系統中的使用者數很小,所以每天都能抽時間看看日誌記錄,看有沒有明顯的問題。結果在grep的時候發現trace檔案中有一些警告記錄。
當時的庫是10gR2的庫，現在已經升級到了11gR2.
/xxxx/oracle/xxxxx/oradmp/udump/xxxxx01_ora_15070.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /opt/app/oracle/xxxxxx/product/10.2.0
System name: Linux
Node name: xxxx
Release: 2.6.18-308.el5
Version: #1 SMP Fri Jan 27 17:17:51 EST 2012
Machine: x86_64
Instance name: xxxxxx
Redo thread mounted by this instance: 1
Oracle process number: 472
Unix process pid: 15070, image: oracle@xxxxx (TNS V1-V3)

*** 2013-03-29 20:18:54.711
*** ACTION NAME:() 2013-03-29 20:18:54.707
*** MODULE NAME:(OGG-ECC_ARCG-GLOPEN_DATA_SOURCE) 2013-03-29 20:18:54.707
*** SERVICE NAME:(SYS$USERS) 2013-03-29 20:18:54.707
*** SESSION ID:(5029.3122) 2013-03-29 20:18:54.707
WARNING:Could not increase the asynch I/O limit to 224 for SQL direct I/O. It is set to 128
WARNING:Could not increase the asynch I/O limit to 256 for SQL direct I/O. It is set to 128
WARNING:Could not increase the asynch I/O limit to 224 for SQL direct I/O. It is set to 128
*** 2013-03-29 22:25:26.506
WARNING:Could not increase the asynch I/O limit to 224 for SQL direct I/O. It is set to 128
WARNING:Could not increase the asynch I/O limit to 256 for SQL direct I/O. It is set to 128
WARNING:Could not increase the asynch I/O limit to 256 for SQL direct I/O. It is set to 128
WARNING:Could not increase the asynch I/O limit to 288 for SQL direct I/O. It is set to 128

從警告的資訊可以看出，和IO相關，可能是什麼引數的設定出問題了。檢視MOS 發現有一篇相關的文章。
Warning:Could Not Increase The Asynch I/O Limit To XX For Sql Direct I/O (Doc ID 1302633.1)
其中給出的solution如下：

	[root@xyz ~]# cat /proc/sys/fs/aio-max-nr

65536

SOLUTION

The aio-max-size kernel parameter doesn't exist in the 2.6.x Linux kernels.
This feature is now "automatic" in the 2.6.x kernel, based on the physical capabilities of the disk device driver.
This should mean that the Linux Kernel is ready to perform ASYNC I/O.

All install requirements should be met.

To ensure ASYNC I/O can be performed by Oracle Database you need to verify or set the following parameters in the Database:

	sql> alter system set disk_asynch_io=true scope=spfile;

sql> alter system set filesystemio_options=setall scope=spfile;

Then shutdown and startup the database and check if the warning reappears.
An HCVE report (refer to Note 250262.1) should report no remaining issues

If the above doesn't resolve the problem, then increase fs.aio-max-nr

關於檢視aio的內容可以使用如下的方式來檢視。
cat /proc/sys/fs/aio-max-nr
/sbin/sysctl -a |grep aio
當時得到的值是
fs.aio-max-nr = 65536
fs.aio-nr = 65472

可能一般來說問題處理到這個地方就告一段落了。
這個問題當時是叫給一個資深的專家來做的調優，他又要了一些其他的資訊。
最後給出的建議如下：
- Increase fs.aio-max-nr as advised in my initial e-mail to 3145728
- Change Oracle filesystemio_options to “SETALL” (enable both asynch and direct IO)

Following is suggested:
- VXFS – change mount options to:
mincache=direct,convosync=direct (Must be done along with the change of the Oracle filesystemio_options)
- Enable ODM
Pending upon the VXFS version – and True VXFS license supporting ODM (should be by default with VXFS 5.X)
Enable Veritas ODM - by doing “ln –s /opt/VRTSodm/lib64/libodm.so $ORACLE_HOME/lib/libodm10.so”
- Memory
As server has sufficient memory – consult with DBAs if it is worthwhile increasing Oracle cache
- Power
Consider disabling power management savings for better response / latency (BIOS level)
- Disable C-State
RH 5.X do not support C-State (BIOS level)

Kernel parameters:
vm.min_free_kbytes = 32768
vm.dirty_expire_centisecs = 500
vm.dirty_ratio = 15
vm.dirty_writeback_centisecs = 100
vm.swappiness = 0
vm.hugetlb_shm_group = <set to group id of user Oracle)
vm.nr_hugepages = 16384 # set hugepages to 32 GB - if total of Oracle instances cache grows beyond must increase it accordingly it is 2 MB pages)
# any increase requires a reboot !!

其中關於huge page的部分還是有點意思。其中關於設定的Hugepage數還可以透過一個指令碼來計算。
當時看得雲裡霧裡，詳細的學習了一下關於hugepage的設定，還是有所收穫。
其實在官方文件中已經有詳細的描述。hugepage的配置可以參考文件。

發現官方文件還是很有權威性和價值的。
計算hugepage的指令碼如下：

#!/bin/bash
#
# hugepages_settings.sh
#
# Linux bash script to compute values for the
# recommended HugePages/HugeTLB configuration
#
# Note: This script does calculation for all shared memory
# segments available when the script is run, no matter it
# is an Oracle RDBMS shared memory segment or not.
# Check for the kernel version
KERN=`uname -r | awk -F. '{ printf("%d.%d\n",$1,$2); }'`
# Find out the HugePage size
HPG_SZ=`grep Hugepagesize /proc/meminfo | awk {'print $2'}`
# Start from 1 pages to be on the safe side and guarantee 1 free HugePage
NUM_PG=1
# Cumulative number of pages required to handle the running shared memory segments
for SEG_BYTES in `ipcs -m | awk {'print $5'} | grep "[0-9][0-9]*"`
do
   MIN_PG=`echo "$SEG_BYTES/($HPG_SZ*1024)" | bc -q`
   if [ $MIN_PG -gt 0 ]; then
      NUM_PG=`echo "$NUM_PG+$MIN_PG+1" | bc -q`
   fi
done
# Finish with results
case $KERN in
   '2.4') HUGETLB_POOL=`echo "$NUM_PG*$HPG_SZ/1024" | bc -q`;
          echo "Recommended setting: vm.hugetlb_pool = $HUGETLB_POOL" ;;
   '2.6') echo "Recommended setting: vm.nr_hugepages = $NUM_PG" ;;
    *) echo "Unrecognized kernel version $KERN. Exiting." ;;
esac
# End

所以問題的分析和解決思路還是最重要的，可能大多數時候碰到問題都是頭疼醫頭，腳疼醫腳。處理問題還是沒有從整體的情況來做，以點帶面，能夠舉一反三才是問題處理的高手。

由一條日誌警告所做的調優分析

SOLUTION

相關文章