一個RMAN備份時導致系統慢解決的案例

kewin發表於2010-01-04
一個RMAN備份時導致系統慢解決的案例
2010-1-4
這是在國外論壇找資料,無意間發現的。記錄下來以作為參考。
原文出處: http://www.tek-tips.com/viewthread.cfm?qid=1428879&page=41
我為了省事,沒有翻譯,直接CTRL+C CTRL+V
The problem shows up as following, when they do an Online Backup of their DBs:
- Up to 50 of kthreads in the r-queue in vmstat while about up to 5-7 in the b-queue.
- No paging space in/out occures.
- minperm% is 5 all the time
- When having lru_file_repage=0, lrud gets about 25%-60%, blocking the rest of the system so extreme, that typing on a term/shell is about impossible or lagging very bad.
- When having lru_file_repage=1 and maxperm% is 50% or 80%,
 watching numperm% reaching 50% or 80%, whatever is actually set, lrud is getting busy like described above when having it at lru_file_repage=0.
- fre column on vmstat is about 300k pages while having the problems so minfree and maxfree are not touched, I think.
- Traffic on Disks is up to 30 MB/sec and doesn't look bad. - The disk-arrays themselves are in good shape.
- FS's are jfs2 and have no cio or dio activated.
- There is a veeery slow rate 
root@srdbhv05:/oracle> vmstat -t -I 1
System Configuration: lcpu=4 mem=16384MB
  kthr      memory             page              faults          cpu        time
-------- ----------- ------------------------ ------------ ----------- --------
 r  b  p   avm   fre  fi  fo  pi  po  fr  sr   in   sy  cs us sy id wa hr mi se
 6  2  0 847720 139620 1016 998   0   0 1665 5203 854 10912 1151  3 20 69  8 17:25:53
13  0  0 847664 139775 1165 1367   0   0 2462 4877 1085 9277 1075  2 80  6 12 17:25:54
 2  6  0 847668 139770 1029 875   0   0 2207 3606 774 23026 848  3 52  1 44 17:25:55
34  1  0 847630 139920 1163 858   0   0 1132 1912 772 7819 541  1 80  1 18 17:25:57
38  0  0 849085 138478 517 282   0   0 1093 1706 730 14130 656  1 74 17  8 17:25:58
 4  1  0 848920 138638 259 792   0   0 901 1445 688 3528 477  1 70 19 11 17:25:59
17  6  0 847644 139918 1887 1970   0   0 3738 6398 1158 21265 1925  5 75  7 12 17:26:00
 2  2  0 847806 139436   2 549   0   0 804 1405 667 10602 623  1 59 16 24 17:26:01
19  1  0 848161 139395 519 415   0   0 772 1351 574 8230 466  2 55 22 20 17:26:02
45  2  0 848396 138791 1208 1364   0   0 2664 6237 827 14631 989  4 91  1  4 17:26:03
11  6  0 848015 139544 1026 802   0   0 1673 4141 766 10809 628  2 62  5 32 17:26:04
 3  2  0 847649 139692 132 790   0   0 1217 2296 682 21417 722  4 62  0 34 17:26:05


Topas Monitor for host:    srdbhv05             EVENTS/QUEUES    FILE/TTY
Wed Nov 21 17:27:44 2007   Interval:  1         Cswitch    1019  Readch  7337.8K
                                                Syscall    7906  Writech 6134.5K
Kernel   74.2   |#####################       |  Reads       251  Rawin         0
User      5.1   |##                          |  Writes      196  Ttyout      632
Wait     19.7   |######                      |  Forks         7  Igets         0
Idle      1.0   |#                           |  Execs         5  Namei       315
                                                Runqueue    7.4  Dirblk        0
Network  KBPS   I-Pack  O-Pack   KB-In  KB-Out  Waitqueue   1.6
en0     151.4    161.8   155.4    66.8    84.6
lo0       2.5     15.3    15.3     1.3     1.3  PAGING           MEMORY
en1       0.3      3.8     1.3     0.2     0.1  Faults     2076  Real,MB   16383
                                                Steals     4512  % Comp     21.0
Disk    Busy%     KBPS     TPS KB-Read KB-Writ  PgspIn        0  % Noncomp  76.6
skpower0 99.3   6792.4   300.6  6527.4   265.0  PgspOut       0  % Client   76.9
hdisk27  98.0   4484.1   205.1  4346.5   137.6  PageIn     1956
skpower2 95.5   3913.4     3.8     0.0  3913.4  PageOut    2357  PAGING SPACE
hdisk0   95.5     25.5     6.4     0.0    25.5  Sios       4771  Size,MB    6912
hdisk38  95.5   2935.0    24.2     0.0  2935.0                   % Used      0.7
                                                NFS (calls/sec)  % Free     99.2
Name            PID  CPU%  PgSp Owner           ServerV2       0
lrud         143430  45.7   0.1 root            ClientV2       0   Press:
oracle      1163354  22.0   6.8 oracle          ServerV3       0   "h" for help
sshd        2404354   1.0   0.7 root            ClientV3       0   "q" to quit
aioserve    1671306   1.0   0.1 root
oracle       794768   0.6   6.8 oracle
oracle      1212514   0.6  15.5 oracle
oracle       868558   0.6   6.8 oracle
在這裡發現了lrud執行緒佔用很高的CPU使用率,而且每個DISK的都在95%以上的利用。基本可以排除oracle的問題,精力可以focus在作業系統層面,或者OS和ORACLE結合部分。
root@srdbhv05:/oracle> vmstat -v
              4194304 memory pages
              4008879 lruable pages
               140234 free pages
                    3 memory pools
               266928 pinned pages
                 80.1 maxpin percentage
                  5.0 minperm percentage
                 80.0 maxperm percentage
                 79.6 numperm percentage
              3194452 file pages
                  0.0 compressed percentage
                    0 compressed pages
                 79.9 numclient percentage
                 80.0 maxclient percentage
              3206731 client pages
                    0 remote pageouts scheduled
                    0 pending disk I/Os blocked with no pbuf
                    0 paging space I/Os blocked with no psbuf
                 2740 filesystem I/Os blocked with no fsbuf
                    0 client filesystem I/Os blocked with no fsbuf
                    0 external pager filesystem I/Os blocked with no fsbuf


root@srdbhv05:/oracle> vmo -x
cpu_scale_memp,8,8,8,1,64,,B,
data_stagger_interval,161,161,161,0,4095,4KB pages,D,lgpg_regions
defps,1,1,1,0,1,boolean,D,
force_relalias_lite,0,0,0,0,1,boolean,D,
framesets,2,2,2,1,10,,B,
htabscale,n/a,-1,-1,-4,0,,B,
kernel_heap_psize,4096,4096,4096,4096,16777216,bytes,B,lgpg_size
large_page_heap_size,0,0,0,0,9223372036854775807,bytes,B,lgpg_size
lgpg_regions,0,0,0,0,,,B,lgpg_size
lgpg_size,0,0,0,0,16777216,bytes,B,lgpg_regions
low_ps_handling,1,1,1,1,2,,D,
lru_file_repage,1,1,1,0,1,boolean,D,
lru_poll_interval,10,0,10,0,60000,milliseconds,D,
lrubucket,131072,131072,131072,65536,,4KB pages,D,
maxclient%,80,80,80,1,100,% memory,D,maxperm%
maxfree,1088,128,1088,16,204800,4KB pages,D,minfree memory_frames
maxperm,3207102,,3207102,,,,S,
maxperm%,80,80,80,1,100,% memory,D,minperm% maxclient%
maxpin,3355444,,3355444,,,,S,
maxpin%,80,80,80,1,99,% memory,D,pinnable_frames memory_frames
mbuf_heap_psize,4096,4096,4096,4096,16777216,bytes,B,
memory_affinity,1,1,1,0,1,boolean,B,
memory_frames,4194304,,4194304,,,4KB pages,S,
mempools,1,1,1,1,256,,B,
minfree,960,1080,960,8,204800,4KB pages,D,maxfree memory_frames
minperm,200443,,200443,,,,S,
minperm%,5,20,5,1,100,% memory,D,maxperm%
nokilluid,0,0,0,0,4294967295,uid,D,
npskill,13824,13824,13824,1,1769471,4KB pages,D,
npswarn,55296,55296,55296,0,1769471,4KB pages,D,
num_spec_dataseg,0,0,0,0,,,B,
numpsblks,1769472,,1769472,,,4KB blocks,S,
pagecoloring,n/a,0,0,0,1,boolean,B,
pinnable_frames,3927417,,3927417,,,4KB pages,S,
pta_balance_threshold,n/a,50,50,0,99,% pta segment,R,
relalias_percentage,0,0,0,0,32767,,D,
soft_min_lgpgs_vmpool,0,0,0,0,90,%,D,lgpg_size
spec_dataseg_int,512,512,512,0,,,B,
strict_maxclient,1,1,1,0,1,boolean,D,
strict_maxperm,0,0,0,0,1,boolean,D,
v_pinshm,0,0,0,0,1,boolean,D,
vmm_fork_policy,0,0,0,0,1,boolean,D,
最後的解決方案,啟用CIO來解決這個問題。
Currently it is running very smooth - we mounted the FS'es with the "cio" option 
and in the ora.init "FILESYSTEMIO_OPTIONS=setall" as advised in some of the
tuning guides and now it uses all available AIO-servers it can get, 2000
(5 CPUs now and 400 maxservers). About no blocked kthread, low I/O wait and
lrud has nothing to do. Strange but works. Seems that more parallelisation for I/O is it. Though we don't know yet why
this all happened. Must have been something different than just putting some
FS on another storage system.
進一步的閱讀: http://download.oracle.com/docs/cd/B19306_01/server.102/b15658/appa_aix.htm#sthref643 -END- PS: 最近遇到也是CIO設定的問題導致編譯出問題。我有時間的話,會寫成另外的日誌。

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/40239/viewspace-624275/,如需轉載,請註明出處,否則將追究法律責任。

相關文章