一個RMAN備份時導致系統慢解決的案例

kewin發表於2010-01-04

一個RMAN備份時導致系統慢解決的案例
2010-1-4
這是在國外論壇找資料，無意間發現的。記錄下來以作為參考。
原文出處： http://www.tek-tips.com/viewthread.cfm?qid=1428879&page=41
我為了省事，沒有翻譯，直接CTRL+C CTRL+V

The problem shows up as following, when they do an Online Backup of their DBs:
- Up to 50 of kthreads in the r-queue in vmstat while about up to 5-7 in the b-queue.
- No paging space in/out occures.
- minperm% is 5 all the time
- When having lru_file_repage=0, lrud gets about 25%-60%, blocking the rest of the system so extreme, that typing on a term/shell is about impossible or lagging very bad.
- When having lru_file_repage=1 and maxperm% is 50% or 80%,
 watching numperm% reaching 50% or 80%, whatever is actually set, lrud is getting busy like described above when having it at lru_file_repage=0.
- fre column on vmstat is about 300k pages while having the problems so minfree and maxfree are not touched, I think.
- Traffic on Disks is up to 30 MB/sec and doesn't look bad. - The disk-arrays themselves are in good shape.
- FS's are jfs2 and have no cio or dio activated.
- There is a veeery slow rate 
root@srdbhv05:/oracle> vmstat -t -I 1
System Configuration: lcpu=4 mem=16384MB
  kthr      memory             page              faults          cpu        time
-------- ----------- ------------------------ ------------ ----------- --------
 r  b  p   avm   fre  fi  fo  pi  po  fr  sr   in   sy  cs us sy id wa hr mi se
 6  2  0 847720 139620 1016 998   0   0 1665 5203 854 10912 1151  3 20 69  8 17:25:53
13  0  0 847664 139775 1165 1367   0   0 2462 4877 1085 9277 1075  2 80  6 12 17:25:54
 2  6  0 847668 139770 1029 875   0   0 2207 3606 774 23026 848  3 52  1 44 17:25:55
34  1  0 847630 139920 1163 858   0   0 1132 1912 772 7819 541  1 80  1 18 17:25:57
38  0  0 849085 138478 517 282   0   0 1093 1706 730 14130 656  1 74 17  8 17:25:58
 4  1  0 848920 138638 259 792   0   0 901 1445 688 3528 477  1 70 19 11 17:25:59
17  6  0 847644 139918 1887 1970   0   0 3738 6398 1158 21265 1925  5 75  7 12 17:26:00
 2  2  0 847806 139436   2 549   0   0 804 1405 667 10602 623  1 59 16 24 17:26:01
19  1  0 848161 139395 519 415   0   0 772 1351 574 8230 466  2 55 22 20 17:26:02
45  2  0 848396 138791 1208 1364   0   0 2664 6237 827 14631 989  4 91  1  4 17:26:03
11  6  0 848015 139544 1026 802   0   0 1673 4141 766 10809 628  2 62  5 32 17:26:04
 3  2  0 847649 139692 132 790   0   0 1217 2296 682 21417 722  4 62  0 34 17:26:05


Topas Monitor for host:    srdbhv05             EVENTS/QUEUES    FILE/TTY
Wed Nov 21 17:27:44 2007   Interval:  1         Cswitch    1019  Readch  7337.8K
                                                Syscall    7906  Writech 6134.5K
Kernel   74.2   |#####################       |  Reads       251  Rawin         0
User      5.1   |##                          |  Writes      196  Ttyout      632
Wait     19.7   |######                      |  Forks         7  Igets         0
Idle      1.0   |#                           |  Execs         5  Namei       315
                                                Runqueue    7.4  Dirblk        0
Network  KBPS   I-Pack  O-Pack   KB-In  KB-Out  Waitqueue   1.6
en0     151.4    161.8   155.4    66.8    84.6
lo0       2.5     15.3    15.3     1.3     1.3  PAGING           MEMORY
en1       0.3      3.8     1.3     0.2     0.1  Faults     2076  Real,MB   16383
                                                Steals     4512  % Comp     21.0
Disk    Busy%     KBPS     TPS KB-Read KB-Writ  PgspIn        0  % Noncomp  76.6
skpower0 99.3   6792.4   300.6  6527.4   265.0  PgspOut       0  % Client   76.9
hdisk27  98.0   4484.1   205.1  4346.5   137.6  PageIn     1956
skpower2 95.5   3913.4     3.8     0.0  3913.4  PageOut    2357  PAGING SPACE
hdisk0   95.5     25.5     6.4     0.0    25.5  Sios       4771  Size,MB    6912
hdisk38  95.5   2935.0    24.2     0.0  2935.0                   % Used      0.7
                                                NFS (calls/sec)  % Free     99.2
Name            PID  CPU%  PgSp Owner           ServerV2       0
lrud         143430  45.7   0.1 root            ClientV2       0   Press:
oracle      1163354  22.0   6.8 oracle          ServerV3       0   "h" for help
sshd        2404354   1.0   0.7 root            ClientV3       0   "q" to quit
aioserve    1671306   1.0   0.1 root
oracle       794768   0.6   6.8 oracle
oracle      1212514   0.6  15.5 oracle
oracle       868558   0.6   6.8 oracle

在這裡發現了lrud執行緒佔用很高的CPU使用率，而且每個DISK的都在95%以上的利用。基本可以排除oracle的問題，精力可以focus在作業系統層面，或者OS和ORACLE結合部分。

root@srdbhv05:/oracle> vmstat -v
              4194304 memory pages
              4008879 lruable pages
               140234 free pages
                    3 memory pools
               266928 pinned pages
                 80.1 maxpin percentage
                  5.0 minperm percentage
                 80.0 maxperm percentage
                 79.6 numperm percentage
              3194452 file pages
                  0.0 compressed percentage
                    0 compressed pages
                 79.9 numclient percentage
                 80.0 maxclient percentage
              3206731 client pages
                    0 remote pageouts scheduled
                    0 pending disk I/Os blocked with no pbuf
                    0 paging space I/Os blocked with no psbuf
                 2740 filesystem I/Os blocked with no fsbuf
                    0 client filesystem I/Os blocked with no fsbuf
                    0 external pager filesystem I/Os blocked with no fsbuf


root@srdbhv05:/oracle> vmo -x
cpu_scale_memp,8,8,8,1,64,,B,
data_stagger_interval,161,161,161,0,4095,4KB pages,D,lgpg_regions
defps,1,1,1,0,1,boolean,D,
force_relalias_lite,0,0,0,0,1,boolean,D,
framesets,2,2,2,1,10,,B,
htabscale,n/a,-1,-1,-4,0,,B,
kernel_heap_psize,4096,4096,4096,4096,16777216,bytes,B,lgpg_size
large_page_heap_size,0,0,0,0,9223372036854775807,bytes,B,lgpg_size
lgpg_regions,0,0,0,0,,,B,lgpg_size
lgpg_size,0,0,0,0,16777216,bytes,B,lgpg_regions
low_ps_handling,1,1,1,1,2,,D,
lru_file_repage,1,1,1,0,1,boolean,D,
lru_poll_interval,10,0,10,0,60000,milliseconds,D,
lrubucket,131072,131072,131072,65536,,4KB pages,D,
maxclient%,80,80,80,1,100,% memory,D,maxperm%
maxfree,1088,128,1088,16,204800,4KB pages,D,minfree memory_frames
maxperm,3207102,,3207102,,,,S,
maxperm%,80,80,80,1,100,% memory,D,minperm% maxclient%
maxpin,3355444,,3355444,,,,S,
maxpin%,80,80,80,1,99,% memory,D,pinnable_frames memory_frames
mbuf_heap_psize,4096,4096,4096,4096,16777216,bytes,B,
memory_affinity,1,1,1,0,1,boolean,B,
memory_frames,4194304,,4194304,,,4KB pages,S,
mempools,1,1,1,1,256,,B,
minfree,960,1080,960,8,204800,4KB pages,D,maxfree memory_frames
minperm,200443,,200443,,,,S,
minperm%,5,20,5,1,100,% memory,D,maxperm%
nokilluid,0,0,0,0,4294967295,uid,D,
npskill,13824,13824,13824,1,1769471,4KB pages,D,
npswarn,55296,55296,55296,0,1769471,4KB pages,D,
num_spec_dataseg,0,0,0,0,,,B,
numpsblks,1769472,,1769472,,,4KB blocks,S,
pagecoloring,n/a,0,0,0,1,boolean,B,
pinnable_frames,3927417,,3927417,,,4KB pages,S,
pta_balance_threshold,n/a,50,50,0,99,% pta segment,R,
relalias_percentage,0,0,0,0,32767,,D,
soft_min_lgpgs_vmpool,0,0,0,0,90,%,D,lgpg_size
spec_dataseg_int,512,512,512,0,,,B,
strict_maxclient,1,1,1,0,1,boolean,D,
strict_maxperm,0,0,0,0,1,boolean,D,
v_pinshm,0,0,0,0,1,boolean,D,
vmm_fork_policy,0,0,0,0,1,boolean,D,

最後的解決方案，啟用CIO來解決這個問題。

Currently it is running very smooth - we mounted the FS'es with the "cio" option 
and in the ora.init "FILESYSTEMIO_OPTIONS=setall" as advised in some of the 
tuning guides and now it uses all available AIO-servers it can get, 2000 
(5 CPUs now and 400 maxservers). About no blocked kthread, low I/O wait and 
lrud has nothing to do. Strange but works.
Seems that more parallelisation for I/O is it. Though we don't know yet why 
this all happened. Must have been something different than just putting some 
FS on another storage system.

進一步的閱讀： http://download.oracle.com/docs/cd/B19306_01/server.102/b15658/appa_aix.htm#sthref643 -END- PS: 最近遇到也是CIO設定的問題導致編譯出問題。我有時間的話，會寫成另外的日誌。

來自 “ ITPUB部落格 ” ，連結：http://blog.itpub.net/40239/viewspace-624275/，如需轉載，請註明出處，否則將追究法律責任。

RMAN備份恢復典型案例——RMAN備份&系統變慢
2022-10-18
【RMAN】“壞塊”導致RMAN備份不成功的RMAN處理方法
2009-07-29
rman備份的時候讀取v$session_longops失敗導致備份失敗
2019-07-14
SessionGo
Oracle 11g RMAN備份-一致備份
2016-11-03
Oracle
oracle rman備份時出現ORA-19502錯誤解決案例
2010-03-09
Oracle
有關修改作業系統時間導致例項down掉的一個案例
2007-04-23
作業系統
效能分析（7）- 未利用系統快取導致 I/O 緩慢案例
2020-08-17
快取
RMAN備份之備份多個備份集到帶庫（一）
2007-07-19
RMAN備份恢復典型案例——快速檢查資料庫一致性
2022-10-18
資料庫
Rman增量壓縮備份來解決備份空間不足
2011-10-05
Oracle rman 全備份的一個小例子
2014-02-13
Oracle
一個完整的RMAN備份指令碼(轉)
2010-08-18
指令碼
rman如何在備庫執行一致性備份
2020-09-24
Rman 定時備份crontab
2010-05-18
RMAN 備份詳解
2013-12-05
RMAN備份詳解
2015-01-11
-- RMAN備份詳解
2013-02-24
Backup And Recovery User's Guide-RMAN備份概念-RMAN備份的多個拷貝-備份的備份
2014-02-19
GUIIDE
windows10系統備份c盤時其他磁碟被強制備份怎麼解決
2020-04-20
Windows
修改系統時間導致RAC環境的一個例項重啟
2007-05-25
RMAN備份時出現RMAN-06056: could not access datafile 6 錯誤，解決方法！！
2008-02-13
TSM備份時因歸檔日誌丟失而導致備份失敗
2016-05-20
一個較完整的RMAN增量備份指令碼
2011-12-19
指令碼
探索ORACLE之RMAN_03一致性備份
2012-05-21
Oracle
解決一次RMAN遲遲不能開始備份的問題
2009-12-22
Oracle EXPDP自動備份緩慢問題解決
2018-11-10
Oracle
DNS導致資料庫登入緩慢的問題解決
2012-12-17
DNS資料庫
rman映像copy自動備份的一個指令碼
2007-10-24
指令碼
透過rman備份system系統表空間
2016-10-31
RMAN備份之備份多個備份集到帶庫（三）
2007-07-21
RMAN備份之備份多個備份集到帶庫（二）
2007-07-20
探索ORACLE之RMAN_04非一致性備份
2012-05-23
Oracle
Win10系統升級時導致電腦無法開機的解決方法
2018-03-11
Win10
oracle rman 定時備份指令碼
2014-03-20
Oracle指令碼
RMAN定時全備份指令碼
2015-08-25
指令碼
RMAN備份多個備份集到帶庫的小bug
2007-06-13
Oracle rman 備份與恢復臨時表空間的檔案問題解決
2016-01-27
Oracle
win10系統中時間不準慢一個小時如何解決
2019-02-13
Win10

一個RMAN備份時導致系統慢解決的案例

相關文章