記一次開啟資料庫慢原因分析過程

pingdanorcale發表於2023-03-13

最近在測試生產資料庫時發現資料庫在open過程很長慢,大概持續20分鐘左右,有時持續1-2個小時。日誌如下:

Fri Nov 23 10:53:51 2018 
SMON: enabling tx recovery 
Database Characterset is AL32UTF8 
Fri Nov 23 10:54:02 2018 
Decreasing number of real time LMS from 2 to 0 
Fri Nov 23 10:54:05 2018 
No Resource Manager plan active 
Fri Nov 23 10:54:47 2018 
Starting background process GTX0 
Fri Nov 23 10:54:47 2018 
GTX0 started with pid=41, OS id=14549130 
  Starting background process RCBG 
Fri Nov 23 10:54:48 2018<================== 
RCBG started with pid=43, OS id=14811234 
  replication_dependency_tracking turned off (no async multimaster replication found) 
Fri Nov 23 14:07:01 2018<====================
從RCBG程式啟動到QMNC程式啟動之間間隔了180分鐘左右,而其他open過程佔用一分鐘左右。 
Starting background process QMNC 
Fri Nov 23 14:07:01 2018 
QMNC started with pid=31, OS id=28246190 
  Fri Nov 23 14:07:57 2018 
Completed: alter database open 

為了進一步分析 RCBG QMNC 啟動慢的原因,做如下日誌日誌跟蹤:

1. 做oracel hanganalyze

資料庫啟動到mount 狀態

在看到alert log列印如下字樣時,在另外的sqlplus視窗下收集3次,間隔2~3分鐘收集一次:

步驟如下:

sqlplus / as sysdba

oradebug setmypid

oradebug -G all dump systemstate 267

oradebug -G all hanganalyze 2

等2~3分鐘

oradebug -G all dump systemstate 267

oradebug -G all hanganalyze 2

等2~3分鐘

oradebug -G all dump systemstate 267

oradebug -G all hanganalyze 2

oradebug tracefile_name

alter 日誌產生如下檔案

No Resource Manager plan active 
Fri Nov 23 15:09:25 2018 
Starting background process GTX0 
Fri Nov 23 15:09:25 2018 
GTX0 started with pid=41, OS id=8585312 
  Starting background process RCBG 
Fri Nov 23 15:09:26 2018 
RCBG started with pid=43, OS id=18809034 
  replication_dependency_tracking turned off (no async multimaster replication found) 
Fri Nov 23 15:39:33 2018 
System State dumped to trace file /u01/app/oracle/diag/rdbms/erpdb/ erpdb 1/trace/erpdb 1_diag_31850618.trc

從檔案上看oracle 在open過程中等待如下

1.   
-- Trace File -- 
Starting Systemstate 1 
......................................... 
Ass.Awk Version 1.0.31 
~~~~~~~~~~~~~~~~~~~~~~ 
Source file : symbols1_diag_31850618.trc 

System State 1 (2018-11-23 15:39:32.932) 
~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~ 
1: [DEAD] 
2: waiting for 'pmon timer' 
3: waiting for 'rdbms ipc message' 
4: waiting for 'VKTM Logical Idle Wait' 
5: waiting for 'rdbms ipc message' 
6: 
7: waiting for 'rdbms ipc message' 
8: waiting for 'PING' 
9: waiting for 'rdbms ipc message' 
10: waiting for 'DIAG idle wait' 
11: waiting for 'rdbms ipc message' 
12: waiting for 'ges remote message'<===lmd 
13: waiting for 'gcs remote message'<===lms 
14: waiting for 'gcs remote message'<===lms 
15: waiting for 'rdbms ipc message' 
16: waiting for 'GCR sleep' 
17: waiting for 'rdbms ipc message' 
18: waiting for 'rdbms ipc message' 
19: waiting for 'rdbms ipc message' 
20: waiting for 'rdbms ipc message' 
21: waiting for 'rdbms ipc message' 
22: waiting for 'smon timer' 
23: waiting for 'rdbms ipc message' 
24: waiting for 'rdbms ipc message' 
25: waiting for 'ASM background timer' 
26: waiting for 'rdbms ipc message' 
27: waiting for 'rdbms ipc message' 
28: 
29: 
30: waiting for 'wait for unread message on broadcast channel' 
31: waiting for 'class slave wait' 
32: waiting for 'rdbms ipc message' 
33: waiting for 'rdbms ipc message' 
34: waiting for 'db file scattered read' (0x4f,0xa2766,0xb) 
35: waiting for 'rdbms ipc reply' (from_process=6)[IPC 6] 
37: waiting for 'rdbms ipc message' 
38: waiting for 'rdbms ipc message' 
39: waiting for 'rdbms ipc message' 
40: waiting for 'rdbms ipc message' 
41: waiting for 'rdbms ipc message' 
43: waiting for 'wait for unread message on broadcast channel' 

2.     

flags2: (0x409) -/-/INC 
DID: , short-term DID: 
txn branch: 0x0 
edition#: 0 oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS 
ksuxds FALSE at location: 0 
service name: SYS$BACKGROUND 
Current Wait Stack: 
0: waiting for 'gcs remote message' 
waittime=0x1e, poll=0x0, event=0x0 
wait_id=51479 seq_num=51480 snap_id=1 
wait times: snap=0.000577 sec, exc=0.000577 sec, total=0.000577 sec 
wait times: max=0.030000 sec, heur=31 min 42 sec<==============
應該是例項啟動到dump時總共等待時間,這兒說明私網gcs訊息傳輸可能有問題。 

 

為了排除私有網路卡的問題部署oswatch,從oswatch收集的資訊看,網路和資源都沒有問題。

 2. 為了進一步確定原因

truss -aefdo /tmp/truss.out sqlplus / as sysdba

alter database open;

從日誌上分析可以看到

14876838: 56950981: 2.0001: kread(9, "01 -\0\006\0\0\0\0\011 i".., 8208) (sleeping...) 
21954596: 92471451: 2.0001: kread(0, "\n l t e r d a t a b a".., 4096) (sleeping...) 
21954596: 92471451: kread(0, " e x i t\n d a t a b a".., 4096) = 5 
21954596: 92471451: 1344.6921: kfcntl(1, F_GETFL, 0x0000000000000008) = 67110914<=== 
從它前面的kread讀取的終端輸入是exit來看應該是已經完成了open執行推出之後的os call了,所以上面的1344.6921秒是不能計入open過程的,排除這個之後來看,主要問題就是aio了, listio64 數目高達123644次,而單次等待從0.0003到0.5271s不等 
14876838: 56950981: kpread(24, 0x07000107C5755000, 4096, 0x0000004FBAD2A000, 0x000000000000007D) = 4096 
14876838: 56950981: kpread(24, 0x07000107C5E14000, 4096, 0x000000E41E85F000, 0x000000000000007D) = 4096 
14876838: 56950981: 0.5271: listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFCC1E0, 0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x00000001116E89A0, 0x0000000000000040) = 0x0000000000000000


---Continue--- 
  有一點還是很奇怪,按理open的時候已經完成了crash recovery了,此時對資料檔案只有讀沒有寫了,但為何您這裡出來這麼多 kpread,還是令人不解。 為什麼從RCBG程式啟動到QMNC程式啟動之間間隔很長時間那,而其他後臺程式open過程佔用一分鐘左右。
從上邊的分析,問題又回到了原點。難道是oracle bug,

為了進一步確認問題,在mos搜到如下一篇文章:

Procedure to Manually Coalesce All the IOTs / Indexes Associated with Advanced Queueing Tables to Maintain Enqueue / Dequeue Performance; Reduce QMON CPU Usage and Redo Generation ( 檔案 ID 271855.1)

為了驗證文章的可靠性,透過此文章提供的指令碼,進行修復,資料庫能正常OPEN,並且很快。日誌如下:

Database Characterset is AL32UTF8

No Resource Manager plan active

Starting background process GTX0

Fri Nov 23 16:39:33 2018 

GTX0 started with pid=36, OS id=16056540

Starting background process RCBG

Fri Nov 23 16:41:23 2018 

RCBG started with pid=37, OS id=26214482

replication_dependency_tracking turned off (no async multimaster replication found)

Starting background process QMNC

Fri Nov 23 16:42:23 2018 

QMNC started with pid=39, OS id=14287094

Fri Nov 23 16:43:33 2018 

Completed: ALTER DATABASE OPEN

 

這個案例的後續是,如果經常使用AQ,資料庫可以不定期做碎片整理。


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/10201716/viewspace-2939378/,如需轉載,請註明出處,否則將追究法律責任。

相關文章