最近在測試生產資料庫時發現資料庫在open過程很長慢，大概持續20分鐘左右，有時持續1-2個小時。日誌如下：

Fri Nov 23 10:53:51 2018
SMON: enabling tx recovery
Database Characterset is AL32UTF8
Fri Nov 23 10:54:02 2018
Decreasing number of real time LMS from 2 to 0
Fri Nov 23 10:54:05 2018
No Resource Manager plan active
Fri Nov 23 10:54:47 2018
Starting background process GTX0
Fri Nov 23 10:54:47 2018
GTX0 started with pid=41, OS id=14549130
Starting background process RCBG
Fri Nov 23 10:54:48 2018<==================
RCBG started with pid=43, OS id=14811234
replication_dependency_tracking turned off (no async multimaster replication found)
Fri Nov 23 14:07:01 2018<==================== 從RCBG程式啟動到QMNC程式啟動之間間隔了180分鐘左右，而其他open過程佔用一分鐘左右。
Starting background process QMNC
Fri Nov 23 14:07:01 2018
QMNC started with pid=31, OS id=28246190
Fri Nov 23 14:07:57 2018
Completed: alter database open 。

為了進一步分析 RCBG 到 QMNC 啟動慢的原因，做如下日誌日誌跟蹤：

1. 做oracel hanganalyze

資料庫啟動到mount 狀態

在看到alert log列印如下字樣時，在另外的sqlplus視窗下收集3次，間隔2~3分鐘收集一次：

步驟如下：

sqlplus / as sysdba

oradebug setmypid

oradebug -G all dump systemstate 267

oradebug -G all hanganalyze 2

等2~3分鐘

oradebug -G all dump systemstate 267

oradebug -G all hanganalyze 2

等2~3分鐘

oradebug -G all dump systemstate 267

oradebug -G all hanganalyze 2

oradebug tracefile_name

alter 日誌產生如下檔案

No Resource Manager plan active
Fri Nov 23 15:09:25 2018
Starting background process GTX0
Fri Nov 23 15:09:25 2018
GTX0 started with pid=41, OS id=8585312
Starting background process RCBG
Fri Nov 23 15:09:26 2018
RCBG started with pid=43, OS id=18809034
replication_dependency_tracking turned off (no async multimaster replication found)
Fri Nov 23 15:39:33 2018
System State dumped to trace file /u01/app/oracle/diag/rdbms/erpdb/ erpdb 1/trace/erpdb 1_diag_31850618.trc

從檔案上看oracle 在open過程中等待如下

1.
-- Trace File --
Starting Systemstate 1
.........................................
Ass.Awk Version 1.0.31
~~~~~~~~~~~~~~~~~~~~~~
Source file : symbols1_diag_31850618.trc

System State 1 (2018-11-23 15:39:32.932)
~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~
1: [DEAD]
2: waiting for 'pmon timer'
3: waiting for 'rdbms ipc message'
4: waiting for 'VKTM Logical Idle Wait'
5: waiting for 'rdbms ipc message'
6:
7: waiting for 'rdbms ipc message'
8: waiting for 'PING'
9: waiting for 'rdbms ipc message'
10: waiting for 'DIAG idle wait'
11: waiting for 'rdbms ipc message'
12: waiting for 'ges remote message'<===lmd
13: waiting for 'gcs remote message'<===lms
14: waiting for 'gcs remote message'<===lms
15: waiting for 'rdbms ipc message'
16: waiting for 'GCR sleep'
17: waiting for 'rdbms ipc message'
18: waiting for 'rdbms ipc message'
19: waiting for 'rdbms ipc message'
20: waiting for 'rdbms ipc message'
21: waiting for 'rdbms ipc message'
22: waiting for 'smon timer'
23: waiting for 'rdbms ipc message'
24: waiting for 'rdbms ipc message'
25: waiting for 'ASM background timer'
26: waiting for 'rdbms ipc message'
27: waiting for 'rdbms ipc message'
28:
29:
30: waiting for 'wait for unread message on broadcast channel'
31: waiting for 'class slave wait'
32: waiting for 'rdbms ipc message'
33: waiting for 'rdbms ipc message'
34: waiting for 'db file scattered read' (0x4f,0xa2766,0xb)
35: waiting for 'rdbms ipc reply' (from_process=6)[IPC 6]
37: waiting for 'rdbms ipc message'
38: waiting for 'rdbms ipc message'
39: waiting for 'rdbms ipc message'
40: waiting for 'rdbms ipc message'
41: waiting for 'rdbms ipc message'
43: waiting for 'wait for unread message on broadcast channel'

2.

flags2: (0x409) -/-/INC
DID: , short-term DID:
txn branch: 0x0
edition#: 0 oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
0: waiting for 'gcs remote message'
waittime=0x1e, poll=0x0, event=0x0
wait_id=51479 seq_num=51480 snap_id=1
wait times: snap=0.000577 sec, exc=0.000577 sec, total=0.000577 sec
wait times: max=0.030000 sec, heur=31 min 42 sec<============== 應該是例項啟動到dump時總共等待時間，這兒說明私網gcs訊息傳輸可能有問題。

為了排除私有網路卡的問題部署oswatch，從oswatch收集的資訊看，網路和資源都沒有問題。

2. 為了進一步確定原因

truss -aefdo /tmp/truss.out sqlplus / as sysdba

alter database open;

從日誌上分析可以看到

14876838: 56950981: 2.0001: kread(9, "01 -\0\006\0\0\0\0\011 i".., 8208) (sleeping...)
21954596: 92471451: 2.0001: kread(0, "\n l t e r d a t a b a".., 4096) (sleeping...)
21954596: 92471451: kread(0, " e x i t\n d a t a b a".., 4096) = 5
21954596: 92471451: 1344.6921: kfcntl(1, F_GETFL, 0x0000000000000008) = 67110914<===
從它前面的kread讀取的終端輸入是exit來看應該是已經完成了open執行推出之後的os call了，所以上面的1344.6921秒是不能計入open過程的，排除這個之後來看，主要問題就是aio了， listio64 數目高達123644次，而單次等待從0.0003到0.5271s不等
14876838: 56950981: kpread(24, 0x07000107C5755000, 4096, 0x0000004FBAD2A000, 0x000000000000007D) = 4096
14876838: 56950981: kpread(24, 0x07000107C5E14000, 4096, 0x000000E41E85F000, 0x000000000000007D) = 4096
14876838: 56950981: 0.5271: listio64(0x0000000010000004, 0x000000000FFFFFFF, 0x00000000FFFCC1E0, 0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x00000001116E89A0, 0x0000000000000040) = 0x0000000000000000 ，

---Continue---
有一點還是很奇怪，按理open的時候已經完成了crash recovery了，此時對資料檔案只有讀沒有寫了，但為何您這裡出來這麼多 kpread，還是令人不解。為什麼從RCBG程式啟動到QMNC程式啟動之間間隔很長時間那，而其他後臺程式open過程佔用一分鐘左右。
從上邊的分析，問題又回到了原點。難道是oracle bug，

為了進一步確認問題，在mos搜到如下一篇文章：

Procedure to Manually Coalesce All the IOTs / Indexes Associated with Advanced Queueing Tables to Maintain Enqueue / Dequeue Performance; Reduce QMON CPU Usage and Redo Generation ( 檔案 ID 271855.1)

為了驗證文章的可靠性，透過此文章提供的指令碼，進行修復，資料庫能正常OPEN,並且很快。日誌如下：

Database Characterset is AL32UTF8

No Resource Manager plan active

Starting background process GTX0

Fri Nov 23 16:39:33 2018

GTX0 started with pid=36, OS id=16056540

Starting background process RCBG

Fri Nov 23 16:41:23 2018

RCBG started with pid=37, OS id=26214482

replication_dependency_tracking turned off (no async multimaster replication found)

Starting background process QMNC

Fri Nov 23 16:42:23 2018

QMNC started with pid=39, OS id=14287094

Fri Nov 23 16:43:33 2018

Completed: ALTER DATABASE OPEN

這個案例的後續是，如果經常使用AQ，資料庫可以不定期做碎片整理。

記一次開啟資料庫慢原因分析過程

相關文章