【DBA】log file switch (checkpoint incomplete) - 容易被誤診的event
By:
Shunlun Pan
首先來看如下的一份AWR,12分鐘的取樣,DB Time 105分鐘。
DB Name DB Id Instance Inst num Startup Time Release RAC
R11204 2114874159 R11204 1 23-Oct-17 10:10 11.2.0.4.0 NO
Host Name Platform CPUs Cores Sockets Memory (GB)
nascds18 Linux x86 64-bit 2 2 1 11.64
Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 3 23-Oct-17 10:55:46 37 2.5
End Snap: 4 23-Oct-17 11:08:27 53 2.3
Elapsed: 12.67 (mins)
DB Time: 105.90 (mins)
Top event發現buffer busy waits和log file switch (checkpoint incomplete)幾乎佔用了所有的DB Time
Top 10 Foreground Events by Total Wait Time
Event Waits Total Wait Time (sec) Wait Avg(ms) % DB time Wait Class
buffer busy waits 11 3310.6 300960 52.1 Concurrency
log file switch (checkpoint incomplete) 10 3034.8 303479 47.8 Configuration
DB CPU 5.5 .1
log file sync 28 2.3 82 .0 Commit
log buffer space 24 .8 32 .0 Configuration
透過ASH不難發現buffer busy waits被log file switch (checkpoint incomplete)阻塞,而log file switch (checkpoint incomplete)又被LGWR的control file sequential read阻塞。
2017-10-23 11:05:31.563 1 37 perl@nascds18 (TNS V1-V3) buffer busy waits WAITING 1 150
2017-10-23 11:05:31.563 1 150 sqlplus@nascds18 (TNS V1-V3) buffer busy waits WAITING 1 141
2017-10-23 11:05:31.563 1 141 OMS log file switch (checkpoint incomplete) WAITING 1 130
2017-10-23 11:05:31.563 1 130 oracle@nascds18 (LGWR) control file sequential read WAITING NO HOLDER
接下來再看一眼AWR的iostat,很快發現LGWR有大量的讀取,幾乎所有的I/O都是LGWR發起的。
IOStat by Function summary
Function Name Reads: Data
LGWR 1.5G
Others 210M
DBWR 0M
Buffer Cache Reads 10M
Direct Writes 0M
TOTAL: 1.7G
結合ASH中的LGWR是最終holder,並且LGWR等待control file sequential read的事實,我們可能很快將LGWR列為重點懷疑物件,如果真是這樣,那就跑偏了。
此時不妨停下來思考一下,什麼是log file switch (checkpoint incomplete)?
log file switch (checkpoint incomplete)指的是當redo需要向下一組redo group切換的時候,發現下組日誌是active的,也就是說下組日誌中對應的一些buffer cache中的髒塊尚未寫入到資料檔案中,因此必須等待這些髒塊被完畢後,才可以複用下一組redo group。
接下來再思考一下,哪個程式負責將髒塊寫入到資料檔案?答案是DBWn,因此我們更需要關注的是DBWn。分析一下OSWatcher不難發現DBW0的狀態是D,man一下ps命令: D指的是Uninterruptible sleep (usually IO),也就是說DBW0 hang在I/O上了,這才是問題的根本原因。
那麼為什麼LGWR會執行如此多的control file sequential read ?答案是DBWn出現問題導致 LGWR不斷查詢control 檔案獲取redo 狀態(看它有沒有切換成功)導致大量的control file sequential read
下面透過實驗來演示這個現象:
Test on 12.1.0.2
Session 1: 建立一個表,並且update一下讓其產生redo
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
SQL> select sid from v$mystat where rownum=1;
SID
----------
11 <<<< session 1的 sid 為11
SQL> create table t as select * from dba_objects;
Table created.
SQL> update t set object_name=object_name;
93841 rows updated.
Session 2: 用於檢視redo的切換狀態
SQL> set line 200 pages 1000
SQL> select group#,thread#,status from v$log;
GROUP# THREAD# STATUS
---------- ---------- ----------------
1 1 INACTIVE
2 1 INACTIVE
3 1 CURRENT
SQL> alter system switch logfile;
System altered.
SQL> alter system switch logfile;
System altered.
SQL> select group#,thread#,status from v$log;
GROUP# THREAD# STATUS
---------- ---------- ----------------
1 1 ACTIVE <<<<<< 讓下一組是active的
2 1 ACTIVE
3 1 CURRENT
Session 3: 確定DBWn以及LGWR的spid,並且透過oradebug將DBW0 suspend住讓其不工作
SQL> set line 200 pages 1000
SQL> select spid,program from v$process where program like '%DBW%' or program like '%LG%';
SPID PROGRAM
------------------------ ----------------------------------------------------------------
5768 ORACLE.EXE (DBW0)
7248 ORACLE.EXE (LGWR)
6384 ORACLE.EXE (LG00)
6308 ORACLE.EXE (LG01)
SQL> oradebug setospid 5768
Oracle pid: 11, Windows thread id: 5768, image: ORACLE.EXE (DBW0)
SQL> oradebug suspend
Statement processed.
接下來在Session 1中再執行幾次update讓其自動切換日誌:
Session 1:
SQL> update t set object_name=object_name;
93841 rows updated.
SQL> /
93841 rows updated.
SQL> /
93841 rows updated.
SQL> / ---- 在這裡hang住了
Session 4: 觀察等待事件
SQL> set line 200 pages 1000
SQL> col program for a15
SQL> col event for a40
SQL> select sid,serial#,program,event,state from v$session where program like '%sqlplus%';
SID SERIAL# PROGRAM EVENT STATE
---------- ---------- --------------- ---------------------------------------- -------------------
10 33682 sqlplus.exe SQL*Net message from client WAITING
11 48189 sqlplus.exe log file switch (checkpoint incomplete) WAITING
129 25471 sqlplus.exe SQL*Net message to client WAITED SHORT TIME
130 64963 sqlplus.exe SQL*Net message from client WAITING
接下來我們對LGWR做10046 trace,觀察其等待: 可見反覆的 control file sequential read
SQL> oradebug setospid 7248
Oracle pid: 12, Windows thread id: 7248, image: ORACLE.EXE (LGWR)
SQL> oradebug event 10046 trace name context forever, level 8
Statement processed.
SQL> oradebug tracefile_name
D:\ORACLE\diag\rdbms\r12102\r12102\trace\r12102_lgwr_7248.trc
tail -f D:\ORACLE\diag\rdbms\r12102\r12102\trace\r12102_lgwr_7248.trc
WAIT #0: nam='LGWR all worker groups' ela= 72 p1=0 p2=0 p3=0 obj#=-1 tim=25178622234
WAIT #0: nam='control file sequential read' ela= 407 file#=0 block#=1 blocks=1 obj#=-1 tim=25178622880
WAIT #0: nam='control file sequential read' ela= 262 file#=1 block#=1 blocks=1 obj#=-1 tim=25178623344
WAIT #0: nam='control file sequential read' ela= 717 file#=0 block#=15 blocks=1 obj#=-1 tim=25178624315
WAIT #0: nam='control file sequential read' ela= 1774 file#=0 block#=17 blocks=1 obj#=-1 tim=25178626427
WAIT #0: nam='control file sequential read' ela= 311 file#=0 block#=19 blocks=1 obj#=-1 tim=25178627527
WAIT #0: nam='control file sequential read' ela= 269 file#=0 block#=284 blocks=1 obj#=-1 tim=25178627983
WAIT #0: nam='control file sequential read' ela= 238 file#=0 block#=22 blocks=1 obj#=-1 tim=25178628363
WAIT #0: nam='LGWR all worker groups' ela= 51 p1=0 p2=0 p3=0 obj#=-1 tim=25178628590
WAIT #0: nam='control file sequential read' ela= 503 file#=0 block#=1 blocks=1 obj#=-1 tim=25178629320
WAIT #0: nam='control file sequential read' ela= 322 file#=1 block#=1 blocks=1 obj#=-1 tim=25178630389
WAIT #0: nam='control file sequential read' ela= 276 file#=0 block#=15 blocks=1 obj#=-1 tim=25178630864
WAIT #0: nam='control file sequential read' ela= 253 file#=0 block#=17 blocks=1 obj#=-1 tim=25178631286
WAIT #0: nam='control file sequential read' ela= 250 file#=0 block#=19 blocks=1 obj#=-1 tim=25178631696
WAIT #0: nam='control file sequential read' ela= 658 file#=0 block#=284 blocks=1 obj#=-1 tim=25178632935
WAIT #0: nam='control file sequential read' ela= 303 file#=0 block#=22 blocks=1 obj#=-1 tim=25178633812
......
最後我們將DBWn恢復,結束這個無限的control file sequential read等待。
回到Session 3,resume DBW0:
SQL> oradebug resume
Statement processed.
只要理解了原理,這個問題本來不難診斷,但是ASH和AWR的資訊會讓DBA錯誤的認為是LGWR的等待control file sequential read導致的問題。
https://blogs.oracle.com/database4cn/log-file-switch-checkpoint-incomplete-%e5%ae%b9%e6%98%93%e8%a2%ab%e8%af%af%e8%af%8a%e7%9a%84event
https://blogs.oracle.com/database4cn/log-file-switch-checkpoint-incomplete-%e5%ae%b9%e6%98%93%e8%a2%ab%e8%af%af%e8%af%8a%e7%9a%84event
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29487349/viewspace-2146345/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- log file switch (checkpoint incomplete)等待事件事件
- 【等待事件】log file switch (checkpoint incomplete)事件
- 關於log file switch and checkpoint機制
- log file switch
- LOG FILE SWITCH等待事件事件
- log file switch相關等待事件事件
- alter system switch log file 與 archive log current/all 區別Hive
- PostgreSQL DBA(140) - PG 12(Don't log incomplete startup packet if it's empty)SQL
- Checkpoint log:invalid bitmap page錯誤修復
- oracle switch logfile日誌切換及alter system checkpoint作了什麼Oracle
- oracle core-log file sync-dba enterpriseOracle
- [Logmnr]使用logminer找回被誤刪的資料
- 條款18 讓介面容易被正確使用,不易被誤用
- MySQL中的redo log和checkpointMySql
- Stream -- Node.js中最好的卻最容易被誤解的部分Node.js
- PostgreSQL xlog格式之checkpointSQL
- Oracle log_checkpoint_interval和Oracle
- log file sync 和 log file parallel writeParallel
- Query the duration of log switch
- Vue中那些容易被忽略的~Vue
- Web開發和設計上容易被忽視的8個錯誤Web
- PostgreSQL DBA(168) - pgAdmin(config:checkpoint)SQL
- PostgreSQL DBA(66) - 配置引數(checkpoint_flush_after)SQL
- zt_oracle診斷事件event列表Oracle事件
- 【ASK_ORACLE】檢查點錯誤“Cannot allocate new log”和“Checkpoint not complete”Oracle
- RedoLog Checkpoint 和 SCN關係
- [Oracle Script] Log switch statusOracle
- redo_log_switch_date
- SQL Server 容易忽略的錯誤SQLServer
- PostgreSQL DBA(64) - checkpoint_completion_target引數解析SQL
- RAC生產庫出現嚴重row cache lock和log file switch(archiving need)
- Could not find first log file name in binary log index fileIndex
- 前端常見bug系列1:容易被誤解的:last-child和:first-child前端AST
- Switch log 掛起的小問題
- Archive Log模式下Redo Log、Check Point和Switch LogHive模式
- 記Promise一個容易被忽略的特性Promise
- alter system archive log current / all / switch logfileHive
- redo的等待log file sync和log file parallel write和redo size設定Parallel