[20211214]檢視檢視V$ACTIVE_SESSION_HISTORY遇到奇怪問題.txt

lfree發表於2021-12-15

[20211214]檢視檢視V$ACTIVE_SESSION_HISTORY遇到奇怪問題.txt

--//生產系統遇到無法生成awr的問題,做一個分析:

1.環境:
xxxxxx> @ver1
PORT_STRING                    VERSION        BANNER
------------------------------ -------------- ----------------------------------------------------------------------------
x86_64/Linux 2.4.xx            11.2.0.3.0     Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production

2.分析:
xxxxxx> select min(sample_time),max(sample_time),sysdate from V$ACTIVE_SESSION_HISTORY ;
MIN(SAMPLE_TIME)        MAX(SAMPLE_TIME)        SYSDATE
----------------------- ----------------------- -------------------
2012-10-26 08:19:21.466 2012-10-26 08:19:55.536 2021-12-14 17:20:18
--//MAX(SAMPLE_TIME) = 2012-10-26 08:19:55.536 , 而當前時間是2021-12-14 17:20:18。

xxxxxx> @ bg mm
ADDR                 PID SPID   PROGRAM
---------------- ------- ------ --------------------------
00000000BD799A00       9 3618   oracle@xhserver (MMAN)
00000000BD79FE20      15 3630   oracle@xhserver (MMON)
00000000BD7A0ED0      16 3632   oracle@xhserver (MMNL)
--//很明顯收集ash停止收集資訊,什麼原因造成的問題呢?

xxxxxx> SELECT *  FROM DBA_HIST_SNAPSHOT WHERE snap_id IN (SELECT MAX (snap_id) FROM DBA_HIST_SNAPSHOT)
  2  @pr
==============================
SNAP_ID                       : 60089
DBID                          : 252262230
INSTANCE_NUMBER               : 1
STARTUP_TIME                  : 2012-10-26 08:06:15.000
BEGIN_INTERVAL_TIME           : 2021-12-05 12:00:21.235
END_INTERVAL_TIME             : 2021-12-05 13:00:24.355
FLUSH_ELAPSED                 : +00000 00:00:00.1
SNAP_LEVEL                    : 1
ERROR_COUNT                   : 0
SNAP_FLAG                     : 0
SNAP_TIMEZONE                 : +0 08:00:00
PL/SQL procedure successfully completed.
--// 2021-12-05 13:00:24.355停止了,從來沒有遇到這個問題。為什麼呢?
--// 再次看STARTUP_TIME發現一個奇特現象時間指向了2012-10-26 08:06:15.000,不應該這麼長時間啟動資料庫而沒有關庫。
--// 也就伺服器發生了重啟,並且伺服器的時鐘走的不對.

# hwclock ;date
Sun 04 Nov 2012 10:36:20 AM CST  -0.000513 seconds
Tue Dec 14 17:30:23 CST 2021

--//機器硬體的時間與伺服器存在巨大差異。檢查alert*.log 發現:(補充實際上重啟了2次)
--//實際上第1次發生在2021/8/12
Thu Aug 12 11:59:35 2021
LNS: Standby redo logfile selected for thread 1 sequence 16387 for destination LOG_ARCHIVE_DEST_2
Fri Oct 26 08:05:58 2012                                         --//第1次重啟資料庫。
Starting ORACLE instance (normal)
****************** Large Pages Information *****************

Total Shared Global Region in Large Pages = 1522 MB (100%)

Large Pages used by this instance: 761 (1522 MB)
Large Pages unused system wide = 39 (78 MB) (alloc incr 16 MB)
Large Pages configured system wide = 800 (1600 MB)
Large Page size = 2048 KB
***********************************************************
...
Fri Oct 26 08:10:46 2012
SMCO started with pid=85, OS id=5133
Thu Aug 12 15:18:04 2021
VKTM detected a time drift. Please check trace file for more details.
VKTM detected a time drift. Please check trace file for more details.
Thu Aug 12 15:19:16 2021
Starting background process CJQ0


Sun Dec 05 11:44:42 2021
~~~~~~~~~~~~~~~~~~~~~~~~
LNS: Standby redo logfile selected for thread 1 sequence 17237 for destination LOG_ARCHIVE_DEST_2
Fri Oct 26 08:19:10 2012                                         --//第2次重啟資料庫。
~~~~~~~~~~~~~~~~~~~~~~~
Starting ORACLE instance (normal)
****************** Large Pages Information *****************

Total Shared Global Region in Large Pages = 1522 MB (100%)

Large Pages used by this instance: 761 (1522 MB)
Large Pages unused system wide = 39 (78 MB) (alloc incr 16 MB)
Large Pages configured system wide = 800 (1600 MB)
Large Page size = 2048 KB
***********************************************************
.......
Fri Oct 26 08:19:50 2012
db_recovery_file_dest_size of 41220 MB is 0.00% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Setting Resource Manager plan SCHEDULER[0x318F]:DEFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Fri Oct 26 08:19:53 2012
Starting background process VKRM
Fri Oct 26 08:19:53 2012
VKRM started with pid=20, OS id=3802
Sun Dec 05 15:13:58 2021
VKTM detected a time drift. Please check trace file for more details.
VKTM detected a time drift. Please check trace file for more details.
Sun Dec 05 15:14:25 2021

--//順便說一下,我沒有使用ntp服務,而是在crond裡面加入:
*/20 * * * * root /usr/sbin/ntpdate 192.168.xxx.xx > /dev/null 2>&1

--//但是VKTM到Sun Dec 05 15:13:58 2021才報VKTM detected a time drift. Please check trace file for more details.

# uptime
 17:36:38 up 9 days,  2:25,  1 user,  load average: 0.05, 0.11, 0.09

xxxxxx> select sysdate-9-17/24-25/24/60 from dual ;
SYSDATE-9-2/24-25/2
-------------------
2021-12-05 15:12:44
--//也就是大概2021-12-05 15:12:44上下, 伺服器重啟.
--//檢視/var/log/messages.1 檔案,也可以
Dec  5 04:03:03 xhserver syslogd 1.4.1: restart.
Oct 26 08:18:41 xhserver syslogd 1.4.1: restart.
Oct 26 08:18:42 xhserver kernel: klogd 1.4.1, log source = /proc/kmsg started.
...

--//可以大致猜測應該2021/12/5 某個時刻當機了,或者重啟伺服器.大致到了2021-12-05 15:13:XX重啟動資料庫。
--//我估計是異常重啟,沒有正常關閉資料庫。

xxxxxx> select * from DBA_HIST_DATABASE_INSTANCE
  2  @ pr
==============================
DBID                          : 252262230
INSTANCE_NUMBER               : 1
STARTUP_TIME                  : 2021-03-13 10:31:36.000
PARALLEL                      : NO
VERSION                       : 11.2.0.3.0
DB_NAME                       : HRP430
INSTANCE_NAME                 : hrp430
HOST_NAME                     : xhserver
LAST_ASH_SAMPLE_ID            : 215975423
PLATFORM_NAME                 : Linux x86 64-bit
==============================
DBID                          : 252262230
INSTANCE_NUMBER               : 1
STARTUP_TIME                  : 2012-10-26 08:06:15.000
PARALLEL                      : NO
VERSION                       : 11.2.0.3.0
DB_NAME                       : HRP430
INSTANCE_NAME                 : hrp430
HOST_NAME                     : xhserver
LAST_ASH_SAMPLE_ID            : 0
PLATFORM_NAME                 : Linux x86 64-bit
==============================
DBID                          : 252262230
INSTANCE_NUMBER               : 1
STARTUP_TIME                  : 2012-10-26 08:19:40.000
PARALLEL                      : NO
VERSION                       : 11.2.0.3.0
DB_NAME                       : HRP430
INSTANCE_NAME                 : hrp430
HOST_NAME                     : xhserver
LAST_ASH_SAMPLE_ID            : 0
PLATFORM_NAME                 : Linux x86 64-bit
PL/SQL procedure successfully completed.
--//再次檢視alert確實重啟了2次。
--//結合前面的檢查,可以發現2021-12-05 13:00:24.355 - 2021-12-05 15:13:58 (Sun Dec 05 15:13:58 2021) 之間當機或者重啟了服
--//務器以及資料庫.

--//可以大致猜測幾種情況
--//1.伺服器的cmos電池已經沒電了,導致伺服器時間不正常.
--//2.我記憶裡linux伺服器正常關閉,應該會執行一步伺服器當前時間同步硬體時鐘的操作,很明顯沒有這步.也就是當機的可能性很大,
--//  關於一些細節要跟管理伺服器的同事確認.
--//3.我並不排除有人進入BIOS SETUP介面修改時間的可能,不過這個可能性很小。

--//剩下的如何解決這些問題:
xxxxxx> @bgx mm
PROGRAM                    MODULE       ACTION    SID     PID SPID
-------------------------- ------------ ------------- ------- ------
oracle@xhserver (MMAN)                            417       9 3618
oracle@xhserver (MMNL)                              4      16 3632
oracle@xhserver (MMON)                           1246      15 3630

# ps -elf | grep -E "UI[D]|361[8]|363[0]|363[2]"
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S oracle    3618     1  0  78   0 - 58663 -      Dec05 ?        00:00:00 ora_mman_hrp430
0 S oracle    3630     1  0  75   0 - 60125 -      Dec05 ?        00:00:00 ora_mmon_hrp430
0 S oracle    3632     1  0  75   0 - 58957 -      Dec05 ?        00:00:00 ora_mmnl_hrp430

# hwclock --systohc --debug
hwclock from util-linux 2.13-pre7
Using /dev interface to clock.
Last drift adjustment done at 1457145517 seconds after 1969
Last calibration done at 1457145517 seconds after 1969
Hardware clock is on UTC time
Assuming hardware clock is kept in UTC time.
Waiting for clock tick...
...got clock tick
Time read from Hardware Clock: 2012/11/04 17:38:38
Hw clock time : 2012/11/04 17:38:38 = 1352050718 seconds since 1969
Time elapsed since reference time has been 0.516304 seconds.
Delaying further to reach the new time.
Setting Hardware Clock to 00:32:42 = 1639528362 seconds since 1969
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ioctl(RTC_SET_TIME) was successful.
Not adjusting drift factor because it has been less than a day since the last calibration.
--//注:加入--debug 主要看裡面執行的細節。下劃線表示什麼不懂。視乎裡面的32表示32天。

xxxxxx> select sysdate - timestamp'2012-11-04 17:38:38'  from dual ;
SYSDATE-TIMESTAMP'2012-11-0417:38:38'
---------------------------------------------------------------------------
+000003327 14:58:35.000000000

# hwclock ; date
Wed 15 Dec 2021 08:37:40 AM CST  -0.000560 seconds
Wed Dec 15 08:37:40 CST 2021
--//現在時間已經同步。

--//kill mmon,mmnl程式。
xxxxxx> alter system enable restricted session;
System altered.

# ps -elf | grep -E "UI[D]|361[8]|363[0]|363[2]"
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
0 S oracle    3618     1  0  78   0 - 58663 -      Dec05 ?        00:00:00 ora_mman_hrp430

xxxxxx> alter system disable restricted session;
System altered.

xxxxxx> @ bgx mm
PROGRAM                    MODULE       ACTION   SID     PID SPID
-------------------------- ------------ ------- ---- ------- ------
oracle@xhserver (MMAN)                           417       9 3618
oracle@xhserver (MMNL)                             4      16 18968
oracle@xhserver (MMON)                          1246      15 18966

xxxxxx> select min(sample_time),max(sample_time),sysdate from V$ACTIVE_SESSION_HISTORY ;
MIN(SAMPLE_TIME)        MAX(SAMPLE_TIME)        SYSDATE
----------------------- ----------------------- -------------------
2012-10-26 08:19:21.466 2021-12-15 08:41:19.991 2021-12-15 08:41:20

--//MAX(SAMPLE_TIME) = 2021-12-15 08:41:19.991 ,說明開始能收集ash資訊。問題解決。

xxxxxx> select count(*) from V$ACTIVE_SESSION_HISTORY where sample_time>trunc(sysdate);
  COUNT(*)
----------
       737

--//繼續分析:
xxxxxx> select  sql_opname,sql_id,count(*) from V$ACTIVE_SESSION_HISTORY where sample_time<=trunc(sysdate) group by sql_id,sql_opname order by 3 desc;
SQL_OPNAME      SQL_ID          COUNT(*)
--------------- ------------- ----------
                                      32
ALTER DATABASE  1h50ks4ncswfn         24
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SELECT          f6cz4n8y72xdc          6
SELECT          g0jvz8csyrtcf          3
ALTER DATABASE  92b382ka0qgdt          3
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PL/SQL EXECUTE  02577v815yp77          2
SELECT          05uqdabhzncdc          1
SELECT          836b98xx998x3          1
PL/SQL EXECUTE  cnq31548hb8un          1
SELECT          1p36ta7p3d7fu          1
PL/SQL EXECUTE  49s332uhbnsma          1
SELECT          a1xgxtssv5rrp          1
SELECT          9dhn1b8d88dpf          1
13 rows selected.
--//sql_id=1h50ks4ncswfn出現多次,而且SQL_OPNAME 是ALTER DATABASE。

xxxxxx> @ sql_id 1h50ks4ncswfn
no rows selected
view dba_hist_sqltext
no rows selected

--//上網直接查詢1h50ks4ncswfn,
*** 2019-01-25 14:15:19.247
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=1h50ks4ncswfn) -----
ALTER DATABASE OPEN

--//視乎這個命令對應的就是ALTER DATABASE OPEN。

$ sql_id.sh "ALTER DATABASE OPEN\0"
sql_text = ALTER DATABASE OPEN\0
full_hash_value(16) = 2560c35ddf0cebbf181412c128cc71d4
hash_value(10) = 684487124
sql_id(32) = 1h50ks4ncswfn
sql_id(32) = 1h50ks4ncswfn
sql_id(32) = 1h50ks4ncswfn
--//正確。

--//查詢92b382ka0qgdt,視乎對應ALTER DATABASE MOUNT,我的測試對不上。
 $ sql_id.sh "ALTER DATABASE MOUNT\0"
sql_text = ALTER DATABASE MOUNT\0
full_hash_value(16) = c74c9f84f1d0a9d569fd9ec039ac9989
hash_value(10) = 967612809
sql_id(32) = 6mzcys0wut6c9
sql_id(32) = 6mzcys0wut6c9
sql_id(32) = 6mzcys0wut6c9

--//終於找到一個連結
LibraryObjectLock:  Address=f6edd440 Handle=f69b40e0 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=1         
        User=f7bdefd0 Session=f7bdefd0 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=556c8d0e
      LibraryHandle:  Address=f69b40e0 Hash=940b3db9 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
        ObjectName:  Name=ALTER DATABASE   MOUNT         --//這裡的MOUNT前面有3個空格。
          FullHashValue=3b4263a262865c96912c6814940b3db9 Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=2483764665 OwnerIdn=0
        Statistics:  InvalidationCount=0 ExecutionCount=1 LoadCount=2 ActiveLocks=1 TotalLockCount=1 TotalPinCount=1
        Counters:  BrokenCount=1 RevocablePointer=1 KeepDependency=1 BucketInUse=0 HandleInUse=0 HandleReferenceCount=0
        Concurrency:  DependencyMutex=f69b4190(0, 1, 0, 0) Mutex=f69b4210(0, 20, 0, 0)
        Flags=RON/PIN/TIM/PN0/DBN/[10012841]
        WaitersLists:  
          Lock=f69b4170[f69b4170,f69b4170]
          Pin=f69b4150[f69b4150,f69b4150]
          LoadLock=f69b41c8[f69b41c8,f69b41c8]
        Timestamp:  Current=06-01-2015 12:49:18
        HandleReference:  Address=f69b4290 Handle=0 Flags=[00]
        LibraryObject:  Address=eeff20b0 HeapMask=0000-0001-0001-0000 Flags=EXS[0000] Flags2=[0000] PublicFlags=[0000]
          ChildTable:  size='16'
            Child:  id='0' Table=eeff2f60 Reference=eeff29b8 Handle=f69b3e90
        NamespaceDump:  
          Parent Cursor:  sql_id=92b382ka0qgdt parent=eeff2150 maxchild=1 plk=y ppn=n         

$ sql_id.sh "ALTER DATABASE   MOUNT\0"
sql_text = ALTER DATABASE   MOUNT\0
full_hash_value(16) = 3b4263a262865c96912c6814940b3db9
hash_value(10) = 2483764665
sql_id(32) = 92b382ka0qgdt
sql_id(32) = 92b382ka0qgdt
sql_id(32) = 92b382ka0qgdt
--//終於對上了。          
--//實際上自己使用10046跟蹤估計也可以獲得,有計劃嘗試看看。

3.總結:
--//很明顯我們團隊許多人根本不看awr報表。
--//另外該硬體應該到了替換的時間。


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/267265/viewspace-2847778/,如需轉載,請註明出處,否則將追究法律責任。

相關文章