案列分析 p570a主機掛起處理報告

paulyibinyi發表於2010-11-30

環境:oracle 11.2.0.1 +rac +AIX 6.1建立兩套資料庫

20101129日下午15點左右,p570a主機 telnet不進去,應用新建連線不成功,嚴重影響到業務,16點趕到使用者現場,進行應急處理。

現把本次資料庫應急故障處理、問題分析過程總結如下:


透過hmc控制檯,登入到p570a主機,輸入任何命令都報記憶體不足,如下;

root@p570a:/> errpt|more

ksh: 0403-031 The fork function failed. There is not enough memory available.

ksh: 0403-031 The fork function failed. There is not enough memory available.

root@p570a:/> ps -ef | grep  LOCAL=NO|wc -l

ksh: 0403-031 The fork function failed. There is not enough memory available.

root@p570a:/> ls

ksh: 0403-031 The fork function failed. There is not enough memory available.

 

徵求使用者意見同意後,透過hmc控制檯,重啟p570a主機。

故障分析

p570a@root#errpt|more

IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION

A6DF45AA   1129164210 I O RMCdaemon      The daemon is started.

EC0BCCD4   1129164110 T H ent1           ETHERNET DOWN

67145A39   1129163910 U S SYSDUMP        SYSTEM DUMP

F48137AC   1129163810 U O minidump       COMPRESSED MINIMAL DUMP

1104AA28   1129163810 T S SYSPROC        SYSTEM RESET INTERRUPT RECEIVED

9DBCFDEE   1129164110 T O errdemon       ERROR LOGGING TURNED ON

B6267342   1126235510 P H hdisk3         DISK OPERATION ERROR

B6267342   1125235510 P H hdisk3         DISK OPERATION ERROR

C5C09FFA   1125062110 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1125051010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

C5C09FFA   1124144010 P S SYSVMM         SOFTWARE PROGRAM ABNORMALLY TERMINATED

 

p570a@root#errpt -aj C5C09FFA |more

---------------------------------------------------------------------------

LABEL:          PGSP_KILL

IDENTIFIER:     C5C09FFA

 

Date/Time:       Thu Nov 25 06:21:13 BEIST 2010

Sequence Number: 99122

Machine Id:      00C6E9C54C00

Node Id:         p570a

Class:           S

Type:            PERM

WPAR:            Global

Resource Name:   SYSVMM         

 

Description

SOFTWARE PROGRAM ABNORMALLY TERMINATED

 

Probable Causes

SYSTEM RUNNING OUT OF PAGING SPACE

 

Failure Causes

INSUFFICIENT PAGING SPACE DEFINED FOR THE SYSTEM

PROGRAM USING EXCESSIVE AMOUNT OF PAGING SPACE

 

1124號開始已經報沒有足夠的頁面交換空間可以使用,可見實體記憶體早就用完。

alert_gzjb1.log1124號開始就有大量如下報錯:

Wed Nov 24 22:36:15 2010

ORA-27302: failure occurred at: skgpspawn3

ORA-27301: OS failure message: Not enough space

ORA-27300: OS system dependent operation:fork failed with status: 12

Errors in file /oracle/app/oracle/diag/rdbms/gdjb/gdjb1/trace/gdjb1_psp0_352314.trc:

Process startup failed, error stack:

 

Thu Nov 25 02:56:24 2010

Process q000 died, see its trace file

Thu Nov 25 02:56:13 2010

ORA-27302: failure occurred at: skgpspawn3

ORA-27301: OS failure message: Not enough space

ORA-27300: OS system dependent operation:fork failed with status: 12

Errors in file /oracle/app/oracle/diag/rdbms/gdjb/gdjb1/trace/gdjb1_psp0_352314.trc:

Process startup failed, error stack:

 

Instance terminated by USER, pid = 144242

USER (ospid: 144242): terminating the instance due to error 443

Process LMHB died, see its trace file

ORA-27302: failure occurred at: skgpspawn3

ORA-27301: OS failure message: Not enough space

ORA-27300: OS system dependent operation:fork failed with status: 12

Errors in file /oracle/app/oracle/diag/rdbms/gdjb/gdjb1/trace/gdjb1_ora_144242.trc:

 

p570a節點資料庫down機是由於實體記憶體和頁面交換空間已經使用完,無法得到請求引起的。

 

 

TNS-12500: TNS:監聽器未能啟動專用的伺服器程式

 TNS-12540: TNS:超出內部極限限制

  TNS-12560: TNS: 協議介面卡錯誤

   TNS-00510: 超出內部極限限制

    IBM/AIX RISC System/6000 Error: 12: Not enough space

 

     監聽日誌也報無法請求外部連線錯誤。

記憶體引數

實體記憶體

p570a

AIX

System Model: IBM,9117-MMA

Machine Serial Number: 066E9C5

Processor Type: PowerPC_POWER6

Processor Implementation Mode: POWER 6

Processor Version: PV_6_Compat

Number Of Processors: 8

Processor Clock Speed: 3504 MHz

CPU Type: 64-bit

Kernel Type: 64-bit

LPAR Info: 1 06-6E9C5

Memory Size: 15232 MB

Good Memory Size: 15232 MB

Platform. Firmware level: EM350_038

Firmware Version: IBM,EM350_038

Console Login: enable

Auto Restart: true

Full Core: false

可以看出總實體記憶體為15G左右

 

資料庫A

SQL> show sga

 

Total System Global Area 2137886720 bytes

Fixed Size                  2208496 bytes

Variable Size            1207962896 bytes

Database Buffers          922746880 bytes

Redo Buffers                4968448 bytes

SQL> show parameter sga

 

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

lock_sga                             boolean     FALSE

pre_page_sga                         boolean     FALSE

sga_max_size                         big integer 2G

sga_target                           big integer 2G

 

SQL> show parameter pga

 

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

pga_aggregate_target                 big integer 1G

 

SQL> show parameter instance_name

 

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

instance_name                        string      gd1

 

可以看出A資料庫佔用3G實體記憶體

 

資料庫B

SQL> show sga

 

Total System Global Area 8551575552 bytes

Fixed Size                  2223904 bytes

Variable Size            1778385120 bytes

Database Buffers         6761218048 bytes

Redo Buffers                9748480 bytes

SQL> show parameter sga

NAME                                 TYPE     VALUE

lock_sga                             Boolean  FALSE

pre_page_sga                         Boolean  FALSE

sga_max_size                         big integer 8G

sga_target                           big integer 8G

SQL> show parameter instance_name

 

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------

instance_name                        string      gd2

SQL> show parameter pga

 

NAME                                 TYPE             VALUE

pga_aggregate_target                 big integer       2G

 

 

可以看出B資料庫佔用10G實體記憶體,分配的值佔用總記憶體較多。

 

 

 

 

總實體記憶體15G,分配給兩個資料庫總共記憶體13G,只剩2G給作業系統使用,隨著業務連線數增多或不釋放等原因,很容易把實體記憶體和頁面交換空間耗用完,導致資料庫down機和主機掛起。

1) gzcdc資料庫oracle記憶體引數值設定過大,建議調整,跟開發商,使用者商量後,將gzcdc資料庫sga調整為5G,pga設定為1G,這樣作業系統還剩餘7G

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/7199859/viewspace-680613/,如需轉載,請註明出處,否則將追究法律責任。

相關文章