oracle不定期的出現當機的問題診斷

rainbowbridg發表於2007-05-17

環境:

AIX5L+HACMP+ORACLE9I


分析alert,沒有發現錯誤:

Fri Apr 27 17:08:05 2007
Thread 1 advanced to log sequence 612
Current log# 2 seq# 612 mem# 0: /oracle2/oracle/oradata/mssdb2/redo02.log
Fri Apr 27 17:08:05 2007
ARC0: Evaluating archive log 1 thread 1 sequence 611
ARC0: Beginning to archive log 1 thread 1 sequence 611
Creating archive destination LOG_ARCHIVE_DEST_1: '/oracle2/oracle/oradata/oraclearchivelog/1_611.dbf'
ARC0: Completed archiving log 1 thread 1 sequence 611
Fri Apr 27 17:18:52 2007
Shutting down instance: further logons disabled
Shutting down instance (immediate)
License high water mark = 117
Waiting for dispatcher 'D000' to shutdown
All dispatchers and shared servers shutdown
Fri Apr 27 17:20:49 2007
ALTER DATABASE CLOSE NORMAL
Fri Apr 27 17:20:49 2007
SMON: disabling tx recovery
SMON: disabling cache recovery
Fri Apr 27 17:20:49 2007
Shutting down archive processes
Archiving is disabled
Fri Apr 27 17:20:49 2007
ARCH shutting down
Fri Apr 27 17:20:49 2007
ARCH shutting down
Fri Apr 27 17:20:49 2007
ARC1: Archival stopped
Fri Apr 27 17:20:49 2007
ARC0: Archival stopped
Fri Apr 27 17:20:49 2007
Thread 1 closed at log sequence 612
Successful close of redo thread 1.
Fri Apr 27 17:20:50 2007
Completed: ALTER DATABASE CLOSE NORMAL
Fri Apr 27 17:20:50 2007
ALTER DATABASE DISMOUNT
Completed: ALTER DATABASE DISMOUNT
ARCH: Archiving is disabled
Shutting down archive processes
Archiving is disabled
Archive process shutdown avoided: 0 active
ARCH: Archiving is disabled
Shutting down archive processes
Archiving is disabled
Archive process shutdown avoided: 0 active
Fri Apr 27 17:22:39 2007

檢查hacmp的日誌,發現出錯時刻hacmp正在執行切換:

hacmp的out檔案:
:server_restart[144] :server_restart[144] odmget -qmonitor=ora2_db and name=NOTIFY_METHOD HACMPmonitor
:server_restart[144] cut -d" -f2
:server_restart[144] grep value =
NOTIFY=
:server_restart[146] :server_restart[146] cut -d -f1
:server_restart[146] echo
NOTIFY_SCRIPT=
:server_restart[146] [[ -n ]]
:server_restart[162] :server_restart[162] odmget -qmonitor=ora2_db and name=CLEANUP_METHOD HACMPmonitor
:server_restart[162] cut -d" -f2
:server_restart[162] egrep value =
CLEANUP=/hacmpscript/ora2stop-db.sh
:server_restart[163] :server_restart[163] cut -d -f1
:server_restart[163] echo /hacmpscript/ora2stop-db.sh
CLEANUP_SCRIPT=/hacmpscript/ora2stop-db.sh
:server_restart[163] [[ -n /hacmpscript/ora2stop-db.sh ]]
:server_restart[163] [[ -x /hacmpscript/ora2stop-db.sh ]]
:server_restart[167] dspmsg scripts.cat 9334 server_restart: Calling user specified cleanup method /hacmpscript/ora2stop-db.shn server_restart /hacmpscript/ora2stop-db.sh
server_restart: Calling user specified cleanup method /hacmpscript/ora2stop-db.sh
:server_restart[168] /hacmpscript/ora2stop-db.sh
:server_restart[168] ODMDIR=/etc/objrepos

SQL*Plus: Release 9.2.0.1.0 - Production on Wed May 9 16:26:13 2007

Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

SQL> Connected.
SQL> Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> Disconnected from Oracle9i Enterprise Edition Release 9.2.0.1.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.1.0 - Production

檢查hacmp的test指令碼,沒有發現問題:

test-db-mssdb2.sh的內容:

#!/bin/sh
#
# Copyright (c) MDCL Inc. All Rights Reserved.
# Created by Li Han Guang 2006.01.16
#
# Should return 0 if the service was start succesfully; otherwise,
# return non-zero
#
# check the following process:
# ora_smon_$sid
# ora_pmon_$sid
# ora_dbw0_$sid
# ora_lgwr_$sid
# ora_reco_$sid
# database connection

ORACLE_SID="mssdb2"
ORACLE_HOME=/oracle2/oracle/product/9.2.0
ORA_USER="oracle2"


#ISQL=svrmgrl
ISQL="sqlplus /nolog"

LOG_FILE="/tmp/test_mssdb2.log"
#
# log messages, need var $LOG_FILE
#
log()
{
echo [`date +"%Y-%m-%d %H:%M:%S"`] $1 >> $LOG_FILE
echo [`date +"%Y-%m-%d %H:%M:%S"`] $1
}

get_status()
{
str=$1

# ps -ef | cut -c52- | grep $str | grep -v grep > /dev/null 2>&1
ps -ef | grep $str | grep -v grep > /dev/null 2>&1
STATUS=$?
case $STATUS in
0) return 0;;
*) return 1;;
esac

}

#
#checks for smon process
#
check_smon()
{
sid=$1
get_status ora_smon_$sid
return $?
}



#
#checks for pmon process
#
check_pmon()
{
sid=$1
get_status ora_pmon_$sid
return $?
}



#
#checks for dbw0 process
#
check_dbw0()
{
sid=$1
get_status ora_dbw0_$sid
return $?
}

#checks for lgwr process
#
check_lgwr()
{
sid=$1
get_status ora_lgwr_$sid
status=$?
return ${status}
}

#
#checks for reco process
#
check_reco()
{
sid=$1
get_status ora_reco_$sid
return $?
}

####################################
# start to check oracle processes
####################################
#checks for database connection
#
su - $ORA_USER -c "${ISQL} > /tmp/$$.out 2> /tmp/$$.err " < connect / as sysdba;
select sysdate from dual;
exit;
EOF
grep SYSDATE /tmp/$$.out >/dev/null 2>&1
if test "$?" -ne 0
then
# cat /tmp/$$.*
log "ORACLE: Cannot connect to database for ORACLE_SID=${ORACLE_SID}."
exit 1
else
log "Oracle instance ${ORACLE_SID} can be connected"
fi
rm -f /tmp/$$.*


check_smon ${ORACLE_SID}
if test "$?" -ne 0
then
log "ORACLE: ora_smon_${ORACLE_SID} is NOT running."
exit 1
else
log "ORACLE: ora_smon_${ORACLE_SID} is running"
fi

check_pmon ${ORACLE_SID}
if test "$?" -ne 0
then
log "ORACLE: ora_pmon_${ORACLE_SID} is NOT running."
exit 1
else
log "ORACLE: ora_pmon_${ORACLE_SID} is running"
fi

check_dbw0 ${ORACLE_SID}
if test "$?" -ne 0
then
log "ORACLE: ora_dbw0_${ORACLE_SID} is NOT running."
exit 1
else
log "ORACLE: ora_dbw0_${ORACLE_SID} is running"
fi

check_lgwr ${ORACLE_SID}
if test "$?" -ne 0
then
log "ORACLE: ora_lgwr_${ORACLE_SID} is NOT running."
exit 1
else
log "ORACLE: ora_lgwr_${ORACLE_SID} is running"
fi

check_reco ${ORACLE_SID}
if test "$?" -ne 0
then
log "ORACLE: ora_reco_${ORACLE_SID} is NOT running."
exit 1
else
log "ORACLE: ora_reco_${ORACLE_SID} is running"
fi

exit 0

根據hacmp的test指令碼檢查test指令碼對應的日誌778500.out發現連線資料庫存在問題:

SQL*Plus: Release 9.2.0.1.0 - Production on Fri Apr 27 17:12:32 2007

Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.

SQL> ERROR:
ORA-12540: TNS:internal limit restriction exceeded

SQL> SP2-0640: Not connected
SQL>

繼續檢查ORACLE資料庫的監聽日誌LISTERNER.LOG,發現如下錯誤:
27-APR-2007 17:10:33 * (CONNECT_DATA=(SID=mssdb2)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.192.139.194)(PORT=40707)) * establish * mssdb2 * 12500
TNS-12500: TNS:listener failed to start a dedicated server process
TNS-12540: TNS:internal limit restriction exceeded
TNS-12560: TNS:protocol adapter error
TNS-00510: Internal limit restriction exceeded
IBM/AIX RISC System/6000 Error: 11: Resource temporarily unavailable
27-APR-2007 17:12:25 * 12502
TNS-12502: TNS:listener received no CONNECT_DATA from client
No longer listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.192.139.202)(PORT=1521)))
27-APR-2007 17:14:43 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=Unicom_ora2)(USER=oracle2))(COMMAND=stop)(ARGUMENTS=64)(SERVICE=LISTENER)(VERSION=153092352)) * stop * 0

TNSLSNR for IBM/AIX RISC System/6000: Version 9.2.0.1.0 - Production on 27-APR-2007 17:22:47

Copyright (c) 1991, 2002, Oracle Corporation. All rights reserved.

最後分析該問題引起的可能原因:
一、 AIX作業系統引數maxuproc:

執行相應的檢查,發現該引數使用的預設值,需要調整:
$ lsattr -El sys0|grep maxuproc
maxuproc 128 Maximum number of PROCESSES allowed per user True

二、 ORACLE資料庫的引數PROCESSES和PGA_AGGREGATE_TARGET

執行相應的檢查,發現該2個引數都為預設值,需要調整:
SQL> show parameter processes

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
aq_tm_processes integer 1
db_writer_processes integer 1
job_queue_processes integer 10
log_archive_max_processes integer 2
processes integer 150

SQL> show parameter pga

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target big integer 25165824

彙總建議,提交客戶。

ref: http://zhouwf0726.itpub.net/post/9689/207598

[@more@]

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/7916042/viewspace-915337/,如需轉載,請註明出處,否則將追究法律責任。

相關文章