Linux主機記憶體溢位導致oracle的SYS使用者無法正常登陸

清風艾艾發表於2015-08-13
    一般情況下,ORACLE DBA看到如下情況的第一反應是,資料庫例項沒有啟動或者是資料庫環境變數沒有設定正確,今天遇到的情況均不是以上兩種情況,有點特別,且來看看為哪般。
oracle@POC-SV12-I2KDB:~> sqlplus / as sysdba
SQL*Plus: Release 11.1.0.7.0 - Production on Thu Aug 13 11:31:58 2015
Copyright (c) 1982, 2008, Oracle.  All rights reserved.
Connected to an idle instance.
SQL> quit
Disconnected.
    檢查資料庫例項是否啟動
oracle@POC-SV12-I2KDB:~> ps -ef|grep ora_
oracle    8799     1  0  2011 ?        02:27:30 ora_pmon_inomc
oracle    8801     1  0  2011 ?        00:00:13 ora_vktm_inomc
oracle    8805     1  0  2011 ?        00:00:26 ora_diag_inomc
oracle    8807     1  0  2011 ?        00:00:53 ora_dbrm_inomc
oracle    8809     1  0  2011 ?        00:07:28 ora_psp0_inomc
oracle    8811     1  0  2011 ?        1-22:14:21 ora_dia0_inomc
oracle    8813     1  0  2011 ?        00:00:22 ora_mman_inomc
oracle    8815     1  0  2011 ?        01:09:25 ora_dbw0_inomc
oracle    8817     1  0  2011 ?        00:39:51 ora_lgwr_inomc
oracle    8819     1  0  2011 ?        04:23:53 ora_ckpt_inomc
oracle    8821     1  0  2011 ?        01:35:38 ora_smon_inomc
oracle    8823     1  0  2011 ?        00:00:05 ora_reco_inomc
oracle    8825     1  0  2011 ?        00:16:42 ora_mmon_inomc
oracle    8827     1  0  2011 ?        02:24:41 ora_mmnl_inomc
oracle    8829     1  0  2011 ?        00:00:07 ora_d000_inomc
oracle    8831     1  0  2011 ?        00:00:07 ora_s000_inomc
oracle    8875     1  0  2011 ?        00:03:26 ora_arc0_inomc
oracle    8877     1  0  2011 ?        00:03:27 ora_arc1_inomc
oracle    8879     1  0  2011 ?        00:03:28 ora_arc2_inomc
oracle    8881     1  0  2011 ?        00:00:16 ora_arc3_inomc
oracle    8883     1  0  2011 ?        00:00:16 ora_fbda_inomc
oracle    8885     1  0  2011 ?        00:00:10 ora_qmnc_inomc
oracle    8901     1  0  2011 ?        00:00:05 ora_q000_inomc
oracle    8903     1  0  2011 ?        00:08:33 ora_q001_inomc
oracle   10895     1  0  2011 ?        01:02:28 ora_cjq0_inomc
oracle   10965     1  0  2011 ?        00:00:26 ora_smco_inomc
oracle   23221 22576  0 11:31 pts/2    00:00:00 grep ora_
    檢查資料庫監聽狀態是否正常
oracle@POC-SV12-I2KDB:~> lsnrctl status
LSNRCTL for Linux: Version 11.1.0.7.0 - Production on 13-AUG-2015 11:31:34
Copyright (c) 1991, 2008, Oracle.  All rights reserved.
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias                     LISTENER
Version                   TNSLSNR for Linux: Version 11.1.0.7.0 - Production
Start Date                17-SEP-2011 14:59:03
Uptime                    431 days 15 hr. 36 min. 44 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /opt/oracle/app/product/11g/db/network/admin/listener.ora
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=POC-SV12-I2KDB)(PORT=1521)))
Services Summary...
Service "inomc" has 1 instance(s).
  Instance "inomc", status READY, has 1 handler(s) for this service...
Service "inomcXDB" has 1 instance(s).
  Instance "inomc", status READY, has 1 handler(s) for this service...
Service "inomc_XPT" has 1 instance(s).
  Instance "inomc", status READY, has 1 handler(s) for this service...
The command completed successfully
    看吧,資料庫例項狀態、監聽狀態都是正常的,但是ORACLE資料庫使用者SYS登入就是不能一如往常登入進去。
    再排除一下環境變數設定不當(如果同一資料庫安裝多個例項,ORACLE環境變數又ORACLE_SID配置不當是有可能出現SYS登入空閒例項的)的問題
oracle@POC-SV12-I2KDB:/opt/oracle/app/product/11g/db> pwd
/opt/oracle/app/product/11g/db
oracle@POC-SV12-I2KDB:/opt/oracle/app/product/11g/db> ps -ef|grep smon
oracle    8821     1  0  2011 ?        01:35:38 ora_smon_inomc
oracle   32267 22576  0 11:36 pts/2    00:00:00 grep smon
oracle@POC-SV12-I2KDB:~> echo $ORACLE_HOME
/opt/oracle/app/product/11g/db
oracle@POC-SV12-I2KDB:~> echo $ORACLE_SID
inomc
    資料庫環境變數也都是正常配置了的,那我接下來嘗試使用業務使用者在本地登入看資料庫報什麼錯誤:
oracle@POC-SV12-I2KDB:/opt/oracle/app/product/11g/db/network/admin> sqlplus imap/imap
SQL*Plus: Release 11.1.0.7.0 - Production on Thu Aug 13 11:42:21 2015
Copyright (c) 1982, 2008, Oracle.  All rights reserved.
ERROR:
ORA-01034: ORACLE not available
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory
Additional information: 1
Additional information: 819200
Additional information: 8
Process ID: 0
Session ID: 0 Serial number: 0
    這回很明顯了,資料庫報記憶體溢位了,接下來再看下當前資料庫連線到資料庫例項的程式,由於當前資料庫SYS無法登入,只能ps -ef|grep LOCAL了,結果發現有300個之多。
oracle@POC-SV12-I2KDB:~> ps -ef|grep LOCAL
oracle   28688     1  0  2011 ?        00:03:05 oracleinomc (LOCAL=NO)
oracle   28690     1  0  2011 ?        00:00:00 oracleinomc (LOCAL=NO)
oracle   28737     1  0  2011 ?        00:00:00 oracleinomc (LOCAL=NO)
oracle   28762     1  0  2011 ?        00:00:00 oracleinomc (LOCAL=NO)
oracle   28782     1  0  2011 ?        00:00:16 oracleinomc (LOCAL=NO)
oracle   28785     1  0  2011 ?        00:00:16 oracleinomc (LOCAL=NO)
oracle   28791     1  0  2011 ?        00:00:16 oracleinomc (LOCAL=NO)
oracle   28795     1  0  2011 ?        00:00:16 oracleinomc (LOCAL=NO)
oracle   28798     1  0  2011 ?        00:00:16 oracleinomc (LOCAL=NO)
oracle   28815     1  0  2011 ?        00:00:00 oracleinomc (LOCAL=NO)
oracle   29217     1  0  2011 ?        00:11:59 oracleinomc (LOCAL=NO)
oracle   29226     1  0  2011 ?        00:09:16 oracleinomc (LOCAL=NO)
.
.
.
    再看下資料庫伺服器的記憶體狀態發現,16G的記憶體確實已經用完了,SWAP分割槽也已經使用殆盡了。
oracle@POC-SV12-I2KDB:~> free -t
             total       used       free     shared    buffers     cached
Mem:       8096276    8039268      57008          0     195744    1732592
-/+ buffers/cache:    6110932    1985344
Swap:     16779852   16779844          8
Total:    24876128   24819112      57016
    與現場業務、系統管理員方溝通後決定殺掉LOCAL=NO的程式釋放記憶體。
oracle@POC-SV12-I2KDB:~> kill -9 28688 
oracle@POC-SV12-I2KDB:~> kill -9 28690 
oracle@POC-SV12-I2KDB:~> kill -9 28737 
oracle@POC-SV12-I2KDB:~> kill -9 28762 
oracle@POC-SV12-I2KDB:~> kill -9 28782 
oracle@POC-SV12-I2KDB:~> kill -9 28785 
oracle@POC-SV12-I2KDB:~> kill -9 28791 
oracle@POC-SV12-I2KDB:~> kill -9 28795 
oracle@POC-SV12-I2KDB:~> kill -9 28798 
oracle@POC-SV12-I2KDB:~> kill -9 28815 
oracle@POC-SV12-I2KDB:~> free
             total       used       free     shared    buffers     cached
Mem:       8096276    8044760      51516          0     195652    1730480
-/+ buffers/cache:    6118628    1977648
Swap:     16779852   16754328      25524
oracle@POC-SV12-I2KDB:~> free
             total       used       free     shared    buffers     cached
Mem:       8096276    8033788      62488          0     194856    1686568
-/+ buffers/cache:    6152364    1943912
Swap:     16779852   16753524      26328
oracle@POC-SV12-I2KDB:~> free
             total       used       free     shared    buffers     cached
Mem:       8096276    8028440      67836          0     194856    1686568
-/+ buffers/cache:    6147016    1949260
Swap:     16779852   16753524      26328
oracle@POC-SV12-I2KDB:~> free
             total       used       free     shared    buffers     cached
Mem:       8096276    8006900      89376          0     194856    1686568
-/+ buffers/cache:    6125476    1970800
Swap:     16779852   16753524      26328
    釋放完部分記憶體,再嘗試oracle管理賬戶SYS本地登入,發現SYS使用者能夠正常登入了:
oracle@POC-SV12-I2KDB:~> sqlplus / as sysdba
SQL*Plus: Release 11.1.0.7.0 - Production on Thu Aug 13 11:48:29 2015
Copyright (c) 1982, 2008, Oracle.  All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options
SQL> 
    後續分析,真的很奇怪,以前遇見過會話數超、程式數超、記憶體溢位SYS無法登入的,但是現狀也只是到如下就卡主(這是比較明顯的,要麼是像業務使用者本地登入一樣直接報記憶體溢位)
oracle@POC-SV12-I2KDB:~> sqlplus / as sysdba
SQL*Plus: Release 11.1.0.7.0 - Production on Thu Aug 13 11:48:29 2015
Copyright (c) 1982, 2008, Oracle.  All rights reserved.
Connected to:
    由於查殺程式前無法登入資料庫,無法獲知告警日誌的確切位置,殺過程式後,檢視告警日誌,有大量如下報錯:
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process 
Errors in file /opt/oracle/app/diag/rdbms/inomc/inomc/trace/inomc_cjq0_10895.trc:
.
.
.
Thu Aug 13 11:45:52 2015
Process W000 died, see its trace file
Process W000 died, see its trace file
Process W000 died, see its trace file
    檢視kkjcre1p: unable to spawn jobq slave process 的官方解釋,說可能是資料庫程式數超了,但是資料庫告警日誌沒有報,並且檢視資料庫程式數設定有1000個之多,會話數設定有1101個之多,根據現有資訊分析,不是資料庫程式數超或會話數超引起的。
    總結:由於資料庫伺服器記憶體小,而資料庫會話數、程式數設定過大,當資料庫程式數急劇增多時,資料庫伺服器記憶體耗盡,導致資料庫管理賬戶SYS無法登陸;比較奇怪的是11G的SYS無法登陸不報伺服器記憶體溢位卻連線空閒例項,而使用業務使用者登入時卻報了明顯的記憶體溢位。



來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/29357786/viewspace-1770478/,如需轉載,請註明出處,否則將追究法律責任。

相關文章