EM agent無法啟動的原因及分析
昨天收到一條報警簡訊,簡訊內容大體如下:
Agent is Unreachable(REASON=javax.net.ssl.SSLPeerUnverifiedException:xxxx.com:cn=xxxxx).Host is unreachable.
看著簡訊內容,應該是agent罷工了。可能出現了網路問題。
結果不一會兒就接到了同事的電話,讓我看看是不是有問題。
登入到agent所在的伺服器,檢視agent程式還是存在的。
這個時候嘗試agent的upload操作失敗,就準備重新啟動一下agent試試,但是嘗試重啟的時候報了下面的錯誤。
$ ./emctl start agent
Oracle Enterprise Manager 10g Release 5 Grid Control 10.2.0.5.0.
Copyright (c) 1996, 2009 Oracle Corporation. All rights reserved.
Starting agent ....... failed.
Failed to start HTTP listener.
Consult the log files in: /U01/app/oracle/product/agent10g/sysman/log
在指定的目錄下檢視日誌,裡面提示程式自動退出,也沒有給出很明確的資訊來
41921 :: Wed Sep 30 09:07:34 2015::AgentLifeCycle.pm: Exited loop with retCode=1
9380 :: Wed Sep 30 10:06:27 2015::AgentLifeCycle.pm: Processing status agent
9380 :: Wed Sep 30 10:06:27 2015::AgentStatus.pm:Processing status agent
9380 :: Wed Sep 30 10:06:27 2015::AgentStatus.pm:emdctl status returned 1
22086 :: Wed Sep 30 11:06:50 2015::AgentLifeCycle.pm: Processing status agent
22086 :: Wed Sep 30 11:06:50 2015::AgentStatus.pm:Processing status agent
22086 :: Wed Sep 30 11:06:50 2015::AgentStatus.pm:emdctl status returned 1
861 :: Wed Sep 30 11:40:27 2015::AgentLifeCycle.pm: Processing start agent
861 :: Wed Sep 30 11:40:27 2015::AgentLifeCycle.pm: EMHOME is /U01/app/oracle/product/agent10g
861 :: Wed Sep 30 11:40:27 2015::AgentLifeCycle.pm: service name is
861 :: Wed Sep 30 11:40:28 2015::AgentLifeCycle.pm:status agent returned with retCode=1
861 :: Wed Sep 30 11:40:32 2015::AgentLifeCycle.pm:Watch dog processs id: 906 exited with an exit code of 55
861 :: Wed Sep 30 11:40:32 2015::AgentLifeCycle.pm: Exited loop with retCode=1
檢視歷史日誌,發現這個agent已經已經很久沒有使用了,可能這個問題還沒有想象的那麼簡單。算是個遺留問題吧。
----- Sat Jun 27 15:09:49 2015::Checking status of EMAgent : 24232 -----
-- Timestamp (2014,07,01,14,53,38) of file /U01/app/oracle/product/agent10g/sysman/emd/upload/rawdata4.dat is more than 24 hours old. Current Time is Sat Jun 27 15:09:
52 2015----- Sat Jun 27 15:09:52 2015::Received restart request from EMAgent : 24232 -----
----- Sat Jun 27 15:09:52 2015::Stopping EMAgent : 24232 -----
----- Sat Jun 27 15:10:02 2015::Failed to reap child process EMAgent : 24232 -----
----- Sat Jun 27 15:10:02 2015::Agent Launched with PID 20174 at time Sat Jun 27 15:10:02 2015 -----
----- Sat Jun 27 15:10:02 2015::Execing EMAgent process is taking longer than expected 120 secs. -----
----- Sat Jun 27 15:10:02 2015::Time elapsed between Launch of Watchdog process and execing EMAgent is 15996749 secs -----
(pid=20174): starting emagent version 10.2.0.5.0
(pid=20174): emagent now exiting abnormally - initialization failure. Consult '.trc' and '.log' files.
----- Sat Jun 27 15:10:05 2015::Checking status of EMAgent : 20174 -----
----- Sat Jun 27 15:10:05 2015::EMAgent exited at Sat Jun 27 15:10:05 2015 with return value 55. -----
----- Sat Jun 27 15:10:05 2015::EMAgent has exited due to initialization failure. -----
----- Sat Jun 27 15:10:05 2015::Stopping other components. -----
----- Sat Jun 27 15:10:05 2015::Commiting Process death. -----
----- Sat Jun 27 15:10:05 2015::Exiting watchdog loop
檢視最近的日誌
--- Standalone agent
----- Wed Sep 30 11:44:40 2015::Agent Launched with PID 3354 at time Wed Sep 30 11:44:40 2015 -----
----- Wed Sep 30 11:44:40 2015::Time elapsed between Launch of Watchdog process and execing EMAgent is 0 secs -----
(pid=3354): starting emagent version 10.2.0.5.0
(pid=3354): emagent now exiting abnormally - initialization failure. Consult '.trc' and '.log' files.
----- Wed Sep 30 11:44:43 2015::Checking status of EMAgent : 3354 -----
----- Wed Sep 30 11:44:43 2015::EMAgent exited at Wed Sep 30 11:44:43 2015 with return value 55. -----
----- Wed Sep 30 11:44:43 2015::EMAgent has exited due to initialization failure. -----
----- Wed Sep 30 11:44:43 2015::Stopping other components. -----
----- Wed Sep 30 11:44:43 2015::Commiting Process death. -----
----- Wed Sep 30 11:44:43 2015::Exiting watchdog loop
-----
--- Standalone agent
----- Wed Sep 30 11:51:57 2015::Agent Launched with PID 6076 at time Wed Sep 30 11:51:57 2015 -----
----- Wed Sep 30 11:51:57 2015::Time elapsed between Launch of Watchdog process and execing EMAgent is 1 secs -----
(pid=6076): starting emagent version 10.2.0.5.0
(pid=6076): emagent now exiting abnormally - initialization failure. Consult '.trc' and '.log' files.
----- Wed Sep 30 11:52:00 2015::Checking status of EMAgent : 6076 -----
----- Wed Sep 30 11:52:00 2015::EMAgent exited at Wed Sep 30 11:52:00 2015 with return value 55. -----
----- Wed Sep 30 11:52:00 2015::EMAgent has exited due to initialization failure. -----
----- Wed Sep 30 11:52:00 2015::Stopping other components. -----
----- Wed Sep 30 11:52:00 2015::Commiting Process death. -----
----- Wed Sep 30 11:52:00 2015::Exiting watchdog loop
檢視其它相關的日誌,發現了這麼一段內容。
2015-09-30 13:53:08,557 Thread-37161840 ERROR engine: [oracle_emd,sadb.com:3872,EMDUploadStats] : nmeegd_GetMetricData failed : em_error=failed to get upload statistics: Inappropriate ioctl for device
Error connecting to />
2015-09-30 13:53:08,557 Thread-37161840 WARN collector: <nmecmc.c> Error exit. Error message: em_error=failed to get upload statistics: Inappropriate ioctl for device
Error connecting to /> 這部分日誌算是唯一能夠看明白,而且相關的了。那麼到底是怎麼回事,為什麼嘗試連線的時候出錯了?
排查了EM sever段的網路連線情況,是沒有問題的。
那麼我們還是按部就班來看看這個地址為什麼不通。試試老辦法ping
# ping sadb.com
PING sadb.com (10.127.xxx.34) 56(84) bytes of data.
64 bytes from sadb.cyou.com (10.127.xxx.34): icmp_seq=1 ttl=64 time=0.185 ms
64 bytes from sadb.cyou.com (10.127.xxx.34): icmp_seq=2 ttl=64 time=0.182 ms
檢視網路卡配置的IP情況
# ifconfig
eth0 Link encap:Ethernet HWaddr C8:1F:66:B8:97:xxxx
inet addr:10.127.xxx.71 Bcast:10.127.xxx.255 Mask:255.255.255.0
inet6 addr: fe80::ca1f:66ff:feb8:975a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:7609183417 errors:0 dropped:0 overruns:0 frame:0
TX packets:6700541034 errors:0 dropped:162 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:957747424255 (891.9 GiB) TX bytes:1573971126949 (1.4 TiB)
Interrupt:35
這個時候就有些奇怪了,ping的主機名應該是本機,怎麼ping出來的IP不同了。
看看配置檔案裡怎麼寫的。
# cat /etc/hosts
#127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.11.154.95 xxx.no.xxxxx.com
10.127.xxxx.34 sadb.com
127.0.0.1 localhost localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
10.127.xxx.134 adb.com
10.127.xxx.88 sadb3.com
10.127.2.128 sadb2.com
10.127.xxx.71 sadb.com
10.127.2.85 newcomm.com
10.25.36.117 xxx.no.xxx.com
這段內容還是蠻考驗眼力的。
如果眼尖還是會發現配置上有問題,那就是adb.com出現了兩個IP,一個是xxx.34 一個是xxx.71
這個錯誤還是一個很常規的錯誤,但是結合具體的問題分析就很容易被誤導。
再次嘗試ping就會發現IP地址已經指向了正確的地址了。
# ping sadb.cyou.com
PING sadb.com (10.127.xxx.71) 56(84) bytes of data.
64 bytes from sadb.com (10.127.xxx.71): icmp_seq=1 ttl=64 time=0.026 ms
這個時候再沒有嘗試啟動,只是做了一個簡單驗證,發現upload已經沒有問題了。
$ ./emctl status agent
Oracle Enterprise Manager 10g Release 5 Grid Control 10.2.0.5.0.
Copyright (c) 1996, 2009 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent Version : 10.2.0.5.0
OMS Version : 10.2.0.5.0
Protocol Version : 10.2.0.5.0
Agent Home : /U01/app/oracle/product/agent10g
Agent binaries : /U01/app/oracle/product/agent10g
Agent Process ID : 24232
Parent Process ID : 39174
Agent URL : /> Repository URL : /> Started at : 2015-06-27 13:09:43
Started by user : oracle
Last Reload : 2015-06-27 13:09:43
Last successful upload : 2015-09-30 14:08:02
Total Megabytes of XML files uploaded so far : 897.22
Number of XML files pending upload : 0
Size of XML files pending upload(MB) : 0.00
Available disk space on upload filesystem : 72.83%
Last successful heartbeat to OMS : 2015-09-30 08:17:43
---------------------------------------------------------------
Agent is Running and Ready
Agent is Unreachable(REASON=javax.net.ssl.SSLPeerUnverifiedException:xxxx.com:cn=xxxxx).Host is unreachable.
看著簡訊內容,應該是agent罷工了。可能出現了網路問題。
結果不一會兒就接到了同事的電話,讓我看看是不是有問題。
登入到agent所在的伺服器,檢視agent程式還是存在的。
這個時候嘗試agent的upload操作失敗,就準備重新啟動一下agent試試,但是嘗試重啟的時候報了下面的錯誤。
$ ./emctl start agent
Oracle Enterprise Manager 10g Release 5 Grid Control 10.2.0.5.0.
Copyright (c) 1996, 2009 Oracle Corporation. All rights reserved.
Starting agent ....... failed.
Failed to start HTTP listener.
Consult the log files in: /U01/app/oracle/product/agent10g/sysman/log
在指定的目錄下檢視日誌,裡面提示程式自動退出,也沒有給出很明確的資訊來
41921 :: Wed Sep 30 09:07:34 2015::AgentLifeCycle.pm: Exited loop with retCode=1
9380 :: Wed Sep 30 10:06:27 2015::AgentLifeCycle.pm: Processing status agent
9380 :: Wed Sep 30 10:06:27 2015::AgentStatus.pm:Processing status agent
9380 :: Wed Sep 30 10:06:27 2015::AgentStatus.pm:emdctl status returned 1
22086 :: Wed Sep 30 11:06:50 2015::AgentLifeCycle.pm: Processing status agent
22086 :: Wed Sep 30 11:06:50 2015::AgentStatus.pm:Processing status agent
22086 :: Wed Sep 30 11:06:50 2015::AgentStatus.pm:emdctl status returned 1
861 :: Wed Sep 30 11:40:27 2015::AgentLifeCycle.pm: Processing start agent
861 :: Wed Sep 30 11:40:27 2015::AgentLifeCycle.pm: EMHOME is /U01/app/oracle/product/agent10g
861 :: Wed Sep 30 11:40:27 2015::AgentLifeCycle.pm: service name is
861 :: Wed Sep 30 11:40:28 2015::AgentLifeCycle.pm:status agent returned with retCode=1
861 :: Wed Sep 30 11:40:32 2015::AgentLifeCycle.pm:Watch dog processs id: 906 exited with an exit code of 55
861 :: Wed Sep 30 11:40:32 2015::AgentLifeCycle.pm: Exited loop with retCode=1
檢視歷史日誌,發現這個agent已經已經很久沒有使用了,可能這個問題還沒有想象的那麼簡單。算是個遺留問題吧。
----- Sat Jun 27 15:09:49 2015::Checking status of EMAgent : 24232 -----
-- Timestamp (2014,07,01,14,53,38) of file /U01/app/oracle/product/agent10g/sysman/emd/upload/rawdata4.dat is more than 24 hours old. Current Time is Sat Jun 27 15:09:
52 2015----- Sat Jun 27 15:09:52 2015::Received restart request from EMAgent : 24232 -----
----- Sat Jun 27 15:09:52 2015::Stopping EMAgent : 24232 -----
----- Sat Jun 27 15:10:02 2015::Failed to reap child process EMAgent : 24232 -----
----- Sat Jun 27 15:10:02 2015::Agent Launched with PID 20174 at time Sat Jun 27 15:10:02 2015 -----
----- Sat Jun 27 15:10:02 2015::Execing EMAgent process is taking longer than expected 120 secs. -----
----- Sat Jun 27 15:10:02 2015::Time elapsed between Launch of Watchdog process and execing EMAgent is 15996749 secs -----
(pid=20174): starting emagent version 10.2.0.5.0
(pid=20174): emagent now exiting abnormally - initialization failure. Consult '.trc' and '.log' files.
----- Sat Jun 27 15:10:05 2015::Checking status of EMAgent : 20174 -----
----- Sat Jun 27 15:10:05 2015::EMAgent exited at Sat Jun 27 15:10:05 2015 with return value 55. -----
----- Sat Jun 27 15:10:05 2015::EMAgent has exited due to initialization failure. -----
----- Sat Jun 27 15:10:05 2015::Stopping other components. -----
----- Sat Jun 27 15:10:05 2015::Commiting Process death. -----
----- Sat Jun 27 15:10:05 2015::Exiting watchdog loop
檢視最近的日誌
--- Standalone agent
----- Wed Sep 30 11:44:40 2015::Agent Launched with PID 3354 at time Wed Sep 30 11:44:40 2015 -----
----- Wed Sep 30 11:44:40 2015::Time elapsed between Launch of Watchdog process and execing EMAgent is 0 secs -----
(pid=3354): starting emagent version 10.2.0.5.0
(pid=3354): emagent now exiting abnormally - initialization failure. Consult '.trc' and '.log' files.
----- Wed Sep 30 11:44:43 2015::Checking status of EMAgent : 3354 -----
----- Wed Sep 30 11:44:43 2015::EMAgent exited at Wed Sep 30 11:44:43 2015 with return value 55. -----
----- Wed Sep 30 11:44:43 2015::EMAgent has exited due to initialization failure. -----
----- Wed Sep 30 11:44:43 2015::Stopping other components. -----
----- Wed Sep 30 11:44:43 2015::Commiting Process death. -----
----- Wed Sep 30 11:44:43 2015::Exiting watchdog loop
-----
--- Standalone agent
----- Wed Sep 30 11:51:57 2015::Agent Launched with PID 6076 at time Wed Sep 30 11:51:57 2015 -----
----- Wed Sep 30 11:51:57 2015::Time elapsed between Launch of Watchdog process and execing EMAgent is 1 secs -----
(pid=6076): starting emagent version 10.2.0.5.0
(pid=6076): emagent now exiting abnormally - initialization failure. Consult '.trc' and '.log' files.
----- Wed Sep 30 11:52:00 2015::Checking status of EMAgent : 6076 -----
----- Wed Sep 30 11:52:00 2015::EMAgent exited at Wed Sep 30 11:52:00 2015 with return value 55. -----
----- Wed Sep 30 11:52:00 2015::EMAgent has exited due to initialization failure. -----
----- Wed Sep 30 11:52:00 2015::Stopping other components. -----
----- Wed Sep 30 11:52:00 2015::Commiting Process death. -----
----- Wed Sep 30 11:52:00 2015::Exiting watchdog loop
檢視其它相關的日誌,發現了這麼一段內容。
2015-09-30 13:53:08,557 Thread-37161840 ERROR engine: [oracle_emd,sadb.com:3872,EMDUploadStats] : nmeegd_GetMetricData failed : em_error=failed to get upload statistics: Inappropriate ioctl for device
Error connecting to />
2015-09-30 13:53:08,557 Thread-37161840 WARN collector: <nmecmc.c> Error exit. Error message: em_error=failed to get upload statistics: Inappropriate ioctl for device
Error connecting to /> 這部分日誌算是唯一能夠看明白,而且相關的了。那麼到底是怎麼回事,為什麼嘗試連線的時候出錯了?
排查了EM sever段的網路連線情況,是沒有問題的。
那麼我們還是按部就班來看看這個地址為什麼不通。試試老辦法ping
# ping sadb.com
PING sadb.com (10.127.xxx.34) 56(84) bytes of data.
64 bytes from sadb.cyou.com (10.127.xxx.34): icmp_seq=1 ttl=64 time=0.185 ms
64 bytes from sadb.cyou.com (10.127.xxx.34): icmp_seq=2 ttl=64 time=0.182 ms
檢視網路卡配置的IP情況
# ifconfig
eth0 Link encap:Ethernet HWaddr C8:1F:66:B8:97:xxxx
inet addr:10.127.xxx.71 Bcast:10.127.xxx.255 Mask:255.255.255.0
inet6 addr: fe80::ca1f:66ff:feb8:975a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:7609183417 errors:0 dropped:0 overruns:0 frame:0
TX packets:6700541034 errors:0 dropped:162 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:957747424255 (891.9 GiB) TX bytes:1573971126949 (1.4 TiB)
Interrupt:35
這個時候就有些奇怪了,ping的主機名應該是本機,怎麼ping出來的IP不同了。
看看配置檔案裡怎麼寫的。
# cat /etc/hosts
#127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
#::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.11.154.95 xxx.no.xxxxx.com
10.127.xxxx.34 sadb.com
127.0.0.1 localhost localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
10.127.xxx.134 adb.com
10.127.xxx.88 sadb3.com
10.127.2.128 sadb2.com
10.127.xxx.71 sadb.com
10.127.2.85 newcomm.com
10.25.36.117 xxx.no.xxx.com
這段內容還是蠻考驗眼力的。
如果眼尖還是會發現配置上有問題,那就是adb.com出現了兩個IP,一個是xxx.34 一個是xxx.71
這個錯誤還是一個很常規的錯誤,但是結合具體的問題分析就很容易被誤導。
再次嘗試ping就會發現IP地址已經指向了正確的地址了。
# ping sadb.cyou.com
PING sadb.com (10.127.xxx.71) 56(84) bytes of data.
64 bytes from sadb.com (10.127.xxx.71): icmp_seq=1 ttl=64 time=0.026 ms
這個時候再沒有嘗試啟動,只是做了一個簡單驗證,發現upload已經沒有問題了。
$ ./emctl status agent
Oracle Enterprise Manager 10g Release 5 Grid Control 10.2.0.5.0.
Copyright (c) 1996, 2009 Oracle Corporation. All rights reserved.
---------------------------------------------------------------
Agent Version : 10.2.0.5.0
OMS Version : 10.2.0.5.0
Protocol Version : 10.2.0.5.0
Agent Home : /U01/app/oracle/product/agent10g
Agent binaries : /U01/app/oracle/product/agent10g
Agent Process ID : 24232
Parent Process ID : 39174
Agent URL : /> Repository URL : /> Started at : 2015-06-27 13:09:43
Started by user : oracle
Last Reload : 2015-06-27 13:09:43
Last successful upload : 2015-09-30 14:08:02
Total Megabytes of XML files uploaded so far : 897.22
Number of XML files pending upload : 0
Size of XML files pending upload(MB) : 0.00
Available disk space on upload filesystem : 72.83%
Last successful heartbeat to OMS : 2015-09-30 08:17:43
---------------------------------------------------------------
Agent is Running and Ready
明白了問題的原因,其實發現還是一個蠻低階的錯誤,但是為什麼會出現這種情況呢,還是因為做了故障轉移,主備進行了切換,結果/etc/hosts裡面的資訊知識做了追加。
所以這些潛在的問題細小,但是出現問題的時候又很容易被誤導。尤其需要注意一下。
所以這些潛在的問題細小,但是出現問題的時候又很容易被誤導。尤其需要注意一下。
來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/23718752/viewspace-1813099/,如需轉載,請註明出處,否則將追究法律責任。
相關文章
- [20181112]EM12c agent無法啟動.txt
- Jtti:c++無法啟動程式的原因有哪些JttiC++
- 【ASM】ASM啟動無法找到spfile問題原因ASM
- Win10系統下eM客戶端無法啟動的解決方法Win10客戶端
- Android setVisibility(View.GONE)無效的問題及原因分析AndroidViewGo
- @FeignClient @Resource 無法注入Bean Springboot無法啟動clientBeanSpring Boot
- [MySQL] “MySQL 服務無法啟動”原理及解決方法MySql
- Unraid 使用 Docker Compose 安裝 Immich 套件無法啟用人臉識別的原因及修復方法AIDocker套件
- tomcat無法啟動的解決方法Tomcat
- IIS無法訪問動態連結庫DLL的原因
- php-worker 無法啟動PHP
- Manjaro下Steam無法啟動JAR
- 本地無法連線Mysql的原因MySql
- Backup Exec Remote Agent for Windows Servers Service不能啟動,應用無法監聽到NDMP TCP/IP埠REMWindowsServerTCP
- 雲伺服器無法正常關機/重啟的幾種原因伺服器
- 阿里雲國際版搭建的網站無法訪問的原因分析阿里網站
- 記vscode無法啟動解決辦法VSCode
- SAP Fiori應用沒能從Fiori Launchpad啟動的一個可能原因及分析過程
- MySQL服務名無效或者MySQL正在啟動 MySQL無法啟動MySql
- jenkins安裝提示無法啟動Jenkins
- springboot連線hive無法啟動Spring BootHive
- [20231003]windows 2003無法啟動.txtWindows
- laradock安裝rabbitmq無法啟動MQ
- Spring AOP無法呼叫自身方法的原因Spring
- 調節閥振動原因分析及解決方案
- se://error/ Oracle 19c EM Exporess無法登陸ErrorOracle
- JVMTI Agent 工作原理及核心原始碼分析JVM原始碼
- 伺服器自動重啟的原因及解決方法-VeCloud伺服器Cloud
- CentOS 7 下Tomcat啟動超慢的原因及解決方案CentOSTomcat
- win10的mysql服務無法啟動Win10MySql
- Istio Polit-agent & Envoy 啟動流程
- StarRocks-FE無法啟動,日誌:
- Webphser Applcation Server Dmgr無法正常啟動WebAPPServer
- Flask 框架啟動無法改變埠Flask框架
- 解決ASM無法啟動問題ASM
- 解決GITLAB無法啟動runsv no runningGitlab
- 無法啟動?教你進入安全模式模式
- GitLab修改配置後nginx無法啟動GitlabNginx
- Netty啟動流程及原始碼分析Netty原始碼