Description of Important Components / Threads in EM GC Agent_1101615.1

rongshiyuan發表於2014-08-10

Description of Important Components / Threads in Enterprise Manager Grid Control Agent (文件 ID 1101615.1)


In this Document

Purpose
Scope
Details
  1. Data Collection Components
  2. Communication Related Components
  3. Control / Infrastructure Related
  4. Miscellaneous
  Agent Threads
References


APPLIES TO:

Enterprise Manager Base Platform - Version 10.1.0.2 to 11.1.0.1 [Release 10.1 to 11.1]
Information in this document applies to any platform.
Checked for relevance on 18-Aug-2013

PURPOSE

This document provides details about the important Components / Threads that are part of the 10g Grid Agent.

SCOPE

For Grid Control Administrators / Users who want to have an overview of the components in a 10g Grid Agent. Understanding the functionality of these sub-systems will be useful in reading the Agent log/trace files.

DETAILS

The 10g Enterprise Manager Agent is designed in C language and consists of multiple components that work together to provide the monitoring and management capabilities. These components can be classified according to the activities performed by them, namely:

1. Data Collection: Metric Engine, Target Manager, Fetchlet Manager, Recvlet Manager, Collection Manager, Scheduler.
2. Communication: HTTP Listener and client, Ping Manager, Upload Manager, EMDClient Interface, Request Dispatcher
3. Control / Infrastructure Related: Reload Manager, emctl, emwd.
4. Miscellaneous: Blackout Manager, Job Engine, Job Auxillary Programs, Cluster Manager,  HealthMonitor

1. Data Collection Components

  • Metric Engine
    • The Metric Engine loads the metric definitions (metadata) of the various target types from the /sysman/admin/metadata directory.
    • For Metrics which are collected via Fetchlets, the Metric Engine is responsible for initiating the metric collection. It identifies the type of fetchet that needs to be used for that metric collection from the metadata and contacts the Fetchlet Manager to spawn that particular Fetchlet.
    • Includes Metric Cache, which is responsible for caching the last-fetched data for a metric. This data is used to compute expressions faster. For example, the the Host level 'Load' metric uses values from the previous collection to compute values for 'rate' measurements such as 'Total Disk I/O Per Second'.
    • The metric cache value can be queried using:

      emctl status agent mcache ,
    • Tracing parameters in the Agent's emd.properties file:

      tracelevel.engine=WARN
      tracelevel.metadata=WARN
    • Sample entries from the emagent.trc file:
      Thread-3832 WARN engine: File=file:/D:/oracle/gridagent/agent10g/sysman/admin/metadata/oracle_ovf_ivr.xml,line=398: is not valid element, will be ignored
      Thread-3572 ERROR engine: [[oracle_database,V920_agentmachine.domain,db_recTablespaceSettings] : nmeegd_GetMetricData failed : ORA-12541: TNS:no listener
      Thread-3572 ERROR engine: [oracle_database,V920_agentmachine.domain,db_recSegmentSettings] : nmeegd_GetMetricData failed : ORA-12541: TNS:no listener
      ...
      Thread-38669232 ERROR engine: 
      [oracle_ias,EnterpriseManager0.omsmachine.domain,Response] : nmeegd_GetMetricData failed : Exception : Error finding JNI wrapper class definition
      ..
      Thread-2838883232 ERROR engine: [host,agentmachine.domain,UDM] : nmeegd_GetMetricData failed : LOG: Local Authentication Failed...Attempt PAM authentication...PAM failed with error: ERROR: Invalid username and/or password
    • Sample entries from emagent_perl.trc for metric execution:

      emrepresp.pl: : DEBUG: Connectdescriptor (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=em11gc.idc.oracle.com)(PORT=1521)))(CONNECT_DATA=(SERVICE_NAME=emrep.oracle.com)))
      emrepresp.pl: : DEBUG: sysmantest, /tmp/sysmantest_emrepresp
      emrepresp.pl: : DEBUG: data from repository... 1 1 0 -.0005
      emrepresp.pl: : DEBUG: closing connection
      emrepresp.pl: : DEBUG: emrepresp: Time in emrepresp: 58.9389801025391
      emrepresp.pl: : DEBUG: Total OMSs=1; Active OMSs=1
      lsnrresp.pl: : DEBUG: lsnrdebug :: LSNR_PORT is 1521
  • Target Manager
    • The Target Manager obtains details about the targets discovered by the Agent from the /sysman/emd/targets.xml file.
    • It encrypts any Credential-related properties in the targets.xml file.
    • Based on the metadata of that target type that has been loaded by the Metric Engine already, the Target Manager computes the applicable Dynamic Properties of the target. 
      For example, for a oracle_database target type, one of the dynamic properties is the version of the database. This property defines the list of metrics that get executed for this particular database, as the same metrics may not be applicable to all versions of the database target.
    • If the dynamic property evaluation fails, then the target can be marked as broken, with errors such as:
      • Required properties not provided
      • Dynamic properties take too long to compute
    • This is also responsible for adding, changing and removing targets from the targets.xml file, when requested to do so from the console or as part of installation/running agentca command.
    • The list of targets being monitored by the Agent can be queried by using:

      emctl config agent listtargets
    • Tracing parameters in the emd.properties file:

      tracelevel.TargetManager=WARN
      tracelevel.targets=WARN
    • Sample entries from the emagent.trc file:

      Thread-2668 INFO TargetManager: save to targets.xml success
      Thread-2668 INFO TargetManager: save to targets.xml success
      Thread-5532 DEBUG TargetManager: No change in metadata/targets.xml/emd.properties: No op reload
      Thread-5532 INFO TargetManager: Trying to set the Thread cap based on number of targets
      ...
      Thread-2855091104 DEBUG targets: In nmedt_addRefCount aa31498 : (oracle_database, orcl.domain), counter: 2, caller: nmeetm.c
      Thread-2855091104 DEBUG targets: nmedt.c : nmedt_Target_getHostTarget : Target orcl.domain Remote Host target agentmachine.domain
      Thread-2855091104 DEBUG targets: In nmedt_addRefCount a41b0d8 : (host, agentmachine.domain), counter: 18, caller: nmedt.c
      Thread-2855091104 DEBUG targets: nmedt.c : Entering function - nmedt_Target_releaseObject (host, agentmachine.domain), addr: a41b0d8, counter: 18, caller: nmeeps.c
  • Fetchlet Manager
    • Fetchlets are specialized data access mechanisms used to retrieve data from the monitored targets.The Fetchlet Manager maintains a list of all the fetchlets registered with the agent.
    • When contacted by the Metric Engine with details of target-type metadata, target properties, fetchlet type, etc the fetchlet manager spawns that particular fetchlet for evaluating the metric and returning the results to the metric engine.
    • The type of fetchlet used for a particular metric can be seen in the target-type specific metadata file in the /sysman/admin/metadata directory. 
      For steps to identify the fetchlet, refer to Note.435975.1: How a particular metric is evaluated?
    • Fetchlets are mostly designed in C libraries or in Java depending on the target against which the fetchlet is used. The Java fetchlets are used for collecting metrics from Targets which themselves are designed in Java, for example the Oracle Application Server, Oracle Applications Targets, etc.
    • Types of Fetchlets:
      • OS fetchlets: Allow collection of metric data by executing OS commands (either individually or from scripts) that return a standard out (stdout) data stream. They are further classified as:
        • OS Fetchlet
        • OSLine Fetchlet
        • OSLineToken
        • UDM: for User Defined Metric
      • SQL fetchlet: Executes a given SQL statement on a given database as a given user and returns the table result.
      • URL fetchlets
        • HTTP data: Obtain the contents of a URL and returns the contents of the URL as data.
        • URLTiming: Gets the contents of a given URL, timing not only the base page source but any frames or images in the page as well.
      • For more details on Fetchlets, refer to Oracle Enterprise Manager Extensibility Guide, Fetchlets
    • Tracing parameters in the emd.properties file for C Fetchlets:

      tracelevel.fetchlets=WARN - for Fetchlet Manager
      tracelevel.fetchlets.os=WARN - for OS Commands fetchlet
      tracelevel.fetchlets.sql=WARN - for SQL fetchlet
      tracelevel.fetchlets.UDM=WARN - for User Defined Metrics
      tracelevel.fetchlets.url=WARN - for URL/HTTP operations
      tracelevel.fetchlets.URLTiming=WARN - for URL/HTTP Timings,etc
    • Tracing parameter in the emagentlogging.properties file for Java Fetchlets:

      log4j.rootCategory=WARN, emagentlogAppender, emagenttrcAppender
    • Sample entries from the emagent.trc:

      Thread-81361824 ERROR fetchlets.UDM: Process exitcode = 0, stdout = LOG: Local Authentication Failed...Attempt PAM authentication...PAM failed with error: 
      Thread-81361824 ERROR fetchlets.UDM: Process exitcode = 0, stderr = ERROR: Invalid username and/or password 
      ..
      Thread-99834800 ERROR fetchlets: Exception : Error finding JNI wrapper class definition
      ...
      Thread-54418352 ERROR fetchlets.dms: DMS: Schema - No response from :0
      Thread-35519408 ERROR fetchlets.dms: DMS: Schema - No response from :0
    • Sample entries from the emagentfetchlet.trc:

      [Thread-60816] ERROR ias.ResponseMetric getResponseMetric.153 - Unable to compute application server status
      oracle.sysman.emSDK.emd.fetchlet.FetchletException: oracle.sysman.emSDK.emd.comm.CommException: IOException in sending Request :: Connection refused at oracle.sysman.ias.ias.ResponseMetric.getResponseMetric
      (ResponseMetric.java:107)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      ............
      [Thread-2833] WARN ias.util getMetricResultUtil.311 - Error getting metric opmn_process_info Error getting metric opmn_process_info rows is null
  • Recvlet Manager
    • Receivelet (recvlet) is a library that allows Agent to receive external notifications sent by the managed Targets. These are notifications that are asynchronously sent by the Targets without any requests from the Agent. It allows the Target to push data to the Agent, rather than be polled.The Recvlet Manager maintains a list of all the receivelets registered with the agent.
    • Each recvlet on initialization, spawns at least one thread and possibly one per target, which waits for some kind of input from the particular target and forwards the same to the Recvlet Manager.
    • A receivelet may be tightly coupled to a particular type of managed target, or may be useful to a broad range of potential targets. The following receivelets are offered with the Agent:
      • SNMP: Allows the Agent to receive snmp trap notifications from the SNMP Agent of that third-party network element.
      • Advanced Queue: Allows the Agent to receive notifications from Oracle Databases of version 10g. The 10g and above version of Databases monitor and apply thresholds to many of its own performance metrics (called Server-generated Alerts) internally. When the Agent connects to such a Database, it registers itself as a subscriber to the Advanced Queue; thereafter, a copy of each threshold alert is preserved for the registered Agent.
      • HTTP: Allows the Agent to receive notifications from targets that communicate over HTTP or HTTPS protocol.
      • For more details, refer to Oracle Enterprise Manager Extensibility Guide, Receivelets

    • Tracing parameter in the emd.properties:

      tracelevel.recvlets=WARN
      tracelevel.recvlets.snmp=WARN, etc
    • Sample entries from the emagent.trc:

      Thread-68156320 WARN recvlets.aq: no matching registered alert type for instance_efficiency.pxdwngrdserial_pt
      Thread-68156320 WARN recvlets.aq: no matching registered alert type for wait_bottlenecks.userio_wait_cnt
      Thread-68156320 WARN recvlets.aq: no matching registered alert type for wait_bottlenecks.other_wait_cnt
      ...
      Thread-81361824 ERROR recvlets.aq: duplicate registration of metric instance_throughput for target emrep.oracle.com oracle_database
      Thread-81361824 ERROR recvlets.aq: Unable to add metric instance_throughput to AQDatabase [oracle_database emrep.oracle.com] for oracle_database emrep.oracle.com
      Thread-81361824 ERROR recvlets: Error adding metric instance_throughput, target emrep.oracle.com oracle_database, to recvlet AQMetrics
      Thread-81361824 ERROR recvlets.aq: duplicate registration of metric wait_bottlenecks for target emrep.oracle.com oracle_database
      Thread-81361824 ERROR recvlets.aq: Unable to add metric wait_bottlenecks to AQDatabase [oracle_database emrep.oracle.com] for oracle_database emrep.oracle.com
      ...
      Thread-96144288 ERROR recvlets.snmp: Registration oid does not match with Varbind oid
  • Collection Manager
    • This is responsible for applying the thesholds on the metric data that is collected by the Fetchlets/Recvlets and generating the alerts, if any.
    • Input for this component is the collection definition file for the target and the metric data. The collection definition file is an xml file which has the metric threshold and collection interval details.
      • By default, these files are defined one per target type and can be found in /sysman/admin/default_collection/.xml
      • If the Metric collections are manually modified for a target from the Grid Console, then the new settings will be stored in file specific to that target in 
        /sysman/emd/collection/_.xml
    • CollectionItem is the basic unit of scheduled collection. Multiple metrics collected from the same target at the same interval can be collected together.
    • Once data is collected for a CollectionItem, the data is evaluated against the thresholds and generates an alert. The Alert can have three states: Clear, Warning, Critical
    • The last evaluated Condition state is stored in /sysman/emd/state/*.dlt (delta files)
    • Tracing parameters in the emd.properties:

      tracelevel.collector=WARN
    • Sample entries from the emagent.trc:

      Thread-54418352 WARN collector: Error exit. Error message: ORA-12505: TNS:listener does not currently know of SID given in connect descriptor
      Thread-60849072 WARN collector: CollectorItem 8ac31f8 : is attached to collection 8b69b30 - current collection is - 8b7ddf8
      Thread-54418352 WARN collector: CollectorItem has been unscheduled but is still being evaluated
      Thread-38669232 WARN collector: Error exit. Error message: Exception : Error finding JNI wrapper class definition
      Thread-54111152 WARN collector: Error exit. Error message: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/log4j/Category
      Thread-99834800 WARN collector: Error exit. Error message: Exception : Error finding JNI wrapper class definition
      Thread-99834800 WARN collector: Error exit. Error message: Exception : Error finding JNI wrapper class definition

  • Scheduler
    • This maintains a queue, sorted in the order of time and is contacted by the Collection Manager, Ping Manager, Blackout Manager components to enqueue individuals record into the queue.
    • When the time comes for the evaluation of a metric, the scheduler dequeues the record and returns the activity to the corresponding sub-systems that is mentioned in the record. For example, a record enqueued by the Collection Manager will be dequeued and sent to the Metric Engine.
    • The interval can be of different formats: 
      • Once: happens only once
      • Interval: happens every n minutes/hours/days
      • Week: happens on certain day of week
      • Month: happens on certain day of month
    • When the record is enqueued for the first time, the begin time/end time can be specified for entries which are repetitive, so that the Scheduler itself enqueues the record into the queue depending on the interval. Hence, the Collection Manager / Ping Manager / Blackout Manager will determine the interval and contact the Scheduler only at the time of Agent startup.
    • The HealthMonitor thread checks periodically whether the Scheduler is working or not.
    • The list of scheduled metric collections can be viewed by using:

      emctl status agent scheduler
    • Tracing parameter in the emd.properties:

      tracelevel.scheduler=WARN
    • Sample entries from the emagent.trc file:
      Thread-50863008 INFO scheduler: Scheduler wake up ...
      Thread-50863008 INFO scheduler: Scheduler will wait for 1 seconds ...
      Thread-2816842656 INFO scheduler: nmesse_executeWork: executing Upload Manager
      ....
      Thread-2816842656 INFO scheduler: nmessr_scheduleEntry: schedule first time = 2010-05-14 10:02:40
      Thread-2816842656 INFO scheduler: Added New Schedule entry id=727 
      Thread-2816842656 DEBUG scheduler: DELETEING SCHEDULE 80733d0
      Thread-2816842656 DEBUG scheduler: nmesse_executeWork: done executing Upload Manager
      Thread-2816842656 DEBUG scheduler: nmesse_executeWork: delete Upload Manager
      Thread-2816842656 DEBUG scheduler: DELETEING SCHEDULE ab86790
      ..
      Thread-69761952 DEBUG scheduler: nmesse_executeWork: done executing oracle_database:emrep.oracle.com:health_check

2. Communication Related Components

  • HTTP Server/Listener and Client
    • All communication to and from the Grid Agent happens via http(s) protocol. 
    • The Agent has an in-built (not Apache) HTTP Listener component which is responsible for handling all the communication to and from the Agent. This HTTP listener listens to connection requests in a persistent thread and spawns new threads to handle incoming connections. 
    • Other agent components (upload, job, ping) can start outgoing (client) connections themselves, as required.
    • Three URL definitions specified in the emd.properties file control the manner in which the HTTP listener will communicate with the OMS:
      • EMD_URL: used to identify the Agent. The port of this URL will be used to listen for incoming requests coming from the OMS. This URL can be defined with either HTTPS or HTTP, depending on whether the Agent is secured or not. 
      • REPOSITORY_URL: main URL used by the Agent to talk to the OMS. It is used to send all upload files and Agent heartbeat information. This URL can be defined with either HTTPS or HTTP, depending on whether the OMS / Agent is secured or not. 
      • emdWalletSrcURL: used to get the wallet with the SSL key from the OMS, when securing the Agent. Once the Agent is secured, there is no need for the Agent to use this URL anymore, unless the Agent needs to be re-secured.
    • Tracing parameter in the emd.properties:

      tracelevel.http=WARN
      tracelevel.http.client=WARN
      tracelevel.ssl=WARN
      tracelevel.ssl.io=WARN
    • Sample entries from the emagent.trc:

      Thread-37206944 DEBUG http: nmehl_httpListener: Registered HTTPListener activity
      Thread-3086857920 INFO http: nmehl_connect_internal: connected to (omsmachine.domain:1159). fd = 6
      Thread-3086857920 DEBUG http: 6: Initializing SSL connection for HTTPS
      Thread-3086857920 DEBUG http: 6: --&gt sending request headers
      Thread-3086857920 DEBUG http: 6: sent header, length = 401: "GET /em/upload?ACTION=FIRST_HEARTBEAT&EMD_URL=https%3a%2f%2fomsmachine%2edomain%3a1830%2femd%2fmain%2f&HEARTBEAT_TIME=2010-05-14+09%3a03%3a15&OUTSTANDING_SEVS=TRUE&EMD_UPTIME=2010-05-14+09%3a03%3a15&OLDEST_COLL_TIME=2010-05-14+09%3a03%3a15&INSTALL_TYPE=agent&AGENT_TZ=Asia%2fCalcutta&BOUNCE_CTR=224&X-ORCL-EMOV=4%2e0%2e0&X-ORCL-EMCV=10%2e2%2e0%2e5%2e0&X-ORCL-EMSV=10%2e2%2e0%2e5%2e0 HTTP/1.1"
      Thread-3086857920 DEBUG http: 6: sent header, length = 24: "Connection: Keep-Alive"
      Thread-3086857920 DEBUG http: 6: sent header, length = 29: "Host: omsmachine.domain"
      Thread-3086857920 DEBUG http: 6: sent header, length = 22: "Expect: 100-continue"
      Thread-3086857920 DEBUG http: 6: sent header, length = 59: "X-ORCL-EMUR: https://agentmachine.domain:1830/emd/main/"
      Thread-3086857920 DEBUG http: 6: sent header, length = 47: "X-ORCL-EMAK: E813B89BA9C60B50257D573AFBD70567"
      Thread-3086857920 DEBUG http: 6: Sent empty line header
      Thread-3086857920 DEBUG http: 6: Thread-3086857920 DEBUG http: 6: read line, length = 15: "HTTP/1.1 200 OK"
      Thread-3086857920 DEBUG http: 6: Response Code = 200: "OK"
  • Ping Manager
    • The Ping Manager sends a HTTP heartbeat periodically to the OMS using the URL specified by the REPOSITORY_URL parameter in the emd.properties.
    • If an OMS is NOT receiving these heartbeats in a timely fashion from each of its agents, it will try to reverse ‘ping’ the Agent. If that ping fails, the OMS will mark this Agent and all of its targets with status: ‘Agent Unreachable’.
    • Initial Heartbeat: The very first heartbeat an Agent sends after its startup is treated differently. During this initial exchange of information, the OMS and Agent will negotiate the versions of the protocol to be used based on the version of the Agent and the OMS. This information is stored in the /sysman/emd/protocol.ini file.
    • During the ‘heartbeat’ message, additional information like the Agent TZ settings are also exchanged.
    • The last known heartbeat timestamp is recorded in the file /sysman/emd/agntstmp.txt. This file contains the timestamp as well as the EMD_URL of the Agent, which serves as the main identifier of the Agent.

      Note: Manually deleting this file in a 10.2.0.5 Agent home and re-starting the agent will result in the Agent being blocked from further uploads, in the Grid Console.
    • Tracing parameter in the emd.properties:

      tracelevel.pingManager=WARN
    • Sample entries from the emagent.trc:
      Thread-81361824 DEBUG pingManager: nmepm_pingReposURL: Oldest Response Coll : 2010-05-14 09:04:33
      ...
      Thread-81361824 DEBUG pingManager: nmepm_pingReposURL: HTTP Response code = 200: "OK"
      Thread-81361824 INFO pingManager: nmepm_pingReposURL: X-ORCL-EMSV=10.2.0.5.0
      Thread-81361824 INFO pingManager: nmepm_pingReposURL: HTTP Response = 60
      Thread-81361824 DEBUG pingManager: nmeuvr.c: verStr is 10.2.0.5.0,
      Thread-81361824 DEBUG pingManager: nmeuvr.c: nmeuvr_version_compare returned 6 for 10.2.0.5.0, 4.0.0.0.0
      Thread-81361824 DEBUG pingManager: nmeuvr.c: nmeuvr_version_compare returned 6 for 10.2.0.5.0, 4.0.1.0.0
      ..
      Thread-81361824 DEBUG pingManager: Updating the agent time stamp file ...
      ..
      Thread-81361824 DEBUG pingManager: nmepm.c: lastPingTime is 78, currTime is 91
      Thread-81361824 DEBUG pingManager: Ping Manager status is up and running
  • Upload Manager
    • The Upload Manager is responsible for uploading all the target related data collected by the Agent to the OMS.
    • As soon as the /sysman/emd/protocol.ini file is created and populated with all the information, the Upload Manager will get enabled and the Agent will be able to upload new data to the OMS. Without this file, the upload manager is disabled and NO information will get loaded to the OMS.
    • The XML files are sent to OMS as HTTP requests There are 3 forms of data generated by the Agent:
      • Metadata: Definitions of what it is monitoring and how (A-files)
      • State information: Condition of the targets and current state of all the metrics (B-files)
      • Metric Data: All datapoints generated by the metrics (C-, D- and E- files)
    • The Upload information is maintained in /sysman/emd/lastupld.xml
    • Tracing parameter in the emd.properties:

      tracelevel.upload=WARN
    • Sample entries from the emagent.trc:
      Thread-2864540576 INFO upload: Upload manager reload
      Thread-2864540576 INFO upload: Upload Manager Starting upload
      Thread-2864540576 DEBUG upload: Value of omsRecvDir: ""
      Thread-2864540576 INFO upload: Upload interval set at 15 minutes
      Thread-2864540576 INFO upload: Upload retry interval set at 5 seconds
      Thread-2864540576 INFO upload: Upload file recount interval set at 60 minutes
      Thread-2864540576 INFO upload: Upload Timeout Set at 1800 seconds
      Thread-2864540576 INFO upload: Upload Max Time Set at 3600 minutes
      ...
      Thread-2855091104 DEBUG upload: FxferSend: received SUCCESS in header from repository URL: https://em11gc.idc.oracle.com:1159/em/upload
      Thread-2855091104 DEBUG upload: nmehum_uploadOneXMLFile Closed file A0000016.xml, ret = 0
      Thread-2855091104 INFO upload: Successfully upload to https://em11gc.idc.oracle.com:1159/em/upload: direct-load : A0000016.xml, size = 12863, time(milliseconds) = 194, rate(kilobytes/second) = 64.750121
      Thread-2855091104 DEBUG upload: nmehum_uploadDataFiles : Uploading data files asynchronously
      Thread-2855091104 DEBUG upload: Done uploading. retcode = 0


For more details on the above three communication components, also refer to 
Note.1084777.1: Description of Important Communication Components in a 10g Enterprise Manager Grid Control Agent 

  • EMDClient / Remote API
    • The OMS sends requests to the Agent in the form of XML elements over HTTP via the EMDClient API. For example : Manual addition of a target in the console.
    • The agent processes the requests and sends the responses via the Dispatcher layer.
    • Tracing parameter in the emd.properties:

      tracelevel.dispatcher=WARN
  • Request Dispatcher
    • When a request comes to the Agent, the HTTP Listener launches a thread and within this thread calls the Request Dispatcher component.
    • It parses the incoming XML data (EMDRequest) using the EMDClient API and identifies the sub-component, to which the request has to be sent. 
      For example: details related to target addition will go to the Target Manager, Blackout details will go to the Blackout Manager
    • This is also responsible for sending back the response from the Agent component to the OMS. For example, once the target addition to the targets.xml file is done, this sends back the confirmation of the activity to the OMS in the form of xml data (EMDResponse)
    • Tracing parameter in the emd.properties:

      tracelevel.dispatcher=WARN
    • Sample entries from the emagent.nohup: 
      Request TS:
      Thread - 131128224









      From emagent.trc, checking for the above thread ID : 131128224:

       Thread-131128224 INFO Dispatcher: nmemdisp.c: Entering nmemdisp_Dispatcher_main
      Thread-131128224 INFO Dispatcher: nmemdisp_Dispatcher_main: Request Api number: 10640 ID: 344544 Tag Name RemoteStreamOperationReq 
      Thread-131128224 DEBUG Dispatcher: Request ID = 344544, type = 10640, Timeout = -1
      Thread-131128224 DEBUG Dispatcher: Adding wrapper context for request ID = 344544, batch = 0, type = 10640
      Thread-131128224 DEBUG Dispatcher: nmemdisp_Dispatcher_startActivity: Registering Activity RemoteStreamOperationReq
      Thread-131128224 DEBUG Dispatcher: nmemdisp_Dispatcher_startActivity: Started Activity RemoteStreamOperationReq
      Thread-131128224 DEBUG Dispatcher: nmemdisp.c: using delimitedIS = delimitedIS
      Thread-131128224 INFO Dispatcher: nmemdisp.c: Entering nmemdisp_StreamOpReq
      Thread-131128224 DEBUG Dispatcher: nmemdisp_Dispatcher_endActivity: End Activity RemoteStreamOperationReq

3. Control / Infrastructure Related

  • Reload Manager
    • Is reponsible for Reload of any changed configuration data without requiring an Agent bounce.
    • Handles all reload requests from command-line, Remote API, or Agent component
    • The operation is Serialized, so only one thread is reloading the particular Component at any time.
    • The Agent Components are always reloaded in same order as they are initialized at Agent startup. This will help in ensuring that the all dependent components are also re-loaded correctly.
    • The Reload manager checks the filesystem timestamps of the files managed by a particular component to verify whether that component needs to be reloaded. 
      For example, the Target Manager will be reloaded if the targets.xml file changes.
    • Tracing parameter in the emd.properties:

      tracelevel.reload=WARN
    • Sample entries from the emagent.trc:
      Thread-2810174368 DEBUG reload: nmermgr_reload : Reloading ping manager
      Thread-2810174368 DEBUG reload: nmermgr_reload : Reloading metric engine
      Thread-2810174368 DEBUG reload: nmermgr_reload : Reloading target manager
      Thread-2810174368 DEBUG reload: nmermgr_reload : Reloading coll manager
      Thread-2810174368 DEBUG reload: nmermgr_reload : Reloading recvlet manager
      Thread-2810174368 DEBUG reload: nmermgr_reload : Reloading blackout manager
      Thread-2810174368 DEBUG reload: nmermgr_reload : Reloading lctx in subsystems
      Thread-2810174368 DEBUG reload: nmermgr_reload : Reloading upload manager
      Thread-2810174368 DEBUG reload: nmermgr_reload: updating lastStatus_nmerctx to 0
      Thread-2810174368 DEBUG reload: nmermgr_reload: returning 0
  • Control Utility
    • 'emctl' is the command-line interface to control the Agent operations.
    • The emctl (shell script) launches emctl.pl (Perl script), which inturn launches the emdctl (executable)
    • For more details, refer to Note 397228.1 : Details About 'emctl' Script and Steps to Enable Tracing 
    • Sample entries in the emdctl.trc:

      Thread-3086857920 WARN http: nmehl_connect_internal: connect failed to (agentmachine.domain:1830): Connection refused (error = 111)
      Thread-3086857920 ERROR main: nmectla_agentctl: Error connecting to https://agentmachine.domain:1830/emd/main/. Returning status code 1

      The above usually occur at Agent startup time, when the Agent watchdog is already pinging the Agent URL while the Agent is not fully initialized yet. These entries are normal and do not indicate any problem.
    • Sample entries from emctl.log:

      21817 :: ::AgentLifeCycle.pm: Processing start agent
      21817 :: ::AgentLifeCycle.pm: EMHOME is /home/oracle/OracleHomes/agent10g
      21817 :: ::AgentLifeCycle.pm: service name is
      21817 :: ::AgentLifeCycle.pm:status agent returned with retCode=1
      21817 :: ::AgentLifeCycle.pm: Exited loop with retCode=3
      27814 :: ::AgentLifeCycle.pm: Processing status agent
      27814 :: ::AgentStatus.pm:Processing status agent
  • Watchdog script (emwd)
    • This is a watchdog process started by the emdctl executable.
    • The watchdog is responsible for starting the Agent (emagent) process.
    • It then periodically checks:
      • Agent process exists
      • Agent responds to ‘emctl status agent’
    • If the emagent process exits / crashes without a clean exit status, the watchdog script restarts the Agent.
    • If the emagent process is alive but is not responding, the watchdog kills (after taking cores/dumps if possible) and restarts the Agent.
    • In a Unix / Linux OS, the details of this process can be seen using the 'ps' command:

      $ ps -ef | grep emwd
      em 6544 1 0 Oct01 ? 00:00:03 /home/em/oracle/gc102/agent10g/perl/bin/perl /home/em/oracle/gc102/agent10g/bin/emwd.pl agent /home/em/oracle/gc102/agent10g/sysman/log/emagent.nohup
      em 6706 6677 0 09:13 pts/3 00:00:00 grep emwd

4. Miscellaneous

  • Blackout Manager
    • Manages (adds, updates, deletes) the Blackout information stored in /sysman/emd/blackouts.xml. This files is read at the time of agent start (and reload if blackout sub-system is modified). 
    • A Blackout can be immediate or scheduled. If scheduled, they can be one-time or repeated at intervals. Also, the Node-level blackout affects all targets monitored by the agent.
    • The Scheduler consults the Blackout Manager and if target is currently blacked-out, collection is skipped.
    • When the blackout manager receives details for a scheduled blackout with start and end time, it contacts the scheduler and enqueues 2 records into its queue, one each for start and end times. Hence, it is possible that the "emctl status agent scheduler" still shows the metric collection as scheduled even though there is a blackout.
    • When the scheduler dequeues the 'start blackout' record, it identifies that this record is from the blackout manager and contacts it for details. The actual 'start blackout' is implemented at this time and the scheduler skips all metric collections for the target, during the blackout period.
    • When the blackout receives a forceful 'Stop blackout' command, then it contacts the scheduler to identify the end-blackout record and remove it from the queue. The blackout end details are then updated in the blackouts.xml file.
    • For Immediate blackouts, the start blackout is started by the blackout manager without adding an entry in the scheduler.
    • The status of the blackouts can be viewed by using:

      emctl status blackout
    • Tracing parameter in the emd.properties:

      tracelevel.blackout=WARN
    • Sample entries from the emagent.trc:
      Thread-2810174368 INFO blackouts: blackout scheduled for -test with sched id 1064
      Thread-2810174368 INFO blackouts: scheduler returned 1 for -test
      Thread-2810174368 INFO blackouts: Returned from BLACKOUT RELOAD
      Thread-2810174368 DEBUG blackouts: Uploading information for cli blackout: -test
      Thread-2810174368 INFO blackouts: nmebb: updating blackout info for tgtName= agentmachine.domain, tgtType = host, to 1, ds=0
      ..
      Thread-88169376 INFO blackouts: nmebmgr: Removing blackout 0A699D1C843C23446B8C8F91F0070D94
      Thread-88169376 INFO blackouts: nmebmgr:Removing Reference in Targets table for 0A699D1C843C23446B8C8F91F0070D94
      Thread-88169376 INFO blackouts: Unscheduling the entries for blackout -test 
      Thread-88169376 DEBUG blackouts: DELETEING SCHEDULE 9fd06f8
      Thread-88169376 DEBUG blackouts: Entering parallel upload of tgt state
      Thread-2810174368 INFO blackouts: nmebb: updating blackout info for tgtName= agentmachine.domain, tgtType = host, to 0, ds=1
      Thread-2810174368 INFO blackouts: nmebb: discarding severity state for tgtName= agentmachine.domain, tgtType = host, 
      Thread-2810174368 DEBUG blackouts: Exiting nmeupe_doTask doUploadState counter=0
      Thread-88169376 DEBUG blackouts: Exiting parallel upload of tgt state
      Thread-88169376 DEBUG blackouts: DELETEING SCHEDULE ad6f7c8
      Thread-88169376 INFO blackouts: nmebmgr: Removing blackout 0A699D1C843C23446B8C8F91F0070D94
      Thread-88169376 DEBUG blackouts: Blackout 0A699D1C843C23446B8C8F91F0070D94 does not exist 
      Thread-88169376 DEBUG blackouts: nmebb_schedule : Remove blackout -test as the schedules have expired
      Thread-88169376 INFO blackouts: Returned from BLACKOUT RELOAD
      Thread-88169376 DEBUG blackouts: Uploading information for cli blackout: -test
      Thread-88169376 INFO blackouts: nmebb: updating blackout info for tgtName= agentmachine.domain, tgtType = host, to 0, ds=1
      Thread-88169376 INFO blackouts: nmebb: discarding severity state for tgtName= agentmachine.domain, tgtType = host, 
      Thread-88169376 INFO blackouts: save to blackouts.xml succeeded
  • Job System / Job Engine
    • When the OMS needs to send the EM Job step details to the Agent, it sends EMDClient request via the HTTP Listener, which forwards the xml data to the Request Dispatcher. The Request Dispatcher parses this data and then forwards the details to the Job Engine.
    • The Job Engine is responsible for executing the command details sent by the OMS.
    • All commands need ‘authentication’ for the Agent to be able to execute them. Hence, the job details will include the target credentials which need to be used.
    • The result from the command execution is not saved on the Agent filesystem but is sent back immediately to the OMS via the Request Dispatcher.
    • For long-running jobs, the Agent will send periodic updates to the OMS to report the status of the job, and push any new output this command has produced. The output will be send to the OMS in pieces.
    • It is also possible that two Agents communicate with each other via the EMD_URL, for execution of a job or exchange data. For example: patching of a RAC setup from the console, Cloning operations or Data Guard setup, etc.
    • Tracing parameter in the emd.properties:

      tracelevel.Authentication=DEBUG
      tracelevel.command=DEBUG
      tracelevel.Dispatcher=DEBUG


      For Authentication on Windows NT, can also use:

      nmotracing=TRUE
    • Sample entries from the emagent.nohup:

      Request TS: 2010-05-18 14:01:43
      Thread - 97618848









      The emagent.nohup shows the job request sent by the OMS with the details of the command to be executed.
    • Sample entries in the emagent.trc for the above Thread ID: 97618848
       Thread-97618848 INFO command: launched process 15791 with 3 pipes attempting to run perl
      Thread-97618848 INFO command: Child named perl (former pid=15791) has exited with status=0
      Thread-97618848 INFO Dispatcher: nmemdisp.c: Entering nmemdisp_Dispatcher_main
      Thread-97618848 INFO Dispatcher: nmemdisp_Dispatcher_main: Request Api number: 10640 ID: 344808 Tag Name RemoteStreamOperationReq 
      Thread-97618848 DEBUG Dispatcher: Request ID = 344808, type = 10640, Timeout = -1
      Thread-97618848 DEBUG Dispatcher: Adding wrapper context for request ID = 344808, batch = 0, type = 10640
      ...
      ....
      Thread-97618848 INFO Authentication: Default nmo binary has setuid permissions
      Thread-97618848 DEBUG Authentication: nmejcap.c: argv = 
      Thread-97618848 DEBUG Authentication: nmejcap.c: 0: /bin/sh
      Thread-97618848 DEBUG Authentication: nmejcap.c: 1: -c
      Thread-97618848 DEBUG Authentication: nmejcap.c: 2: ls
      Thread-97618848 DEBUG Authentication: nmejcap.c :_adjustArgsForPDP args[0] = /bin/sh
      Thread-97618848 DEBUG Authentication: nmejcap.c :_adjustArgsForPDP args[1] = -c
      Thread-97618848 DEBUG Authentication: nmejcap.c :_adjustArgsForPDP args[2] = ls
      Thread-97618848 DEBUG Authentication: nmejcap_Process_remote_new: nmejcap: Local process
      Thread-97618848 DEBUG command: created extra pipe 45:46
      Thread-97618848 INFO command: launched process 15810 with 3 pipes attempting to run nmo
      Thread-97618848 DEBUG command: Closing fd = 46, ret = 0
      Thread-97618848 DEBUG Authentication: nmejcap_Process_run : PROCESS_TYPE (LOCAL_RUN) RETURN_FROM_PROCESS_RUN (0)
      ..
      Thread-3073325984 DEBUG Dispatcher: nmemdisp_addAbortCallback 344808 - abort1 = ac41238, acb=a14ae0
      Thread-3073325984 DEBUG Dispatcher: Adding abort context for ID = 344808
      Thread-3073325984 DEBUG command: Added abort callback successfully for ID = 344808, Synchronous = 0
      Thread-3073325984 DEBUG Authentication: nmejcap.c: buf='agntstmp.txt
      bcn_cleanLogs.log
      blackouts.xml
      chronos
      ...
      upload
      '
      ...
      Thread-3073325984 DEBUG command: nmejcn: HTTP response code 200: "OK" from https://omsmachine.domain:1159/em/upload
      Thread-3073325984 DEBUG command: nmejcn: received SUCCESS in header from repository at https://omsmachine.domain:1159/em/upload; notification delivered
      Thread-3073325984 DEBUG Authentication: nmejcap.c: buf=''
      ..
      Thread-3073325984 INFO command: Child named nmo (former pid=15810) has exited with status=0
      ..
      Thread-97618848 INFO command: launched process 15871 with 3 pipes attempting to run perl
      Thread-97618848 INFO command: Child named perl (former pid=15871) has exited with status=0

  • Job Auxiliary Programs

    These are the programs used by the Job engine for the actual command execution:
    • nmo: Executable launched by Agent, responsible for authenticating OS user credentials, impersonating user, and running command.
    • nmosudo: Acts as a proxy for the Agent to ensure that all commands that are run using sudo/pbrun are 'trusted' that is, they came from the Aent binary and not from someone running the command from the command line.
    • nmocat: Program which can transfer the content of a file to or from the Agent in case of communication between two Agents.
  • Cluster Manager
    • Used for composite targets, where the member targets can be present on multiple hosts.
      Examples are : OS level cluste target, RAC Database, Oracle Application Server etc.
    • In this case, the composite target which is common across all the nodes needs to be monitored by only one particular Agent at any point of time, which is known as the Master Agent. Rest of the Agents are known as standby-Agents. The Master Agent monitors all the targets on its host + the composite target.
    • Methods available for choosing the Master Agent:
      • AgentMediated - The Agents decide among themselves which Agent will be responsible for monitoring the composite target. The Cluster Manager tracks whether this agent is currently Master Agent. The Master Agent is identified by the entry "IS_MASTER=TRUE" entry in the Agent's targets.xml file.
      • OMSMediated - OMS decides which Agent becomes the Master Agent based on a pre-determined algorithm.
  • HealthMonitor
    • Health Monitor provides functionality to call other modules back if they have not completed an operation within a certain timeframe. This periodically checks whether all the active / dynamic threads of the Agent are working or not. If it notices any abnormality, it tries to ping that particular thread.
    • The Health Monitor does not have knowledge of the other subsystems.
      • The other components subscribe to the Health Monitor, telling it to perform certain actions if the operation has not completed in the time specified.
      • If there is no response after a pre-determined number of times, then the healthmonitor initiates a kill of the emagent process. The Agent watchdog (emwd) then spawns a new emagent process as the earlier process had abruptly ended.
    • Components that use Health Monitor include
      • The metric engine (for metric timeouts)
      • EMDClient API
    • Sample entries from the emagent.trc:
      2010-05-14 21:08:40,161 Thread-2851941280 DEBUG HealthMonitor: Added new HealthMonitor entry id=4574, msg= ping 
      2010-05-14 21:08:40,161 Thread-34057120 DEBUG HealthMonitor: HealthMonitor wakeup by notify ...
      2010-05-14 21:08:40,161 Thread-34057120 DEBUG HealthMonitor: there are 1 items in queue
      2010-05-14 21:08:40,161 Thread-34057120 DEBUG HealthMonitor: HealthMonitor will sleep for 900 seconds ...
      2010-05-14 21:08:40,222 Thread-2851941280 DEBUG HealthMonitor: nmeshm_deregisterEntry: 4574
      2010-05-14 21:08:40,222 Thread-34057120 DEBUG HealthMonitor: HealthMonitor wakeup by notify ...
      2010-05-14 21:08:40,222 Thread-34057120 DEBUG HealthMonitor: there are 0 items in queue
      2010-05-14 21:08:40,222 Thread-34057120 DEBUG HealthMonitor: HealthMonitor will sleep until there is task..

As seen from the above, the Agent log trace will have entries from many threads interspersed with each other. Not every line in the log/trace file is an error message or indicative of a problem. There are a lot of informational messages which are logged to depict the Agent operations. When reading the Agent log / trace, it is necessary to know the operation / activity for which the entries are being checked.

Agent Threads

Most of the above components run inside their own thread, for other components the threads are spawned as needed.

Threads which are always active:

  • Main thread: Core of the Agent.
  • One HTTP Listener thread, which will accept all requests coming in from the OMS
  • One Scheduler thread, which will make sure the Uploads, Heartbeats and metric execution operations are executed on time.
  • One Job thread, which will create a new job thread for each job the Agent has to execute
  • HealthMonitor thread for monitoring all the above threads.

Additional threads:

  • One thread for every 10g database. This is the AQ recvlet thread, which will listen for incoming RDBMS generated alerts for this database.
  • A new thread is created by the Scheduler for each metric that needs to get executed. The number of threads allowed for metrics is controlled by the ThreadPoolModel.
  • The Job thread will also create new threads for each job that needs to get executed.
  • The HTTP listener will also create a new thread for each request coming in.
  • Reload Manager thread, etc.

REFERENCES

NOTE:1082009.1 - Master Note for 10g Grid Control Agent Process Control (Start, Stop & Status) & Configuration
NOTE:1084777.1 - Description of Important Communication Components in a 10g Enterprise Manager Grid Control Agent
NOTE:1087997.1 - Master Note for 10g Enterprise Manager Grid Control Agent Performance & Core Dump issues
NOTE:229624.1 - Enterprise Manager Grid Control Agent 11g - Locate and Manage the Log and Trace Files
NOTE:234872.1 - Understanding the Enterprise Manager 10g and 11g Grid Control Management Agent, Directory Structure and Key Configuration Files.
NOTE:235290.1 - Understanding the Enterprise Manager Management Agent 10g 'emd.properties' File
NOTE:397228.1 - Details About 'emctl' Script and Steps to Enable Tracing
NOTE:435975.1 - How a Metrics Are Evaluated


來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/17252115/viewspace-1248729/,如需轉載,請註明出處,否則將追究法律責任。

相關文章