【MOS】Cluster Health Monitor (CHM) FAQ (文件 ID 1328466.1 ID 2062234.1)

lhrbest發表於2017-01-12

11gR2 新特性:Oracle Cluster Health Monitor(CHM)簡介


Cluster Health Monitor(以下簡稱CHM)是一個Oracle提供的工具,用來自動收集作業系統的資源(CPU、記憶體、SWAP、程式、I/O以及網路等)的使用情況。CHM會每秒收集一次資料。

   這些系統資源資料對於診斷叢集系統的節點重啟、Hang、例項驅逐(Eviction)、效能問題等是非常有幫助的。另外,使用者可以使用CHM來及早發現一些系統負載高、記憶體異常等問題,從而避免產生更嚴重的問題。

CHM會自動安裝在下面的軟體:
    11.2.0.2 及更高版本的 Oracle Grid Infrastructure for Linux (不包括Linux Itanium) 、Solaris (Sparc 64 和 x86-64)
    11.2.0.3 及更高版本 Oracle Grid Infrastructure for AIX 、 Windows (不包括Windows Itanium)。

    在叢集中,可以通過下面的命令檢視CHM對應的資源(ora.crf)的狀態:
    $ crsctl stat res -t -init
    --------------------------------------------------------------------------------
    NAME           TARGET  STATE        SERVER                   STATE_DETAILS       Cluster Resources
ora.crf        ONLINE  ONLINE       rac1


CHM主要包括兩個服務:
    1). System Monitor Service(osysmond):這個服務在所有節點都會執行,osysmond會將每個節點的資源使用情況傳送給cluster logger service,後者將會把所有節點的資訊都接收並儲存到CHM的資料庫。
      $ ps -ef|grep osysmond
       root      7984     1  0 Jun05 ?        01:16:14 /u01/app/11.2.0/grid/bin/osysmond.bin

    2). Cluster Logger Service(ologgerd):在一個叢集中的,ologgerd 會有一個主機點(master),還有一個備節點(standby)。當ologgerd在當前的節點遇到問題無法啟動後,它會在備用節點啟用。

     主節點:
     $ ps -ef|grep ologgerd
       root      8257     1  0 Jun05 ?        00:38:26 /u01/app/11.2.0/grid/bin/ologgerd -M -d      /u01/app/11.2.0/grid/crf/db/rac2

     備節點:
      $ ps -ef|grep ologgerd
       root      8353     1  0 Jun05 ?        00:18:47 /u01/app/11.2.0/grid/bin/ologgerd -m rac2 -r -d
/u01/app/11.2.0/grid/crf/db/rac1

CHM Repository:用於存放收集到資料,預設情況下,會存在於Grid Infrastructure home 下 ,需要1 GB 的磁碟空間,每個節點大約每天會佔用0.5GB的空間。 您可以使用OCLUMON來調整它的存放路徑以及允許的空間大小(最多隻能儲存3天的資料)。

下面的命令用來檢視它當前設定:

     $ oclumon manage -get reppath
       CHM Repository Path = /u01/app/11.2.0/grid/crf/db/rac2
       Done

     $ oclumon manage -get repsize
       CHM Repository Size = 68082 <====單位為秒
       Done

     修改路徑:
     $ oclumon manage -repos reploc /shared/oracle/chm

     修改大小:
     $ oclumon manage -repos resize 68083 <==在3600(小時) 到 259200(3天)之間
      rac1 --> retention check successful
      New retention is 68083 and will use 1073750609 bytes of disk space
      CRS-9115-Cluster Health Monitor repository size change completed on all nodes.
      Done

獲得CHM生成的資料的方法有兩種:
     1. 一種是使用Grid_home/bin/diagcollection.pl:
        1). 首先,確定cluster logger service的主節點:
         $ oclumon manage -get master
         Master = rac2

        2).用root身份在主節點rac2執行下面的命令:
         # /bin/diagcollection.pl -collect -chmos -incidenttime inc_time -incidentduration duration
         inc_time是指從什麼時間開始獲得資料,格式為MM/DD/YYYY24HH:MM:SS, duration指的是獲得開始時間後多長時間的資料。

         比如:# diagcollection.pl -collect -crshome /u01/app/11.2.0/grid -chmoshome  /u01/app/11.2.0/grid -chmos -incidenttime 06/15/201215:30:00 -incidentduration 00:05

       3).執行這個命令之後,CHM的資料會生成在檔案chmosData_rac2_20120615_1537.tar.gz。

    2. 另外一種獲得CHM生成的資料的方法為oclumon:
        $oclumon dumpnodeview [[-allnodes] | [-n node1 node2] [-last "duration"] | [-s "time_stamp" -e "time_stamp"] [-v] [-warning]] [-h]

        -s表示開始時間,-e表示結束時間
       $ oclumon dumpnodeview -allnodes -v -s "2012-06-15 07:40:00" -e "2012-06-15 07:57:00" > /tmp/chm1.txt

       $ oclumon dumpnodeview -n node1 node2 node3 -last "12:00:00" >/tmp/chm1.txt
       $ oclumon dumpnodeview -allnodes -last "00:15:00" >/tmp/chm1.txt


下面是/tmp/chm1.txt中的部分內容:
----------------------------------------
Node: rac1 Clock: '06-15-12 07.40.01' SerialNo:168880
----------------------------------------

SYSTEM:
#cpus: 1 cpu: 17.96 cpuq: 5 physmemfree: 32240 physmemtotal: 2065856 mcache: 1064024 swapfree: 3988376 swaptotal: 4192956 ior: 57 io
w: 59 ios: 10 swpin: 0 swpout: 0 pgin: 57 pgout: 59 netr: 65.767 netw: 34.871 procs: 183 rtprocs: 10 #fds: 4902 #sysfdlimit: 6815744
 #disks: 4 #nics: 3  nicErrors: 0

TOP CONSUMERS:
topcpu: 'mrtg(32385) 64.70' topprivmem: 'ologgerd(8353) 84068' topshm: 'oracle(8760) 329452' topfd: 'ohasd.bin(6627) 720' topthread:
 'crsd.bin(8235) 44'

PROCESSES:

name: 'mrtg' pid: 32385 #procfdlimit: 65536 cpuusage: 64.70 privmem: 1160 shm: 1584 #fd: 5 #threads: 1 priority: 20 nice: 0
name: 'oracle' pid: 32381 #procfdlimit: 65536 cpuusage: 0.29 privmem: 1456 shm: 12444 #fd: 32 #threads: 1 priority: 15 nice: 0
...
name: 'oracle' pid: 8756 #procfdlimit: 65536 cpuusage: 0.0 privmem: 2892 shm: 24356 #fd: 47 #threads: 1 priority: 16 nice: 0

----------------------------------------
Node: rac2 Clock: '06-15-12 07.40.02' SerialNo:168878
----------------------------------------

SYSTEM:
#cpus: 1 cpu: 40.72 cpuq: 8 physmemfree: 34072 physmemtotal: 2065856 mcache: 1005636 swapfree: 3991808 swaptotal: 4192956 ior: 54 io
w: 104 ios: 11 swpin: 0 swpout: 0 pgin: 54 pgout: 104 netr: 77.817 netw: 33.008 procs: 178 rtprocs: 10 #fds: 4948 #sysfdlimit: 68157
44 #disks: 4 #nics: 4  nicErrors: 0

TOP CONSUMERS:
topcpu: 'orarootagent.bi(8490) 1.59' topprivmem: 'ologgerd(8257) 83108' topshm: 'oracle(8873) 324868' topfd: 'ohasd.bin(6744) 720' t
opthread: 'crsd.bin(8362) 47'

PROCESSES:

name: 'oracle' pid: 9040 #procfdlimit: 65536 cpuusage: 0.19 privmem: 6040 shm: 121712 #fd: 33 #threads: 1 priority: 16 nice: 0
...


  關於CHM的更多解釋,請參考Oracle官方文件:
  http://docs.oracle.com/cd/E11882_01/rac.112/e16794/troubleshoot.htm#CWADD92242
  Oracle? Clusterware Administration and Deployment Guide
  11g Release 2 (11.2)
  Part Number E16794-17

  或者 My Oracle Support文件:
  Cluster Health Monitor (CHM) FAQ (Doc ID 1328466.1)





Cluster Health Monitor (CHM) FAQ (文件 ID 1328466.1)

In this Document

Purpose
Questions and Answers
  What is the Cluster Health Monitor?
  What is the purpose of the Cluster Health Monitor?
  What platform does Cluster Health Monitor support and where can I get the Cluster Health Monitor?
  What is the resource name for Cluster Health Monitor in 11.2.0.2 or higher?
  Is stop/start ora.crf affecting clusterware function or cluster database function?
  Can the Cluster Health Monitor be installed on a single node, non-RAC server?
  Do Engineered Systems like Exadata have a default usage with CHM and if so, any specific version??
  Where is oclumon?
  How do I collect the Cluster Health Monitor data?
  Why does “diagcollection.pl --collect --chmos” return “Cannot parse master from output: ERROR : in reading init file” error?
  How do you get the syntax of different options and explanations for those options for diagcollection.pl and oclumon?
  What is IPD/OS?
  How is the Cluster Health Monitor different from OSWatcher?
  Is the Cluster Health Monitor replacing OSWatcher?
  How much of overhead does the Cluster Health Monitor cause?
  Does CHM on Multiple Node configurations (e.g. 4 to 8 nodes) have scaling concerns?
  Will CDB and PDB result in any new information or special conditions using CHM?
  How much of disk space is needed for the Cluster Health Monitor?
  How do I find out the size of data collected and saved by the Cluster Health Monitor in my system?
  How can I increase the size of the Cluster Health Monitor repository ?
  What platforms can I run the Cluster Health Monitor?
  What steps are needed to install 11.2.0.2 when the Cluster Health Monitor from OTN is already running?
  Where does the Cluster Health Monitor from OTN installed in Linux?
  What logs and data should I gather before logging a SR for the Cluster Health Monitor error?
  How do I increase the trace level the Cluster Health Monitor?
  Can I use procwatcher to get the pstack of the Cluster Health Monitor regularly?
  What are the processes and components for the Cluster Health Monitor?
  What is oclumon?
  What is definition of some of the files like *.bdb, _db.* , *.ldb , log.* files created by tool in the BDB (Berkeley Database) location directory ?
  Because it takes many days / weeks to resolve a problem like the node reboot or performance degradation, is there any way to keep the Cluster Health Monitor data for that long so that it can be replayed any time later when needed ?
  Where is the location for the log files for the Cluster Health Monitor from OTN (pre 11.2.0.2)?
  How do I fix the problem that the time in the oclumon report is in UTC time zone instead of the time zone of my server?
  Can I install CHM from OTN on 11.2.0.2? What if I stop and disable CHM resource (ora.crf) on 11.2.0.2?
  Where is the trace file for client like oclumon? How do I increase the trace level for oclumon?
  Can the Directory path to the CHM Repository be same on all nodes if shared storage is used?
  How much of data (how long in time) does the node store CHM data locally when it cannot communicate with the master?
  How often does CHM collect the system metric data? Can this be changed?
  What is the default CHM retention time? 
  How can you reduce the size of bdb file that became big for any reason?
  Can you set up CHM to run locally on each node?
  Can CHM be used on a single node non-RAC server?
  How to start and stop CHM that is installed as a part of GI in 11.2 and higher?
  Database - RAC/Scalability Community
References


APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.1.0.2 to 12.1.0.2 [Release 10.1 to 12.1]
Information in this document applies to any platform.

PURPOSE


 

The Cluster Health Monitor FAQ is an evolving document that answers common questions about the Cluster Health Monitor

QUESTIONS AND ANSWERS

What is the Cluster Health Monitor?

The Cluster Health Monitor collects OS statistics (system metrics) such as memory and swap space usage, processes, IO usage, and network related data. The Cluster Health Monitor collects information in real time and usually once a second. The Cluster Health Monitor collects OS statistics using OS API to gain performance and reduce the CPU usage overhead. The Cluster Health Monitor collects as much of system metrics and data as feasible that is restricted by the acceptable level of resource consumption by the tool.

What is the purpose of the Cluster Health Monitor?

The Cluster Health Monitor is developed to provide system metrics and data for troubleshooting many different types of problems such as node reboot and hang, instance eviction and hang, severe performance degradation, and any other problems that need the system metrics and data. 

By monitoring the data constantly, users can use the Cluster Health Monitor detect potential problem areas such as CPU load, memory constraints, and spinning processes before the problem causes an unwanted outage.

What platform does Cluster Health Monitor support and where can I get the Cluster Health Monitor?

The Cluster Health Monitor is NOT supported on Linux Itanium,  and IBM Linux Z and HP-UX.

The Cluster Health Monitor is integrated part of 11.2.0.2 Oracle Grid Infrastructure for Linux (not on Linux Itanium and IBM Linux Z) and Solaris (Sparc 64 and x86-64 only), so installing 11.2.0.2 Oracle Grid Infrastructure on those platforms will automatically install the Cluster Health Monitor. AIX will have the Cluster Health Monitor starting from 11.2.0.3. The Cluster Health Monitor is also enabled for Windows (except Windows Itanium) in 11.2.0.3.

Prior to 11.2.0.2 on Linux (not on Linux Itanium and IBM Linux Z), the Cluster Health Monitor can be downloaded from OTN.

http://www-content.oracle.com/technetwork/products/clustering/downloads/ipd-download-homepage-087212.html

The OTN version for Windows is not available.  Please upgrade to 11.2.0.3 if you need CHM for Windows.

What is the resource name for Cluster Health Monitor in 11.2.0.2 or higher?

ora.crf is the Cluster Health Monitor resource name that ohasd manages. Issue “crsctl stat res –t –init” to check the current status of the Cluster Health Monitor.

Is stop/start ora.crf affecting clusterware function or cluster database function?

No, stop/start ora.crf resource will stop and start Cluster Health Monitor and its data collection, it will not affect clusterware or database functionality.

Can the Cluster Health Monitor be installed on a single node, non-RAC server?

The Cluster Health Monitor Standalone for LINUX x86 and x86-64 can be downloaded from OTN, it can be installed on a single node, non-RAC server without the need to install Grid Infrastructure or CRS. For other platform, it is required to install Grid Infrastructure or CRS to get CHM/OS.

Do Engineered Systems like Exadata have a default usage with CHM and if so, any specific version??

Engineered systems use the default GI stack that includes CHM functionality as shipped on standard platforms. At this time there are no specific extensions for Engineered systems but this may change in future releases.

Where is oclumon?

If the CHM is installed as a part of 11.2 installation on the supported platform, then the location of oclumon is in GI_HOME/bin directory. 

If the CHM is manually installed using the CHM file from OTN, then the location of oclumon is in: 
Linux : /usr/lib/oracrf/bin 
Windows : C:\Program Files\oracrf\bin

How do I collect the Cluster Health Monitor data?

As grid user, using command “/bin/diagcollection.pl --collect --chmos” will produce output for all data that is collected in the repository. There may be too much data and may take long time, so the suggestion is limit the query to an interesting time interval.

For example, issue “/bin/diagcollection.pl --collect --crshome $ORA_CRS_HOME --chmos --incidenttime --incidentduration 05:00”

The above outputs the report that covers 5 hours from the time specified by incidenttime.
The incidenttime must be in MM/DD/YYYYHH:MN:SS where MM is month, DD is date, YYYY is year, HH is hour in 24 hour format, MN is minute, and SS is second. For example, if you want to put the incident time to start from 10:15 PM on June 01, 2011, the incident time is 06/01/201122:15:00. The incidenttime and incidentduration can be changed to capture more data.

Alternatively, ‘oclumon dumpnodeview -allnodes -v -last "11:59:59" > your-filename’ if diagcollection.pl fails with any reason. This will generate a report from the repository up to last 12 hours. The -last value can be changed to get more or less data.

Another example of using oclumon is 'oclumon dumpnodeview -allnodes -v -s "2012-06-01 22:15:00" -e "2012-06-02 03:15:00" > /tmp/chm.log '.  The difference in this command is that it specifies the start (-s flag) and end time (-e flag).
In this case, the time format used is "YYYY-MM-DD HH24:MI:SS" like "2007-11-12 23:05:00".

Why does “diagcollection.pl --collect --chmos” return “Cannot parse master from output: ERROR : in reading init file” error?

This is due to bug 10048487 that affects 11.2.0.2. As a result, the bug in the script causes the diagcollection.pl to never be able to retrieve the master node.

The workaround for this is to issue 
oclumon dumpnodeview -allnodes -v -last “amount of data needed”
For example, oclumon dumpnodeview -allnodes -v -last “01:00:00”
will provide last one hour of data from all nodes.

How do you get the syntax of different options and explanations for those options for diagcollection.pl and oclumon?

Issue “/bin/diagcollection.pl –h” and “oclumon –h”. You may need to drill down further to get information for different options.

What is IPD/OS?

The IPD/OS is an old name for the Cluster Health Monitor. The names can be used interchangeably although Oracle now calls the tool Cluster Health Monitor.

How is the Cluster Health Monitor different from OSWatcher?

OSWatcher collects OS statistics by running regular unix commands such as vmstat, top, ps, iostat, netstat, mpstat, and meminfo. The private.net file can be configured in OSWatcher to issue traceroute command over the private interconnect to test the private interconnect. OSWatcher also runs in a user priority, so OSWatcher often cannot run when CPU load is heavy.

Is the Cluster Health Monitor replacing OSWatcher?

The Cluster Health Monitor has many advantages over OSWatcher, and the most significant is that the Cluster Health Monitor runs in real time and usually once a second, so the Cluster Health Monitor will collect data even when OSWatcher cannot. However, there are some information such as top, traceroute, and netstat that the Cluster Health Monitor does not collect, so running the Cluster Health Monitor while running OSWatcher is ideal. Both tools complement each other rather than supplement.
On the other hand, if only one of the tools can be used, then Oracle recommends that the Cluster Health Monitor is used.

How much of overhead does the Cluster Health Monitor cause?

In today's server environment, the Cluster Health Monitor uses approximately less than 3% of the server's capacity for CPU. The overhead of using the Cluster Health Monitor is minimal.  However. CHM on the server with large number of disks or IO devices and more CPUs/memory would use more CPU than CHM on a server that does not have many disks and CPUs/memory.

Does CHM on Multiple Node configurations (e.g. 4 to 8 nodes) have scaling concerns?

CHM functionality is designed to scale automatically with the cluster. While each node hosts an osysmond daemon, the ologgerd daemon services multiple osysmonds. Should a cluster grow large enough another ologgerd daemon is spawned to manage the increased load. The user is responsible for increasing the CHM data repository size as nodes are added to ensure sufficient retention time is maintained. This is recommended to be 72 hours.

Will CDB and PDB result in any new information or special conditions using CHM?

As CHM is collecting OS metrics there currently are no CDB or PDB specific metrics collected. There are currently no special conditions that are triggered when hosting a multitenant (CDB) database.

How much of disk space is needed for the Cluster Health Monitor?

The Cluster Health Monitor takes up 1GB space by default on all nodes in the cluster. The approximate amount of data collected is 0.5 GB per node per day. The size of the repository can increase to collect and save data up to 3 days, and this will increase the disk usage appropriately.

How do I find out the size of data collected and saved by the Cluster Health Monitor in my system?

“oclumon manage -get repsize” will show the size in seconds.
To estimate the space required, use the following formula:

# of nodes * 720MB * 3 = Size required for 3 days retention  
eg. for 4 node cluster: 4 * 720 * 3 = 8,640MB (8.4GB)

How can I increase the size of the Cluster Health Monitor repository ?

“oclumon manage -repos resize ”. Setting the value to 259200 will collect and save the data for 72 hours (3 days). It is recommended to set 72 hours of retention based on above formula. This space needs to be available on all node in the cluster. Please resize the repositories or moving them if necessary in order to achieve 72 hours of retention.

What platforms can I run the Cluster Health Monitor?

11.2.0.1 and earlier: Linux only (download from OTN) 
11.2.0.2: Solaris (Sparc 64 and x86-64 only), and Linux. 
11.2.0.3: AIX, Solaris (Sparc 64 and x86-64 only), Linux, and Windows.

Cluster Health Monitor is NOT available for any Itanium platform such as Linux Itanium and Windows Itanium.

What steps are needed to install 11.2.0.2 when the Cluster Health Monitor from OTN is already running?

Remove the Cluster Health Monitor from OTN before upgrading the CRS or installing Grid Infrastructure.

Where does the Cluster Health Monitor from OTN installed in Linux?

$CRF_HOME is set /usr/lib/oracrf on Linux by default if the Cluster Health Monitor is from OTN. This is the Cluster Health Monitor home location.

What logs and data should I gather before logging a SR for the Cluster Health Monitor error?

1) provide 3-4 pstack outputs over a minute for osysmond.bin
2) output of strace -v for osysmond.bin about 2 minutes.
3) strace -cp for about 2 min
4) oclumon dumpnodeview -v output for that node for 2 min.
5) output of "uname -a"
6) outpuft of "ps -eLf | grep osysmond.bin"
7) The ologgerd and sysmond log files in the CRS_HOME/log/ directory from all nodes

How do I increase the trace level the Cluster Health Monitor?

Increase the log level for the daemons using,
oclumon debug log all allcomp:

Higher the trace level, more detailed tracing is done, so do not forget to reset the trace level back to 1 (the default trace level when the CHM is first installed) by issuing "oclumon debug log all allcomp:1"

Can I use procwatcher to get the pstack of the Cluster Health Monitor regularly?

Procwatcher version 030810 can now be used to monitor IPD procs. Just add the proc names to the CLUSTERPROCS list. The change is that Procwatcher is now smarter about picking the path of the executable so now it can find the IPD daemons if it is looking for them.

What are the processes and components for the Cluster Health Monitor?

Cluster Logger Service (Ologgerd) – there is a master ologgerd that receives the data from other nodes and saves them in the repository (Berkeley database). It compresses the data before persisting to save the disk space. In an environment with multiple nodes, a replica ologgerd is also started on a node where the master ologgerd is not running. The master ologgerd will sync the data with replica ologgerd by sending the data to the replica ologgerd. The replica ologgerd takes over if the master ologgerd dies. A new replica ologgerd starts when the replica ologgerd dies. There is only one master ologgerd and one replica ologgerd per cluster. 

System Monitor Service (Sysmond) – the sysmond process collects the system statistics of the local node and sends the data to the master ologgerd. A sysmond process runs on every node and collects the system statistics including CPU, memory usage, platform info, disk info, nic info, process info, and filesystem info.

To find the master olggerd, one can use the following command:
oclumon manage -get master

What is oclumon?

OCLUMON command-line tool - use oclumon command line to query the CHM repository to display node-specific metrics for a specified time period. 

You can also use oclumon to query and print the durations and the states for a resource on a node during a specified time period. These states are based on predefined thresholds for each resource metric and are denoted as red, orange, yellow, and green, indicating decreasing order of criticality.

What is definition of some of the files like *.bdb, _db.* , *.ldb , log.* files created by tool in the BDB (Berkeley Database) location directory ?

*.bdb & _db.* - These are files created for the berkeley db which stores the data collected.

log.* - These are berkeley bdb logfiles which preserve changes before making them to the db files. We have checkpointing setup and it reuses the log files.

*.ldb - This is the local logging file and MUST be present on all servers.

Do not delete above files except in case of trying to reduce the size of bdb file that get grow to a large size.  To reduce the size of bdb file, refer to the question "How can you reduce the size of bdb file that became big for any reason?" in this document.

Because it takes many days / weeks to resolve a problem like the node reboot or performance degradation, is there any way to keep the Cluster Health Monitor data for that long so that it can be replayed any time later when needed ?

The Cluster Health Monitor is designed to store data up to 3 days as best as it can by increasing the size of the repository up to 2GB. If you want to store data more than that, one way is to zip the output from ‘oclumon dumpnodeviews’ or ‘diagcollection’ regularly (like every hour).

Before 12.1.0.2, another way is to archive the whole BDB regularly (like every day) by making a copy of BDB file in the BDB location directory.

The way that CHMOS reads archived BDB is to start it in debug mode. It starts by using
ologdbg -d
After it starts, issue the oclumon dumpnodeview to get the data from the archived BDB.
For example, issue
oclumon dumpnodeview -n -s -e -v

Where is the location for the log files for the Cluster Health Monitor from OTN (pre 11.2.0.2)?

Check directory /usr/lib/oracrf/log/* for the alert.log and other subdir for each daemons (SYSMOND, LOGGERD, OPROXYD) log.

How do I fix the problem that the time in the oclumon report is in UTC time zone instead of the time zone of my server?

The time in the repository is in UTC, and by default, oclumon shows the time in UTC. Check README, it shows UTC if ORACRF_TZ not set. Setting ORACRF_TZ should fix the time zone issue.

Can I install CHM from OTN on 11.2.0.2? What if I stop and disable CHM resource (ora.crf) on 11.2.0.2?

You cannot install CHM from OTN if there is any conflicting install, so installing CHM from OTN on servers that has 11.2.0.2 Grid Infrastructure will not work.  Disabling CHM resource (ora.crf) on 11.2.0.2 will still keep the installation; hence, OTN install will fail.

Where is the trace file for client like oclumon? How do I increase the trace level for oclumon?

The 'log' file for oclumon is in log//clients/oclumon.log.

Generally its not generated because, at the log level 0, there is no log data. 
To see logs at higher log level one needs to do the following
1. oclumon [Enter the interactive mode]
2. query> debug log all allcomp:3

After this, any command execution will produce finer logs in oclumon.log

Can the Directory path to the CHM Repository be same on all nodes if shared storage is used?

One can set CHM repository at a shared storage under the same directory although it is recommended not to do so. One reason is the performance issue. In such a case, each node’s repository location is under the directory named as its hostname.

How much of data (how long in time) does the node store CHM data locally when it cannot communicate with the master?

The local repository size is small for nodes that need to save the local CHM data when it cannot communicate with the master.

With a sampling interval of 1 second, ideally it will be around 1 hour of data. With 11.2.0.3, we have moved to sampling interval of 5 seconds, hence, in that case the data that can be retained is 4-5 hours of data.

How often does CHM collect the system metric data? Can this be changed?

In pre-11.2.0.3, the CHM collection interval is usually once a second, but this can change depending on the the amount of data getting collected.  In 11.2.0.3, the CHM collection interval is changed to once every 5 seconds.

Currently, the collection interval can not be changed.

What is the default CHM retention time? 

In pre-11.2.0.2 CHM available from OTN, the default data retention time was 24 hours. 

In 11.2.0.2, the retention time is determined by the size.   The default size has changed to 1GB. Depending on how large the cluster is, the default retention time is different.  For example, it is usually 6.9 hours for a one-node cluster when sampling interval is 1 second.   Please issue "oclumon manage -get repsize" to find out the retention time of your cluster.  The output is in seconds.

With sampling interval moving to 5 seconds in 11.2.0.3, the retention time becomes 5 times retention time with sampling interval 1 second.

It is recommended to set 72hours retention time.

How can you reduce the size of bdb file that became big for any reason?

You can manage repository size in terms of space using below command. This feature is present from 11.2.0.3.

oclumon manage -repos changesize .

As a temporary work around, you can kill ologgerd and delete the contents in the BDB directory. osysmond should respawn ologgerd and new bdb file will get created. The past data is lost when this is done.

Please note the minimum size must be >= 1024 MB (1 GB), otherwise CRS-9100 "Error setting Cluster Health Monitor repository size" will be reported.

Can you set up CHM to run locally on each node?

On OTN, one can do that by installing CHM on each node independently although it is not recommended.

The Cluster Health Monitor that comes with the Grid Infrastructure install image must run with only one master ologgerd, so it can not be set  up to run locally on each node.

Can CHM be used on a single node non-RAC server?

The CHM available on OTN can be used on a single node non-RAC server, but only Linux and Windows version of CHM is available from OTN.  The CHM that comes with GI in 11.2 and higher must run with GI (RAC)

How to start and stop CHM that is installed as a part of GI in 11.2 and higher?

The ora.crf resource in 11.2 GI (and higher) is the resource for CHM, and the ora.crf resource is managed by ohasd. Starting and stopping ora.crf resource starts and stops CHM. 

To stop CHM (or ora.crf resource managed by ohasd) 
$GRID_HOME/bin/crsctl stop res ora.crf -init 

To start CHM (or ora.crf resource managed by ohasd) 
$GRID_HOME/bin/crsctl start res ora.crf -init


 

Database - RAC/Scalability Community

To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Database - RAC/Scalability Community

 


 

How to relocate CHM repository and increase retention time (文件 ID 2062234.1)

In this Document

Goal
Solution
  11.2
  12.1
References


APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.

GOAL

Often CHM data ages out when if not collected on time, this note provides steps to increase the retention time which is strongly recommended.

SOLUTION

11.2

In 11.2, the repository of CHM is in Grid home, to change the retention time: 

/bin/oclumon manage -repos resize 259200
racnode1 --> retention check successful
racnode2 --> retention check successful
New retention is 259200 and will use 4525424640 bytes of disk space

CRS-9115-Cluster Health Monitor repository size change completed on all nodes.

Done

Note: the command line specifies for how many seconds to retain the data and it's recommended to be at least 259200 which is 3 days.

 

In case there's insufficient amount of space in Grid home, relocate CHM data with the following command:

$ /bin/oclumon manage -repos reploc /home/grid/chm
racnode1 --> Ready to commit new location
racnode2 --> Ready to commit new location
New retention is 259200 and will use 4525424640 bytes of disk space

CRS-9113-Cluster Health Monitor repository location change completed on all nodes. Restarting Loggerd.

Done

12.1

In 12c, the repository of CHM is GIMR which is a database, only retention time can be changed. To change the retention time: 

1. Check how much space is needed for the expected retention time:

$ /bin/oclumon manage -repos checkretentiontime 259200
The Cluster Health Monitor repository is too small for the desired retention. Please first resize the repository to 3896 MB

Note: the command line specifies for how many seconds to retain the data and it's recommended to be at least 259200 which is 3 days. The output tells that the repository needs to be at least 3896 MB for 3 days.

 

2. Change the repository size: 

$ /bin/oclumon manage -repos changerepossize 3896 
The Cluster Health Monitor repository was successfully resized.The new retention is 259200 seconds.

 

 

REFERENCES

NOTE:1589394.1 - How to Move/Recreate GI Management Repository to Different Shared Storage (Diskgroup, CFS or NFS etc)







About Me

...............................................................................................................................

● 本文整理自網路

● 本文在itpub(http://blog.itpub.net/26736162)、部落格園(http://www.cnblogs.com/lhrbest)和個人微信公眾號(xiaomaimiaolhr)上有同步更新

● 本文itpub地址:http://blog.itpub.net/26736162/abstract/1/

● 本文部落格園地址:http://www.cnblogs.com/lhrbest

● 本文pdf版及小麥苗雲盤地址:http://blog.itpub.net/26736162/viewspace-1624453/

● 資料庫筆試面試題庫及解答:http://blog.itpub.net/26736162/viewspace-2134706/

● QQ群:230161599     微信群:私聊

● 聯絡我請加QQ好友(646634621),註明新增緣由

● 於 2017-06-02 09:00 ~ 2017-06-30 22:00 在魔都完成

● 文章內容來源於小麥苗的學習筆記,部分整理自網路,若有侵權或不當之處還請諒解

● 版權所有,歡迎分享本文,轉載請保留出處

...............................................................................................................................

拿起手機使用微信客戶端掃描下邊的左邊圖片來關注小麥苗的微信公眾號:xiaomaimiaolhr,掃描右邊的二維碼加入小麥苗的QQ群,學習最實用的資料庫技術。

【MOS】Cluster Health Monitor (CHM) FAQ (文件 ID 1328466.1  ID 2062234.1)
DBA筆試面試講解
歡迎與我聯絡

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/26736162/viewspace-2132364/,如需轉載,請註明出處,否則將追究法律責任。

相關文章