AIX: Top Things to DO NOW to Stabilize 11gR2 GI/RAC Cluster (文件 ID 1427855.1)

xychong123發表於2017-11-08

A. Required OS Technology Level and Service Pack, and Recommended VM Setting

  • AIX kernel should be equal or higher than the following(execute "/bin/oslevel -s" to confirm):
AIX 7.1 TL 00 SP1 ("7100-00-01"), 64-bit kernel
AIX 6.1 TL 02 SP1 ("6100-02-01"), 64-bit kernel
AIX 5.3 TL 09 SP1 ("5300-09-01"), 64 bit kernel
  • Recommended Virtual Memory setting:
maxperm%=90
minperm%=3
maxclient%=90
strict_maxperm=0
strict_maxclient=1
lru_file_repage=0
page_steal_method=1     ###(change requires reboot to take effective)
  • vpm_xvcpus should never be set to 0: set to 2 or higher so at least 2 or more processors will be in unfolded.

 

 

B. USLA heap fix to reduce memory footprint for Oracle Server processes

 

  • For AIX 6.1 TL07 SP02/AIX 7.1 TL01 SP02 or later, apply 
  • For AIX 6.1 TL07 or AIX 7.1 TL01, install AIX 6.1 TL-07 APAR IV09580, AIX 7.1 TL-01 APAR IV09541, and apply 
  • For other AIX level, apply , this will disable Oracle's online patching mechanism

 

  • Note: as of 06/21/2012, fix for  or  are not included in any PSU and the interim patch is needed. Interim  exists on top of most PSU, and  on top of 11.2.0.3 does not conflict with 11.2.0.3.1 PSU and can be applied on top of both 11.2.0.3 base and 11.2.0.3.1 PSU.

 

  • New connection can be slow to establish without fix for  which is fixed in 11.2.0.4
@ if online patch exists, process startup/new connection will be slower: , duplicate  

C. Other recommended OS fixes

  • note 1528452.1 - AIX 6.1 TL8 or 7.1 TL2: 11gR2 GI Second Node Fails to Join the Cluster as CRSD and EVMD are in INTERMEDIATE State 
     
  • Paging space growth leads to node failure/eviction:
64K paging taking place when available system RAM exists, the fix will avoid unexpected paging space growth and node failure. Below is a matrix of APAR for various TL/SP level

6100 TL5            6100-05                   IZ71603
6100 TL4 SP4     6100-04-04-1014    IZ71191
6100 TL3 SP4     6100-03-04-1014    IZ72031
6100 TL2 SP7     6100-02-07-1014    IZ71850
6100 TL1 SP8     6100-01-08-1014    IZ71987
5300 TL12          5300-12                  IZ71460
5300 TL11 SP4   5300-11-04-1015    IZ73687
5300 TL10 SP4   5300-10-04-1015    IZ73754
5300 TL9 SP7     5300-09-07-1015    IZ73864
5300 TL8 SP10   5300-08-10-1015    IZ67445

For more info, refer to
  • gc block lost or IPC send timeout or instance eviction
VIOS Server will not forward traffic from its VIO Clients to the external network, interrupts do not reach the trunk adapter, the fix will avoid SEA/VIO client hang. Below is a matrix of APAR for various TL/SP level
 
7100 TL0 SP3      7100-00-03-1115    IZ97035
6100 TL6 SP5      6100-06-05-1115    IZ96155
6100 TL5 SP6      6100-05-06-1119    IZ97457
6100 TL4 SP10    6100-04-10-1119    IZ97605
5300 TL12 SP4    5300-12-04-1119    IZ98126
5300 TL11 SP7    5300-11-07-1119    IZ98424

For more info, refer to
  • Other kernel hang fix
* IZ91983 lockl performance issue, hang

For more info, refer to


* IV04047: shlap64 unable to process Oracle request leading to kernel hang

For more info, refer to

 

  • Excessive CPU usage in LPAR in shared processor mode

If LPAR is in shared processor mode, without the following fix, LPAR may see excessive CPu usage:

APARs for WAITPROC IDLE LOOPING CONSUMES CPU:

IV01111 AIX 6.1 TL05 if before SP08 (fixed in SP08)
IV06197 AIX 6.1 TL06 if before SP07 (fixed in SP07)
IV10172 AIX 6.1 TL07 if before SP02 (fixed in SP02)
IV09133 AIX 7.1 TL00 if before SP05 (fixed in SP05)
IV10484 AIX 7.1 TL01 if before SP02 (fixed in SP02)

This problem can effect POWER7 systems running any level of Ax720 firmware prior to Ax720_101. But it is recommended to update to the latest available firmware. If required, AIX and Firmware fixes can be obtained from IBM Support Fix Central:

 

  • Crash in netinfo_unixdomnlist while running netstat

6100 TL6 SP6  6100-06-06-1140  IZ97166
6100 TL5 SP7  6100-05-07-1140  IZ97353
6100 TL4 SP11  6100-04-11-1140  IV00634

For more info, refer to 

  • <note 2237498.1> - ALERT: Database Corruption ORA-600 ORA-7445 errors after applying AIX SP patches - AIX 6.1.9.8 or AIX 7.1.3.8 or AIX 7.1.4.3 or AIX 7.2.0.3 or AIX 7.2.1.0, 01

 

D. Apply the latest GI PSU to avoid known high resource consumption bugs

If you are running 11.2.0.3, apply 11.2.0.3 GI PSU8 ()

For 11.2.0.3, applying above PSU will fix the following known bugs (Note: it does not fixes bugs in Section D1)

  • Note 1062676.1 - ORAAGENT or ORAROOTAGENT High Resource (CPU, Memory etc) Usage
Except , all others have been fixed in 11.2.0.2 

 is fixed in 11.2.0.2 GI PSU6, 11.2.0.3 GI PSU2, 11.2.0.4 and 12.1, interim  exists on top of 11.2.0.3.1 GI PSU

 

  • Note 1287709.1 - ocssd.bin High CPU Usage, Instance Crashes With ORA-29702 or ORA-29770 or ORA-29701 With "gipcWait failed with 16"
This note talks about  which is fixed in 11.2.0.2 GI PSU3, 11.2.0.3

 

This note talks about the following bugs:
, fixed in 11.2.0.2 GI PSU2, 11.2.0.3 and above
, fixed in 11.2.0.2 GI PSU4, 11.2.0.3 and above

 

  • Note 1348202.1 - 11gR2 Grid Infrastructure CRSD High CPU Usage or Slow Command Response
This note talks about the following bugs:
 is fixed in 11.2.0.2 GI PSU3, 11.2.0.3 and above
 is fixed in 11.2.0.2 GI PSU4, 11.2.0.3 and above
 is fixed in 11.2.0.2 GI PSU4, 11.2.0.3 and above
  • note 1455973.1 - 11gR2 Grid Infrastructure High CPU Usage by crsd.bin, ocssd.bin, evmd.bin gipcd.bin etc due to GIPC
 is fixed in 11.2.0.4, please request interim  if it's not available

 

E. ASM/Database fixes

  •  - diag0 high memory usage, fixed in 11.2.0.4, interim  exists on top of certain patchset/PSU
Refer to Note 1376981.1 for more information
  •  - high "log file sync" or "asynch descriptor resize" wait , fixed in 11.2.0.4, interim  exists on top of most patchset/PSU
  •  - higher CPU usage in 11.2 on AIX , fixed in 11.2.0.4. Interim  on top of 11.2.0.3 does not conflict with 11.2.0.3.1 PSU and can be applied on top of both 11.2.0.3 base and 11.2.0.3.1 PSU
  •  - instance hangs; fixed in 11.2.0.2 DB PSU4, 11.2.0.3
Refer to Note 1348264.1 for more information.
Please check the interim patch available status for your release.

  

F. CSSD fix to avoid node eviction/reboot related issues

  •  - Threads does not always inherit parent processes's real time priority 
  •  - 11.2.0.3 GI node reboot if only one voting file exists
    Refer to Note 1466639.1 for more information

  • Unpublished Bug 17733927 - CSS CLIENTS TIMEOUT UNDER HEAVY CONNECTIVITY LOADS ON AIX
    Refer to Note 1953101.1 AIX: High CPU utilization for CSSD

G. EM agent high memory consumption on AIX (likely node will be rebooted)

  • note 1530102.1 - EM 12c: Agent emdprocstats.pl Consuming High Memory

 

Appendix A: Data gathering


If the issue still happens after the above recommendations are in place, collect output of the followings from all nodes as root user:

# svmon -P -O unit=MB -O segment=category
# svmon -U -O unit=MB -O segment=category 
# ps -elf
# vmstat 5 3

And all other requested files per the following notes:

Note 289690.1 - Data Gathering for Troubleshooting Oracle Clusterware (CRS or GI) and RAC Issues

Appendix B: Reference


Note 282036.1 - Minimum Software Versions and Patches Required to Support Oracle Products on IBM Power Systems

Note 811293.1 - RAC and Oracle Clusterware Best Practices and Starter Kit (AIX)

Note 169706.1 - Oracle Database on AIX,HP-UX,Linux,Solaris.. Installation and Configuration Requirements Quick Reference



Note 1379753.1 - AIX: Link/Relink/Make Fails With: ld: 0711-780 SEVERE ERROR: Symbol .ksmpfpva (entry 58964) in object libserver11.a[ksmp.o]
    
note 1470654.1 - Understanding Processor Utilization with IBM PowerVM

Note 1530943.1 - AIX: VIP and SCAN VIP fails to failover to other node after pulled cable on public network if LHEA is being used
  

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/20747382/viewspace-2146977/,如需轉載,請註明出處,否則將追究法律責任。

相關文章