ASM REACTING TO PARTITION ERRORS [ID 1062954.1]

--------------------------------------------------------------------------------

修改時間 10-AUG-2011 型別 PROBLEM 狀態 PUBLISHED

In this Document
Symptoms
Cause
Solution

--------------------------------------------------------------------------------

Applies to:
Oracle Server - Enterprise Edition - Version: 10.1.0.4 to 11.2.0.1.0 - Release: 10.1 to 11.2
Linux x86
Haansoft Linux x86-64

Symptoms
Randomly disks that belonged to ASM disk groups show as PROVISIONED or at times as CANDIDATE in v$asm_disk.header_status. Upon dismount, disk groups with those disks will not mount.

From ASM alert log:

ERROR: diskgroup was not mounted
ORA-15032: not all alterations performed
ORA-15063: ASM discovered an insufficient number of disks for diskgroup ""

Or

ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "disk number here>" is missing
ORA-15063: ASM discovered an insufficient number of disks for diskgroup ""

This seems to occur when new LUNs are either added or configured on the cluster, but this behavior. has occurred several times (10+), on more than one cluster and across separate data centers.

At times the disks' v$asm_header_status is of member but still the disk groups will not mount, upon attempt to re-mount the disk group.

While troubleshooting the issue, it has been noticed that the OS partition table for the devices employed by the ASM disks, is wiped out (does not exist). This issue is similarly reproduced when dd is used to wiped the devices although this does not explain why some times the disks will show with v$asm_disk.header_status=member, and still cannot be mountable.

Cause
It turns out that the inq.Linux command is incorrectly writing to /dev/sd device, which is wiping out the partition table. Depending on which mpath device /dev/sd is part of, where once notices the corruption.

This is caused by EMC bug which has older version (prior to versions 6.3.0.0-771) of the Linux inq utility/command. The eNav utility calls inq.Linux.

Details from EMC bug:
1) older versions of this command scanned all devices in /dev, not just scsi disks, and so included /dev/kmsg
2) older versions of this command incorrectly matched /dev/kmsg and /dev/sd thinking it was multiple paths to the same device, when it is not.
3) older versions of this command allocated a 216 byte inquiry buffer. This was apparently sufficient for EMC devices, but was too small for certain other disks. The scsi layer would return an error if the buffer is undersized.

The above 3 conditions basically caused the errors to erroneously get routed to /dev/sd instead of /dev/kmsg, which then wipes out the corresponding partition tables. All three conditions are fixed in versions of the INQ command after 6.3.0.0-771.

It is assumed that any installations with over 500 scsi disks attached (includes multiple paths to the same disk via multipathing, etc....so in that case, only 250 or so LUNs if the environment has two paths per LUN, minus the number of locally attached disks) which would cause /dev/sd to exist and that were running the eNav utility were at risk for similar corruption.

Note: Verified/confirmed by customer and sources outside Oracle (RedHat, EMC, Maryville) however no further details, like the EMC bug number or any other additional information, were furthermore provided.
Solution
An immediate work-around is to comment out the calling of the inq.Linux command to prevent it from happening across your environment. For this one has to contact either Maryville support (vendor of eNav) or EMC.

Another appropriate solution is to upgrade to a newer version of inq.Linux.

ASM REACTING TO PARTITION ERRORS [ID 1062954.1]

相關文章