因為網路調整了一下，一臺資料庫主機上nfs檔案系無法訪問（這個nfs檔案系統跟oracle沒有任何關係），結果導致資料庫無法訪問，連sqlplus都登陸不了了。

RAC叢集資料庫中，在10gR2的版本中，有時候會使用nfs做共享檔案系統存放歸檔日誌，當一個節點作業系統掛起，由於使用nfs，導致另外
一個節點同樣使用問題，從而導致整個叢集出現問題。
wps1:/oracle/app/oracle/admin/mss/bdump$sqlplus '/as sysdba'
SQL*Plus: Release 10.2.0.4.0 - Production on Sat Nov 19 10:03:49 2011
Copyright (c) 1982, 2007, Oracle. All Rights Reserved.
NFS server 130.34.3.102 not responding still trying

為了排查問題，使用truss跟蹤一下sqlplus的過程：
wps1:/oracle$truss -D sqlplus '/as sysdba'
0.0000:        execve("/oracle/niyl/bin/sqlplus", 0x2FF22964, 0x2FF22970) argc: 4
0.0226:        execve("/oracle/app/oracle/product/10.2.0/db/bin/sqlplus", 0x2FF22964, 0x2FF22970) argc: 2
0.0611:        kusla(2, 0x09FFFFFFF000C490)     = -1
0.0041:        thread_init(0x0900000000739020, 0x09001000A0860350) =
0.0004:        sbrk(0x0000000000000000)         = 0x00000001100ED448
0.0002:        vmgetinfo(0x0FFFFFFFFFFFF570, 7, 16) = 0
0.0003:        sbrk(0x0000000000000000)         = 0x00000001100ED448
......
0.0002:        kioctl(7, 22528, 0x0000000000000000, 0x0000000000000000) = -1
kread(7, "\0 DA '\014\0\001\002\0".., 4096)    = 164
0.0002:        close(7)                         = 0
0.0002:        __libc_sbrk(0x0000000000010020) = 0x000000011041D9A0
0.0002:        kioctl(1, 22528, 0x0000000000000000, 0x0000000000000000) = 0

kwrite(1, "\n", 1) = 1
SQL*Plus: Release 10.2.0.4.0 - Production on Sat Nov 19 10:59:07 2011
kwrite(1, " S Q L * P l u s : R e".., 70) = 70

kwrite(1, "\n", 1) = 1
Copyright (c) 1982, 2007, Oracle. All Rights Reserved.
kwrite(1, " C o p y r i g h t ( c".., 56) = 56

kwrite(1, "\n", 1)                              = 1
0.0002:        kfcntl(1, F_GETFL, 0x0000000000000008) = 2
0.0002:        lseek(4, 512, 0)                 = 512
kread(4, "17A5\0\0\0\0\0\0\0\0\0\0".., 512)     = 512
0.0002:        lseek(4, 1024, 0)                = 1024
kread(4, "\016\0 *\0 R\0 h\081\09E".., 512)     = 512
0.0002:        lseek(4, 4608, 0)                = 4608
kread(4, "\00F\0A0\0\0\0 b\0A1\0\0".., 512)     = 512
0.0003:        __libc_sbrk(0x0000000000010020) = 0x000000011042D9C0
0.0003:        statx(".", 0x0FFFFFFFFFFF7170, 176, 010) = 0
0.0002:        kopen(".", O_RDONLY)             = 7
0.0002:        getdirent64(7, 0x0000000110434810, 4096) = 680
0.0002:        klseek(7, 0, 0, 0x0FFFFFFFFFFF7070) = 0
0.0002:        kfcntl(7, F_GETFD, 0x00000001100F34B8) = 0
0.0002:        kfcntl(7, F_SETFD, 0x0000000000000001) = 0
0.0002:        close(7)                         = 0
0.0002:        statx("/", 0x0FFFFFFFFFFF7390, 176, 020) = 0
0.0002:        statx("./", 0x0FFFFFFFFFFF7390, 176, 020) = 0
0.0002:        statx("./../", 0x0FFFFFFFFFFF7170, 176, 010) = 0
0.0002:        kopen("./../", O_RDONLY)         = 7
0.0002:        getdirent64(7, 0x0000000110434810, 4096) = 1776
0.0002:        klseek(7, 0, 0, 0x0FFFFFFFFFFF7070) = 0
0.0002:        kfcntl(7, F_GETFD, 0x00000001100F34B8) = 0
0.0002:        kfcntl(7, F_SETFD, 0x0000000000000001) = 0
0.0002:        fstatx(7, 0x0FFFFFFFFFFF7390, 176, 020) = 0
0.0002:        getdirent64(7, 0x0000000110434810, 4096) = 1776
0.0002:        statx("./../.", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../..", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.TTauthority", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.Xauthority", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.bash_history", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.dt", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.dtprofile", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.java", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.mozilla", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.profile", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.rhosts", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.sh_history", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.topasrecrc", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.vi_history", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.vnc", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../.wmrc", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../IBM", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../TT_DB", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../admin", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../aixfix", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../audit", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../bin", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../bpmdata", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../dbscripts", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../dev", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../esa", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../etc", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../filedata", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0512:        statx("./../ha_script", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../home", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../ihsGSKitUpgradeLog.txt", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../isoxlc", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0003:        statx("./../lib", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../lost+found", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../lpp", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../ls.sh", 0x0FFFFFFFFFFF7440, 176, 021) = 0
0.0002:        statx("./../mnt", 0x0FFFFFFFFFFF7440, 176, 021) = 0
2.0001:        statx("./../nbu_nfs_102", 0x0FFFFFFFFFFF7440, 176, 021) (sleeping...)
NFS server 130.34.3.102 not responding still trying

Oracle的程式派生的時候，會受作業系統的影響，不同的作業系統行為不同，而這個問題就是發生在AIX下的。在其他的平臺上，則無此現象。
根據文件1316251.1的描述：
Oracle code calls a Unix system call, 'getcwd' to get the current working directory. Then, after that, all the control reverts over to the operating system. From what we can see, the function 'getcwd' calls 'getwd' which in turn calls 'stat'. Once 'stat' is entered it starts processing directory entries in the order shown below by performing a 'statx' call for each entry.
Once the root directory is reached then 'lstat' calls 'statx' for each entry in the directory. Oracle has no control over this processing and there is nothing we can do to prevent it (it is all at the OS level at this point).

statx的行為我們可以透過/usr/include/sys/stat.h定義得知，在這一步中，是去獲取對應的檔案或者目錄的相關資訊，包括：
device id
file serial number
user id
group id
Time of last access
Time of last data modification
Time of last file status change
Type of fs

......
等等資訊

在這個hang住的程式中，就是由於nfs網路檔案系統無法訪問，導致statx hang住了，所以無法連線。

解決辦法：
修復網路問題，是nfs網路檔案系統恢復正常。
或者
重啟主機，暫不掛載nfs網路檔案系統

為避免再次出現這樣的問題，可以參照文件1316251.1提供的辦法，不將nfs掛載在根目錄/下
mkdir /nfs
mkdir /nfs/nbu_nfs_102_mount
mount /nfs/nbu_nfs_102_mount
ln -s /nfs/nbu_nfs_102_mount /nbu_nfs_102

請參考具體文件。
When NFS Server Is Down, Oracle Server Freezes With No Errors In Alert Log File (文件 ID 1316251.1)
Disconnected NFS Mount Point Causes Instance to Hang on AIX (文件 ID 1445600.1)
為了方便沒有 oracle support 賬戶的朋友，我把兩篇文件的內如貼上如下。

When NFS Server Is Down, Oracle Server Freezes With No Errors In Alert Log File (文件 ID 1316251.1)

In this Document
Symptoms
Changes
Cause
Solution

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.2.0.4 and later
IBM AIX on POWER Systems (64-bit)
SYMPTOMS

Each of the Oracle instances on AIX has a NFS mount point for backup purposes. It is mounted with following options:

bg,hard,intr,rsize=32768,wsize=32768,sec=sys,noac,rw

When the NFS server is down, the Oracle RDBMS freezes with no errors in alert log file. When the NFS server is up again, the database is working without problems.

CHANGES

No changes on the environment, just lost NAS connectivity (to NFS server), so the remote directories are not available.

CAUSE

From the uploaded truss output of sqlplus and df command, we can see the statx command is hanging at /backup , i.e. the NFS mounted drive:

462940: statx("./../../../../backup", 0x0FFFFFFFFFFF5980, 176, 021) (sleeping...)
561338: kread(14, " ? ? J ?\0\0\0\0\0\0\010".., 64) Err#82 ERESTART
561338: Received signal #2, SIGINT [caught]
561338: sigprocmask(0, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: sigprocmask(1, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: ksetcontext_sigreturn(0x0FFFFFFFFFFF37A0, 0x0000000000000000, 0x00000001100F04F0,
0x800000000000D032, 0x3000000000000000, 0x0000000000000360, 0x0000000000000000, 0x0000000000000000)
561338: kread(14, " ? ? J ?\0\0\0\0\0\0\010".., 64) Err#82 ERESTART
561338: Received signal #2, SIGINT [caught]
561338: sigprocmask(0, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: sigprocmask(1, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: ksetcontext_sigreturn(0x0FFFFFFFFFFF37A0, 0x0000000000000000, 0x00000001100F04F0,
0x800000000000D032, 0x3000000000000000, 0x0000000000000320, 0x0000000000000000, 0x0000000000000000)
561338: kread(14, " ? ? J ?\0\0\0\0\0\0\010".., 64) Err#82 ERESTART
561338: Received signal #2, SIGINT [caught]
561338: sigprocmask(0, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: sigprocmask(1, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: ksetcontext_sigreturn(0x0FFFFFFFFFFF37A0, 0x0000000000000000, 0x00000001100F04F0,
0x800000000000D032, 0x3000000000000000, 0x0000000000000310, 0x0000000000000000, 0x0000000000000000)
561338: kread(14, " ? ? J ?\0\0\0\0\0\0\010".., 64) Err#82 ERESTART
561338: Received signal #2, SIGINT [caught]
561338: sigprocmask(0, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: sigprocmask(1, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: ksetcontext_sigreturn(0x0FFFFFFFFFFF37A0, 0x0000000000000000, 0x00000001100F04F0,
0x800000000000D032, 0x3000000000000000, 0x0000000000000310, 0x0000000000000000, 0x0000000000000000)
561338: kread(14, " ? ? J ?\0\0\0\0\0\0\010".., 64) Err#82 ERESTART
561338: Received signal #2, SIGINT [caught]
561338: sigprocmask(0, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: sigprocmask(1, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: ksetcontext_sigreturn(0x0FFFFFFFFFFF37A0, 0x0000000000000000, 0x00000001100F04F0,
0x800000000000D032, 0x3000000000000000, 0x0000000000000320, 0x0000000000000000, 0x0000000000000000)
561338: kread(14, " ? ? J ?\0\0\0\0\0\0\010".., 64) (sleeping...)
462940: statx("./../../../../backup", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../usr", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../lib", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../audit", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../dev", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../etc", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../u", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../lpp", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../mnt", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../proc", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../sbin", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../bin", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../oracle", 0x0FFFFFFFFFFF5980, 176, 021) = 0

The problem comes from:

statx("./../../../../backup", 0x0FFFFFFFFFFF5980, 176, 021) (sleeping...)

Oracle code calls a Unix system call, 'getcwd' to get the current working directory. Then, after that, all the control reverts over to the operating system. From what we can see, the function 'getcwd' calls 'getwd' which in turn calls 'stat'. Once 'stat' is entered it starts processing directory entries in the order shown below by performing a 'statx' call for each entry:

./
./..
./../..
./../../.. (this goes on until the root directory is reached)

Once the root directory is reached then 'lstat' calls 'statx' for each entry in the directory. Oracle has no control over this processing and there is nothing we can do to prevent it (it is all at the OS level at this point).

SOLUTION

From one similar issue, IBM has suggested the following action plan to avoid this issue. The answer from IBM is:

Here's a solution to avoid the problem described by Oracle:
DO NOT have the NFS mounts directly under /, but put them one level lower. Then, we can use symbolic links to them.

NFS mount point on node /nfs/backup (/nfs is a directory we'll create, it can have any name) and create a softlink /backup -> /nfs/backup.

$ ln -s /nfs/backup /backup

This will avoid the statx problem without having to make changes in the setup (because /backup is still there).

Additionally you can ask IBM about APAR # IZ85027, IZ85029, IZ85032, IZ86102, IZ87374, IZ90533.

Check with IBM which one applies to your configuration.

Disconnected NFS Mount Point Causes Instance to Hang on AIX (文件 ID 1445600.1)

In this Document
Symptoms
Changes
Cause
Solution
References

APPLIES TO:

Oracle Server - Enterprise Edition - Version: 10.2.0.1 and later [Release: 10.2 and later ]
IBM AIX on POWER Systems (64-bit)
IBM AIX on POWER Systems (32-bit)
SYMPTOMS

An NFS-mounted file system was unavailable, causing the database instance to hang until the mount point was restored. Clients cannot log on to the database instance. However, this file system is not used within the database.
CHANGES

The remote file system has become unavailable.
CAUSE

This is an issue with the way in which the system call getcwd is implemented within AIX.
SOLUTION

As long as the NFS mount point has at least one other parent directory besides the root directory, this problem will not occur, regardless of whether the remote file system is reachable or not.

For example, suppose that currently, the NFS mount point is called /faraway_files. The fix would be to rename the mount point to something like /my_mounts/faraway_files:

# unmount /faraway_files
# mkdir /my_mounts

# mv /faraway_files /my_mounts

# mount remhost01:/documents /my_mounts/faraway_files

Be sure to make a similar configuration change within smit, so that it will survive a reboot.

AIX下由於nfs故障導致oracle hang

相關文章