今天在做資料遷移的時候，碰到了一個嚴重的問題，資料載入完全hang住了，最後無奈回退了。

系統使用的vxfs檔案系統，在生產升級前一個月的時候，做過一次小規模的資料遷移，當時檢視awr,ash，最後根據addm的推薦得出載入速度比較慢主要是由於非同步IO導致的，而且當時生產庫確實沒有啟用非同步io, filesystemio_option的設定為none，在經過確認之後，在半個月前的一此例行維護中，由客戶做了這個配置的修改。修改後發現iowait明顯增加了，當時也沒再多跟多的分析因為沒有比較明顯的效能問題，當時認為iowait高可能和啟用了非同步io後處理的效率更高有關，反而認為這是一種改進。

在升級的前2天的時候，做資料遷移前的檢查工作的時候就做了一次簡單的io分析。其中使用了dd來做了一個小測試。

發現io的速度很差，相比測試環境有很大的差別。

time dd if=/dev/zero bs=1M count=204 of=direct_200M

213909504 bytes (214 MB) copied, 5.27558 seconds, 40MB/s40.5 MB/s

real 0m5.284s
user 0m0.003s
sys 0m0.031s

以下是當時做的sar的記錄。
07:00:01 AM CPU %user %nice %system %iowait %steal %idle
09:20:01 AM all 10.48 0.11 1.76 2.89 0.00 84.76
09:30:01 AM all 10.59 0.10 1.81 2.45 0.00 85.04
09:40:01 AM all 7.91 0.18 1.61 3.20 0.00 87.10
09:50:01 AM all 7.26 0.07 1.66 3.23 0.00 87.78
10:00:01 AM all 7.54 0.13 1.53 3.67 0.00 87.13
10:10:01 AM all 7.78 0.09 1.76 3.92 0.00 86.45
10:20:01 AM all 8.24 0.09 2.27 3.98 0.00 85.43
10:30:01 AM all 7.38 0.08 1.79 5.18 0.00 85.57
10:40:01 AM all 8.14 0.16 2.01 6.36 0.00 83.33
10:50:02 AM all 7.05 0.10 1.74 4.83 0.00 86.29
11:00:01 AM all 7.61 0.09 2.04 5.43 0.00 84.83
11:10:01 AM all 7.22 0.09 1.70 6.22 0.00 84.76
11:20:01 AM all 6.71 0.12 2.10 7.35 0.00 83.72
11:30:01 AM all 9.36 0.10 2.87 5.03 0.00 82.63
11:40:01 AM all 7.26 0.25 1.76 6.08 0.00 84.65
11:50:01 AM all 7.17 0.12 2.40 5.24 0.00 85.07
12:00:01 PM all 6.30 0.10 2.64 5.27 0.00 85.69
Average: all 10.36 0.26 1.14 3.40 0.00 84.83
一個月前的資料情況
Production statistics 20-June-14:
204+0 records in
204+0 records out
213909504 bytes (214 MB) copied, 1.44182 seconds, 148 MB/s
real 0m1.445s
user 0m0.001s
sys 0m0.039s

測試環境
TEST machine statistics:
204+0 records in
204+0 records out
213909504 bytes (214 MB) copied, 0.550607 seconds, 388 MB/s
real 0m0.595s
user 0m0.001s
sys 0m0.072s

另外一個資料遷移伺服器
TEST2 machine statistics:
213909504 bytes (214 MB) copied, 0.320128 seconds, 668 MB/s

real 0m0.43s
user 0m0.01s
sys 0m0.42s

結果這個問題在升級前還是沒有解決，在資料遷移的時候就最終回退了。

在做資料的merge的時候，強制啟用了parallel，但是透過top命令看到cpu的使用率可憐的低。

使用dd簡單測試，竟然最低達到了15M/s左右。

以下是當時檢視top的結果。

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

31130 root 10 -5 0 0 0 S 20.5 0.0 1352:56 [vxfs_thread]

17285 oraccbs1 16 0 18.3g 114m 35m S 12.9 0.0 4:56.41 ora_p026_PRODB

18568 oraccbs1 16 0 18.3g 50m 22m D 7.3 0.0 2:40.81 ora_p056_PRODB

18580 oraccbs1 16 0 18.2g 42m 21m D 4.6 0.0 2:24.26 ora_p062_PRODB

7846 oraccbs1 16 0 18.5g 315m 47m S 4.0 0.1 0:12.23 oraclePRODB (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

18576 oraccbs1 16 0 18.2g 42m 21m D 3.6 0.0 2:25.63 ora_p060_PRODB

11334 tivoliam 15 0 820m 89m 14m S 3.3 0.0 341:18.97 /opt/app/IBM/ITM/lx8266/lz/bin/klzagent

18570 oraccbs1 16 0 18.2g 42m 21m D 3.3 0.0 2:25.69 ora_p057_PRODB

18578 oraccbs1 16 0 18.2g 42m 21m D 3.0 0.0 2:23.12 ora_p061_PRODB

稍後就看到parallel啟用的很艱難。過一會才能看到幾個相關的程式。

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

31130 root 10 -5 0 0 0 S 5.6 0.0 1371:05 [vxfs_thread]

11334 tivoliam 15 0 820m 89m 14m S 3.3 0.0 342:47.68 /opt/app/IBM/ITM/lx826

1388 root 18 0 59116 784 568 S 1.6 0.0 0:13.80 sadc 5 1001 -z

4661 oraccbs1 15 0 18.2g 24m 4272 S 1.3 0.0 23:40.35 ora_dbw2_PRODB

27545 oraccbs1 16 0 13428 1844 816 R 1.0 0.0 0:02.35 top -c

2833 root 16 0 127m 71m 3520 S 0.7 0.0 81:28.54 vxconfigd -x syslog

4653 oraccbs1 18 0 18.3g 130m 14m S 0.7 0.0 221:18.30 ora_dia0_PRODB

4663 oraccbs1 15 0 18.2g 24m 3464 S 0.7 0.0 23:48.27 ora_dbw3_PRODB

2598 root 15 0 0 0 0 S 0.3 0.0 5:03.01 [dmp_daemon]

4878 oraccbs1 15 0 18.2g 7140 4396 S 0.3 0.0 67:17.54 ora_mmnl_PRODB

5016 root 10 -5 0 0 0 S 0.3 0.0 0:14.14 [kjournald]

7334 root 18 0 280m 21m 7824 S 0.3 0.0 26:27.83 /opt/VRTSob/bin/vxsvc

19215 root 15 0 37872 9464 2264 S 0.3 0.0 69:47.85 /opt/VRTSvcs/bin/Mount

最後，公司的unix team的一個同事的判斷是vxfs的bug，需要打一個補丁。活還得幹，看看今晚的進展了。

The first two are major

VXFS version

We had IO performance issues with the very same version of VXFS installed in TRUE 6.0.100.000

Eventually we found we were hitting the following bug which is fixed with version 6.0.3

this happened at that site – even though it was a fresh install and NOT and upgrade as indicated in the below.

We did see the very same issues of performance degrading when removing the direct mount option

Hence we recommend installing this patch

SYMPTOM:

Performance degradation is seen after upgrade from SF 5.1SP1RP3 to SF 6.0.1 on

Linux

DESCRIPTION:

The degradation in performance is seen because the I/O are not unplugged before

getting delivered to lower layers in the IO path. These I/Os are unplugged by

OS at a default time which 3 milli seconds, which resulted in an additional

overhead in completion of I/Os.

RESOLUTION:

Code Changes made to explicitly unplug the I/Os before sending then to the lower

layer.

* 3254132 (Tracking ID: 3186971)

Power management

Servers should have power management savings disabled set to high performance

Make sure C-state is disabled set to C0

This is executed at the BIOS level and requires a reboot.

VXFS啟用非同步IO導致的嚴重問題

相關文章