OCFS,OCFS2,ASM,RAW 討論主題合併帖(轉)二

zhouwf0726發表於2019-03-29
HEARTBEAT
# How does the disk heartbeat work?
Every node writes every two secs to its block in the heartbeat system file. The block offset is equal to its global node number. So node 0 writes to the first block, node 1 to the second, etc. All the nodes also read the heartbeat sysfile every two secs. As long as the timestamp is changing, that node is deemed alive.
# When is a node deemed dead?
An active node is deemed dead if it does not update its timestamp for O2CB_HEARTBEAT_THRESHOLD (default=7) loops. Once a node is deemed dead, the surviving node which manages to cluster lock the dead node's journal, recovers it by replaying the journal.
# What about self fencing?
A node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) secs. The [o2hb-xx] kernel thread, after every timestamp write, sets a timer to panic the system after that duration. If the next timestamp is written within that duration, as it should, it first cancels that timer before setting up a new one. This way it ensures the system will self fence if for some reason the [o2hb-x] kernel thread is unable to update the timestamp and thus be deemed dead by other nodes in the cluster.
# How can one change the parameter value of O2CB_HEARTBEAT_THRESHOLD?
This parameter value could be changed by adding it to /etc/sysconfig/o2cb and RESTARTING the O2CB cluster. This value should be the SAME on ALL the nodes in the cluster.
# What should one set O2CB_HEARTBEAT_THRESHOLD to?
It should be set to the timeout value of the io layer. Most multipath solutions have a timeout ranging from 60 secs to 120 secs. For 60 secs, set it to 31. For 120 secs, set it to 61.

O2CB_HEARTBEAT_THRESHOLD = (((timeout in secs) / 2) + 1)

# How does one check the current active O2CB_HEARTBEAT_THRESHOLD value?

# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold
7

# What if a node umounts a volume?
During umount, the node will broadcast to all the nodes that have mounted that volume to drop that node from its node maps. As the journal is shutdown before this broadcast, any node crash after this point is ignored as there is no need for recovery.
# I encounter "Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing" whenever I run a heavy io load?
We have encountered a bug with the default CFQ io scheduler which causes a process doing heavy io to temporarily starve out other processes. While this is not fatal for most environments, it is for OCFS2 as we expect the hb thread to be r/w to the hb area atleast once every 12 secs (default). Bug with the fix has been filed with Red Hat. Red Hat is expected to have this fixed in RHEL4 U4 release. SLES9 SP3 2.5.6-7.257 includes this fix. For the latest, refer to the tracker bug filed on bugzilla. Till this issue is resolved, one is advised to use the DEADLINE io scheduler. To use it, add "elevator=deadline" to the kernel command line as follows:

* For SLES9, edit the command line in /boot/grub/menu.lst.

title Linux 2.6.5-7.244-bigsmp (with deadline)
kernel (hd0,4)/boot/vmlinuz-2.6.5-7.244-bigsmp root=/dev/sda5
vga=0x314 selinux=0 splash=silent resume=/dev/sda3 elevator=deadline showopts console=tty0 console=ttyS0,115200 noexec=off
initrd (hd0,4)/boot/initrd-2.6.5-7.244-bigsmp

* For RHEL4, edit the command line in /boot/grub/grub.conf:

title Red Hat Enterprise Linux AS (2.6.9-22.EL) (with deadline)
root (hd0,0)
kernel /vmlinuz-2.6.9-22.EL ro root=LABEL=/ console=ttyS0,115200 console=tty0 elevator=deadline noexec=off
initrd /initrd-2.6.9-22.EL.img

To see the current kernel command line, do:

# cat /proc/cmdline

QUORUM AND FENCING
# What is a quorum?
A quorum is a designation given to a group of nodes in a cluster which are still allowed to operate on shared storage. It comes up when there is a failure in the cluster which breaks the nodes up into groups which can communicate in their groups and with the shared storage but not between groups.
# How does OCFS2's cluster services define a quorum?
The quorum decision is made by a single node based on the number of other nodes that are considered alive by heartbeating and the number of other nodes that are reachable via the network.
A node has quorum when:

* it sees an odd number of heartbeating nodes and has network connectivity to more than half of them.
OR,
* it sees an even number of heartbeating nodes and has network connectivity to at least half of them *and* has connectivity to the heartbeating node with the lowest node number.

# What is fencing?
Fencing is the act of forecefully removing a node from a cluster. A node with OCFS2 mounted will fence itself when it realizes that it doesn't have quorum in a degraded cluster. It does this so that other nodes won't get stuck trying to access its resources. Currently OCFS2 will panic the machine when it realizes it has to fence itself off from the cluster. As described in Q02, it will do this when it sees more nodes heartbeating than it has connectivity to and fails the quorum test.
# How does a node decide that it has connectivity with another?
When a node sees another come to life via heartbeating it will try and establish a TCP connection to that newly live node. It considers that other node connected as long as the TCP connection persists and the connection is not idle for 10 seconds. Once that TCP connection is closed or idle it will not be reestablished until heartbeat thinks the other node has died and come back alive.
# How long does the quorum process take?
First a node will realize that it doesn't have connectivity with another node. This can happen immediately if the connection is closed but can take a maximum of 10 seconds of idle time. Then the node must wait long enough to give heartbeating a chance to declare the node dead. It does this by waiting two iterations longer than the number of iterations needed to consider a node dead (see the Heartbeat section of this FAQ). The current default of 7 iterations of 2 seconds results in waiting for 9 iterations or 18 seconds. By default, then, a maximum of 28 seconds can pass from the time a network fault occurs until a node fences itself.
# How can one avoid a node from panic-ing when one shutdowns the other node in a 2-node cluster?
This typically means that the network is shutting down before all the OCFS2 volumes are being umounted. Ensure the ocfs2 init script is enabled. This script ensures that the OCFS2 volumes are umounted before the network is shutdown. To check whether the service is enabled, do:

# chkconfig --list ocfs2
ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off

# How does one list out the startup and shutdown ordering of the OCFS2 related services?

* To list the startup order for runlevel 3 on RHEL4, do:

# cd /etc/rc3.d
# ls S*ocfs2* S*o2cb* S*network*
S10network S24o2cb S25ocfs2

* To list the shutdown order on RHEL4, do:

# cd /etc/rc6.d
# ls K*ocfs2* K*o2cb* K*network*
K19ocfs2 K20o2cb K90network

* To list the startup order for runlevel 3 on SLES9, do:

# cd /etc/init.d/rc3.d
# ls S*ocfs2* S*o2cb* S*network*
S05network S07o2cb S08ocfs2

* To list the shutdown order on SLES9, do:

# cd /etc/init.d/rc3.d
# ls K*ocfs2* K*o2cb* K*network*
K14ocfs2 K15o2cb K17network

Please note that the default ordering in the ocfs2 scripts only include the network service and not any shared-device specific service, like iscsi. If one is using iscsi or any shared device requiring a service to be started and shutdown, please ensure that that service runs before and shutsdown after the ocfs2 init service.

NOVELL SLES9
# Why are OCFS2 packages for SLES9 not made available on oss.oracle.com?
OCFS2 packages for SLES9 are available directly from Novell as part of the kernel. Same is true for the various Asianux distributions and for ubuntu. As OCFS2 is now part of the mainline kernel, we expect more distributions to bundle the product with the kernel.
# What versions of OCFS2 are available with SLES9 and how do they match with the Red Hat versions available on oss.oracle.com?
As both Novell and Oracle ship OCFS2 on different schedules, the package versions do not match. We expect to resolve itself over time as the number of patch fixes reduce. Novell is shipping two SLES9 releases, viz., SP2 and SP3.

* The latest kernel with the SP2 release is 2.6.5-7.202.7. It ships with OCFS2 1.0.8.
* The latest kernel with the SP3 release is 2.6.5-7.257. It ships with OCFS2 1.2.1.

RELEASE 1.2
# What is new in OCFS2 1.2?
OCFS2 1.2 has two new features:

* It is endian-safe. With this release, one can mount the same volume concurrently on x86, x86-64, ia64 and big endian architectures ppc64 and s390x.
* Supports readonly mounts. The fs uses this feature to auto remount ro when encountering on-disk corruptions (instead of panic-ing).

# Do I need to re-make the volume when upgrading?
No. OCFS2 1.2 is fully on-disk compatible with 1.0.
# Do I need to upgrade anything else?
Yes, the tools needs to be upgraded to ocfs2-tools 1.2. ocfs2-tools 1.0 will not work with OCFS2 1.2 nor will 1.2 tools work with 1.0 modules.

UPGRADE TO THE LATEST RELEASE
# How do I upgrade to the latest release?

* Download the latest ocfs2-tools and ocfs2console for the target platform and the appropriate ocfs2 module package for the kernel version, flavor and architecture. (For more, refer to the "Download and Install" section above.)

* Umount all OCFS2 volumes.

# umount -at ocfs2

* Shutdown the cluster and unload the modules.

# /etc/init.d/o2cb offline
# /etc/init.d/o2cb unload

* If required, upgrade the tools and console.

# rpm -Uvh ocfs2-tools-1.2.1-1.i386.rpm ocfs2console-1.2.1-1.i386.rpm

* Upgrade the module.

# rpm -Uvh ocfs2-2.6.9-22.0.1.ELsmp-1.2.2-1.i686.rpm

* Ensure init services ocfs2 and o2cb are enabled.

# chkconfig --add o2cb
# chkconfig --add ocfs2

* To check whether the services are enabled, do:

# chkconfig --list o2cb
o2cb 0:off 1:off 2:on 3:on 4:on 5:on 6:off
# chkconfig --list ocfs2
ocfs2 0:off 1:off 2:on 3:on 4:on 5:on 6:off

* At this stage one could either reboot the node or simply, restart the cluster and mount the volume.

# Can I do a rolling upgrade from 1.0.x/1.2.x to 1.2.2?
Rolling upgrade to 1.2.2 is not recommended. Shutdown the cluster on all nodes before upgrading the nodes.
# After upgrade I am getting the following error on mount "mount.ocfs2: Invalid argument while mounting /dev/sda6 on /ocfs".
Do "dmesg | tail". If you see the error:

ocfs2_parse_options:523 ERROR: Unrecognized mount option "heartbeat=local" or missing value

it means that you are trying to use the 1.2 tools and 1.0 modules. Ensure that you have unloaded the 1.0 modules and installed and loaded the 1.2 modules. Use modinfo to determine the version of the module installed and/or loaded.
# The cluster fails to load. What do I do?
Check "demsg | tail" for any relevant errors. One common error is as follows:

SELinux: initialized (dev configfs, type configfs), not configured for labeling audit(1139964740.184:2): avc: denied { mount } for ...

The above error indicates that you have SELinux activated. A bug in SELinux does not allow configfs to mount. Disable SELinux by setting "SELINUX=disabled" in /etc/selinux/config. Change is activated on reboot.

[ 本帖最後由 nntp 於 2006-9-1 00:00 編輯 ]


vecentli 回覆於:2006-08-31 22:02:14

PROCESSES
# List and describe all OCFS2 threads?

[o2net]
One per node. Is a workqueue thread started when the cluster is brought online and stopped when offline. It handles the network communication for all threads. It gets the list of active nodes from the o2hb thread and sets up tcp/ip communication channels with each active node. It sends regular keepalive packets to detect any interruption on the channels.
[user_dlm]
One per node. Is a workqueue thread started when dlmfs is loaded and stopped on unload. (dlmfs is an in-memory file system which allows user space processes to access the dlm in kernel to lock and unlock resources.) Handles lock downconverts when requested by other nodes.
[ocfs2_wq]
One per node. Is a workqueue thread started when ocfs2 module is loaded and stopped on unload. Handles blockable file system tasks like truncate log flush, orphan dir recovery and local alloc recovery, which involve taking dlm locks. Various code paths queue tasks to this thread. For example, ocfs2rec queues orphan dir recovery so that while the task is kicked off as part of recovery, its completion does not affect the recovery time.
[o2hb-14C29A7392]
One per heartbeat device. Is a kernel thread started when the heartbeat region is populated in configfs and stopped when it is removed. It writes every 2 secs to its block in the heartbeat region to indicate to other nodes that that node is alive. It also reads the region to maintain a nodemap of live nodes. It notifies o2net and dlm any changes in the nodemap.
[ocfs2vote-0]
One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. It downgrades locks when requested by other nodes in reponse to blocking ASTs (BASTs). It also fixes up the dentry cache in reponse to files unlinked or renamed on other nodes.
[dlm_thread]
One per dlm domain. Is a kernel thread started when a dlm domain is created and stopped when destroyed. This is the core dlm which maintains the list of lock resources and handles the cluster locking infrastructure.
[dlm_reco_thread]
One per dlm domain. Is a kernel thread which handles dlm recovery whenever a node dies. If the node is the dlm recovery master, it remasters all the locks owned by the dead node.
[dlm_wq]
One per dlm domain. Is a workqueue thread. o2net queues dlm tasks on this thread.
[kjournald]
One per mount. Is used as OCFS2 uses JDB for journalling.
[ocfs2cmt-0]
One per mount. Is a kernel thread started when a volume is mounted and stopped on umount. Works in conjunction with kjournald.
[ocfs2rec-0]
Is started whenever another node needs to be be recovered. This could be either on mount when it discovers a dirty journal or during operation when hb detects a dead node. ocfs2rec handles the file system recovery and it runs after the dlm has finished its recovery.


vecentli 回覆於:2006-08-31 22:02:47

url:

http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#O2CB


nntp 回覆於:2006-09-01 00:44:25

各位,我把本版幾個主要討論ocfs,ocfs2,ASM,raw 的討論主題合併在一起了,大家可以在這裡繼續討論


nntp 回覆於:2006-09-01 03:05:31

如果要部署RAC, 如果需要快速完工並且在這方面經驗欠缺的話,Oracle 提供的 "Oracle Validated Configurations" 是一個最好的幫手。
Oracle剛開始推出 OVC的時候,我覺得特別特別好,即便是對於非常熟悉linux/oracle/RAC得人來說,也是一個大大減輕工作量的好工具.

搞不清楚狀況,被工作任務緊逼的朋友,可以完全按照 OVC來完成任務,已經做好RAC並且碰到故障問題的時候,也可以按照 OVC來做排查參考.

Oracle Validated Configurations
http://www.oracle.com/technology/tech/linux/validated-configurations/index.html


nntp 回覆於:2006-09-01 03:46:54

http://forums.oracle.com/forums/thread.jspa?messageID=1337838
Oracle Forum 一個非常有意義的問答討論, 我的看法和他們後面幾位基本一致. 特別是有位仁兄提到的ASMRAW之間的便捷轉換.
還有關於之前我回答本線索某位朋友關於 voting 和OCR的位置問題,我當時沒有說太多原因,在這個討論中也由簡單的提及.


vecentli 回覆於:2006-09-01 10:07:53

引用:原帖由 nntp 於 2006-8-31 18:01 發表



單機還是RAC? 如果是RAC的話, 就算掉電, asm 可以處理這種情況的,你訂了oracle mag麼?去年年底有一期介紹類似情況的.




對這個介紹比較感興趣。能否提供一個url?:D

如果要對這個進行恢復,我覺得是比較有難度的。。畢竟關於asm內部i/o機制的資料不多。

[ 本帖最後由 vecentli 於 2006-9-1 10:10 編輯 ]


blue_stone 回覆於:2006-09-01 12:01:34

redhat的gfs和ibm的gpfs能不能也放一起討論?
能不能把gfs, gpfs, ocfs, ocfs2比較一下?
用途, 可靠性, 可用性, 效能, 穩定性等


nntp 回覆於:2006-09-01 16:13:41

gfs 和ocfs2是一種東西, 和ocfs, gpfs不是一種東西. ocfs 和當中的任何一種都不一樣.

gfs/ocfs2 使得多個節點訪問共享儲存的同一個位置成為可能,他們通過普通網路建立不同節點上檔案系統快取的同步機制,通過叢集鎖,杜絕多個節點的不同應用操作同一個檔案產生的競爭關係從而破壞檔案的可能性,通過普通網路交換節點之間的心跳狀態. 這是功能上的類似。從成熟度,效能來考慮,目前ocfs2還遠不能和gfs相提並論, 能夠用ocfs2的地方都可以用gfs來替代,但是反之就不行. gfs在 HA叢集環境,擔當了一個"廉價縮水版"的polyserv. 至少目前來看,我個人的觀點是gfs在技術,成熟度,開發力量投入,效能上都要領先ocfs2 差不多3年左右的時間.而且這種差距可能進一步拉大.

ocfs是隻能for oracle的,也是oracle把叢集檔案系統納入發展視線的第一個版本,之前我也說過,這個版本當時並沒有定位在通用叢集檔案系統上,無論是質量,效能,穩定性等等在oracle使用者圈子,反面的意見佔大多數.

即便是在今天ocfs2的階段,oracle mailing list, forum上大量充斥對於ocfs2質量,效能和可靠性的投訴.

ASM 是Oracle 在 linux, HP-UX, Solaris 等多個商用高階Unix平臺採用的新一代儲存管理系統,在Oracle公司的產品地位,開發的投入,使用者範圍,適用的層次和領域都是ocfs2專案無法比的.
ASM在功能上,相當於 RAW+LVM. 在資料量和訪問量的線性增長關係上,表現也很出色,在實際的真實測試環境中,ASM的效能基本接近RAW, 因為還有Volume 開銷,所以效能上有一點點地開銷,也是很容易理解的. CLVM+OCFS2的效能線上性增長的測試中,明顯低於ASM和RAW. 前天我一個朋友給我發來了他在歐洲高能實驗室一個年會上作的slide,他們實驗室的IT部門統計了一下,整個實驗室各種單資料庫和叢集加起來,現在有540多個TB的資料跑在ASM上面,經過重負荷的使用和測試,他們對於ASM是表現是相當滿意的. 他們大部分的系統是IA64+linux和AMD Opteron+Linux. 我看有時間的話,會把他們的測試和結論貼一些上來.

[ 本帖最後由 nntp 於 2006-9-1 16:30 編輯 ]


myprotein 回覆於:2006-09-15 09:14:06

nntp老大太強了!
小弟一事不明:lvm+ocfs2,您說lvm不是cluster aware的,但是以我的淺薄知識,好像aix中可以建立併發vg的吧?這個併發vg,是不是cluster aware的呢?


blue_stone 回覆於:2006-09-15 10:18:33

引用:原帖由 myprotein 於 2006-9-15 09:14 發表
nntp老大太強了!
小弟一事不明:lvm+ocfs2,您說lvm不是cluster aware的,但是以我的淺薄知識,好像aix中可以建立併發vg的吧?這個併發vg,是不是cluster aware的呢?



lvm和lvm2都不時cluster aware的, linux下cluster aware的卷管理軟體是clvm.
aix中的concurrent vg是cluster aware的


myprotein 回覆於:2006-09-15 10:47:13

多謝老大


king3171 回覆於:2006-09-19 17:14:25

引用:原帖由 nntp 於 2006-9-1 16:13 發表
gfs 和ocfs2是一種東西, 和ocfs, gpfs不是一種東西. ocfs 和當中的任何一種都不一樣.

gfs/ocfs2 使得多個節點訪問共享儲存的同一個位置成為可能,他們通過普通網路建立不同節點上檔案系統快取的同步機制,通 ...



這個帖子的每一個回覆我都看了,受益非淺,這幾種檔案系統的比較,我很感興趣,但還是有疑惑,我對SUNSOLARIS的檔案系統比較熟悉,其他的HPUX、AIX有一些瞭解,但對他們的檔案系統不很清楚。SOLARIS中有一種檔案系統叫Global File Systems,也被稱為Cluster file system或Proxy file system,我想應該就是老兄所說的GFS,在SOLARIS中,這個Global File Systems可以被叢集中的多個節點同時訪問,但只有一個節點在實際控制操作讀寫,其他節點都是通過這個主控節點來操作,主控節點DOWN掉後,主控權會轉移到其他節點。但SOLARIS的這個Global File Systems其實和普通的UFS檔案系統是沒有本質區別的,只是在MOUNT這個要作為Global File Systems的分割槽的的時候加了global這個選項而已。如下:
mount -o global,logging /dev/vx/dsk/nfs-dg/vol-01 /global/nfs
去年在做SUN的CLUSTER,跑IBM的DB2 用到這個Global File Systems時出現一些問題,後來廠家的工程師說不推薦用Global File Systems,說容易出現問題,後來把這個Global File Systems取消了,雖然後來證實出現問題並不是Global File Systems造成的。
我想知道,GFS是一個第三方的標準的技術,各個廠家使用的都一樣呢,還是各個廠家各自為政,雖然叫法類似但實際原理各不相同的一個技術,懇請指點迷津!!!

[ 本帖最後由 king3171 於 2006-9-19 17:18 編輯 ]


nntp 回覆於:2006-09-19 21:58:47

sorry, 恕我直言,你對Solaris 叢集檔案系統的瞭解是不正確的.

Solaris 上面可以跑一個獨立的叢集檔案系統產品,叫做 SUN CFS - Cluster File System. 這個東西就是從Veritas CFS買過來 O*成自己的產品. 實際上HPUX上面也有CFS, 也是從Veritas CFS O*過來的. 這個CFS當時推出來的時候,實際上Sistina公司的GFS還處於初始萌芽狀態,所以在行業內,Veritas就號稱這個CFS可以實現 Global File Service.
這是你瞭解到的資訊中不正確的地方之一. 所以 Sun/HP的CFS 號稱實現Global File Service, 但是這個 GFS 可不是 Sistina 的"GFS"(Global File System). 也就是一字之差,說明了兩者之間的相似和區別.

至於Sun的CFS到底是什麼原理和內部細節,你可以從sun站點查一個白皮書,我記的名字就叫做 Sun Cluster Software Cluster File System xxxxx 的pdf檔案, google一下,裡面有詳細的介紹. Sun CFS的組成部分,特點,原理和基本特性等. 算是寫得相當清楚地.

本版置頂的帖子有關於RedHat 收購的Sistina 公司的GFS的詳細聯接和文件,因為你在帖子中表明需要搞清楚兩者的區別,所以我也覺得如果三兩句說不清楚,還是建議你將兩者的白皮書和規格都詳細閱讀後,自然會有一個比較清楚的比較.

因為都是不同的產品,目的,設計特點,用途都不太一樣,所以也不存在什麼共同的功能上的標準. 底層編碼設計上的標準肯定是有的,還是按照Unix世界通用的幾大標準來設計的.


king3171 回覆於:2006-09-20 13:33:58

謝謝,我再查一下吧,你說的Solaris 上面可以跑一個獨立的叢集檔案系統產品,叫做 SUN CFS - Cluster File System,我不知道你說的產品是不是SUN的CLUSTER 3.1 產品,我想應該不是,因為 SUN CLUSTER 3.1 中並沒有提到你說的那個東西,前面我說的那個GLOBAL file system 就是CLUSTER 3.1 產品中的概念。至於單獨的叢集檔案系統產品我和SUN的工程師交流中沒有聽他們提過,我再查一下吧,有新的發現和心得再上來和您交流。

[ 本帖最後由 king3171 於 2006-9-20 13:46 編輯 ]


nntp 回覆於:2006-09-21 18:08:54

sorry,你還是看看吧. 嘿嘿.

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/756652/viewspace-242248/,如需轉載,請註明出處,否則將追究法律責任。

相關文章