[Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware

Fri Jan 27 10:41:02 PST 2012

Symmetric clustering works best when the nodes are comparable because 
all nodes have to work in sync. NFS may be more suitable for your needs.

On 01/26/2012 05:51 PM, Jorge Adrian Salaices wrote:
> I have been working on trying to convince Mgmt at work that we want to
> go to OCFS2 away from NFS for the sharing of the Application Layer of
> our Oracle EBS (Enterprise Business Suite), and for just general "Backup
> Share", but general instability in my setup has dissuaded me to
> recommend it.
>
> I have a mixture of 1.4.7 (EL 5.3) and 1.6.3 (EL 5.7 + UEK) and
> something as simple as an umount has triggered random Node reboots, even
> on nodes that have Other OCFS2 mounts not shared by the rebooting nodes.
> You see the problem I have is that I have disparate hardware and some of
> these servers are even VM's.
>
> Several documents state that nodes have to be somewhat equal of power
> and specs and in my case that will never be.
> Unfortunately for me, I have had several other events of random Fencing
> that have been unexplained by common checks.
> i.e. My Network has never been the problem yet one server may see
> another one go away when all of the other services on that node may be
> running perfectly fine. I can only surmise that the reason why that may
> have been is because of an elevated load on the server that starved the
> Heartbeat process preventing it from sending Network packets to other
> nodes.
>
> My config has about 40 Nodes on it, I have 4 or 5 different shared LUNs
> out of our SAN and not all servers share all Mounts.
> meaning only 10 or 12 share one LUN, 8 or 9 share another and 2 or 3
> share a third, unfortunately the complexity is such that a server may
> intersect with some of the servers but not all.
> perhaps a change in my config to create separate clusters may be the
> solution but only if a node can be part of multiple clusters:
>
> /node:
> ip_port = 7777
> ip_address = 172.20.16.151
> number = 1
> name = txri-oprdracdb-1.tomkinsbp.com
> cluster = ocfs2-back
>
> node:
> ip_port = 7777
> ip_address = 172.20.16.152
> number = 2
> name = txri-oprdracdb-2.tomkinsbp.com
> cluster = ocfs2-back
>
> node:
> ip_port = 7777
> ip_address = 10.30.12.172
> number = 4
> name = txri-util01.tomkinsbp.com
> cluster = ocfs2-util, ocfs2-back
> node:
> ip_port = 7777
> ip_address = 10.30.12.94
> number = 5
> name = txri-util02.tomkinsbp.com
> cluster = ocfs2-util, ocfs2-back
>
> cluster:
> node_count = 2
> name = ocfs2-back
>
> cluster:
> node_count = 2
> name = ocfs2-util
> /
> Is this even Legal, or can it be done some other way ?
> or is this done based on the Different DOMAINS that are created once a
> mount is done .
>
>
> How can I make the cluster more stable ? and Why does a node fence
> itself on the cluster even if it does Not have any locks on the shared
> LUN ? It seems to be that the node may be "fenceable" simply by having
> the OCFS2 services turned ON, without a mount .
> is this correct ?
>
> Another question I have been having as well is: can the Fencing method
> be other than Panic or restart ? Can a third party or a Userland event
> be triggered to recover from what may be construed by the "Heartbeat" or
> "Network tests" as a downed node ?
>
> Thanks for any of the help you can give me.
>
>
> --
> Jorge Adrian Salaices
> Sr. Linux Engineer
> Tomkins Building Products
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users