[Ocfs2-users] OCFS2 unstable on Disparate Hardware

Mon Jan 23 07:50:25 PST 2012

I have been working on trying to convince Mgmt at work that we want to 
go to OCFS2 away from NFS for the sharing of the Application Layer of 
our Oracle EBS (Enterprise Business Suite), and for just general "Backup 
Share", but general instability in my setup has dissuaded me to 
recommend it.

I have a mixture of 1.4.7 (EL 5.3) and 1.6.3 (EL 5.7 + UEK) and 
something as simple as an umount has triggered random Node reboots, even 
on nodes that have Other OCFS2 mounts not shared by the rebooting nodes.
You see the problem I have is that I have disparate hardware and some of 
these servers are even VM's.

Several documents state that nodes have to be somewhat equal of power 
and specs and in my case that will never be.
Unfortunately for me, I have had several other events of random Fencing 
that have been unexplained by common checks.
i.e. My Network has never been the problem yet one server may see 
another one go away when all of the other services on that node may be 
running perfectly fine. I can only surmise that the reason why that may 
have been is because of an elevated load on the server that starved the 
Heartbeat process preventing it from sending Network packets to other 
nodes.

My config has about 40 Nodes on it, I have 4 or 5 different shared LUNs 
out of our SAN and not all servers share all Mounts.
meaning  only 10 or 12 share one LUN, 8 or 9 share another and 2 or 3 
share a third, unfortunately the complexity is such that a server may 
intersect with some of the servers but not all.
perhaps a change in my config to create separate clusters may be the 
solution but only if a node can be part of multiple clusters:

/node:
         ip_port = 7777
         ip_address = 172.20.16.151
         number = 1
         name = txri-oprdracdb-1.tomkinsbp.com
         cluster = ocfs2-back

node:
         ip_port = 7777
         ip_address = 172.20.16.152
         number = 2
         name = txri-oprdracdb-2.tomkinsbp.com
         cluster = ocfs2-back

node:
         ip_port = 7777
         ip_address = 10.30.12.172
         number = 4
         name = txri-util01.tomkinsbp.com
         cluster = ocfs2-util, ocfs2-back
node:
         ip_port = 7777
         ip_address = 10.30.12.94
         number = 5
         name = txri-util02.tomkinsbp.com
         cluster = ocfs2-util, ocfs2-back

cluster:
         node_count = 2
         name = ocfs2-back

cluster:
         node_count = 2
         name = ocfs2-util
/
Is this even Legal, or can it be done some other way ?
or is this done based on the Different DOMAINS that are created once a 
mount is done .

How can I make the cluster more stable ? and Why does a node fence 
itself on the cluster even if it does Not have any locks on the shared 
LUN ? It seems to be that the node may be "fenceable" simply by having 
the OCFS2 services turned ON, without a mount .
is this correct ?

Another question I have been having as well is:  can the Fencing method 
be other than Panic or restart ? Can a third party or a Userland event 
be triggered to recover from what may be construed by the "Heartbeat" or 
"Network tests"   as a downed node ?

Thanks for any of the help you can give me.

-- 
Jorge Adrian Salaices
Sr. Linux Engineer
Tomkins Building Products

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20120123/d943a682/attachment.html