[Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware

Fri Jan 27 12:14:54 PST 2012

Thank you all for your help Sunil, Sergio  ..

I am working to normalize all my servers to 5.7+UEK+Powerpath and will 
test stability once done .

OCFS2 is really my only solution as I need really good throughput, NFS 
just simply does not cut it for what I need it.

I will report on findings once its all on 5.7+UEK

Cheers and Keep up the fine work !

Jorge Adrian Salaices
Sr. Linux Administrator
Tomkins Building Products
(817)776-7822

On 01/27/2012 10:54 AM, Sérgio Surkamp wrote:
> Hello Jorge,
>
>> I have a mixture of 1.4.7 (EL 5.3) and 1.6.3 (EL 5.7 + UEK) and
>> something as simple as an umount has triggered random Node reboots,
>> even on nodes that have Other OCFS2 mounts not shared by the
>> rebooting nodes. You see the problem I have is that I have disparate
>> hardware and some of these servers are even VM's.
> Probably this is the source of your instability, you shouldn't mix
> different versions of the filesystem in the same cluster stack, as it
> *may* have network protocol incompatibility between the versions. Also
> you should not mount the same filesystem with different driver versions.
>
> The nodes that are fencing while not mounting the ocfs2 is probably due
> to any oops inside the o2cb (cluster stack) driver that *could be*
> triggered by the mix of versions and some protocol incompatibility
> between them.
>
>> Several documents state that nodes have to be somewhat equal of power
>> and specs and in my case that will never be.
>> Unfortunately for me, I have had several other events of random
>> Fencing that have been unexplained by common checks.
>> i.e. My Network has never been the problem yet one server may see
>> another one go away when all of the other services on that node may
>> be running perfectly fine. I can only surmise that the reason why
>> that may have been is because of an elevated load on the server that
>> starved the Heartbeat process preventing it from sending Network
>> packets to other nodes.
>>
>> My config has about 40 Nodes on it, I have 4 or 5 different shared
>> LUNs out of our SAN and not all servers share all Mounts.
>> meaning  only 10 or 12 share one LUN, 8 or 9 share another and 2 or 3
>> share a third, unfortunately the complexity is such that a server may
>> intersect with some of the servers but not all.
>> perhaps a change in my config to create separate clusters may be the
>> solution but only if a node can be part of multiple clusters:
>>
>> /node:
>>           ip_port = 7777
>>           ip_address = 172.20.16.151
>>           number = 1
>>           name = txri-oprdracdb-1.tomkinsbp.com
>>           cluster = ocfs2-back
>>
>> node:
>>           ip_port = 7777
>>           ip_address = 172.20.16.152
>>           number = 2
>>           name = txri-oprdracdb-2.tomkinsbp.com
>>           cluster = ocfs2-back
>>
>> node:
>>           ip_port = 7777
>>           ip_address = 10.30.12.172
>>           number = 4
>>           name = txri-util01.tomkinsbp.com
>>           cluster = ocfs2-util, ocfs2-back
>> node:
>>           ip_port = 7777
>>           ip_address = 10.30.12.94
>>           number = 5
>>           name = txri-util02.tomkinsbp.com
>>           cluster = ocfs2-util, ocfs2-back
>>
>> cluster:
>>           node_count = 2
>>           name = ocfs2-back
>>
>> cluster:
>>           node_count = 2
>>           name = ocfs2-util
>> /
>> Is this even Legal, or can it be done some other way ?
>> or is this done based on the Different DOMAINS that are created once
>> a mount is done .
> Isn't possible. The cluster part does not support the definition of
> more than one cluster. Take a look at the list archives if you are
> interested in why there could not be more than one definition.
>
>> How can I make the cluster more stable ? and Why does a node fence
>> itself on the cluster even if it does Not have any locks on the
>> shared LUN ? It seems to be that the node may be "fenceable" simply
>> by having the OCFS2 services turned ON, without a mount .
>> is this correct ?
>>
>> Another question I have been having as well is:  can the Fencing
>> method be other than Panic or restart ? Can a third party or a
>> Userland event be triggered to recover from what may be construed by
>> the "Heartbeat" or "Network tests"   as a downed node ?
>>
>> Thanks for any of the help you can give me.
> The server fence because any driver issued a kernel oops due an
> unexpected behaviour, so there is no guarantee that the kernel or the
> driver is still stable when it happens. That's why is recommended that
> the server should restart in this case.
>
> You can disable the automatic fence by setting the sysctl parameter
> kernel.panic_on_oops
>
> # echo 0>  /proc/sys/kernel/panic_on_oops
>
> To permanently disable the fence you can add (or modify) the fallowing
> line in your /etc/sysctl.conf
>
> kernel.panic_on_oops=0
>
> By doing this, the server will not fence if any of the kernel drivers
> issue an oops, instead, the driver will probably crash and may render
> your server unstable or crashed by a kernel panic.
>
> Anyway, you should configure a netconsole, so if any of them oops or
> panic, you still get the error messages and stack traces.
>
> Regards,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20120127/7d22e0ae/attachment.html