<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#ffffff" text="#000000">

    Thank you all for your help Sunil, Sergio&nbsp; .. <br>

    <br>

    I am working to normalize all my servers to 5.7+UEK+Powerpath and

    will test stability once done . <br>

    <br>

    OCFS2 is really my only solution as I need really good throughput,

    NFS just simply does not cut it for what I need it. <br>

    <br>

    I will report on findings once its all on 5.7+UEK <br>

    <br>

    Cheers and Keep up the fine work !<br>

    <br>

    <div class="moz-signature"><bold>Jorge Adrian Salaices</bold><br>

      Sr. Linux Administrator<br>

      <bold>Tomkins Building Products</bold><br>

      (817)776-7822<br>

    </div>

    <br>

    On 01/27/2012 10:54 AM, S&eacute;rgio Surkamp wrote:

    <blockquote

      cite="mid:20120127145447.7748501e@icedearth.gruposinternet.com.br"

      type="cite">

      <pre wrap="">Hello Jorge,

</pre>

      <blockquote type="cite">

        <pre wrap="">I have a mixture of 1.4.7 (EL 5.3) and 1.6.3 (EL 5.7 + UEK) and 

something as simple as an umount has triggered random Node reboots,

even on nodes that have Other OCFS2 mounts not shared by the

rebooting nodes. You see the problem I have is that I have disparate

hardware and some of these servers are even VM's.

</pre>

      </blockquote>

      <pre wrap="">

Probably this is the source of your instability, you shouldn't mix

different versions of the filesystem in the same cluster stack, as it

*may* have network protocol incompatibility between the versions. Also

you should not mount the same filesystem with different driver versions.

The nodes that are fencing while not mounting the ocfs2 is probably due

to any oops inside the o2cb (cluster stack) driver that *could be*

triggered by the mix of versions and some protocol incompatibility

between them.

</pre>

      <blockquote type="cite">

        <pre wrap="">Several documents state that nodes have to be somewhat equal of power 

and specs and in my case that will never be.

Unfortunately for me, I have had several other events of random

Fencing that have been unexplained by common checks.

i.e. My Network has never been the problem yet one server may see 

another one go away when all of the other services on that node may

be running perfectly fine. I can only surmise that the reason why

that may have been is because of an elevated load on the server that

starved the Heartbeat process preventing it from sending Network

packets to other nodes.

My config has about 40 Nodes on it, I have 4 or 5 different shared

LUNs out of our SAN and not all servers share all Mounts.

meaning  only 10 or 12 share one LUN, 8 or 9 share another and 2 or 3 

share a third, unfortunately the complexity is such that a server may 

intersect with some of the servers but not all.

perhaps a change in my config to create separate clusters may be the 

solution but only if a node can be part of multiple clusters:

/node:

         ip_port = 7777

         ip_address = 172.20.16.151

         number = 1

         name = txri-oprdracdb-1.tomkinsbp.com

         cluster = ocfs2-back

node:

         ip_port = 7777

         ip_address = 172.20.16.152

         number = 2

         name = txri-oprdracdb-2.tomkinsbp.com

         cluster = ocfs2-back

node:

         ip_port = 7777

         ip_address = 10.30.12.172

         number = 4

         name = txri-util01.tomkinsbp.com

         cluster = ocfs2-util, ocfs2-back

node:

         ip_port = 7777

         ip_address = 10.30.12.94

         number = 5

         name = txri-util02.tomkinsbp.com

         cluster = ocfs2-util, ocfs2-back

cluster:

         node_count = 2

         name = ocfs2-back

cluster:

         node_count = 2

         name = ocfs2-util

/

Is this even Legal, or can it be done some other way ?

or is this done based on the Different DOMAINS that are created once

a mount is done .

</pre>

      </blockquote>

      <pre wrap="">

Isn't possible. The cluster part does not support the definition of

more than one cluster. Take a look at the list archives if you are

interested in why there could not be more than one definition.

</pre>

      <blockquote type="cite">

        <pre wrap="">How can I make the cluster more stable ? and Why does a node fence 

itself on the cluster even if it does Not have any locks on the

shared LUN ? It seems to be that the node may be "fenceable" simply

by having the OCFS2 services turned ON, without a mount .

is this correct ?

Another question I have been having as well is:  can the Fencing

method be other than Panic or restart ? Can a third party or a

Userland event be triggered to recover from what may be construed by

the "Heartbeat" or "Network tests"   as a downed node ?

Thanks for any of the help you can give me.

</pre>

      </blockquote>

      <pre wrap="">

The server fence because any driver issued a kernel oops due an

unexpected behaviour, so there is no guarantee that the kernel or the

driver is still stable when it happens. That's why is recommended that

the server should restart in this case.

You can disable the automatic fence by setting the sysctl parameter

kernel.panic_on_oops

# echo 0 &gt; /proc/sys/kernel/panic_on_oops

To permanently disable the fence you can add (or modify) the fallowing

line in your /etc/sysctl.conf

kernel.panic_on_oops=0

By doing this, the server will not fence if any of the kernel drivers

issue an oops, instead, the driver will probably crash and may render

your server unstable or crashed by a kernel panic.

Anyway, you should configure a netconsole, so if any of them oops or

panic, you still get the error messages and stack traces.

Regards,

</pre>

    </blockquote>

  </body>

</html>