[Ocfs2-users] Also just a comment to the Oracle guys

Wed Jan 31 17:07:37 PST 2007

I can agree partially only (some of my information comes from previous
versions; first OCFSv2 reboots all the time, it was improved dramatically,
but I am not sure which scenarios still reboots and which do not).

2 node clusters are never reliable, because of this split-brain problem.
OCFSv2 have 2 heartbeat channels de facto - 1 by network and 1 by the disk,
so it can recognize when second node reboots and when it is just
disconnected. Anyway, there was numerous reports of such double panic, it
was a case in early versions definitely, and can happen in 2 node network
even now. There is nothing wrong in recommendation to run 3-d node if
practical - each 2 node cluster without external fencing can experience such
doule reboots by design (Oracle RAC is not exception).

Now about fencing etc. If file system have not any pending IO (it means -
have not locks, have not unsaved buffers, does not wait for locks and
reads and so on - you know better. In short, it is not used.), then fencing
do not make much sense - it is 100% equivalent to the simple 'remount'
(because after reboot system will try to do the same remount). Current
problem is that, if I am using OCFS 10% of the time (to start server or to
backup database or to save a archive logs), it creates a chance that system
will reboot itself even if there are not any open files on OCFS and so there
is not any need to reboot - system lost conenction to the disk, but it can
remount easily and can report 'I have not connection, how are you' to the
other nodes.

There are 3 special cases:
  - all nodes have the same IO problem (for example, SAN system or SAN
switch restarted) - fencing dont make sense, cluster must wait utiil IO
    will go thru at least on some nodes.
  - node have IO or network problem but can remount file system because it
dont have pending file system IO (I dont mean disk heartbeat IO) -
    then it is better to remount.
  - node experience fatal problems, but I claim that OCFS file system is
secondary FS - then better to FAIL it and dont panic.

Examples:

case 1. I restart netApp filer with iSCSI. Systems must wait until IO comes
thru, no any sense in reboots and fencings.
case 2. I use OCFSv2 for Oracle backups. If OCFS experience problems in
prime time, it will not panic Oracle.
case 3. The same but at night,. OCFS experience problem and it fail backup
but will not panic Oracle.
case 4. I use OCFS to run Oracle tablespace in RAC. If it experience
problems it is safer to reboot.

I will retest failure scenarios for OCFSv2 in the lab again, but preliminary
tests and usage experience (I use it in a few cases) shows, that it
sugnificantly decreases cluster  reliability, esp. when mixed with other
clusters (heartbeat or RAC).

----- Original Message ----- 
From: "Mark Fasheh" <mark.fasheh at oracle.com>
To: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
Cc: "Joel Becker" <Joel.Becker at oracle.com>; "martin sautter"
<martin.a.sautter at oracle.com>; "OCFS2 Users List"
<ocfs2-users at oss.oracle.com>
Sent: Wednesday, January 31, 2007 3:44 PM
Subject: Re: [Ocfs2-users] Also just a comment to the Oracle guys

> On Wed, Jan 31, 2007 at 02:44:31PM -0800, Alexei_Roudnev wrote:
> > If you run OCFS in 2 node configuration, then, when 1 node crashes,
second
> > can't resolve split-brain problem so it self-fence if it is not primary
> > node.
> > It makes many scenarios, when (in 2 node OCFS) both nodes crash after
some
> > failure (one by itself, second by self fencing).
>
> You are stating as fact, something which is simply not true. We run many
> destructive tests here at Oracle, and so do many of our customers (before
> putting stuff into production). This is simply not normally the case -
there
> would be no reason to use a cluster file system (other than performance I
> suppose) if it worked as poorly as you claim. And yes, we run two node
> clusters here all the time.
>
>
> > Real story. We run 2 Oracle RAC nodes in the lab. Each had ASM and
OCFSv2.
> > In some point, one of our switches restarted because of short power
glitch.
> > It cause interconnect to went down for about 1 minute and it caused some
> > delay in iSCSI disk access. No one normal file system noticed it - all
> > resumed working with a minor error messages. But clusters was another
> > story...
> >
> > Both nodes rebooted - one because 'ASM lost quorum' and second because
'OCFS
> > lost quorum' (and they happen to have a different masters).
>
> Ocfs2 monitors it's connection to the disk. You are correct that if that
> connection is severed, the node will reboot. This is done in order to
> maintain cluster integrity. If the node doesn't reboot itself, then the
> surviving nodes cannot safely recover it's journal and consider it out of
> the cluster.
>
> The situation you describe actually has very little to do with iscsi - I
> could turn off my fibre channel disk array and cause a fence on each node
> that's mounted to it.
>
> Were you running your ocfs2 communication and your iscsi over the same
> wires? That's the worst of both worlds. If as you describe, the network
were
> to go down, not only would the disk heartbeat be lost, but all the ocfs2
> communication links would go down too. The nodes have no choice then but
to
> reboot themselves - they have very little information other than "I can't
> see anything".
>
>
> > iSCSI is another story. OCFSv2 have (HAD? I knew about plans to improve
it)
> > a very primitive decision - making _what to do_ if it lost connection to
the
> > primary disk storage (for example, it reboots even if it have not
> > outstanding IO commands). So, your must use heartbeat time (counter, in
> > reality) big enough to allow iSCSI  IO to be restored after network
> > reconvergence (switch reboot, for example, or STP configuration change -
40
> > seconds by standard). Increasing it will increase  OCFSv2 reconvergence
time
> > in case of real node failure, so it must be done very careful.
>
> You are correct in that there is definitely some tuning that should be
done
> with ocfs2 heartbeat timeouts and iscsi timeouts. I'd also say that
running
> iscsi and your Ocfs2 node traffic over seperate networks is probably a
good
> idea. And yes, we're going to have a configurable network timeout soon
too.
>
> You are absolutely incorrect however that a lack of pending I/O should be
a
> reason to not fence. Let us put aside for a moment that there is
practically
> no such state - ocfs2 does a disk read and write every two seconds as a
> method of monitoring the disk connection.
>
> The point of fencing is to ensure that a node is sufficiently isolated
from
> the rest of the cluster so that recovery on that node can be performed.
> Whether or not the node is concurrently I/Oing to the disk is irrelevant.
If
> it is recovered (so that the cluster can continue), then it needs to have
> it's access to the disk shut down (typically this means a reboot).
>
> To use a specific example, say an admin accidentally unplugs the disk
> connection from one node. Ocfs2 nodes fence because the journal of the
> misbehaving node needs to be recovered (amongst other reasons). Now say
that
> node didn't fence, but it's journal was still recovered by the remaining
nodes
> so that they could continue.
>
> If the admin notices that the disk cable is unplugged and plugs it back
in,
> the node which has been recovered is now out of sync with the other nodes.
> This can cause very serious corruption of your file system the _next time_
> that node decides to write to the disk.
> --Mark
>
> --
> Mark Fasheh
> Senior Software Developer, Oracle
> mark.fasheh at oracle.com
>