[Ocfs2-users] Also just a comment to the Oracle guys

Wed Jan 31 14:44:31 PST 2007

Not iSCSI, but OCFS.

If you run OCFS in 2 node configuration, then, when 1 node crashes, second
can't resolve split-brain problem so it self-fence if it is not primary
node.
It makes many scenarios, when (in 2 node OCFS) both nodes crash after some
failure (one by itself, second by self fencing).

If you add third system (doing nothing, just mounting) it will play a role
of arbiter - system, which sees it, knows, that it have a quorum, so single
node crash never destroy the whole cluster.

Real story. We run 2 Oracle RAC nodes in the lab. Each had ASM and OCFSv2.
In some point, one of our switches restarted because of short power glitch.
It cause interconnect to went down for about 1 minute and it caused some
delay in iSCSI disk access. No one normal file system noticed it - all
resumed working with a minor error messages. But clusters was another
story...

Both nodes rebooted - one because 'ASM lost quorum' and second because 'OCFS
lost quorum' (and they happen to have a different masters).

Additional instability (for both, Oracle RAC and OCFSv2) happen because no
one of them supports multiple network heartbeat interfaces (and no one
supports serial heartbeat). It makes them sensitive to almost any network
failures. Bonding is not a full solution because it adds complexity and
instability by itself. Using loopbacks + OSPF can resolve problem, but by a
very strange way (OSPF recovery time is very short, so it can work well).

iSCSI is another story. OCFSv2 have (HAD? I knew about plans to improve it)
a very primitive decision - making _what to do_ if it lost connection to the
primary disk storage (for example, it reboots even if it have not
outstanding IO commands). So, your must use heartbeat time (counter, in
reality) big enough to allow iSCSI  IO to be restored after network
reconvergence (switch reboot, for example, or STP configuration change - 40
seconds by standard). Increasing it will increase  OCFSv2 reconvergence time
in case of real node failure, so it must be done very careful.

CHECKLIST for the cluster. What to test:

- run cluster. POWER RESET node1. verify that node2 survived. When it comes,
do the same with NODE2.
Repeat in another order.
- the same but 'shutdown' node, not 'power reset' it (power reset means
'hard power reset, not shutdown button').
- drop interconnection for 40 seconds, then restore it.
- reboot Ethernet switch. verify that nodes survived (at least one node).
- reboot your iSCSI system (if it is cluster, takeover and giveback). verify
that cluster survived.

Do it with and without pending IO on cluster file system(s). Good cluster
file system should survive any network and infrastructure failures without
self fencing if it have not pending IO. System should survive if it have IO,
but network or second node failure is shorter than your timeout (heartbeat
critical time - after which system decide that the peer is dead). You must
find a balance between this timeout and maximum unavailability time on your
file system (because this timeout cause a service timeout if second node
really crashes).

There are a few catastrophic scenarios, which should not happen in normal
life but you must be aware of. One is _system freeze and then unfreeze in
few minutes_. I saw it on some Linuxes because of broken spinlock somewhere
in the memory allocation (saw in SLES9 SP3 with badly configured HUGETLB and
Oracle 10.2.0.2 RAC cluster). The only way to prevent such things is to use
external fencing (SLES10 can use it with OCFSv2,
but I never tested it myself), or may be linux watchdog module
(hangcheck-timer). Another bad scenario is _blinking access_. I had it when
we connected 2 iSCSI initiators with the same ID - OCFSv2 could not
recognize it, iSCSI provided enough access, and systems got crazy and
damaged file system in a few minutes. I believe that no one need protection
against such errors (it was a primitive human error).

----- Original Message ----- 
From: "Joel Becker" <Joel.Becker at oracle.com>
To: "martin sautter" <martin.a.sautter at oracle.com>
Cc: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>; "OCFS2 Users List"
<ocfs2-users at oss.oracle.com>
Sent: Wednesday, January 31, 2007 12:53 PM
Subject: Re: [Ocfs2-users] Also just a comment to the Oracle guys

> On Wed, Jan 31, 2007 at 10:09:31AM +0100, martin sautter wrote:
> > please can anybody explain why iSCSI  requires 3 nodes  for a stable
> > cluster configuration  or which problems I  will have with a 2 node
> > OCFS2 cluster against iSCSI  based storage.
>
> I think he's claiming you'd have two "server nodes" that mount
> the ocfs2 volume, and one "iSCSI node" that actually hosts the disks and
> runs the iSCSI target.
> You can certainly have a different iSCSI target.
>
> Joel
>
> -- 
>
> "The opposite of a correct statement is a false statement. The
>  opposite of a profound truth may well be another profound truth."
>          - Niels Bohr
>
> Joel Becker
> Principal Software Developer
> Oracle
> E-mail: joel.becker at oracle.com
> Phone: (650) 506-8127
>