[Ocfs2-users] Also just a comment to the Oracle guys

Wed Jan 31 15:44:51 PST 2007

On Wed, Jan 31, 2007 at 02:44:31PM -0800, Alexei_Roudnev wrote:
> If you run OCFS in 2 node configuration, then, when 1 node crashes, second
> can't resolve split-brain problem so it self-fence if it is not primary
> node.
> It makes many scenarios, when (in 2 node OCFS) both nodes crash after some
> failure (one by itself, second by self fencing).

You are stating as fact, something which is simply not true. We run many
destructive tests here at Oracle, and so do many of our customers (before
putting stuff into production). This is simply not normally the case - there
would be no reason to use a cluster file system (other than performance I
suppose) if it worked as poorly as you claim. And yes, we run two node
clusters here all the time.

> Real story. We run 2 Oracle RAC nodes in the lab. Each had ASM and OCFSv2.
> In some point, one of our switches restarted because of short power glitch.
> It cause interconnect to went down for about 1 minute and it caused some
> delay in iSCSI disk access. No one normal file system noticed it - all
> resumed working with a minor error messages. But clusters was another
> story...
> 
> Both nodes rebooted - one because 'ASM lost quorum' and second because 'OCFS
> lost quorum' (and they happen to have a different masters).

Ocfs2 monitors it's connection to the disk. You are correct that if that
connection is severed, the node will reboot. This is done in order to
maintain cluster integrity. If the node doesn't reboot itself, then the
surviving nodes cannot safely recover it's journal and consider it out of
the cluster.

The situation you describe actually has very little to do with iscsi - I
could turn off my fibre channel disk array and cause a fence on each node
that's mounted to it.

Were you running your ocfs2 communication and your iscsi over the same
wires? That's the worst of both worlds. If as you describe, the network were
to go down, not only would the disk heartbeat be lost, but all the ocfs2
communication links would go down too. The nodes have no choice then but to
reboot themselves - they have very little information other than "I can't
see anything".

> iSCSI is another story. OCFSv2 have (HAD? I knew about plans to improve it)
> a very primitive decision - making _what to do_ if it lost connection to the
> primary disk storage (for example, it reboots even if it have not
> outstanding IO commands). So, your must use heartbeat time (counter, in
> reality) big enough to allow iSCSI  IO to be restored after network
> reconvergence (switch reboot, for example, or STP configuration change - 40
> seconds by standard). Increasing it will increase  OCFSv2 reconvergence time
> in case of real node failure, so it must be done very careful.

You are correct in that there is definitely some tuning that should be done
with ocfs2 heartbeat timeouts and iscsi timeouts. I'd also say that running
iscsi and your Ocfs2 node traffic over seperate networks is probably a good
idea. And yes, we're going to have a configurable network timeout soon too.

You are absolutely incorrect however that a lack of pending I/O should be a
reason to not fence. Let us put aside for a moment that there is practically
no such state - ocfs2 does a disk read and write every two seconds as a
method of monitoring the disk connection.

The point of fencing is to ensure that a node is sufficiently isolated from
the rest of the cluster so that recovery on that node can be performed.
Whether or not the node is concurrently I/Oing to the disk is irrelevant. If
it is recovered (so that the cluster can continue), then it needs to have
it's access to the disk shut down (typically this means a reboot).

To use a specific example, say an admin accidentally unplugs the disk
connection from one node. Ocfs2 nodes fence because the journal of the
misbehaving node needs to be recovered (amongst other reasons). Now say that
node didn't fence, but it's journal was still recovered by the remaining nodes
so that they could continue.

If the admin notices that the disk cable is unplugged and plugs it back in,
the node which has been recovered is now out of sync with the other nodes.
This can cause very serious corruption of your file system the _next time_
that node decides to write to the disk.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com