[Ocfs2-users] Catatonic nodes under SLES10

Tue Apr 10 18:23:51 PDT 2007

On Tue, Apr 10, 2007 at 07:33:01PM +0200, Eckenfels. Bernd wrote:
> > 	By the time you determine you need a node to fence, you do not
> know what I/O it has in its pipeline.  Any I/O that > is below the
> request queue can't reliably be stopped in Linux.  If that I/O goes out
> after other nodes have decided the > node is gone, it is corruption.
> This is why fencing has to be absolute.
> 
> Fencing does not help here, you have a race condition, and the I/O queue
> is only the smallest part of the path where the IO request might get
> stuck. The panic wont stop BH interrupts, nor will it stop the
> controllers and switches. In fact the other node can much better "pull
> the plug" with a SCSI reservation. And especially the panic might not
> work in the worst state, if the node is borken.

	Everyone has this race.  The point is not to assume perfection
(it doesn't exist).  The only guarantee you need is that the other nodes
wait until the dead node has finished I/O before trusting the state of
the disk.  If an I/O is on the controller, a reset may catch it.  If it
is on the switch or the SAN, it'll complete.  If it completes, you can
then continue.
	The danger is not that the failing node has a few writes left in
it.  The danger is that the failing node writes *after* everyone else
has assumed he won't.
	You are correct, btw, that panic() as a hole here.  Which is why
we've fixed that to be machine_restart().  That triggers the PCI reset
line - about as good as a self-fence can get.
	But what about SCSI Reservations, or turning off a fibre port at
the switch?  Those solutions certainly promise to halt I/O from a node
without killing it.  However, they are tricky.  You have to know a
particular switch's access method.  Or make sure SCSI reservations work
with this particular HBA/disk/SAN.  Without a standard way to do this,
we can't promise it works.  It's piecemeal - add a particular switch,
vendor, hba, whatever at a time.  We just haven't had time to do that
(or NDA, or source, or document, or etc).  And often, hardware capable
of these things is expensive.
	Heck, STONINTH is better than self-fencing.  But that would
require users to invest in and configure STONINTH hardware.  Would we
like to do that?  Hey, it adds to our reliability.  But as of right now,
we can't and don't.
 
> There are essentiallly multiple problems:
> 
> A) the node is well behaving, just loses connection to part of the
> storage - no fencing is needed, you just dont issue more requests and
> remeber the new node state, the other nodes might recognize
> recovery/replay is needed. Nobody (especially not the fencing node) can
> know which part of the last IO transaction will reach the device (or
> not) anyway. Thats why you have to have IO transactions with atomic
> changes and integrity. Also there is no need to dequeue the last IO
> request for exactly that reason - it must not be harmfull anyway.

	As you state, this condition is unrecognizable.  You have to
assume this node will start trying to write at a later date.  It will
have an out-of-date idea of the filesystem and corrupt the disk.

> B) the node is totally borken (memory overwrite, interrupt deadlocks,
> etc). In that case the note might not notice the failure, or it might
> not be able to self-fence. This is a typical case for STONIT.

	Sure.  As I stated above, STONINTH is better than self-fencing,
but requires more up-front cost.  Our quorum software is low-level
enough that it should be able to run if anything else is able to run -
thus it should kill itself if the system is alive enough to run
just about anything.

> C) the node is well just the heartbeat is delayed, or the overall nodes
> are well and the cache is in sync, only a single disk storage fails.
> Those cases should never occur (larger timeouts are only part of the
> solution, a smarter quorum algorithm like provided with heartbeat or
> other cluster managers is needed). That case was happenign quite often,
> I guess increased timeouts make this a bit better, but cant be reliable
> solved with a cluster framework which considers more thant the state of
> of a single network connection.

	The timeouts are one part of the issue, sure.  Regarding a
"single network connection" - if you can't talk to a node, he may as
well be down.  Even if you see him alive on five other channels, you
can't discuss shared state.  So you can't continue as a cluster until
either he can talk again, or he goes offline.  There's no "hope he
doesn't mess with us".

> One of the Problems OCFSv2 has is, that the condition c) happened too
> often and that the condition a) is not recognizeable. And of course that
> all those conditions are handled in the simplest way, with a panic which
> is IMHO not really helpful for the above mention reasons.

	Like I said, we're doing machine_restart() now.

Joel
-- 

"In the beginning, the universe was created. This has made a lot 
 of people very angry, and is generally considered to have been a 
 bad move."
        - Douglas Adams

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127