[Ocfs2-users] Catatonic nodes under SLES10

Mon Apr 9 15:50:25 PDT 2007

I am saying, that by default (at least on SLES9 SP1) system panic do not
cause automatic reboot, so when some applications try to
fence themself via panic, they stop system instead of rebooting it. I saw it
on olde SLES9 (and because I alwayts set up these 2 variables, I did not
verified how OCFSv2 behaves today).

No matter how OCFSv2 is implemented, it is recommended to set up both
variables to 1. Else you can end up with serfver, blinking by keyboard
lights because of panic.

I 100% disagree with OCFSv2 approach in fencing - if system had not pending
io operations (at least have not files open for the writing), then
it can simple remount instead of panic. IN some cases, desired behaviour is
to suspend all IO on OCFS but dont panic the whole server (example - if you
use it for backups). Moreover, if you use OCFSv2 for oracle database and
system have split brain etc, it is desirable to freeze file change
operations (so keeping simple write sand reads allowed) so that database can
still work (but can't expand files for example), as it worked in OCFSv1,
instead of panicing.

The only case when nodes have not chances other then fencing is:
- system lost connection to both, disk and network, or system dont see disk
heartbeat for a long
AND
- system have active IO operations OR at least have files opened for the
writing.

many other cases can be overriden, for example:
- system can't read or write disk BUT can communicate with second node - why
don't ask these node to do IO operation instead of panicking. If node1 have
not disk access but node2 have disk access, everything can work properly;
- if node 1 doing nothing, then fencing dont release any resources so it
don't have sense at all;
- if node1 can't comminictae with node2 by eth0, why it can't try eth1 or
serial connection?
- if all nodes lost disk access, fencing dont make any sense until at least
1 node goit these access back.

All things are implemented in many clusters.

----- Original Message ----- 
From: "David Miller" <syslog at d.sparks.net>
To: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
Cc: <ocfs2-users at oss.oracle.com>
Sent: Monday, April 09, 2007 2:32 PM
Subject: Re: [Ocfs2-users] Catatonic nodes under SLES10

> Alexei_Roudnev wrote:
> > Did you checked
> >
> >  /proc/sys/kernel/panic  /proc/sys/kernel/panic_on_oops
> >
> > system variables?
> >
>
> No.  Maybe I'm missing something here.
>
> Are you saying that a panic/freeze/reboot is the expected/desirable
> behavior?  That nothing more graceful could be done, like to just
> dismount the ocfs2 file systems, or force them to a read-only mount or
> something like that?  We have to reload the kernel?
>
> Thanks,
>
> --- David
>
> > ----- Original Message ----- 
> > From: "David Miller" <syslog at d.sparks.net>
> > To: <ocfs2-users at oss.oracle.com>
> > Sent: Monday, April 02, 2007 9:01 AM
> > Subject: [Ocfs2-users] Catatonic nodes under SLES10
> >
>
> [snip]
>
> > Both servers will be connected to a dual-host external RAID system.
> > I've setup ocfs2 on a couple of test systems and everything appears to
> > work fine.
> >
> > Until, that is, one of the systems loses network connectivity.
> >
> > When the systems can't talk to each other anymore, but the disk
> > heartbeat is still alive, the high numbered node goes catatonic.  Under
> > SLES 9 it fenced itself off with a kernel panic; under 10 it simply
> > stops responding to network or console.  A power cycling is required to
> > bring it back up.
> >
> > The desired behavior would be for the higher numbered node to lose
> > access to the ocfs2 file system(s).  I don't really care whether it
> > would simply timeout ala stale NFS mounts, or immediately error like
> > access to non-existent files.
> >
> >
>
>