[Ocfs2-users] Catatonic nodes under SLES10

Mon Apr 9 20:47:42 PDT 2007

It is all true, when system _READS_, but not when it is just 'get all
buffers and sit quiet_.

I mean that it is possible to find a quiet states, when cluster can be
remounted without any harm. Even with the reads, but more likely when there
is lot any activity on the file system for a while.

Of course it is a challenge, and of course you pay for the cluster by
stability. It's why I counted 'reliability' improvements first (such as
multiple
heartbeats and possible proxy IO in case if node can't reach disk directly),
and counted only '100% quiet' state as _easy to implement by some way_ (even
'FS as application home' is not so quiet state because it means 'disk
mapping to the memory' for the exec files and so requires locks). 'Read
only' state is not so simple - nodes are really locked to read inodes and
block lists (but RO file system can be remounted, at least in theory - I am
100% sure that you don't keep locks all the way while file is in read access
but only while file is opened or actively readen).

Btw, why OCFSv1 had not such problems? It sacrified  functionality (worked
with oracle only), but it buy so much stability, that such mode can be
extremely usefull for OCFSv2 too.

----- Original Message ----- 
From: "Sunil Mushran" <Sunil.Mushran at oracle.com>
To: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
Cc: "David Miller" <syslog at d.sparks.net>; <ocfs2-users at oss.oracle.com>
Sent: Monday, April 09, 2007 6:55 PM
Subject: Re: [Ocfs2-users] Catatonic nodes under SLES10

> While a fs node may only be reading, does not mean the metadata
> on disk is not being updated by some other node. Means, it needs to
> take appropriate locks to perform the read... means it needs to have
> a lock on the mastered lock resource... means it needs to be part of
> the active cluster. The dlm is not only keeping track of the locks but
> also the inode (lvb). Bottomline, either you are in a cluster or not.
> There is no middle ground. And if not, then all you are doing are
> dirty reads.
>
> In ocfs2, we have a readonly and a hard-ro state. As in, we go into the
> hard-ro state only if the device is truly ro. In the hard ro, we don't
> start a heartbeat nor create a dlm domain. Normal readonly is like
> a rw mount, with the exception that the userspace cannot write.
> As in, hb is started and the dlm domain is created.
>
> Alexei_Roudnev wrote:
> > Of course it is cluster operations.
> >
> > as I said, cluster have a clients like FS. Client can be in 3 modes:
> > - passive (no reason to fence, just don't allow to switch mode)
> > - active read only
> > - active write
> >
> > Active write requires fencing in all cases, active read status can't
transit
> > into active writes if  cluster is not connected, and passive mode never
> > require fencing (at lest until FS want to switch the mode). FS in
passive
> > mode can run re-initialization without fencing and with 0 risk of
corruption
> > (because server state after the reboot is exactly the same as before
> > reboot).
> >
> > Of course, client must make a transitions himself (all writes completed
30
> > seconds ago, except disk heartbeat - switch to passive mode and inform
> > cluster manager).
> >
> > In addition, you can't fully separate cluster manager and FS because FS
have
> > it's own heatbeats and network connections.
> >
> > I think, that the only way to improve behavior without grand changes (or
> > risk to have a corruptions) is to monitor FS mode and
> > switch it to the passive when possible (no activity for some time and
all
> > buffers are flushed out or at least written). Existing implementation
can
> > not be used in many cases (as I descrived in another mail) because it
> > dramatically decrease cluster reliability.
> >
> > In addition, if all nodes lost IO access to the disks it don't make
sence to
> > fence as well, until at least one node got access.
> >
> > PS. I was able to fraud existion OCFSv2 with all it's fencing, by simple
> > assigning 2 servers the same iSCSI ID. So no one cluster system can
protect
> > from all possible failures anyway. And reboots on each _ap chi_ cause
more
> > problems then bring benefits (except when OCFSv2 is used for critical
data
> > in the 100% time write mode).
> >
> > ----- Original Message ----- 
> > From: "Sunil Mushran" <Sunil.Mushran at oracle.com>
> > To: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
> > Cc: "David Miller" <syslog at d.sparks.net>; <ocfs2-users at oss.oracle.com>
> > Sent: Monday, April 09, 2007 4:26 PM
> > Subject: Re: [Ocfs2-users] Catatonic nodes under SLES10
> >
> >
> >
> >> Fencing is not a fs operation but a cluster operation. The fs is only a
> >> client
> >> of the cluster stack.
> >>
> >> Alexei_Roudnev wrote:
> >>
> >>> It all depends of the usage scenario.
> >>>
> >>> Tipical usage is, for example:
> >>>
> >>> (1) Shared application home. Writes happens once / week during
> >>>
> > maintanance,
> >
> >>> otehr time files are opened for reading only. Few logfiles
> >>> can be redirected if required.
> >>>
> >>> So, when server see a problems, it HAD NOT any pending IO for a 3
days -
> >>>
> > so
> >
> >>> what the purpose of reboot? It 100% knows that NO ANY IO
> >>> is pending, and other nodes have not any IO pending as well.
> >>>
> >>> (2) Backup storage for the RAC. FS is not opened 90% of the time. At
> >>>
> > night,
> >
> >>> one node opens it and creates a few files. Other node have not any
> >>>
> > pending
> >
> >>> IO on this FS. Fencing passive node (which dont run any backup) is not
> >>> useful because it HAD NOT ANY PENDING IO for a few hours.
> >>>
> >>> (3) WEB server. 10 nodes, 1 only makes updates. The same - most nodes
> >>>
> > have
> >
> >>> not any pending IO.
> >>>
> >>> Of course there is always a risk of FS corruption in the clusters. Any
> >>>
> > layer
> >
> >>> can keep pending IO forever (I saw Linux kernel keeping it for 10
> >>>
> > minutes).
> >
> >>> Problem is that in such cases software fencing can't help as well
> >>>
> > because
> >
> >>> node is half-dead and can't detect it's own status.
> >>>
> >>> So, the key point here is not in _fence for each ap-chi_ but _keep
> >>>
> > system
> >
> >>> without pending writes as long as possible and make clean transition
> >>>
> > between
> >
> >>> active write/active read / passive states. Then you can avoid
> >>>
> > self-fencing
> >
> >>> in 90% cases (because of server wil be in passive or active reads
> >>>
> > state). I
> >
> >>> mounT FS but don't cd into it, or just CD but dont read - passive
> >>>
> > status. I
> >
> >>> read file - active read for 1 minute, tbhnen flush buffers so that it
is
> >>>
> > in
> >
> >>> passive mode again. I began top write - switch system to write mode. I
> >>>
> > did
> >
> >>> not write blocks for 1 minute - flush everything, wait 1 more minute
and
> >>> switch to passive mode.
> >>>
> >>>
> >>>
> >>>
> >>> ----- Original Message ----- 
> >>> From: "Sunil Mushran" <Sunil.Mushran at oracle.com>
> >>> To: "David Miller" <syslog at d.sparks.net>
> >>> Cc: <ocfs2-users at oss.oracle.com>
> >>> Sent: Monday, April 09, 2007 3:18 PM
> >>> Subject: Re: [Ocfs2-users] Catatonic nodes under SLES10
> >>>
> >>>
> >>>
> >>>
> >>>> For io fencing to be graceful, one requires better hardware. Read
> >>>>
> >>>>
> >>> expensive.
> >>>
> >>>
> >>>> As in, switches where one can choke off all the ios to the storage
from
> >>>> a specific
> >>>> node.
> >>>>
> >>>> Read the following for a discussion on force umounts. In short, not
> >>>> possible as yet.
> >>>> http://lwn.net/Articles/192632/
> >>>>
> >>>> Readonly does not work wrt to io fencing. As in, ro only stops any
new
> >>>> userspace
> >>>> writes but cannot stop pending writes. And writes could be lodged in
> >>>>
> > any
> >
> >>>> io layer.
> >>>> A reboot is the cheapest way to avoid corruption. (While a reboot is
> >>>> painful, it is
> >>>> much less painful than a corrupted fs.)
> >>>>
> >>>> With 1.2.5 you should be able to increase the network timeouts and
> >>>> hopefully avoid
> >>>> the problem.
> >>>>
> >>>> David Miller wrote:
> >>>>
> >>>>
> >>>>> Alexei_Roudnev wrote:
> >>>>>
> >>>>>
> >>>>>> Did you checked
> >>>>>>
> >>>>>>  /proc/sys/kernel/panic  /proc/sys/kernel/panic_on_oops
> >>>>>>
> >>>>>> system variables?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> No.  Maybe I'm missing something here.
> >>>>>
> >>>>> Are you saying that a panic/freeze/reboot is the expected/desirable
> >>>>> behavior?  That nothing more graceful could be done, like to just
> >>>>> dismount the ocfs2 file systems, or force them to a read-only mount
or
> >>>>> something like that?  We have to reload the kernel?
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> --- David
> >>>>>
> >>>>>
> >>>>>
> >>>>>> ----- Original Message ----- From: "David Miller"
> >>>>>>
> > <syslog at d.sparks.net>
> >
> >>>>>> To: <ocfs2-users at oss.oracle.com>
> >>>>>> Sent: Monday, April 02, 2007 9:01 AM
> >>>>>> Subject: [Ocfs2-users] Catatonic nodes under SLES10
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> [snip]
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Both servers will be connected to a dual-host external RAID system.
> >>>>>> I've setup ocfs2 on a couple of test systems and everything appears
> >>>>>> to work fine.
> >>>>>>
> >>>>>> Until, that is, one of the systems loses network connectivity.
> >>>>>>
> >>>>>> When the systems can't talk to each other anymore, but the disk
> >>>>>> heartbeat is still alive, the high numbered node goes catatonic.
> >>>>>> Under SLES 9 it fenced itself off with a kernel panic; under 10 it
> >>>>>> simply stops responding to network or console.  A power cycling is
> >>>>>> required to bring it back up.
> >>>>>>
> >>>>>> The desired behavior would be for the higher numbered node to lose
> >>>>>> access to the ocfs2 file system(s).  I don't really care whether it
> >>>>>> would simply timeout ala stale NFS mounts, or immediately error
like
> >>>>>> access to non-existent files.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> _______________________________________________
> >>>>> Ocfs2-users mailing list
> >>>>> Ocfs2-users at oss.oracle.com
> >>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>>>>
> >>>>>
> >>>> _______________________________________________
> >>>> Ocfs2-users mailing list
> >>>> Ocfs2-users at oss.oracle.com
> >>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>>>
> >>>>
> >>>>
> >>>
> >
> >
>