[Ocfs2-users] OCFS2 Fencing, then panic

Wed Apr 11 11:05:27 PDT 2007

Disk controller or network controller?

For network, check duplex mode and interface errors, or try separate
cross-cable connection for the heartbeat.

For disk, you can configure timeout (# of ticks lost before system fence
cluster).

----- Original Message ----- 
From: "Andrew Phillips" <Andrew.Phillips at betfair.com>
To: "enohi ibekwe" <enohiaghe at hotmail.com>
Cc: <Alexei_Roudnev at exigengroup.com>; <jeffm at suse.com>;
<Sunil.Mushran at oracle.com>; <ocfs2-users at oss.oracle.com>
Sent: Wednesday, April 11, 2007 2:43 AM
Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic

> Do you see anything else odd in your system logs? For example "losing
> too many ticks"?
> We've traced our problem, that may be similar to yours, to a disk
> controller/firmware/driver
> that was blocking interrupts for various periods of time. We've tried a
> variety of ways
> to get it to play nice, but without much luck. If the system is
> unresponsive, or unable
> to handle packet transmission or reception for 10s (unless you use the
> 1.2.5 release) then
> you'll trigger the o2net_idle_timer shutdown.
>
> Andy
>
> On Wed, 2007-04-11 at 09:13 +0000, enohi ibekwe wrote:
> > Thanks for your help so far.
> >
> > My issue is the frequency at which node 0 gets fenced, it has happened
at
> > least once a day in the last 2 days.
> >
> > More details:
> >
> > I am attempting to add a node (node 2) to an existing 2 node ( node 0
and
> > node1) cluster. Alll nodes are curently running SLES9
(2.6.5-7.283-bigsmp
> > i686) + ocfs 1.2.1-4.2. This is the ocfs package that ships with SLES9.
Node
> > 2 is not part of the RAC cluster yet, I have only installed ocfs on it.
I
> > can mount the ocfs file system on all nodes, and the ocfs file system is
> > accessible from all nodes.
> >
> > Node 0 is the node alway fenced and gets fenced very frequently. Before
I
> > added the kernel.panic parameter, node 0 would get fenced, panic and
hang.
> > Only a power reboot would make it responsive again.
> >
> > This is what happened this morning.
> >
> > I was remotely connected to node 0 via ssh. Then I suddenly lost the
> > connection. I tried to ssh again but node 0 refused the connection.
> >
> > Checking node 1 dmesg I found :
> > ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
> > o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle
for
> > 10 seconds, shutting it down.
> > (0,3):o2net_idle_timer:1310 here are some times that might help debug
the
> > situation: (tmr 1176207822.713473 now 1176207832.712008 dr
1176207822.713466
> > adv 1176207822.713475:1176207822.713476 func (1459c2a9:504)
> > 1176196519.600486:1176196519.600489)
> > o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
> >
> > checking node 2 dmesg I found:
> > ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
> > o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle
for
> > 10 seconds, shutting it down.
> > (0,0):o2net_idle_timer:1310 here are some times that might help debug
the
> > situation: (tmr 1176207823.774296 now 1176207833.772712 dr
1176207823.774293
> > adv 1176207823.774297:1176207823.774297 func (1459c2a9:504)
> > 1176196505.704238:1176196505.704240)
> > o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
> >
> > Since I had reboot on panic on both node 0, node 0 restarted. Checking
> > /var/log/messages I found:
> > Apr 10 09:39:50 ora1 kernel: (12,2):o2quo_make_decision:121 ERROR:
fencing
> > this node because it is only connected to 1 nodes and 2 is needed to
make a
> > quorum out of 3 heartbeating nodes
> > Apr 10 09:39:50 ora1 kernel: (12,2):o2hb_stop_all_regions:1909 ERROR:
> > stopping heartbeat on all active regions.
> > Apr 10 09:39:50 ora1 kernel: Kernel panic: ocfs2 is very sorry to be
fencing
> > this system by panicing.
> >
> >
> >
> >
> > ----Original Message Follows----
> > From: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
> > To: "Jeff Mahoney" <jeffm at suse.com>,"enohi ibekwe"
<enohiaghe at hotmail.com>
> > CC: <ocfs2-users at oss.oracle.com>
> > Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
> > Date: Mon, 9 Apr 2007 11:00:30 -0700
> >
> > It's noty an issue; it is really OCFSv2 killer:
> > - in 99% cases, it is not split brain condition but just a short (20 -
30
> > seconds) network interruption. Systems can (in most cases) see each
other by
> > network or thru the voting disk, so they can communicate by one or
another
> > way;
> > - in 90% cases system have not any pending IO activity, so it have not
any
> > reason to fence itself at least until some IO happen on OCFSv2 file
system.
> > For example, if OCFSv2 is used for backups, it is active 3 hours at
night +
> > at the time of restoring only, and server can remount it without any
fencing
> > if it lost consensus.
> > - timeouts and other fencing parameters are badly designed, and it makes
a
> > problem worst. IT can't work out of the box on the most SAN networks
(with
> > recoinfiguration timeouts all about 30 seconds - 1 minute by default).
For
> > example, NetApp cluster takepooevr takes about 20 seconds, and giveback
> > about 40 seconds - which kills OCFSv2 for 100% sure (with default
settings).
> > STP timeout (in classical mode) is 40 seconds, which kills OCFSv2 for
100%
> > sure. Network switch remoot time is about 1 minute for most switches,
which
> > kills OCFSv2 for 100% sure. Result - if I reboot staging network switch,
I
> > have all stand alone servers working, all RAC clusters working, all
other
> > servers working, and all OCFSv2 cluster fenced themself.
> >
> > For me, I baned OCFSv2 from any usage except backup and archive logs,
and
> > only with using cross connection cable for heartbeat.
> > All other scenarios are catastrofic (cause overall cluster failure in
many
> > cases). And all because of this fencing behavior.
> >
> > PS> SLES9 SP3 build 283 have a very stable OCFSv2, with one well known
> > problem in buffer use - it don't release small buffers after file is
> > created/deleted (so if you run create file / remove file loop for a long
> > time, you will deplete system memory in apporox a few days). It is not a
> > case if files are big enough (Oracle backups, oracle archive logs,
> > application home) but must be taken into account if you have more than
> > 100,000 - 1,000,000 files on OCFSv2 file system(s).
> >
> > But fencing problem exists in all versions (little better in modern
ones,
> > because developers added configurable network timeout). If you add _one
> > heartbeat interface only_ design and _no serial heartbeat_ design, it
really
> > became a problem, ad it's why I was thinking about testing OCFSv2 in
SLES10
> > with heartbeat2 (heartbeat2 have a very reliable heartbeat and have
external
> > fencing, but unfortunately SLES10 is not production ready yet for us, de
> > facto).
> >
> >
> >
> > ----- Original Message -----
> > From: "Jeff Mahoney" <jeffm at suse.com>
> > To: "enohi ibekwe" <enohiaghe at hotmail.com>
> > Cc: <ocfs2-users at oss.oracle.com>
> > Sent: Saturday, April 07, 2007 12:06 PM
> > Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
> >
> >
> >  > -----BEGIN PGP SIGNED MESSAGE-----
> >  > Hash: SHA1
> >  >
> >  > enohi ibekwe wrote:
> >  > > Is this also an issue on SLES9?
> >  > >
> >  > > I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I
see
> >  > > the error on the same box on the cluster.
> >  >
> >  > I'm not sure what you mean by "issue." This is designed behavior.
When
> >  > the cluster ends up in a split condition, one or more nodes will
fence
> >  > themselves.
> >  >
> >  > - -Jeff
> >  >
> >  > - --
> >  > Jeff Mahoney
> >  > SUSE Labs
> >  > -----BEGIN PGP SIGNATURE-----
> >  > Version: GnuPG v1.4.6 (GNU/Linux)
> >  > Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
> >  >
> >  > iD8DBQFGF+vDLPWxlyuTD7IRAuNPAJ9lZPLSaH7nOCNammYyW3bwC2Wj5wCgomUp
> >  > zcRzcaedVAmk+AaJ/OFeddE=
> >  > =8e6c
> >  > -----END PGP SIGNATURE-----
> >  >
> >  > _______________________________________________
> >  > Ocfs2-users mailing list
> >  > Ocfs2-users at oss.oracle.com
> >  > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >  >
> >
> > _________________________________________________________________
> > Cant afford to quit your job?  Earn your AS, BS, or MS degree online in
1
> > year.
> >
http://www.classesusa.com/clickcount.cfm?id=866145&goto=http%3A%2F%2Fwww.classesusa.com%2Ffeaturedschools%2Fonlinedegreesmp%2Fform-dyn1.html%3Fsplovr%3D866143
> >
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> --
> Andy Phillips
> Systems Architecture Manager, Betfair.com
>
> Office: 0208 8348436
>
> Betfair Ltd|Winslow Road|Hammersmith Embankment|London|W69HP Company No.
> 5140986
> The information in this e-mail and any attachment is confidential and is
> intended only for the named recipient(s). The e-mail may not be
> disclosed or used by any person other than the addressee, nor may it be
> copied in any way. If you are not a named recipient please notify the
> sender immediately and delete any copies of this message. Any
> unauthorized copying, disclosure or distribution of the material in this
> e-mail is strictly forbidden. Any view or opinions presented are solely
> those of the author and do not necessarily represent those of the
> company.
>
>
>
>
>
> ________________________________________________________________________
> In order to protect our email recipients, Betfair Group use SkyScan from
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>
> ________________________________________________________________________
>