[Ocfs2-users] OCFS2 Fencing, then panic

Wed Apr 11 02:43:47 PDT 2007

Do you see anything else odd in your system logs? For example "losing
too many ticks"?
We've traced our problem, that may be similar to yours, to a disk
controller/firmware/driver
that was blocking interrupts for various periods of time. We've tried a
variety of ways
to get it to play nice, but without much luck. If the system is
unresponsive, or unable
to handle packet transmission or reception for 10s (unless you use the
1.2.5 release) then
you'll trigger the o2net_idle_timer shutdown.

Andy 

On Wed, 2007-04-11 at 09:13 +0000, enohi ibekwe wrote:
> Thanks for your help so far.
> 
> My issue is the frequency at which node 0 gets fenced, it has happened at 
> least once a day in the last 2 days.
> 
> More details:
> 
> I am attempting to add a node (node 2) to an existing 2 node ( node 0 and
> node1) cluster. Alll nodes are curently running SLES9 (2.6.5-7.283-bigsmp
> i686) + ocfs 1.2.1-4.2. This is the ocfs package that ships with SLES9. Node
> 2 is not part of the RAC cluster yet, I have only installed ocfs on it. I
> can mount the ocfs file system on all nodes, and the ocfs file system is
> accessible from all nodes.
> 
> Node 0 is the node alway fenced and gets fenced very frequently. Before I
> added the kernel.panic parameter, node 0 would get fenced, panic and hang.
> Only a power reboot would make it responsive again.
> 
> This is what happened this morning.
> 
> I was remotely connected to node 0 via ssh. Then I suddenly lost the
> connection. I tried to ssh again but node 0 refused the connection.
> 
> Checking node 1 dmesg I found :
> ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
> o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for
> 10 seconds, shutting it down.
> (0,3):o2net_idle_timer:1310 here are some times that might help debug the
> situation: (tmr 1176207822.713473 now 1176207832.712008 dr 1176207822.713466
> adv 1176207822.713475:1176207822.713476 func (1459c2a9:504)
> 1176196519.600486:1176196519.600489)
> o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
> 
> checking node 2 dmesg I found:
> ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
> o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for
> 10 seconds, shutting it down.
> (0,0):o2net_idle_timer:1310 here are some times that might help debug the
> situation: (tmr 1176207823.774296 now 1176207833.772712 dr 1176207823.774293
> adv 1176207823.774297:1176207823.774297 func (1459c2a9:504)
> 1176196505.704238:1176196505.704240)
> o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
> 
> Since I had reboot on panic on both node 0, node 0 restarted. Checking
> /var/log/messages I found:
> Apr 10 09:39:50 ora1 kernel: (12,2):o2quo_make_decision:121 ERROR: fencing
> this node because it is only connected to 1 nodes and 2 is needed to make a
> quorum out of 3 heartbeating nodes
> Apr 10 09:39:50 ora1 kernel: (12,2):o2hb_stop_all_regions:1909 ERROR:
> stopping heartbeat on all active regions.
> Apr 10 09:39:50 ora1 kernel: Kernel panic: ocfs2 is very sorry to be fencing
> this system by panicing.
> 
> 
> 
> 
> ----Original Message Follows----
> From: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
> To: "Jeff Mahoney" <jeffm at suse.com>,"enohi ibekwe" <enohiaghe at hotmail.com>
> CC: <ocfs2-users at oss.oracle.com>
> Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
> Date: Mon, 9 Apr 2007 11:00:30 -0700
> 
> It's noty an issue; it is really OCFSv2 killer:
> - in 99% cases, it is not split brain condition but just a short (20 - 30
> seconds) network interruption. Systems can (in most cases) see each other by
> network or thru the voting disk, so they can communicate by one or another
> way;
> - in 90% cases system have not any pending IO activity, so it have not any
> reason to fence itself at least until some IO happen on OCFSv2 file system.
> For example, if OCFSv2 is used for backups, it is active 3 hours at night +
> at the time of restoring only, and server can remount it without any fencing
> if it lost consensus.
> - timeouts and other fencing parameters are badly designed, and it makes a
> problem worst. IT can't work out of the box on the most SAN networks (with
> recoinfiguration timeouts all about 30 seconds - 1 minute by default). For
> example, NetApp cluster takepooevr takes about 20 seconds, and giveback
> about 40 seconds - which kills OCFSv2 for 100% sure (with default settings).
> STP timeout (in classical mode) is 40 seconds, which kills OCFSv2 for 100%
> sure. Network switch remoot time is about 1 minute for most switches, which
> kills OCFSv2 for 100% sure. Result - if I reboot staging network switch, I
> have all stand alone servers working, all RAC clusters working, all other
> servers working, and all OCFSv2 cluster fenced themself.
> 
> For me, I baned OCFSv2 from any usage except backup and archive logs, and
> only with using cross connection cable for heartbeat.
> All other scenarios are catastrofic (cause overall cluster failure in many
> cases). And all because of this fencing behavior.
> 
> PS> SLES9 SP3 build 283 have a very stable OCFSv2, with one well known
> problem in buffer use - it don't release small buffers after file is
> created/deleted (so if you run create file / remove file loop for a long
> time, you will deplete system memory in apporox a few days). It is not a
> case if files are big enough (Oracle backups, oracle archive logs,
> application home) but must be taken into account if you have more than
> 100,000 - 1,000,000 files on OCFSv2 file system(s).
> 
> But fencing problem exists in all versions (little better in modern ones,
> because developers added configurable network timeout). If you add _one
> heartbeat interface only_ design and _no serial heartbeat_ design, it really
> became a problem, ad it's why I was thinking about testing OCFSv2 in SLES10
> with heartbeat2 (heartbeat2 have a very reliable heartbeat and have external
> fencing, but unfortunately SLES10 is not production ready yet for us, de
> facto).
> 
> 
> 
> ----- Original Message -----
> From: "Jeff Mahoney" <jeffm at suse.com>
> To: "enohi ibekwe" <enohiaghe at hotmail.com>
> Cc: <ocfs2-users at oss.oracle.com>
> Sent: Saturday, April 07, 2007 12:06 PM
> Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
> 
> 
>  > -----BEGIN PGP SIGNED MESSAGE-----
>  > Hash: SHA1
>  >
>  > enohi ibekwe wrote:
>  > > Is this also an issue on SLES9?
>  > >
>  > > I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I see
>  > > the error on the same box on the cluster.
>  >
>  > I'm not sure what you mean by "issue." This is designed behavior. When
>  > the cluster ends up in a split condition, one or more nodes will fence
>  > themselves.
>  >
>  > - -Jeff
>  >
>  > - --
>  > Jeff Mahoney
>  > SUSE Labs
>  > -----BEGIN PGP SIGNATURE-----
>  > Version: GnuPG v1.4.6 (GNU/Linux)
>  > Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
>  >
>  > iD8DBQFGF+vDLPWxlyuTD7IRAuNPAJ9lZPLSaH7nOCNammYyW3bwC2Wj5wCgomUp
>  > zcRzcaedVAmk+AaJ/OFeddE=
>  > =8e6c
>  > -----END PGP SIGNATURE-----
>  >
>  > _______________________________________________
>  > Ocfs2-users mailing list
>  > Ocfs2-users at oss.oracle.com
>  > http://oss.oracle.com/mailman/listinfo/ocfs2-users
>  >
> 
> _________________________________________________________________
> Cant afford to quit your job?  Earn your AS, BS, or MS degree online in 1 
> year. 
> http://www.classesusa.com/clickcount.cfm?id=866145&goto=http%3A%2F%2Fwww.classesusa.com%2Ffeaturedschools%2Fonlinedegreesmp%2Fform-dyn1.html%3Fsplovr%3D866143
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
--
Andy Phillips
Systems Architecture Manager, Betfair.com

Office: 0208 8348436

Betfair Ltd|Winslow Road|Hammersmith Embankment|London|W69HP Company No.
5140986
The information in this e-mail and any attachment is confidential and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.

________________________________________________________________________
In order to protect our email recipients, Betfair Group use SkyScan from 
MessageLabs to scan all Incoming and Outgoing mail for viruses.

________________________________________________________________________