[Ocfs2-users] OCFS2 Fencing, then panic

Wed Apr 11 10:04:24 PDT 2007

Are you using a private or a public network?

enohi ibekwe wrote:
> Thanks for your help so far.
>
> My issue is the frequency at which node 0 gets fenced, it has happened 
> at least once a day in the last 2 days.
>
> More details:
>
> I am attempting to add a node (node 2) to an existing 2 node ( node 0 and
> node1) cluster. Alll nodes are curently running SLES9 (2.6.5-7.283-bigsmp
> i686) + ocfs 1.2.1-4.2. This is the ocfs package that ships with 
> SLES9. Node
> 2 is not part of the RAC cluster yet, I have only installed ocfs on it. I
> can mount the ocfs file system on all nodes, and the ocfs file system is
> accessible from all nodes.
>
> Node 0 is the node alway fenced and gets fenced very frequently. Before I
> added the kernel.panic parameter, node 0 would get fenced, panic and 
> hang.
> Only a power reboot would make it responsive again.
>
> This is what happened this morning.
>
> I was remotely connected to node 0 via ssh. Then I suddenly lost the
> connection. I tried to ssh again but node 0 refused the connection.
>
> Checking node 1 dmesg I found :
> ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
> o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been 
> idle for
> 10 seconds, shutting it down.
> (0,3):o2net_idle_timer:1310 here are some times that might help debug the
> situation: (tmr 1176207822.713473 now 1176207832.712008 dr 
> 1176207822.713466
> adv 1176207822.713475:1176207822.713476 func (1459c2a9:504)
> 1176196519.600486:1176196519.600489)
> o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
>
> checking node 2 dmesg I found:
> ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
> o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been 
> idle for
> 10 seconds, shutting it down.
> (0,0):o2net_idle_timer:1310 here are some times that might help debug the
> situation: (tmr 1176207823.774296 now 1176207833.772712 dr 
> 1176207823.774293
> adv 1176207823.774297:1176207823.774297 func (1459c2a9:504)
> 1176196505.704238:1176196505.704240)
> o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
>
> Since I had reboot on panic on both node 0, node 0 restarted. Checking
> /var/log/messages I found:
> Apr 10 09:39:50 ora1 kernel: (12,2):o2quo_make_decision:121 ERROR: 
> fencing
> this node because it is only connected to 1 nodes and 2 is needed to 
> make a
> quorum out of 3 heartbeating nodes
> Apr 10 09:39:50 ora1 kernel: (12,2):o2hb_stop_all_regions:1909 ERROR:
> stopping heartbeat on all active regions.
> Apr 10 09:39:50 ora1 kernel: Kernel panic: ocfs2 is very sorry to be 
> fencing
> this system by panicing.
>
>
>
>
> ----Original Message Follows----
> From: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
> To: "Jeff Mahoney" <jeffm at suse.com>,"enohi ibekwe" 
> <enohiaghe at hotmail.com>
> CC: <ocfs2-users at oss.oracle.com>
> Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
> Date: Mon, 9 Apr 2007 11:00:30 -0700
>
> It's noty an issue; it is really OCFSv2 killer:
> - in 99% cases, it is not split brain condition but just a short (20 - 30
> seconds) network interruption. Systems can (in most cases) see each 
> other by
> network or thru the voting disk, so they can communicate by one or 
> another
> way;
> - in 90% cases system have not any pending IO activity, so it have not 
> any
> reason to fence itself at least until some IO happen on OCFSv2 file 
> system.
> For example, if OCFSv2 is used for backups, it is active 3 hours at 
> night +
> at the time of restoring only, and server can remount it without any 
> fencing
> if it lost consensus.
> - timeouts and other fencing parameters are badly designed, and it 
> makes a
> problem worst. IT can't work out of the box on the most SAN networks 
> (with
> recoinfiguration timeouts all about 30 seconds - 1 minute by default). 
> For
> example, NetApp cluster takepooevr takes about 20 seconds, and giveback
> about 40 seconds - which kills OCFSv2 for 100% sure (with default 
> settings).
> STP timeout (in classical mode) is 40 seconds, which kills OCFSv2 for 
> 100%
> sure. Network switch remoot time is about 1 minute for most switches, 
> which
> kills OCFSv2 for 100% sure. Result - if I reboot staging network 
> switch, I
> have all stand alone servers working, all RAC clusters working, all other
> servers working, and all OCFSv2 cluster fenced themself.
>
> For me, I baned OCFSv2 from any usage except backup and archive logs, and
> only with using cross connection cable for heartbeat.
> All other scenarios are catastrofic (cause overall cluster failure in 
> many
> cases). And all because of this fencing behavior.
>
> PS> SLES9 SP3 build 283 have a very stable OCFSv2, with one well known
> problem in buffer use - it don't release small buffers after file is
> created/deleted (so if you run create file / remove file loop for a long
> time, you will deplete system memory in apporox a few days). It is not a
> case if files are big enough (Oracle backups, oracle archive logs,
> application home) but must be taken into account if you have more than
> 100,000 - 1,000,000 files on OCFSv2 file system(s).
>
> But fencing problem exists in all versions (little better in modern ones,
> because developers added configurable network timeout). If you add _one
> heartbeat interface only_ design and _no serial heartbeat_ design, it 
> really
> became a problem, ad it's why I was thinking about testing OCFSv2 in 
> SLES10
> with heartbeat2 (heartbeat2 have a very reliable heartbeat and have 
> external
> fencing, but unfortunately SLES10 is not production ready yet for us, de
> facto).
>
>
>
> ----- Original Message -----
> From: "Jeff Mahoney" <jeffm at suse.com>
> To: "enohi ibekwe" <enohiaghe at hotmail.com>
> Cc: <ocfs2-users at oss.oracle.com>
> Sent: Saturday, April 07, 2007 12:06 PM
> Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
>
>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > enohi ibekwe wrote:
> > > Is this also an issue on SLES9?
> > >
> > > I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I 
> see
> > > the error on the same box on the cluster.
> >
> > I'm not sure what you mean by "issue." This is designed behavior. When
> > the cluster ends up in a split condition, one or more nodes will fence
> > themselves.
> >
> > - -Jeff
> >
> > - --
> > Jeff Mahoney
> > SUSE Labs
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.6 (GNU/Linux)
> > Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
> >
> > iD8DBQFGF+vDLPWxlyuTD7IRAuNPAJ9lZPLSaH7nOCNammYyW3bwC2Wj5wCgomUp
> > zcRzcaedVAmk+AaJ/OFeddE=
> > =8e6c
> > -----END PGP SIGNATURE-----
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >
>
> _________________________________________________________________
> Can’t afford to quit your job? – Earn your AS, BS, or MS degree online 
> in 1 year. 
> http://www.classesusa.com/clickcount.cfm?id=866145&goto=http%3A%2F%2Fwww.classesusa.com%2Ffeaturedschools%2Fonlinedegreesmp%2Fform-dyn1.html%3Fsplovr%3D866143 
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users