[Ocfs2-users] OCFS2 Fencing, then panic

Wed Apr 11 13:54:20 PDT 2007

The IP address on the cluster.conf file is the public IP address for the 
nodes.

----Original Message Follows----
From: Sunil Mushran <Sunil.Mushran at oracle.com>
To: enohi ibekwe <enohiaghe at hotmail.com>
CC: Alexei_Roudnev at exigengroup.com, jeffm at suse.com, 
ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
Date: Wed, 11 Apr 2007 10:04:24 -0700

Are you using a private or a public network?

enohi ibekwe wrote:
>Thanks for your help so far.
>
>My issue is the frequency at which node 0 gets fenced, it has happened at 
>least once a day in the last 2 days.
>
>More details:
>
>I am attempting to add a node (node 2) to an existing 2 node ( node 0 and
>node1) cluster. Alll nodes are curently running SLES9 (2.6.5-7.283-bigsmp
>i686) + ocfs 1.2.1-4.2. This is the ocfs package that ships with SLES9. 
>Node
>2 is not part of the RAC cluster yet, I have only installed ocfs on it. I
>can mount the ocfs file system on all nodes, and the ocfs file system is
>accessible from all nodes.
>
>Node 0 is the node alway fenced and gets fenced very frequently. Before I
>added the kernel.panic parameter, node 0 would get fenced, panic and hang.
>Only a power reboot would make it responsive again.
>
>This is what happened this morning.
>
>I was remotely connected to node 0 via ssh. Then I suddenly lost the
>connection. I tried to ssh again but node 0 refused the connection.
>
>Checking node 1 dmesg I found :
>ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
>o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for
>10 seconds, shutting it down.
>(0,3):o2net_idle_timer:1310 here are some times that might help debug the
>situation: (tmr 1176207822.713473 now 1176207832.712008 dr 
>1176207822.713466
>adv 1176207822.713475:1176207822.713476 func (1459c2a9:504)
>1176196519.600486:1176196519.600489)
>o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
>
>checking node 2 dmesg I found:
>ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
>o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for
>10 seconds, shutting it down.
>(0,0):o2net_idle_timer:1310 here are some times that might help debug the
>situation: (tmr 1176207823.774296 now 1176207833.772712 dr 
>1176207823.774293
>adv 1176207823.774297:1176207823.774297 func (1459c2a9:504)
>1176196505.704238:1176196505.704240)
>o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
>
>Since I had reboot on panic on both node 0, node 0 restarted. Checking
>/var/log/messages I found:
>Apr 10 09:39:50 ora1 kernel: (12,2):o2quo_make_decision:121 ERROR: fencing
>this node because it is only connected to 1 nodes and 2 is needed to make a
>quorum out of 3 heartbeating nodes
>Apr 10 09:39:50 ora1 kernel: (12,2):o2hb_stop_all_regions:1909 ERROR:
>stopping heartbeat on all active regions.
>Apr 10 09:39:50 ora1 kernel: Kernel panic: ocfs2 is very sorry to be 
>fencing
>this system by panicing.
>
>
>
>
>----Original Message Follows----
>From: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
>To: "Jeff Mahoney" <jeffm at suse.com>,"enohi ibekwe" <enohiaghe at hotmail.com>
>CC: <ocfs2-users at oss.oracle.com>
>Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
>Date: Mon, 9 Apr 2007 11:00:30 -0700
>
>It's noty an issue; it is really OCFSv2 killer:
>- in 99% cases, it is not split brain condition but just a short (20 - 30
>seconds) network interruption. Systems can (in most cases) see each other 
>by
>network or thru the voting disk, so they can communicate by one or another
>way;
>- in 90% cases system have not any pending IO activity, so it have not any
>reason to fence itself at least until some IO happen on OCFSv2 file system.
>For example, if OCFSv2 is used for backups, it is active 3 hours at night +
>at the time of restoring only, and server can remount it without any 
>fencing
>if it lost consensus.
>- timeouts and other fencing parameters are badly designed, and it makes a
>problem worst. IT can't work out of the box on the most SAN networks (with
>recoinfiguration timeouts all about 30 seconds - 1 minute by default). For
>example, NetApp cluster takepooevr takes about 20 seconds, and giveback
>about 40 seconds - which kills OCFSv2 for 100% sure (with default 
>settings).
>STP timeout (in classical mode) is 40 seconds, which kills OCFSv2 for 100%
>sure. Network switch remoot time is about 1 minute for most switches, which
>kills OCFSv2 for 100% sure. Result - if I reboot staging network switch, I
>have all stand alone servers working, all RAC clusters working, all other
>servers working, and all OCFSv2 cluster fenced themself.
>
>For me, I baned OCFSv2 from any usage except backup and archive logs, and
>only with using cross connection cable for heartbeat.
>All other scenarios are catastrofic (cause overall cluster failure in many
>cases). And all because of this fencing behavior.
>
>PS> SLES9 SP3 build 283 have a very stable OCFSv2, with one well known
>problem in buffer use - it don't release small buffers after file is
>created/deleted (so if you run create file / remove file loop for a long
>time, you will deplete system memory in apporox a few days). It is not a
>case if files are big enough (Oracle backups, oracle archive logs,
>application home) but must be taken into account if you have more than
>100,000 - 1,000,000 files on OCFSv2 file system(s).
>
>But fencing problem exists in all versions (little better in modern ones,
>because developers added configurable network timeout). If you add _one
>heartbeat interface only_ design and _no serial heartbeat_ design, it 
>really
>became a problem, ad it's why I was thinking about testing OCFSv2 in SLES10
>with heartbeat2 (heartbeat2 have a very reliable heartbeat and have 
>external
>fencing, but unfortunately SLES10 is not production ready yet for us, de
>facto).
>
>
>
>----- Original Message -----
>From: "Jeff Mahoney" <jeffm at suse.com>
>To: "enohi ibekwe" <enohiaghe at hotmail.com>
>Cc: <ocfs2-users at oss.oracle.com>
>Sent: Saturday, April 07, 2007 12:06 PM
>Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
>
>
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > enohi ibekwe wrote:
> > > Is this also an issue on SLES9?
> > >
> > > I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I see
> > > the error on the same box on the cluster.
> >
> > I'm not sure what you mean by "issue." This is designed behavior. When
> > the cluster ends up in a split condition, one or more nodes will fence
> > themselves.
> >
> > - -Jeff
> >
> > - --
> > Jeff Mahoney
> > SUSE Labs
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.6 (GNU/Linux)
> > Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
> >
> > iD8DBQFGF+vDLPWxlyuTD7IRAuNPAJ9lZPLSaH7nOCNammYyW3bwC2Wj5wCgomUp
> > zcRzcaedVAmk+AaJ/OFeddE=
> > =8e6c
> > -----END PGP SIGNATURE-----
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >
>
>_________________________________________________________________
>Can’t afford to quit your job? – Earn your AS, BS, or MS degree online in 1 
>year. 
>http://www.classesusa.com/clickcount.cfm?id=866145&goto=http%3A%2F%2Fwww.classesusa.com%2Ffeaturedschools%2Fonlinedegreesmp%2Fform-dyn1.html%3Fsplovr%3D866143
>
>
>
>_______________________________________________
>Ocfs2-users mailing list
>Ocfs2-users at oss.oracle.com
>http://oss.oracle.com/mailman/listinfo/ocfs2-users

_________________________________________________________________
The average US Credit Score is 675. The cost to see yours: $0 by Experian. 
http://www.freecreditreport.com/pm/default.aspx?sc=660600&bcd=EMAILFOOTERAVERAGE