[Ocfs2-users] OCFS2 Fencing, then panic

Wed Apr 11 14:14:41 PDT 2007

Use private.

enohi ibekwe wrote:
> The IP address on the cluster.conf file is the public IP address for 
> the nodes.
>
> ----Original Message Follows----
> From: Sunil Mushran <Sunil.Mushran at oracle.com>
> To: enohi ibekwe <enohiaghe at hotmail.com>
> CC: Alexei_Roudnev at exigengroup.com, jeffm at suse.com, 
> ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
> Date: Wed, 11 Apr 2007 10:04:24 -0700
>
> Are you using a private or a public network?
>
> enohi ibekwe wrote:
>> Thanks for your help so far.
>>
>> My issue is the frequency at which node 0 gets fenced, it has 
>> happened at least once a day in the last 2 days.
>>
>> More details:
>>
>> I am attempting to add a node (node 2) to an existing 2 node ( node 0 
>> and
>> node1) cluster. Alll nodes are curently running SLES9 
>> (2.6.5-7.283-bigsmp
>> i686) + ocfs 1.2.1-4.2. This is the ocfs package that ships with 
>> SLES9. Node
>> 2 is not part of the RAC cluster yet, I have only installed ocfs on 
>> it. I
>> can mount the ocfs file system on all nodes, and the ocfs file system is
>> accessible from all nodes.
>>
>> Node 0 is the node alway fenced and gets fenced very frequently. 
>> Before I
>> added the kernel.panic parameter, node 0 would get fenced, panic and 
>> hang.
>> Only a power reboot would make it responsive again.
>>
>> This is what happened this morning.
>>
>> I was remotely connected to node 0 via ssh. Then I suddenly lost the
>> connection. I tried to ssh again but node 0 refused the connection.
>>
>> Checking node 1 dmesg I found :
>> ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
>> o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been 
>> idle for
>> 10 seconds, shutting it down.
>> (0,3):o2net_idle_timer:1310 here are some times that might help debug 
>> the
>> situation: (tmr 1176207822.713473 now 1176207832.712008 dr 
>> 1176207822.713466
>> adv 1176207822.713475:1176207822.713476 func (1459c2a9:504)
>> 1176196519.600486:1176196519.600489)
>> o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
>>
>> checking node 2 dmesg I found:
>> ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
>> o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been 
>> idle for
>> 10 seconds, shutting it down.
>> (0,0):o2net_idle_timer:1310 here are some times that might help debug 
>> the
>> situation: (tmr 1176207823.774296 now 1176207833.772712 dr 
>> 1176207823.774293
>> adv 1176207823.774297:1176207823.774297 func (1459c2a9:504)
>> 1176196505.704238:1176196505.704240)
>> o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
>>
>> Since I had reboot on panic on both node 0, node 0 restarted. Checking
>> /var/log/messages I found:
>> Apr 10 09:39:50 ora1 kernel: (12,2):o2quo_make_decision:121 ERROR: 
>> fencing
>> this node because it is only connected to 1 nodes and 2 is needed to 
>> make a
>> quorum out of 3 heartbeating nodes
>> Apr 10 09:39:50 ora1 kernel: (12,2):o2hb_stop_all_regions:1909 ERROR:
>> stopping heartbeat on all active regions.
>> Apr 10 09:39:50 ora1 kernel: Kernel panic: ocfs2 is very sorry to be 
>> fencing
>> this system by panicing.
>>
>>
>>
>>
>> ----Original Message Follows----
>> From: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
>> To: "Jeff Mahoney" <jeffm at suse.com>,"enohi ibekwe" 
>> <enohiaghe at hotmail.com>
>> CC: <ocfs2-users at oss.oracle.com>
>> Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
>> Date: Mon, 9 Apr 2007 11:00:30 -0700
>>
>> It's noty an issue; it is really OCFSv2 killer:
>> - in 99% cases, it is not split brain condition but just a short (20 
>> - 30
>> seconds) network interruption. Systems can (in most cases) see each 
>> other by
>> network or thru the voting disk, so they can communicate by one or 
>> another
>> way;
>> - in 90% cases system have not any pending IO activity, so it have 
>> not any
>> reason to fence itself at least until some IO happen on OCFSv2 file 
>> system.
>> For example, if OCFSv2 is used for backups, it is active 3 hours at 
>> night +
>> at the time of restoring only, and server can remount it without any 
>> fencing
>> if it lost consensus.
>> - timeouts and other fencing parameters are badly designed, and it 
>> makes a
>> problem worst. IT can't work out of the box on the most SAN networks 
>> (with
>> recoinfiguration timeouts all about 30 seconds - 1 minute by 
>> default). For
>> example, NetApp cluster takepooevr takes about 20 seconds, and giveback
>> about 40 seconds - which kills OCFSv2 for 100% sure (with default 
>> settings).
>> STP timeout (in classical mode) is 40 seconds, which kills OCFSv2 for 
>> 100%
>> sure. Network switch remoot time is about 1 minute for most switches, 
>> which
>> kills OCFSv2 for 100% sure. Result - if I reboot staging network 
>> switch, I
>> have all stand alone servers working, all RAC clusters working, all 
>> other
>> servers working, and all OCFSv2 cluster fenced themself.
>>
>> For me, I baned OCFSv2 from any usage except backup and archive logs, 
>> and
>> only with using cross connection cable for heartbeat.
>> All other scenarios are catastrofic (cause overall cluster failure in 
>> many
>> cases). And all because of this fencing behavior.
>>
>> PS> SLES9 SP3 build 283 have a very stable OCFSv2, with one well known
>> problem in buffer use - it don't release small buffers after file is
>> created/deleted (so if you run create file / remove file loop for a long
>> time, you will deplete system memory in apporox a few days). It is not a
>> case if files are big enough (Oracle backups, oracle archive logs,
>> application home) but must be taken into account if you have more than
>> 100,000 - 1,000,000 files on OCFSv2 file system(s).
>>
>> But fencing problem exists in all versions (little better in modern 
>> ones,
>> because developers added configurable network timeout). If you add _one
>> heartbeat interface only_ design and _no serial heartbeat_ design, it 
>> really
>> became a problem, ad it's why I was thinking about testing OCFSv2 in 
>> SLES10
>> with heartbeat2 (heartbeat2 have a very reliable heartbeat and have 
>> external
>> fencing, but unfortunately SLES10 is not production ready yet for us, de
>> facto).
>>
>>
>>
>> ----- Original Message -----
>> From: "Jeff Mahoney" <jeffm at suse.com>
>> To: "enohi ibekwe" <enohiaghe at hotmail.com>
>> Cc: <ocfs2-users at oss.oracle.com>
>> Sent: Saturday, April 07, 2007 12:06 PM
>> Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
>>
>>
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA1
>> >
>> > enohi ibekwe wrote:
>> > > Is this also an issue on SLES9?
>> > >
>> > > I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. 
>> I see
>> > > the error on the same box on the cluster.
>> >
>> > I'm not sure what you mean by "issue." This is designed behavior. When
>> > the cluster ends up in a split condition, one or more nodes will fence
>> > themselves.
>> >
>> > - -Jeff
>> >
>> > - --
>> > Jeff Mahoney
>> > SUSE Labs
>> > -----BEGIN PGP SIGNATURE-----
>> > Version: GnuPG v1.4.6 (GNU/Linux)
>> > Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
>> >
>> > iD8DBQFGF+vDLPWxlyuTD7IRAuNPAJ9lZPLSaH7nOCNammYyW3bwC2Wj5wCgomUp
>> > zcRzcaedVAmk+AaJ/OFeddE=
>> > =8e6c
>> > -----END PGP SIGNATURE-----
>> >
>> > _______________________________________________
>> > Ocfs2-users mailing list
>> > Ocfs2-users at oss.oracle.com
>> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
>> >
>>
>> _________________________________________________________________
>> Can’t afford to quit your job? – Earn your AS, BS, or MS degree 
>> online in 1 year. 
>> http://www.classesusa.com/clickcount.cfm?id=866145&goto=http%3A%2F%2Fwww.classesusa.com%2Ffeaturedschools%2Fonlinedegreesmp%2Fform-dyn1.html%3Fsplovr%3D866143 
>>
>>
>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
> _________________________________________________________________
> The average US Credit Score is 675. The cost to see yours: $0 by 
> Experian. 
> http://www.freecreditreport.com/pm/default.aspx?sc=660600&bcd=EMAILFOOTERAVERAGE 
>
>