[Ocfs2-users] OCFS2 Fencing, then panic

Wed Apr 18 06:34:23 PDT 2007

Please ignore last message, all mess up due to cut and paste .Message should 
be as below:

I cannot do much with checking network issues because it is not under my
control.

We are thinking about taking node 0 that always gets fenced out of the
cluster, so I tried the test below just to check that we will not have
issues with node 1 and node 2.

When I tried to use private IP, the situation got worse. I have tried to
document what I have done below.  I am hoping that somebody will be able to
figure out what is happening.

Took node 0 out of picture i.e shut it down ( I did not remove it from the
RAC or ocfs cluster yet)

Sequence of events with Public IP
1. stop all ocfs services/cluster on node 1 and 2 (ok)
2. unmount ocfs fs on node 1 and 2 (ok)
3. verify that ip in cluster.conf is public (ok)
4. start all ocfs services/cluster on node 1 and 2 (ok)
5. On node 2:
    - mount -at ocfs2 (ok)
    - df ( shows mount ocfs fs)
6. On node 1:
    - mount -at ocfs2 (ok but see error after below)
    ora2:~ # mount -at ocfs2
          mount.ocfs2: Device or resource busy while mounting /dev/sdb1 on 
/u02/oradata/orcl
    ora2:~ # df ( shows mount ocfs fs)

dmesg so far for node 2
ocfs2: Unmounting device (8,17) on (node 2)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 2
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 2, slot 0)
o2net: connected to node ora2 (num 1) at 10.12.1.36:7777
ocfs2_dlm: Node 1 joins domain A7AE746FB3D34479A4B04C0535A0A341
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1 2

dmesg so far for node 1
ocfs2: Unmounting device (8,17) on (node 1)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
o2net: accepted connection from node ora3 (num 2) at 10.12.1.37:7777
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1 2
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 1, slot 1)

Sequence of events with Private IP
1. unmount ocfs fs on node 1 and 2 (ok)
2. stop all ocfs services/cluster on node 1 and 2 (ok)
3. change ip in cluster.conf is private (ok)
4. verify you can ping private ip from/to node 1 and node 2 (ok)
5. start all ocfs services/cluster on node 1 and 2 (ok)
5. On node 2:
    - mount -at ocfs2 (ok)
    - df ( shows mount ocfs fs)
6. On node 1:
     ora2:~ # mount -at ocfs2
         mount.ocfs2: Transport endpoint is not connected while mounting 
/dev/sdb1 on /u02/oradata/orcl
         mount.ocfs2: Transport endpoint is not connected while mounting 
/dev/sdb1 on /u02/oradata/orcl
         ora2:~ # df ( shows no mount of ocfs fs)
7. unmount ocfs fs on node 2 (ok)
8. mount ocfs fs on node 1 (ok but see message below)
       ora2:~ # mount -at ocfs2
       mount.ocfs2: Device or resource busy while mounting /dev/sdb1 on 
/u02/oradata/orcl
       ora2:~ # df ( shows mount ocfs fs)
9.mount ocfs fs on node 2 ( NOT OK)
        ora3:~ # mount -at ocfs2
        mount.ocfs2: Transport endpoint is not connected while mounting 
/dev/sdb1 on /u02/oradata/orcl
        ora2:~ # df ( shows no mount of ocfs fs)

dmesg now on node 2
ocfs2: Unmounting device (8,17) on (node 2)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 2
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 2, slot 0)
o2net: connected to node ora2 (num 1) at 10.12.1.36:7777
ocfs2_dlm: Node 1 joins domain A7AE746FB3D34479A4B04C0535A0A341
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1 2
o2net: no longer connected to node ora2 (num 1) at 10.12.1.36:7777
ocfs2: Unmounting device (8,17) on (node 2)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 2
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 2, slot 0)
(15650,0):o2net_start_connect:1390 ERROR: bind failed with -99 at address 
193.168.2.3
(15650,0):o2net_start_connect:1421 connect attempt to node ora2 (num 1) at 
193.168.2.2:7777 failed with errno -99
(15650,0):o2net_connect_expired:1445 ERROR: no connection established with 
node 1 after 10 seconds, giving up and returning errors.
(15650,0):o2net_start_connect:1390 ERROR: bind failed with -99 at address 
193.168.2.3
(15650,0):o2net_start_connect:1421 connect attempt to node ora2 (num 1) at 
193.168.2.2:7777 failed with errno -99
(15650,0):o2net_connect_expired:1445 ERROR: no connection established with 
node 1 after 10 seconds, giving up and returning errors.
ocfs2: Unmounting device (8,17) on (node 2)
(15650,0):o2net_start_connect:1390 ERROR: bind failed with -99 at address 
193.168.2.3
(15650,0):o2net_start_connect:1421 connect attempt to node ora2 (num 1) at 
193.168.2.2:7777 failed with errno -99
(15650,0):o2net_connect_expired:1445 ERROR: no connection established with 
node 1 after 10 seconds, giving up and returning errors.
(21431,0):dlm_request_join:786 ERROR: status = -107
(21431,0):dlm_try_to_join_domain:934 ERROR: status = -107
(21431,0):dlm_join_domain:1186 ERROR: status = -107
(21431,0):dlm_register_domain:1379 ERROR: status = -107
(21431,0):ocfs2_dlm_init:2007 ERROR: status = -107
(21431,0):ocfs2_mount_volume:1064 ERROR: status = -107
ocfs2: Unmounting device (8,17) on (node 2)

dmesg now on node 1
ocfs2: Unmounting device (8,17) on (node 1)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
o2net: accepted connection from node ora3 (num 2) at 10.12.1.37:7777
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1 2
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 1, slot 1)
ocfs2_dlm: Node 2 leaves domain A7AE746FB3D34479A4B04C0535A0A341
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1
o2net: no longer connected to node ora3 (num 2) at 10.12.1.37:7777
ocfs2: Unmounting device (8,17) on (node 1)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
(18294,0):o2net_connect_expired:1445 ERROR: no connection established with 
node 2 after 10 seconds, giving up and returning errors.
(19360,1):dlm_request_join:786 ERROR: status = -107
(19360,1):dlm_try_to_join_domain:934 ERROR: status = -107
(19360,1):dlm_join_domain:1186 ERROR: status = -107
(19360,1):dlm_register_domain:1379 ERROR: status = -107
(19360,1):ocfs2_dlm_init:2007 ERROR: status = -107
(19360,1):ocfs2_mount_volume:1064 ERROR: status = -107
ocfs2: Unmounting device (8,17) on (node 1)
(18294,0):o2net_connect_expired:1445 ERROR: no connection established with 
node 2 after 10 seconds, giving up and returning errors.
(19409,0):dlm_request_join:786 ERROR: status = -107
(19409,0):dlm_try_to_join_domain:934 ERROR: status = -107
(19409,0):dlm_join_domain:1186 ERROR: status = -107
(19409,0):dlm_register_domain:1379 ERROR: status = -107
(19409,0):ocfs2_dlm_init:2007 ERROR: status = -107
(19409,0):ocfs2_mount_volume:1064 ERROR: status = -107
ocfs2: Unmounting device (8,17) on (node 1)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 1, slot 0)
(18294,0):o2net_connect_expired:1445 ERROR: no connection established with 
node 2 after 10 seconds, giving up and returning errors.

----Original Message Follows----
From: Sunil Mushran <Sunil.Mushran at oracle.com>
To: enohi ibekwe <enohiaghe at hotmail.com>
CC: Alexei_Roudnev at exigengroup.com, jeffm at suse.com, 
ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
Date: Wed, 11 Apr 2007 14:14:41 -0700

Use private.

enohi ibekwe wrote:
>The IP address on the cluster.conf file is the public IP address for the 
>nodes.
>
>----Original Message Follows----
>From: Sunil Mushran <Sunil.Mushran at oracle.com>
>To: enohi ibekwe <enohiaghe at hotmail.com>
>CC: Alexei_Roudnev at exigengroup.com, jeffm at suse.com, 
>ocfs2-users at oss.oracle.com
>Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
>Date: Wed, 11 Apr 2007 10:04:24 -0700
>
>Are you using a private or a public network?
>
>enohi ibekwe wrote:
>>Thanks for your help so far.
>>
>>My issue is the frequency at which node 0 gets fenced, it has happened at 
>>least once a day in the last 2 days.
>>
>>More details:
>>
>>I am attempting to add a node (node 2) to an existing 2 node ( node 0 and
>>node1) cluster. Alll nodes are curently running SLES9 (2.6.5-7.283-bigsmp
>>i686) + ocfs 1.2.1-4.2. This is the ocfs package that ships with SLES9. 
>>Node
>>2 is not part of the RAC cluster yet, I have only installed ocfs on it. I
>>can mount the ocfs file system on all nodes, and the ocfs file system is
>>accessible from all nodes.
>>
>>Node 0 is the node alway fenced and gets fenced very frequently. Before I
>>added the kernel.panic parameter, node 0 would get fenced, panic and hang.
>>Only a power reboot would make it responsive again.
>>
>>This is what happened this morning.
>>
>>I was remotely connected to node 0 via ssh. Then I suddenly lost the
>>connection. I tried to ssh again but node 0 refused the connection.
>>
>>Checking node 1 dmesg I found :
>>ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
>>o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle 
>>for
>>10 seconds, shutting it down.
>>(0,3):o2net_idle_timer:1310 here are some times that might help debug the
>>situation: (tmr 1176207822.713473 now 1176207832.712008 dr 
>>1176207822.713466
>>adv 1176207822.713475:1176207822.713476 func (1459c2a9:504)
>>1176196519.600486:1176196519.600489)
>>o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
>>
>>checking node 2 dmesg I found:
>>ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
>>o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle 
>>for
>>10 seconds, shutting it down.
>>(0,0):o2net_idle_timer:1310 here are some times that might help debug the
>>situation: (tmr 1176207823.774296 now 1176207833.772712 dr 
>>1176207823.774293
>>adv 1176207823.774297:1176207823.774297 func (1459c2a9:504)
>>1176196505.704238:1176196505.704240)
>>o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777
>>
>>Since I had reboot on panic on both node 0, node 0 restarted. Checking
>>/var/log/messages I found:
>>Apr 10 09:39:50 ora1 kernel: (12,2):o2quo_make_decision:121 ERROR: fencing
>>this node because it is only connected to 1 nodes and 2 is needed to make 
>>a
>>quorum out of 3 heartbeating nodes
>>Apr 10 09:39:50 ora1 kernel: (12,2):o2hb_stop_all_regions:1909 ERROR:
>>stopping heartbeat on all active regions.
>>Apr 10 09:39:50 ora1 kernel: Kernel panic: ocfs2 is very sorry to be 
>>fencing
>>this system by panicing.
>>
>>
>>
>>
>>----Original Message Follows----
>>From: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
>>To: "Jeff Mahoney" <jeffm at suse.com>,"enohi ibekwe" <enohiaghe at hotmail.com>
>>CC: <ocfs2-users at oss.oracle.com>
>>Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
>>Date: Mon, 9 Apr 2007 11:00:30 -0700
>>
>>It's noty an issue; it is really OCFSv2 killer:
>>- in 99% cases, it is not split brain condition but just a short (20 - 30
>>seconds) network interruption. Systems can (in most cases) see each other 
>>by
>>network or thru the voting disk, so they can communicate by one or another
>>way;
>>- in 90% cases system have not any pending IO activity, so it have not any
>>reason to fence itself at least until some IO happen on OCFSv2 file 
>>system.
>>For example, if OCFSv2 is used for backups, it is active 3 hours at night 
>>+
>>at the time of restoring only, and server can remount it without any 
>>fencing
>>if it lost consensus.
>>- timeouts and other fencing parameters are badly designed, and it makes a
>>problem worst. IT can't work out of the box on the most SAN networks (with
>>recoinfiguration timeouts all about 30 seconds - 1 minute by default). For
>>example, NetApp cluster takepooevr takes about 20 seconds, and giveback
>>about 40 seconds - which kills OCFSv2 for 100% sure (with default 
>>settings).
>>STP timeout (in classical mode) is 40 seconds, which kills OCFSv2 for 100%
>>sure. Network switch remoot time is about 1 minute for most switches, 
>>which
>>kills OCFSv2 for 100% sure. Result - if I reboot staging network switch, I
>>have all stand alone servers working, all RAC clusters working, all other
>>servers working, and all OCFSv2 cluster fenced themself.
>>
>>For me, I baned OCFSv2 from any usage except backup and archive logs, and
>>only with using cross connection cable for heartbeat.
>>All other scenarios are catastrofic (cause overall cluster failure in many
>>cases). And all because of this fencing behavior.
>>
>>PS> SLES9 SP3 build 283 have a very stable OCFSv2, with one well known
>>problem in buffer use - it don't release small buffers after file is
>>created/deleted (so if you run create file / remove file loop for a long
>>time, you will deplete system memory in apporox a few days). It is not a
>>case if files are big enough (Oracle backups, oracle archive logs,
>>application home) but must be taken into account if you have more than
>>100,000 - 1,000,000 files on OCFSv2 file system(s).
>>
>>But fencing problem exists in all versions (little better in modern ones,
>>because developers added configurable network timeout). If you add _one
>>heartbeat interface only_ design and _no serial heartbeat_ design, it 
>>really
>>became a problem, ad it's why I was thinking about testing OCFSv2 in 
>>SLES10
>>with heartbeat2 (heartbeat2 have a very reliable heartbeat and have 
>>external
>>fencing, but unfortunately SLES10 is not production ready yet for us, de
>>facto).
>>
>>
>>
>>----- Original Message -----
>>From: "Jeff Mahoney" <jeffm at suse.com>
>>To: "enohi ibekwe" <enohiaghe at hotmail.com>
>>Cc: <ocfs2-users at oss.oracle.com>
>>Sent: Saturday, April 07, 2007 12:06 PM
>>Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
>>
>>
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA1
>> >
>> > enohi ibekwe wrote:
>> > > Is this also an issue on SLES9?
>> > >
>> > > I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I 
>>see
>> > > the error on the same box on the cluster.
>> >
>> > I'm not sure what you mean by "issue." This is designed behavior. When
>> > the cluster ends up in a split condition, one or more nodes will fence
>> > themselves.
>> >
>> > - -Jeff
>> >
>> > - --
>> > Jeff Mahoney
>> > SUSE Labs
>> > -----BEGIN PGP SIGNATURE-----
>> > Version: GnuPG v1.4.6 (GNU/Linux)
>> > Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
>> >
>> > iD8DBQFGF+vDLPWxlyuTD7IRAuNPAJ9lZPLSaH7nOCNammYyW3bwC2Wj5wCgomUp
>> > zcRzcaedVAmk+AaJ/OFeddE=
>> > =8e6c
>> > -----END PGP SIGNATURE-----
>> >
>> > _______________________________________________
>> > Ocfs2-users mailing list
>> > Ocfs2-users at oss.oracle.com
>> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
>> >
>>
>>_________________________________________________________________
>>Can’t afford to quit your job? – Earn your AS, BS, or MS degree online in 
>>1 year. 
>>http://www.classesusa.com/clickcount.cfm?id=866145&goto=http%3A%2F%2Fwww.classesusa.com%2Ffeaturedschools%2Fonlinedegreesmp%2Fform-dyn1.html%3Fsplovr%3D866143
>>
>>
>>
>>
>>_______________________________________________
>>Ocfs2-users mailing list
>>Ocfs2-users at oss.oracle.com
>>http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>_________________________________________________________________
>The average US Credit Score is 675. The cost to see yours: $0 by Experian. 
>http://www.freecreditreport.com/pm/default.aspx?sc=660600&bcd=EMAILFOOTERAVERAGE
>
>

_________________________________________________________________
Mortgage refinance is Hot. *Terms. Get a 5.375%* fix rate. Check savings 
https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h2bbb&disc=y&vers=925&s=4056&p=5117