[Ocfs2-users] ocfs2 keeps fencing all my nodes

John Lange j.lange at epic.ca
Fri Jan 19 14:19:56 PST 2007


I just want to confirm for the benefit of the list archives that
downgrading the SUSE kernel to 2.6.16.21-0.25-smp did solve the fencing
problem.

Thank you.

John

On Thu, 2007-01-18 at 16:57 -0500, Charlie Sharkey wrote:
> 
> It may be a problem with SLES10. It looks like the latest
> sles10 kernel patch (2.6.16.27-0.6) has this problem.
> 
> here is the problem as reported by someone earlier:
> http://oss.oracle.com/pipermail/ocfs2-users/2007-January/001181.html
> http://oss.oracle.com/pipermail/ocfs2-users/2007-January/001182.html
> 
> here is a bugzilla entry
> http://oss.oracle.com/bugzilla/show_bug.cgi?id=835
> 
> 
>  
> 
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of John Lange
> Sent: Thursday, January 18, 2007 4:03 PM
> To: ocfs2-users
> Subject: [Ocfs2-users] ocfs2 keeps fencing all my nodes
> 
> I have a 4 node SLES 10 cluster with all nodes attached to a SAN via
> fiber.
> 
> The SAN has a EVMS volume formatted with ocfs2. Below is my ocfs2.conf.
> 
> I can mount the volume on any single node but as soon as I mount it on
> the second node, it fences one of the nodes. There is never more than
> one node active at a time.
> 
> When I check the status of the nodes (quickly before they get fenced)
> the satus shows they are heartbeating.
> 
> # /etc/init.d/o2cb status
> Module "configfs": Loaded
> Filesystem "configfs": Mounted
> Module "ocfs2_nodemanager": Loaded
> Module "ocfs2_dlm": Loaded
> Module "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster ocfs2: Online
> Checking O2CB heartbeat: Active
> 
> ======== 
> 
> Here are the logs from 2 machines (NOTE that this is the logs from 2
> machines at the same time as they were captured via remote syslog on a
> 3rd machine machine) of what happens when the node vs2 is already
> running, and node vs3 joins the cluster (mounts the ocfs2 file system).
> In this instance vs3 gets fenced.
> 
> Jan 18 14:52:41 vs2 kernel: o2net: accepted connection from node vs3
> (num 2) at 10.1.1.13:7777 Jan 18 14:52:41 vs3 kernel: o2net: connected
> to node vs2 (num 1) at 10.1.1.12:7777 Jan 18 14:52:45 vs3 kernel: OCFS2
> 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles) Jan 18 14:52:45 vs2
> kernel: ocfs2_dlm: Node 2 joins domain 89FC5CB6C98B43B998AB8492874EA6CA
> Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Nodes in domain
> ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 Jan 18 14:52:45 vs3 kernel:
> ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 Jan
> 18 14:52:45 vs3 kernel: kjournald starting.  Commit interval 5 seconds
> Jan 18 14:52:45 vs3 kernel: ocfs2: Mounting device (253,13) on (node 2,
> slot 0) Jan 18 14:52:45 vs3 udevd-event[5542]: run_program: ressize 256
> too short Jan 18 14:52:51 vs2 kernel: o2net: connection to node vs3 (num
> 2) at 10.1.1.13:7777 has been idle for 10 seconds, shutting it down.
> Jan 18 14:52:51 vs2 kernel: (0,0):o2net_idle_timer:1314 here are some
> times that might help debug the situation: (tmr 1169153561.99906 now
> 1169153571.93951 dr 1169153566.98 030 adv
> 1169153566.98039:1169153566.98040 func (09ab0f3c:504)
> 1169153565.211482:1169153565.211485)
> Jan 18 14:52:51 vs3 kernel: o2net: no longer connected to node vs2 (num
> 1) at 10.1.1.12:7777 Jan 18 14:52:51 vs2 kernel: o2net: no longer
> connected to node vs3 (num 2) at 10.1.1.13:7777
> 
> ==========
> 
> I previously had configured ocfs2 for userspace heartbeating but
> couldn't get that running so I reconfigured for disk based. Could that
> now be the cause of this problem?
> 
> Where do the nodes write the heartbeats? I see nothing on the ocfs2
> system.
> 
> Also, I have no /config directory that is mentioned in the docs. Is that
> normal?
> 
> Here is /etc/ocfs2/cluster.conf
> 
> node:
>         ip_port = 7777
>         ip_address = 10.1.1.11
>         number = 0
>         name = vs1
>         cluster = ocfs2
> 
> node:
>         ip_port = 7777
>         ip_address = 10.1.1.12
>         number = 1
>         name = vs2
>         cluster = ocfs2
> 
> node:
>         ip_port = 7777
>         ip_address = 10.1.1.13
>         number = 2
>         name = vs3
>         cluster = ocfs2
> 
> node:
>         ip_port = 7777
>         ip_address = 10.1.1.14
>         number = 3
>         name = vs4
>         cluster = ocfs2
> 
> cluster:
>         node_count = 4
>         name = ocfs2
> 
> 
> Regards,
> 
> Any tips on how I can go about diagnosing this problem?
> 
> Thanks,
> John Lange
> 
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users





More information about the Ocfs2-users mailing list