[Ocfs2-users] ocfs2 keeps fencing all my nodes

Thu Jan 18 14:09:46 PST 2007

1. In SLES10, the /config has been moved to /sys/kernel/config. That's 
how it
is on mainline.

2. To monitor heartbeat do:
# watch -d -n2 debugfs.ocfs2 -R "hb" /dev/sdX
This comand will work if you have ocfs2-tools 1.2.2. (Not sure whether 
sles10 ships
with 1.2.2 or 1.2.1.) If 1.2.1, do:
# watch -d -n2 "echo \"hb\" | debugfs.ocfs2 -n /dev/sdX | grep -v 
\"0000000000000000 0000000000000000 00000000\""

3. Configure netconsole to catch any oops stack trace.

4. From the looks of it the issue is related to the disk hb timeout.
Check the FAQ on increasing it to 60 secs from a default of 14 secs.

John Lange wrote:
> I have a 4 node SLES 10 cluster with all nodes attached to a SAN via
> fiber.
>
> The SAN has a EVMS volume formatted with ocfs2. Below is my ocfs2.conf.
>
> I can mount the volume on any single node but as soon as I mount it on
> the second node, it fences one of the nodes. There is never more than
> one node active at a time.
>
> When I check the status of the nodes (quickly before they get fenced)
> the satus shows they are heartbeating.
>
> # /etc/init.d/o2cb status
> Module "configfs": Loaded
> Filesystem "configfs": Mounted
> Module "ocfs2_nodemanager": Loaded
> Module "ocfs2_dlm": Loaded
> Module "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster ocfs2: Online
> Checking O2CB heartbeat: Active
>
> ======== 
>
> Here are the logs from 2 machines (NOTE that this is the logs from 2
> machines at the same time as they were captured via remote syslog on a
> 3rd machine machine) of what happens when the node vs2 is already
> running, and node vs3 joins the cluster (mounts the ocfs2 file system).
> In this instance vs3 gets fenced.
>
> Jan 18 14:52:41 vs2 kernel: o2net: accepted connection from node vs3 (num 2) at 10.1.1.13:7777
> Jan 18 14:52:41 vs3 kernel: o2net: connected to node vs2 (num 1) at 10.1.1.12:7777
> Jan 18 14:52:45 vs3 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles)
> Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Node 2 joins domain 89FC5CB6C98B43B998AB8492874EA6CA
> Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 
> Jan 18 14:52:45 vs3 kernel: ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 
> Jan 18 14:52:45 vs3 kernel: kjournald starting.  Commit interval 5 seconds
> Jan 18 14:52:45 vs3 kernel: ocfs2: Mounting device (253,13) on (node 2, slot 0)
> Jan 18 14:52:45 vs3 udevd-event[5542]: run_program: ressize 256 too short
> Jan 18 14:52:51 vs2 kernel: o2net: connection to node vs3 (num 2) at 10.1.1.13:7777 has been idle for 10 seconds, shutting it down.
> Jan 18 14:52:51 vs2 kernel: (0,0):o2net_idle_timer:1314 here are some times that might help debug the situation: (tmr 1169153561.99906 now 1169153571.93951 dr 1169153566.98
> 030 adv 1169153566.98039:1169153566.98040 func (09ab0f3c:504) 1169153565.211482:1169153565.211485)
> Jan 18 14:52:51 vs3 kernel: o2net: no longer connected to node vs2 (num 1) at 10.1.1.12:7777
> Jan 18 14:52:51 vs2 kernel: o2net: no longer connected to node vs3 (num 2) at 10.1.1.13:7777
>
> ==========
>
> I previously had configured ocfs2 for userspace heartbeating but
> couldn't get that running so I reconfigured for disk based. Could that
> now be the cause of this problem?
>
> Where do the nodes write the heartbeats? I see nothing on the ocfs2
> system.
>
> Also, I have no /config directory that is mentioned in the docs. Is that
> normal?
>
> Here is /etc/ocfs2/cluster.conf
>
> node:
>         ip_port = 7777
>         ip_address = 10.1.1.11
>         number = 0
>         name = vs1
>         cluster = ocfs2
>
> node:
>         ip_port = 7777
>         ip_address = 10.1.1.12
>         number = 1
>         name = vs2
>         cluster = ocfs2
>
> node:
>         ip_port = 7777
>         ip_address = 10.1.1.13
>         number = 2
>         name = vs3
>         cluster = ocfs2
>
> node:
>         ip_port = 7777
>         ip_address = 10.1.1.14
>         number = 3
>         name = vs4
>         cluster = ocfs2
>
> cluster:
>         node_count = 4
>         name = ocfs2
>
>
> Regards,
>
> Any tips on how I can go about diagnosing this problem?
>
> Thanks,
> John Lange
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>