[Ocfs2-users] ocfs2 keeps fencing all my nodes

John Lange j.lange at epic.ca
Thu Jan 18 13:03:19 PST 2007


I have a 4 node SLES 10 cluster with all nodes attached to a SAN via
fiber.

The SAN has a EVMS volume formatted with ocfs2. Below is my ocfs2.conf.

I can mount the volume on any single node but as soon as I mount it on
the second node, it fences one of the nodes. There is never more than
one node active at a time.

When I check the status of the nodes (quickly before they get fenced)
the satus shows they are heartbeating.

# /etc/init.d/o2cb status
Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Checking O2CB heartbeat: Active

======== 

Here are the logs from 2 machines (NOTE that this is the logs from 2
machines at the same time as they were captured via remote syslog on a
3rd machine machine) of what happens when the node vs2 is already
running, and node vs3 joins the cluster (mounts the ocfs2 file system).
In this instance vs3 gets fenced.

Jan 18 14:52:41 vs2 kernel: o2net: accepted connection from node vs3 (num 2) at 10.1.1.13:7777
Jan 18 14:52:41 vs3 kernel: o2net: connected to node vs2 (num 1) at 10.1.1.12:7777
Jan 18 14:52:45 vs3 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles)
Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Node 2 joins domain 89FC5CB6C98B43B998AB8492874EA6CA
Jan 18 14:52:45 vs2 kernel: ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 
Jan 18 14:52:45 vs3 kernel: ocfs2_dlm: Nodes in domain ("89FC5CB6C98B43B998AB8492874EA6CA"): 1 2 
Jan 18 14:52:45 vs3 kernel: kjournald starting.  Commit interval 5 seconds
Jan 18 14:52:45 vs3 kernel: ocfs2: Mounting device (253,13) on (node 2, slot 0)
Jan 18 14:52:45 vs3 udevd-event[5542]: run_program: ressize 256 too short
Jan 18 14:52:51 vs2 kernel: o2net: connection to node vs3 (num 2) at 10.1.1.13:7777 has been idle for 10 seconds, shutting it down.
Jan 18 14:52:51 vs2 kernel: (0,0):o2net_idle_timer:1314 here are some times that might help debug the situation: (tmr 1169153561.99906 now 1169153571.93951 dr 1169153566.98
030 adv 1169153566.98039:1169153566.98040 func (09ab0f3c:504) 1169153565.211482:1169153565.211485)
Jan 18 14:52:51 vs3 kernel: o2net: no longer connected to node vs2 (num 1) at 10.1.1.12:7777
Jan 18 14:52:51 vs2 kernel: o2net: no longer connected to node vs3 (num 2) at 10.1.1.13:7777

==========

I previously had configured ocfs2 for userspace heartbeating but
couldn't get that running so I reconfigured for disk based. Could that
now be the cause of this problem?

Where do the nodes write the heartbeats? I see nothing on the ocfs2
system.

Also, I have no /config directory that is mentioned in the docs. Is that
normal?

Here is /etc/ocfs2/cluster.conf

node:
        ip_port = 7777
        ip_address = 10.1.1.11
        number = 0
        name = vs1
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 10.1.1.12
        number = 1
        name = vs2
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 10.1.1.13
        number = 2
        name = vs3
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 10.1.1.14
        number = 3
        name = vs4
        cluster = ocfs2

cluster:
        node_count = 4
        name = ocfs2


Regards,

Any tips on how I can go about diagnosing this problem?

Thanks,
John Lange





More information about the Ocfs2-users mailing list