[Ocfs2-users] Re; another node is heartbeating in our slot! (when starting 3rd node)

Sat Feb 9 10:22:58 PST 2008

Hi, 

I've seen a number of people with this problem (me too!) but nobody seems to have a solution, any help would be greatly appreciated. 

Two noded work file with DRBD/OCFS2, but when I load a third using GNBD, I seem to run into problems... 

I'm running an RH 2.6.21 kernel with Xen 3.2. - OCFS version 1.3.3 - Tools 1.2.4. 

I have two nodes with the following config; 

node: 
ip_port = 7777 
ip_address = 10.0.0.1 
number = 0 
name = nodea 
cluster = ocfs2 

node: 
ip_port = 7777 
ip_address = 10.0.0.2 
number = 1 
name = nodeb 
cluster = ocfs2 

node: 
ip_port = 7777 
ip_address = 10.0.0.20 
number = 3 
name = mgm 
cluster = ocfs2 

cluster: 
node_count = 3 
name = ocfs2 

nodea is running a 400G filesystem on /drdb1 
nodeb is running a 400G filesystem on /drdb2 (mirroring drbd1 using drbd 8) 

I can load nodes a and b and things look fine and work no problem, both systems can mount their respective drbd's and it all seems to work. 

I then run gnbd_serv on both machines and export the drbd devices. 

On booting "mgm", I load drbd-client, then /etc/init.d/o2cb, so far so good; 

root at mgm:~# /etc/init.d/o2cb status 
Module "configfs": Loaded 
Filesystem "configfs": Mounted 
Module "ocfs2_nodemanager": Loaded 
Module "ocfs2_dlm": Loaded 
Module "ocfs2_dlmfs": Loaded 
Filesystem "ocfs2_dlmfs": Mounted 
Checking O2CB cluster ocfs2: Online 
Heartbeat dead threshold = 7 
Network idle timeout: 10000 
Network keepalive delay: 5000 
Network reconnect delay: 2000 
Checking O2CB heartbeat: Not active 

root at mgm:~# mounted.ocfs2 -f 
Device FS Nodes 
/dev/gnbd0 ocfs2 nodea, nodeb 
/dev/gnbd1 ocfs2 nodea, nodeb 

root at mgm:~# mounted.ocfs2 -d 
Device FS UUID Label 
/dev/gnbd0 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick 
/dev/gnbd1 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick 

Slots; 
Slot# Node# 
0 0 
1 1 

Slot# Node# 
0 0 
1 1 

Now .. I come to try and mount a device on host "mgm"; 

mount -t ocfs2 /dev/gnbd0 /cluster 

In the kernel log on nodea I see; 
Feb 9 17:37:01 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot! 
Feb 9 17:37:03 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot! 

On nodeb I see; 
Feb 9 17:37:00 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot! 
Feb 9 17:37:02 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot! 

And within 10 seconds or so both machines fence themselves off and reboot. 

It "seems" as tho' mgm is not recognising that slots 0 and 1 are already taken .. but everything "look" Ok to me. 
Can anyone spot any glaring mistakes or suggest a way I can debug this or provide more information to the list? 

Many thanks, 
Gareth. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080209/602e57bf/attachment.html