[Ocfs2-users] Re; another node is heartbeating in our slot! (when
starting 3rd node)
Gareth Bult
gareth at encryptec.net
Sat Feb 9 10:22:58 PST 2008
Hi,
I've seen a number of people with this problem (me too!) but nobody seems to have a solution, any help would be greatly appreciated.
Two noded work file with DRBD/OCFS2, but when I load a third using GNBD, I seem to run into problems...
I'm running an RH 2.6.21 kernel with Xen 3.2. - OCFS version 1.3.3 - Tools 1.2.4.
I have two nodes with the following config;
node:
ip_port = 7777
ip_address = 10.0.0.1
number = 0
name = nodea
cluster = ocfs2
node:
ip_port = 7777
ip_address = 10.0.0.2
number = 1
name = nodeb
cluster = ocfs2
node:
ip_port = 7777
ip_address = 10.0.0.20
number = 3
name = mgm
cluster = ocfs2
cluster:
node_count = 3
name = ocfs2
nodea is running a 400G filesystem on /drdb1
nodeb is running a 400G filesystem on /drdb2 (mirroring drbd1 using drbd 8)
I can load nodes a and b and things look fine and work no problem, both systems can mount their respective drbd's and it all seems to work.
I then run gnbd_serv on both machines and export the drbd devices.
On booting "mgm", I load drbd-client, then /etc/init.d/o2cb, so far so good;
root at mgm:~# /etc/init.d/o2cb status
Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 7
Network idle timeout: 10000
Network keepalive delay: 5000
Network reconnect delay: 2000
Checking O2CB heartbeat: Not active
root at mgm:~# mounted.ocfs2 -f
Device FS Nodes
/dev/gnbd0 ocfs2 nodea, nodeb
/dev/gnbd1 ocfs2 nodea, nodeb
root at mgm:~# mounted.ocfs2 -d
Device FS UUID Label
/dev/gnbd0 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick
/dev/gnbd1 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick
Slots;
Slot# Node#
0 0
1 1
Slot# Node#
0 0
1 1
Now .. I come to try and mount a device on host "mgm";
mount -t ocfs2 /dev/gnbd0 /cluster
In the kernel log on nodea I see;
Feb 9 17:37:01 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot!
Feb 9 17:37:03 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot!
On nodeb I see;
Feb 9 17:37:00 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot!
Feb 9 17:37:02 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot!
And within 10 seconds or so both machines fence themselves off and reboot.
It "seems" as tho' mgm is not recognising that slots 0 and 1 are already taken .. but everything "look" Ok to me.
Can anyone spot any glaring mistakes or suggest a way I can debug this or provide more information to the list?
Many thanks,
Gareth.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080209/602e57bf/attachment.html
More information about the Ocfs2-users
mailing list