[Ocfs2-users] Re; another node is heartbeating in our slot! (when starting 3rd node)

Mon Feb 11 01:11:31 PST 2008

Urm,

3 nodes, set number = 2 .. if this works this would make the ocfs2 documentation the world's worst.

>Compare the /etc/ocfs2/cluster.conf on all nodes and make sure it's ...

They are all the same.

>Also confirm that 'hostname -s' returns the output of 

They are all correct.

>I'm sure you'll resolve this very quickly.

On the contrary, my config has been checked and rechecked.

My solution is to dump ocfs2 in it's current configuration, there is obviously an issue with it running on Xen in combination with GNBD and there are too few people using it for anyone to know what's going on. Looking in /config I have three nodes all with different node ID's, all block devices have the same UUID, given the slot offset it taken from the device based on the node number, this error should be impossible .. according to the documentation.

>Nobody?? :)

It does seem that way!

Gareth.

----- Original Message -----
step 3.: "Saar Maoz" <Saar.Maoz at oracle.com>
To: "Gareth Bult" <gareth at encryptec.net>
Cc: ocfs2-users at oss.oracle.com
Sent: 11 February 2008 08:20:05 o'clock (GMT) Europe/London
Subject: Re: [Ocfs2-users] Re; another node is heartbeating in our slot! (when starting 3rd node)

> I've seen a number of people with this problem (me too!) but nobody seems to have a solution,

Nobody?? :)

Compare the /etc/ocfs2/cluster.conf on all nodes and make sure it's 
identical.  I would suggest to change:

     number = 3

to number=2 since you only have 3 nodes and they start from zero.  The 
error is simply saying that the node can't join since another node is 
using it's slot, this could easily happen if these files are out of sync 
on all nodes.  Also confirm that 'hostname -s' returns the output of 
"name=" in that file.  I'm sure you'll resolve this very quickly.

Good luck,

Saar.

--
   __  _    _  __    _ _   _   _ ___ _____________________________
  ((  /\\  /\\ ||)  |\V/| /\\ /\\ >/ Consulting Software Engineer
  _))//-\\//-\\||\  |||||//-\\\\//<_ Oracle Corporation
  HQ: 650.50-mixOS  WK: 510.222.4224 Saar.Maoz at oracle.com  4op441
  \\\\\\\\50-64967\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
  ///Share your knowledge with others and compete with yourself///

On Sat, 9 Feb 2008, Gareth Bult wrote:

> Date: Sat, 9 Feb 2008 18:22:58 +0000 (GMT)
> From: Gareth Bult <gareth at encryptec.net>
> To: ocfs2-users at oss.oracle.com
> Subject: [Ocfs2-users] Re; another node is heartbeating in our slot! (when
>     starting 3rd node)
> 
> Hi,
>
> I've seen a number of people with this problem (me too!) but nobody seems to have a solution, any help would be greatly appreciated.
>
> Two noded work file with DRBD/OCFS2, but when I load a third using GNBD, I seem to run into problems...
>
> I'm running an RH 2.6.21 kernel with Xen 3.2. - OCFS version 1.3.3 - Tools 1.2.4.
>
> I have two nodes with the following config;
>
> node:
> ip_port = 7777
> ip_address = 10.0.0.1
> number = 0
> name = nodea
> cluster = ocfs2
>
> node:
> ip_port = 7777
> ip_address = 10.0.0.2
> number = 1
> name = nodeb
> cluster = ocfs2
>
> node:
> ip_port = 7777
> ip_address = 10.0.0.20
> number = 3
> name = mgm
> cluster = ocfs2
>
> cluster:
> node_count = 3
> name = ocfs2
>
> nodea is running a 400G filesystem on /drdb1
> nodeb is running a 400G filesystem on /drdb2 (mirroring drbd1 using drbd 8)
>
> I can load nodes a and b and things look fine and work no problem, both systems can mount their respective drbd's and it all seems to work.
>
> I then run gnbd_serv on both machines and export the drbd devices.
>
> On booting "mgm", I load drbd-client, then /etc/init.d/o2cb, so far so good;
>
> root at mgm:~# /etc/init.d/o2cb status
> Module "configfs": Loaded
> Filesystem "configfs": Mounted
> Module "ocfs2_nodemanager": Loaded
> Module "ocfs2_dlm": Loaded
> Module "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster ocfs2: Online
> Heartbeat dead threshold = 7
> Network idle timeout: 10000
> Network keepalive delay: 5000
> Network reconnect delay: 2000
> Checking O2CB heartbeat: Not active
>
> root at mgm:~# mounted.ocfs2 -f
> Device FS Nodes
> /dev/gnbd0 ocfs2 nodea, nodeb
> /dev/gnbd1 ocfs2 nodea, nodeb
>
> root at mgm:~# mounted.ocfs2 -d
> Device FS UUID Label
> /dev/gnbd0 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick
> /dev/gnbd1 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick
>
> Slots;
> Slot# Node#
> 0 0
> 1 1
>
> Slot# Node#
> 0 0
> 1 1
>
> Now .. I come to try and mount a device on host "mgm";
>
> mount -t ocfs2 /dev/gnbd0 /cluster
>
> In the kernel log on nodea I see;
> Feb 9 17:37:01 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot!
> Feb 9 17:37:03 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot!
>
> On nodeb I see;
> Feb 9 17:37:00 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot!
> Feb 9 17:37:02 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot!
>
> And within 10 seconds or so both machines fence themselves off and reboot.
>
> It "seems" as tho' mgm is not recognising that slots 0 and 1 are already taken .. but everything "look" Ok to me.
> Can anyone spot any glaring mistakes or suggest a way I can debug this or provide more information to the list?
>
> Many thanks,
> Gareth.
>
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users