[Ocfs2-users] Re; another node is heartbeating in our slot! (when starting 3rd node)

Mon Feb 11 00:20:05 PST 2008

> I've seen a number of people with this problem (me too!) but nobody seems to have a solution,

Nobody?? :)

Compare the /etc/ocfs2/cluster.conf on all nodes and make sure it's 
identical.  I would suggest to change:

     number = 3

to number=2 since you only have 3 nodes and they start from zero.  The 
error is simply saying that the node can't join since another node is 
using it's slot, this could easily happen if these files are out of sync 
on all nodes.  Also confirm that 'hostname -s' returns the output of 
"name=" in that file.  I'm sure you'll resolve this very quickly.

Good luck,

Saar.

--
   __  _    _  __    _ _   _   _ ___ _____________________________
  ((  /\\  /\\ ||)  |\V/| /\\ /\\ >/ Consulting Software Engineer
  _))//-\\//-\\||\  |||||//-\\\\//<_ Oracle Corporation
  HQ: 650.50-mixOS  WK: 510.222.4224 Saar.Maoz at oracle.com  4op441
  \\\\\\\\50-64967\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
  ///Share your knowledge with others and compete with yourself///

On Sat, 9 Feb 2008, Gareth Bult wrote:

> Date: Sat, 9 Feb 2008 18:22:58 +0000 (GMT)
> From: Gareth Bult <gareth at encryptec.net>
> To: ocfs2-users at oss.oracle.com
> Subject: [Ocfs2-users] Re; another node is heartbeating in our slot! (when
>     starting 3rd node)
> 
> Hi,
>
> I've seen a number of people with this problem (me too!) but nobody seems to have a solution, any help would be greatly appreciated.
>
> Two noded work file with DRBD/OCFS2, but when I load a third using GNBD, I seem to run into problems...
>
> I'm running an RH 2.6.21 kernel with Xen 3.2. - OCFS version 1.3.3 - Tools 1.2.4.
>
> I have two nodes with the following config;
>
> node:
> ip_port = 7777
> ip_address = 10.0.0.1
> number = 0
> name = nodea
> cluster = ocfs2
>
> node:
> ip_port = 7777
> ip_address = 10.0.0.2
> number = 1
> name = nodeb
> cluster = ocfs2
>
> node:
> ip_port = 7777
> ip_address = 10.0.0.20
> number = 3
> name = mgm
> cluster = ocfs2
>
> cluster:
> node_count = 3
> name = ocfs2
>
> nodea is running a 400G filesystem on /drdb1
> nodeb is running a 400G filesystem on /drdb2 (mirroring drbd1 using drbd 8)
>
> I can load nodes a and b and things look fine and work no problem, both systems can mount their respective drbd's and it all seems to work.
>
> I then run gnbd_serv on both machines and export the drbd devices.
>
> On booting "mgm", I load drbd-client, then /etc/init.d/o2cb, so far so good;
>
> root at mgm:~# /etc/init.d/o2cb status
> Module "configfs": Loaded
> Filesystem "configfs": Mounted
> Module "ocfs2_nodemanager": Loaded
> Module "ocfs2_dlm": Loaded
> Module "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster ocfs2: Online
> Heartbeat dead threshold = 7
> Network idle timeout: 10000
> Network keepalive delay: 5000
> Network reconnect delay: 2000
> Checking O2CB heartbeat: Not active
>
> root at mgm:~# mounted.ocfs2 -f
> Device FS Nodes
> /dev/gnbd0 ocfs2 nodea, nodeb
> /dev/gnbd1 ocfs2 nodea, nodeb
>
> root at mgm:~# mounted.ocfs2 -d
> Device FS UUID Label
> /dev/gnbd0 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick
> /dev/gnbd1 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick
>
> Slots;
> Slot# Node#
> 0 0
> 1 1
>
> Slot# Node#
> 0 0
> 1 1
>
> Now .. I come to try and mount a device on host "mgm";
>
> mount -t ocfs2 /dev/gnbd0 /cluster
>
> In the kernel log on nodea I see;
> Feb 9 17:37:01 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot!
> Feb 9 17:37:03 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot!
>
> On nodeb I see;
> Feb 9 17:37:00 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot!
> Feb 9 17:37:02 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot!
>
> And within 10 seconds or so both machines fence themselves off and reboot.
>
> It "seems" as tho' mgm is not recognising that slots 0 and 1 are already taken .. but everything "look" Ok to me.
> Can anyone spot any glaring mistakes or suggest a way I can debug this or provide more information to the list?
>
> Many thanks,
> Gareth.
>
-------------- next part --------------
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users