[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Wed Oct 22 17:51:47 PDT 2008

Are you mounting the snapshotted lun on more than one node? If not, then
use tunefs.ocfs2 to also make it mount local. That is, do it the time you
are changing the label and uuid. This will avoid the problem as the fs
will not start hb for local mounts.

However, this just avoids the issue.

To resolve I'll need more info. For starters, walk thru the process and
upload the message file. Indicate the device you are snapshotting, etc.
In your description you mention assuming you were snapshotting a particular
device. Don't assume.... because I don't know what to make of it.

The ocfs2 heartbeat is designed to start on mount and stop on umount.
But it may not work out that way. One handy command to use is:

$ ocfs2_hb_ctl -I -d /dev/dm-X

This will tell you the number of hb references on the device. If it is
zero, then o2hb is not heartbeating on that device.

I see a ocfs2_hb_ctl segfault. Is that consistent? If so, it could indicate
that the stop heartbeat was not being successful and that the above command
would return 1 hb reference. If so, then that's your likely problem.

Sunil

Daniel Keisling wrote:
> Greetings,
>  
> Last night I manually unpresented and deleted a LUN (a SAN snapshot)
> that was presented to one node in a four node RAC environment running
> OCFS2 v1.4.1-1.  The system then rebooted with the following error:
>  
> Oct 21 16:45:34 ausracdb03 kernel: (27,1):o2hb_write_timeout:166 ERROR:
> Heartbeat write timeout to device dm-24 after 120000 milliseconds
> Oct 21 16:45:34 ausracdb03 kernel: (27,1):o2hb_stop_all_regions:1873
> ERROR: stopping heartbeat on all active regions.
>
> I'm assuming that dm-24 was the LUN that was deleted.  Looking back in
> the syslog, I see many of these errors since the time the snapshot was
> taken until the reboot:
>  
> Oct 21 16:42:54 ausracdb03 kernel: (6624,2):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-24": another node is heartbeating in our slot!
>
>  
> The errors stopped when the node came back up.  However, after another
> snapshot was taken, the errors are back, and I'm afraid a node will
> reboot again when the LUN snapshot gets unpresented.  Here are the steps
> that happen to generate the errors:
>  
> After unmounting and deleting the LUN that contains the snapshot, I
> receive:
>  
> Oct 22 03:15:43 ausracdb03 multipathd: dm-28: umount map (uevent)
> Oct 22 03:15:44 ausracdb03 kernel: ocfs2_hb_ctl[7721]: segfault at
> 0000000000000000 rip 0000000000428fa0 rsp 00007fff88a7efb8 error 4
> Oct 22 03:15:44 ausracdb03 kernel: ocfs2: Unmounting device (253,28) on
> (node 2)
>
> The kernel will then sense that all SCSI paths to the device are gone,
> and multipathd will then mark all paths as down, which seems correct
> behavior.
>  
> After creating and presenting a new snapshot, multipath will now see the
> paths reappear, which also seems normal behavior:
>  
> Oct 22 03:16:06 ausracdb03 multipathd: sdcj: tur checker reports path is
> up
> Oct 22 03:16:06 ausracdb03 multipathd: 69:112: reinstated
> Oct 22 03:16:06 ausracdb03 multipathd: mpath0: queue_if_no_path enabled
> Oct 22 03:16:06 ausracdb03 multipathd: mpath0: Recovered to normal mode
> Oct 22 03:16:06 ausracdb03 multipathd: mpath0: remaining active paths: 1
> Oct 22 03:16:06 ausracdb03 multipathd: dm-27: add map (uevent)
> Oct 22 03:16:06 ausracdb03 multipathd: dm-27: devmap already registered
>
> However, I then get this message:
>  
> Oct 22 03:16:06 ausracdb03 kernel: (13210,2):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-28": another node is heartbeating in our slot!
> Oct 22 03:16:06 ausracdb03 kernel: (8605,4):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-28": another node is heartbeating in our slot!
>
> I'm assuming dm-28 is the old snapshot as now there is no dm-28 in the
> multipath map (multpath -ll | grep dm-28).  The new snapshot has the
> device map name of "dm-29."
>  
> I then mount the snapshot LUN (after changing the UUID and label):
>  
> Oct 22 03:16:30 ausracdb03 kernel: (9861,1):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-28": another node is heartbeating in our slot!
> Oct 22 03:16:30 ausracdb03 kernel: ocfs2_dlm: Nodes in domain
> ("BCF5F59FF88A4BE0A75BC1491A021664"): 2
> Oct 22 03:16:30 ausracdb03 kernel: (9860,1):ocfs2_find_slot:249 slot 0
> is already allocated to this node!
> Oct 22 03:16:30 ausracdb03 kernel: (9860,1):ocfs2_check_volume:1745 File
> system was not unmounted cleanly, recovering volume.
> Oct 22 03:16:30 ausracdb03 kernel: kjournald starting.  Commit interval
> 5 seconds
> Oct 22 03:16:30 ausracdb03 kernel: ocfs2: Mounting device (253,28) on
> (node 2, slot 0) with ordered data mode.
> Oct 22 03:16:30 ausracdb03 kernel: (9939,1):ocfs2_replay_journal:1076
> Recovering node 0 from slot 3 on device (253,28)
> Oct 22 03:16:32 ausracdb03 kernel: (9861,2):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-28": another node is heartbeating in our slot!
> Oct 22 03:16:34 ausracdb03 kernel: (9861,2):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-28": another node is heartbeating in our slot!
> Oct 22 03:16:36 ausracdb03 kernel: kjournald starting.  Commit interval
> 5 seconds
> Oct 22 03:16:36 ausracdb03 kernel: (9939,1):ocfs2_replay_journal:1076
> Recovering node 1 from slot 2 on device (253,28)
> Oct 22 03:16:36 ausracdb03 kernel: (9861,3):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-28": another node is heartbeating in our slot!
> Oct 22 03:16:38 ausracdb03 kernel: (9861,3):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-28": another node is heartbeating in our slot!
> Oct 22 03:16:40 ausracdb03 kernel: (9861,1):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-28": another node is heartbeating in our slot!
> Oct 22 03:16:41 ausracdb03 kernel: kjournald starting.  Commit interval
> 5 seconds
> Oct 22 03:16:41 ausracdb03 kernel: (9939,1):ocfs2_replay_journal:1076
> Recovering node 3 from slot 1 on device (253,28)
> Oct 22 03:16:42 ausracdb03 kernel: (9861,3):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-28": another node is heartbeating in our slot!
> Oct 22 03:16:44 ausracdb03 kernel: (9861,1):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-28": another node is heartbeating in our slot!
> Oct 22 03:16:46 ausracdb03 kernel: (9861,3):o2hb_do_disk_heartbeat:770
> ERROR: Device "dm-28": another node is heartbeating in our slot!
> Oct 22 03:16:47 ausracdb03 kernel: kjournald starting.  Commit interval
> 5 seconds
>
> We upgraded to v1.4.1-1 on Sunday, up from 1.2.8 and never received
> these errors under v1.2.8.  Once again, these snapshot LUNs are only
> presented to one node in a four node cluster.
>  
> How do I prevent this behavior?  Should I be flushing the multipath
> mapping ("multipath -F", and perhaps restarting multipathd) after
> deleting the LUN?  How do I tell OCFS2 to stop looking at the old device
> for the heartbeat?  How do I tell OCFS2 to ignore read/write timeouts to
> LUNs that are unmounted and unpresented so that it won't fence itself?
>  
> Any insight would be greatly appreciated.
>  
> TIA,
>  
> Daniel
>