[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Wed Oct 22 09:22:11 PDT 2008

Greetings,

Last night I manually unpresented and deleted a LUN (a SAN snapshot)
that was presented to one node in a four node RAC environment running
OCFS2 v1.4.1-1.  The system then rebooted with the following error:

Oct 21 16:45:34 ausracdb03 kernel: (27,1):o2hb_write_timeout:166 ERROR:
Heartbeat write timeout to device dm-24 after 120000 milliseconds
Oct 21 16:45:34 ausracdb03 kernel: (27,1):o2hb_stop_all_regions:1873
ERROR: stopping heartbeat on all active regions.

I'm assuming that dm-24 was the LUN that was deleted.  Looking back in
the syslog, I see many of these errors since the time the snapshot was
taken until the reboot:

Oct 21 16:42:54 ausracdb03 kernel: (6624,2):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-24": another node is heartbeating in our slot!

The errors stopped when the node came back up.  However, after another
snapshot was taken, the errors are back, and I'm afraid a node will
reboot again when the LUN snapshot gets unpresented.  Here are the steps
that happen to generate the errors:

After unmounting and deleting the LUN that contains the snapshot, I
receive:

Oct 22 03:15:43 ausracdb03 multipathd: dm-28: umount map (uevent)
Oct 22 03:15:44 ausracdb03 kernel: ocfs2_hb_ctl[7721]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007fff88a7efb8 error 4
Oct 22 03:15:44 ausracdb03 kernel: ocfs2: Unmounting device (253,28) on
(node 2)

The kernel will then sense that all SCSI paths to the device are gone,
and multipathd will then mark all paths as down, which seems correct
behavior.

After creating and presenting a new snapshot, multipath will now see the
paths reappear, which also seems normal behavior:

Oct 22 03:16:06 ausracdb03 multipathd: sdcj: tur checker reports path is
up
Oct 22 03:16:06 ausracdb03 multipathd: 69:112: reinstated
Oct 22 03:16:06 ausracdb03 multipathd: mpath0: queue_if_no_path enabled
Oct 22 03:16:06 ausracdb03 multipathd: mpath0: Recovered to normal mode
Oct 22 03:16:06 ausracdb03 multipathd: mpath0: remaining active paths: 1
Oct 22 03:16:06 ausracdb03 multipathd: dm-27: add map (uevent)
Oct 22 03:16:06 ausracdb03 multipathd: dm-27: devmap already registered

However, I then get this message:

Oct 22 03:16:06 ausracdb03 kernel: (13210,2):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!
Oct 22 03:16:06 ausracdb03 kernel: (8605,4):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!

I'm assuming dm-28 is the old snapshot as now there is no dm-28 in the
multipath map (multpath -ll | grep dm-28).  The new snapshot has the
device map name of "dm-29."

I then mount the snapshot LUN (after changing the UUID and label):

Oct 22 03:16:30 ausracdb03 kernel: (9861,1):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!
Oct 22 03:16:30 ausracdb03 kernel: ocfs2_dlm: Nodes in domain
("BCF5F59FF88A4BE0A75BC1491A021664"): 2
Oct 22 03:16:30 ausracdb03 kernel: (9860,1):ocfs2_find_slot:249 slot 0
is already allocated to this node!
Oct 22 03:16:30 ausracdb03 kernel: (9860,1):ocfs2_check_volume:1745 File
system was not unmounted cleanly, recovering volume.
Oct 22 03:16:30 ausracdb03 kernel: kjournald starting.  Commit interval
5 seconds
Oct 22 03:16:30 ausracdb03 kernel: ocfs2: Mounting device (253,28) on
(node 2, slot 0) with ordered data mode.
Oct 22 03:16:30 ausracdb03 kernel: (9939,1):ocfs2_replay_journal:1076
Recovering node 0 from slot 3 on device (253,28)
Oct 22 03:16:32 ausracdb03 kernel: (9861,2):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!
Oct 22 03:16:34 ausracdb03 kernel: (9861,2):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!
Oct 22 03:16:36 ausracdb03 kernel: kjournald starting.  Commit interval
5 seconds
Oct 22 03:16:36 ausracdb03 kernel: (9939,1):ocfs2_replay_journal:1076
Recovering node 1 from slot 2 on device (253,28)
Oct 22 03:16:36 ausracdb03 kernel: (9861,3):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!
Oct 22 03:16:38 ausracdb03 kernel: (9861,3):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!
Oct 22 03:16:40 ausracdb03 kernel: (9861,1):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!
Oct 22 03:16:41 ausracdb03 kernel: kjournald starting.  Commit interval
5 seconds
Oct 22 03:16:41 ausracdb03 kernel: (9939,1):ocfs2_replay_journal:1076
Recovering node 3 from slot 1 on device (253,28)
Oct 22 03:16:42 ausracdb03 kernel: (9861,3):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!
Oct 22 03:16:44 ausracdb03 kernel: (9861,1):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!
Oct 22 03:16:46 ausracdb03 kernel: (9861,3):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!
Oct 22 03:16:47 ausracdb03 kernel: kjournald starting.  Commit interval
5 seconds

We upgraded to v1.4.1-1 on Sunday, up from 1.2.8 and never received
these errors under v1.2.8.  Once again, these snapshot LUNs are only
presented to one node in a four node cluster.

How do I prevent this behavior?  Should I be flushing the multipath
mapping ("multipath -F", and perhaps restarting multipathd) after
deleting the LUN?  How do I tell OCFS2 to stop looking at the old device
for the heartbeat?  How do I tell OCFS2 to ignore read/write timeouts to
LUNs that are unmounted and unpresented so that it won't fence itself?

Any insight would be greatly appreciated.

TIA,

Daniel

______________________________________________________________________
This email transmission and any documents, files or previous email
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient or a person
responsible for delivering this transmission to the intended recipient,
you are hereby notified that you must not read this transmission and
that any disclosure, copying, printing, distribution or use of this
transmission is strictly prohibited. If you have received this transmission
in error, please immediately notify the sender by telephone or return email
and delete the original transmission and its attachments without reading
or saving in any manner.