[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Fri Oct 24 08:50:17 PDT 2008

See my answers inline... 

> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> Sent: Wednesday, October 22, 2008 7:52 PM
> To: Daniel Keisling
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
> our slot! errors with LUN removal/addition
> 
> Are you mounting the snapshotted lun on more than one node? 
> If not, then
> use tunefs.ocfs2 to also make it mount local. That is, do it 
> the time you
> are changing the label and uuid. This will avoid the problem as the fs
> will not start hb for local mounts.

Yes, I am mounting the snapshot on a single node.  I'll use tunefs.ocfs2
to change the mount type for now.

> 
> However, this just avoids the issue.
> 
> To resolve I'll need more info. For starters, walk thru the 
> process and
> upload the message file. Indicate the device you are 
> snapshotting, etc.
> In your description you mention assuming you were 
> snapshotting a particular
> device. Don't assume.... because I don't know what to make of it.

I'm positive I'm snapshotting the same device because I reference the
LUN via the SAN, not the host. My original email explained the process
and the relevant syslog entries.  If you still need them, please let me
know.

> 
> The ocfs2 heartbeat is designed to start on mount and stop on umount.
> But it may not work out that way. One handy command to use is:
> 
> $ ocfs2_hb_ctl -I -d /dev/dm-X
> 
> This will tell you the number of hb references on the device. If it is
> zero, then o2hb is not heartbeating on that device.

Oct 23 08:53:21 ausracdb03 kernel: (2410,3):o2hb_do_disk_heartbeat:770
ERROR: Device "dm-28": another node is heartbeating in our slot!

[root at ausracdb03 ~]# ocfs2_hb_ctl -I -d /dev/dm-28
289FD533334645C5A88FD715FC0EEF85: 1 refs

> 
> I see a ocfs2_hb_ctl segfault. Is that consistent? If so, it 
> could indicate
> that the stop heartbeat was not being successful and that the 
> above command
> would return 1 hb reference. If so, then that's your likely problem.

Yes, it segfaults every night (I take two snapshots per night):

[root at ausracdb03 log]# grep segfault /var/log/messages
Oct 21 03:15:47 ausracdb03 kernel: ocfs2_hb_ctl[4197]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007fffefd623e8 error 4
Oct 21 03:17:43 ausracdb03 kernel: ocfs2_hb_ctl[8002]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007fff1a8f9318 error 4
Oct 21 16:43:30 ausracdb03 kernel: ocfs2_hb_ctl[16933]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007fff816aa558 error 4
Oct 21 16:43:31 ausracdb03 kernel: ocfs2_hb_ctl[16950]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007fffcb162b88 error 4
Oct 22 03:15:44 ausracdb03 kernel: ocfs2_hb_ctl[7721]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007fff88a7efb8 error 4
Oct 22 03:17:46 ausracdb03 kernel: ocfs2_hb_ctl[11294]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007fff85549f68 error 4
Oct 23 03:15:51 ausracdb03 kernel: ocfs2_hb_ctl[32555]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007fff8fefe498 error 4
Oct 23 03:17:40 ausracdb03 kernel: ocfs2_hb_ctl[3756]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007fff99bb25d8 error 4
Oct 24 03:15:47 ausracdb03 kernel: ocfs2_hb_ctl[15664]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007ffff4254aa8 error 4
Oct 24 03:17:43 ausracdb03 kernel: ocfs2_hb_ctl[18029]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007fff75055a78 error 4

This began when I upgraded to v1.4.1-1 from v1.2.8.

Thanks,

Daniel

> 
> Sunil
> 
> 
> Daniel Keisling wrote:
> > Greetings,
> >  
> > Last night I manually unpresented and deleted a LUN (a SAN snapshot)
> > that was presented to one node in a four node RAC 
> environment running
> > OCFS2 v1.4.1-1.  The system then rebooted with the following error:
> >  
> > Oct 21 16:45:34 ausracdb03 kernel: 
> (27,1):o2hb_write_timeout:166 ERROR:
> > Heartbeat write timeout to device dm-24 after 120000 milliseconds
> > Oct 21 16:45:34 ausracdb03 kernel: (27,1):o2hb_stop_all_regions:1873
> > ERROR: stopping heartbeat on all active regions.
> >
> > I'm assuming that dm-24 was the LUN that was deleted.  
> Looking back in
> > the syslog, I see many of these errors since the time the 
> snapshot was
> > taken until the reboot:
> >  
> > Oct 21 16:42:54 ausracdb03 kernel: 
> (6624,2):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-24": another node is heartbeating in our slot!
> >
> >  
> > The errors stopped when the node came back up.  However, 
> after another
> > snapshot was taken, the errors are back, and I'm afraid a node will
> > reboot again when the LUN snapshot gets unpresented.  Here 
> are the steps
> > that happen to generate the errors:
> >  
> > After unmounting and deleting the LUN that contains the snapshot, I
> > receive:
> >  
> > Oct 22 03:15:43 ausracdb03 multipathd: dm-28: umount map (uevent)
> > Oct 22 03:15:44 ausracdb03 kernel: ocfs2_hb_ctl[7721]: segfault at
> > 0000000000000000 rip 0000000000428fa0 rsp 00007fff88a7efb8 error 4
> > Oct 22 03:15:44 ausracdb03 kernel: ocfs2: Unmounting device 
> (253,28) on
> > (node 2)
> >
> > The kernel will then sense that all SCSI paths to the 
> device are gone,
> > and multipathd will then mark all paths as down, which seems correct
> > behavior.
> >  
> > After creating and presenting a new snapshot, multipath 
> will now see the
> > paths reappear, which also seems normal behavior:
> >  
> > Oct 22 03:16:06 ausracdb03 multipathd: sdcj: tur checker 
> reports path is
> > up
> > Oct 22 03:16:06 ausracdb03 multipathd: 69:112: reinstated
> > Oct 22 03:16:06 ausracdb03 multipathd: mpath0: 
> queue_if_no_path enabled
> > Oct 22 03:16:06 ausracdb03 multipathd: mpath0: Recovered to 
> normal mode
> > Oct 22 03:16:06 ausracdb03 multipathd: mpath0: remaining 
> active paths: 1
> > Oct 22 03:16:06 ausracdb03 multipathd: dm-27: add map (uevent)
> > Oct 22 03:16:06 ausracdb03 multipathd: dm-27: devmap 
> already registered
> >
> > However, I then get this message:
> >  
> > Oct 22 03:16:06 ausracdb03 kernel: 
> (13210,2):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-28": another node is heartbeating in our slot!
> > Oct 22 03:16:06 ausracdb03 kernel: 
> (8605,4):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-28": another node is heartbeating in our slot!
> >
> > I'm assuming dm-28 is the old snapshot as now there is no 
> dm-28 in the
> > multipath map (multpath -ll | grep dm-28).  The new snapshot has the
> > device map name of "dm-29."
> >  
> > I then mount the snapshot LUN (after changing the UUID and label):
> >  
> > Oct 22 03:16:30 ausracdb03 kernel: 
> (9861,1):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-28": another node is heartbeating in our slot!
> > Oct 22 03:16:30 ausracdb03 kernel: ocfs2_dlm: Nodes in domain
> > ("BCF5F59FF88A4BE0A75BC1491A021664"): 2
> > Oct 22 03:16:30 ausracdb03 kernel: 
> (9860,1):ocfs2_find_slot:249 slot 0
> > is already allocated to this node!
> > Oct 22 03:16:30 ausracdb03 kernel: 
> (9860,1):ocfs2_check_volume:1745 File
> > system was not unmounted cleanly, recovering volume.
> > Oct 22 03:16:30 ausracdb03 kernel: kjournald starting.  
> Commit interval
> > 5 seconds
> > Oct 22 03:16:30 ausracdb03 kernel: ocfs2: Mounting device 
> (253,28) on
> > (node 2, slot 0) with ordered data mode.
> > Oct 22 03:16:30 ausracdb03 kernel: 
> (9939,1):ocfs2_replay_journal:1076
> > Recovering node 0 from slot 3 on device (253,28)
> > Oct 22 03:16:32 ausracdb03 kernel: 
> (9861,2):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-28": another node is heartbeating in our slot!
> > Oct 22 03:16:34 ausracdb03 kernel: 
> (9861,2):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-28": another node is heartbeating in our slot!
> > Oct 22 03:16:36 ausracdb03 kernel: kjournald starting.  
> Commit interval
> > 5 seconds
> > Oct 22 03:16:36 ausracdb03 kernel: 
> (9939,1):ocfs2_replay_journal:1076
> > Recovering node 1 from slot 2 on device (253,28)
> > Oct 22 03:16:36 ausracdb03 kernel: 
> (9861,3):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-28": another node is heartbeating in our slot!
> > Oct 22 03:16:38 ausracdb03 kernel: 
> (9861,3):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-28": another node is heartbeating in our slot!
> > Oct 22 03:16:40 ausracdb03 kernel: 
> (9861,1):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-28": another node is heartbeating in our slot!
> > Oct 22 03:16:41 ausracdb03 kernel: kjournald starting.  
> Commit interval
> > 5 seconds
> > Oct 22 03:16:41 ausracdb03 kernel: 
> (9939,1):ocfs2_replay_journal:1076
> > Recovering node 3 from slot 1 on device (253,28)
> > Oct 22 03:16:42 ausracdb03 kernel: 
> (9861,3):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-28": another node is heartbeating in our slot!
> > Oct 22 03:16:44 ausracdb03 kernel: 
> (9861,1):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-28": another node is heartbeating in our slot!
> > Oct 22 03:16:46 ausracdb03 kernel: 
> (9861,3):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-28": another node is heartbeating in our slot!
> > Oct 22 03:16:47 ausracdb03 kernel: kjournald starting.  
> Commit interval
> > 5 seconds
> >
> > We upgraded to v1.4.1-1 on Sunday, up from 1.2.8 and never received
> > these errors under v1.2.8.  Once again, these snapshot LUNs are only
> > presented to one node in a four node cluster.
> >  
> > How do I prevent this behavior?  Should I be flushing the multipath
> > mapping ("multipath -F", and perhaps restarting multipathd) after
> > deleting the LUN?  How do I tell OCFS2 to stop looking at 
> the old device
> > for the heartbeat?  How do I tell OCFS2 to ignore 
> read/write timeouts to
> > LUNs that are unmounted and unpresented so that it won't 
> fence itself?
> >  
> > Any insight would be greatly appreciated.
> >  
> > TIA,
> >  
> > Daniel
> >   
> 
> 

______________________________________________________________________
This email transmission and any documents, files or previous email
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient or a person
responsible for delivering this transmission to the intended recipient,
you are hereby notified that you must not read this transmission and
that any disclosure, copying, printing, distribution or use of this
transmission is strictly prohibited. If you have received this transmission
in error, please immediately notify the sender by telephone or return email
and delete the original transmission and its attachments without reading
or saving in any manner.