[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Daniel Keisling daniel.keisling at austin.ppdi.com
Fri Nov 21 11:38:41 PST 2008


Sorry for the delay in getting back to you.

I never catch a core during the segfault of the umount.

I tried to delete the heartbeat again and the command completed
successfully, but the messages are still appearing in syslog.  A
subsequent issue of the command brings:

[root at ausracdbd01 ~]# ocfs2_hb_ctl -K -d /dev/dm-30 o2cb 
ocfs2_hb_ctl: Unable to access cluster service while stopping heartbeat 


Please see attached for the log you requested.

Daniel


> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> Sent: Thursday, October 30, 2008 5:42 PM
> To: Daniel Keisling
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
> our slot! errors with LUN removal/addition
> 
> So manually stopping the heartbeat worked.
> 
> Did you catch a coredump for the segfault during umount?
> 
> There is a small difference in the stop heartbeat that is called
> as part of umount and the one called by hand. But have not
> been able to figure out the source of the segfault.
> 
> The above coredump will help.
> 
> The other thing you could do is run the following command when you
> see the "another node heartbeating..." message.
> $ for i in `seq 30` ; do date >>/tmp/hb.out ; debugfs.ocfs2 -R "hb" 
> /dev/dmX >>/tmp/hb.out ; sleep 1; done;
> 
> Replace the device name with the one that is in the logs. 
> Email me the 
> output.
> 
> Sunil
> 
> Daniel Keisling wrote:
> > Sunil,
> >
> > I manually killed the heartbeat via 'ocfs2_hb_ctl -K -d ... 
> o2cb' but it
> > did not generate a core.  It did, however, stop the "ERROR: Device
> > "dm-30": another node is heartbeating in our slot!" messages.
> >
> > After remounting the device, the "ERROR: Device "dm-30": 
> another node is
> > heartbeating in our slot!" messages return to syslog.  
> Unmounting the
> > device brings a segfault:
> >
> > Oct 30 10:59:13 ausracdbd01 multipathd: dm-30: umount map (uevent)
> > Oct 30 10:59:14 ausracdbd01 kernel: ocfs2_hb_ctl[11351]: segfault at
> > 0000000000000000 rip 0000000000428fa0 rsp 00007fffe6710138 error 4
> > Oct 30 10:59:14 ausracdbd01 kernel: ocfs2: Unmounting 
> device (253,30) on
> > (node 0)
> >
> > The "ERROR: Device "dm-30": another node is heartbeating in 
> our slot!"
> > messages keep flowing through syslog until I manually remove the
> > heartbeat.
> >
> > I can reproduce this over and over.
> >
> > TIA,
> >
> > Daniel
> >
> >   
> >> -----Original Message-----
> >> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> >> Sent: Friday, October 24, 2008 2:46 PM
> >> To: Daniel Keisling
> >> Cc: ocfs2-users at oss.oracle.com
> >> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
> >> our slot! errors with LUN removal/addition
> >>
> >> So that's the problem. The heartbeat is not stopping because of the
> >> segfault. I reviewed the code change in this tool (1.2.7 to 1.4.1)
> >> and it is quite limited. As in, I have no idea as to why it is
> >> segfaulting.
> >>
> >> Now you could stop the heartbeat manually. You have to be careful
> >> though because stopping it for a mounted volume will be very
> >> problematic to say the least. But one good reason to do it manually
> >> would be to catch the coredump.... which could help tell 
> us what the
> >> problem is.
> >>
> >> If you are up for it, do:
> >> $ ulimit -c unlimited
> >> $ ocfs2_hb_ctl -K -d /dev/sdX o2cb
> >>
> >> Run it on a umounted volume that has a reference left over. -I is
> >> to view the number of references.
> >>
> >> Sunil
> >>
> >> Daniel Keisling wrote:
> >>     
> >>> Oct 23 08:53:21 ausracdb03 kernel: 
> >>>       
> >> (2410,3):o2hb_do_disk_heartbeat:770
> >>     
> >>> ERROR: Device "dm-28": another node is heartbeating in our slot!
> >>>
> >>> [root at ausracdb03 ~]# ocfs2_hb_ctl -I -d /dev/dm-28
> >>> 289FD533334645C5A88FD715FC0EEF85: 1 refs
> >>>   
> >>> Yes, it segfaults every night (I take two snapshots per night):
> >>>
> >>> [root at ausracdb03 log]# grep segfault /var/log/messages
> >>> Oct 21 03:15:47 ausracdb03 kernel: ocfs2_hb_ctl[4197]: segfault at
> >>> 0000000000000000 rip 0000000000428fa0 rsp 00007fffefd623e8 error 4
> >>> Oct 21 03:17:43 ausracdb03 kernel: ocfs2_hb_ctl[8002]: segfault at
> >>> 0000000000000000 rip 0000000000428fa0 rsp 00007fff1a8f9318 error 4
> >>> Oct 21 16:43:30 ausracdb03 kernel: ocfs2_hb_ctl[16933]: 
> segfault at
> >>> 0000000000000000 rip 0000000000428fa0 rsp 00007fff816aa558 error 4
> >>> Oct 21 16:43:31 ausracdb03 kernel: ocfs2_hb_ctl[16950]: 
> segfault at
> >>> 0000000000000000 rip 0000000000428fa0 rsp 00007fffcb162b88 error 4
> >>> Oct 22 03:15:44 ausracdb03 kernel: ocfs2_hb_ctl[7721]: segfault at
> >>> 0000000000000000 rip 0000000000428fa0 rsp 00007fff88a7efb8 error 4
> >>> Oct 22 03:17:46 ausracdb03 kernel: ocfs2_hb_ctl[11294]: 
> segfault at
> >>> 0000000000000000 rip 0000000000428fa0 rsp 00007fff85549f68 error 4
> >>> Oct 23 03:15:51 ausracdb03 kernel: ocfs2_hb_ctl[32555]: 
> segfault at
> >>> 0000000000000000 rip 0000000000428fa0 rsp 00007fff8fefe498 error 4
> >>> Oct 23 03:17:40 ausracdb03 kernel: ocfs2_hb_ctl[3756]: segfault at
> >>> 0000000000000000 rip 0000000000428fa0 rsp 00007fff99bb25d8 error 4
> >>> Oct 24 03:15:47 ausracdb03 kernel: ocfs2_hb_ctl[15664]: 
> segfault at
> >>> 0000000000000000 rip 0000000000428fa0 rsp 00007ffff4254aa8 error 4
> >>> Oct 24 03:17:43 ausracdb03 kernel: ocfs2_hb_ctl[18029]: 
> segfault at
> >>> 0000000000000000 rip 0000000000428fa0 rsp 00007fff75055a78 error 4
> >>>
> >>>
> >>> This began when I upgraded to v1.4.1-1 from v1.2.8.
> >>>
> >>> Thanks,
> >>>
> >>> Daniel
> >>>   
> >>>       
> >>
> >>     
> >
> > 
> ______________________________________________________________________
> > This email transmission and any documents, files or previous email
> > messages attached to it may contain information that is 
> confidential or
> > legally privileged. If you are not the intended recipient 
> or a person
> > responsible for delivering this transmission to the 
> intended recipient,
> > you are hereby notified that you must not read this transmission and
> > that any disclosure, copying, printing, distribution or use of this
> > transmission is strictly prohibited. If you have received 
> this transmission
> > in error, please immediately notify the sender by telephone 
> or return email
> > and delete the original transmission and its attachments 
> without reading
> > or saving in any manner.
> >
> >   
> 
> 
> 

______________________________________________________________________
This email transmission and any documents, files or previous email
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient or a person
responsible for delivering this transmission to the intended recipient,
you are hereby notified that you must not read this transmission and
that any disclosure, copying, printing, distribution or use of this
transmission is strictly prohibited. If you have received this transmission
in error, please immediately notify the sender by telephone or return email
and delete the original transmission and its attachments without reading
or saving in any manner.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hb.out
Type: application/octet-stream
Size: 5820 bytes
Desc: hb.out
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081121/38997bd5/attachment.obj 


More information about the Ocfs2-users mailing list