[Ocfs2-users] Another node is heartbeating in our slot! errorswith LUN removal/addition
Daniel Keisling
daniel.keisling at austin.ppdi.com
Thu Dec 4 10:15:09 PST 2008
I've restarted the box and the heartbeat threads and messages are now
gone. I've taken six snapshots and unmounted the filesystems several
times and the segmentation faults do not occur.
Thank you so much for looking into this, finding the problem, and
getting me a fix. I look forward to the 1.4.2 release.
Daniel
> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
> Sent: Thursday, December 04, 2008 11:45 AM
> To: Daniel Keisling
> Cc: Joel Becker
> Subject: Re: [Ocfs2-users] Another node is heartbeating in
> our slot! errorswith LUN removal/addition
>
> These could be hb thread that were not killed when you
> umounted those volumes. Have you restarted the box
> since you cleaned out those devices?
>
> Daniel Keisling wrote:
> > Sunil,
> >
> > I edited /dev/sdo and /dev/sdr and the rest of corrupted devices
> > disappeared, so there are no more corrupted OCFS2
> filesystems when doing
> > a 'mounted.ocfs2 -f.' However, the 'heartbeating in our slot' error
> > messages are still coming. The devices in question are not in the
> > device-mapper maps and are not mounted, but do appear in
> mounted.ocfs2.
> > Do I need to do the same procedure and wipe out the signature?
> >
> > Dec 4 10:29:35 ausracdbd01 kernel:
> (26064,2):o2hb_do_disk_heartbeat:770
> > ERROR: Device "dm-43": another node is heartbeating in our slot!
> >
> > [root at ausracdbd01 ~]# multipath -ll | grep dm-43
> > [root at ausracdbd01 ~]#
> >
> > [root at ausracdbd01 ~]# mounted.ocfs2 -f | grep dm-43
> > /dev/dm-43 ocfs2 ausracdbd01
> >
> > [root at ausracdbd01 ~]# mounted.ocfs2 -d | grep dm-43
> > /dev/dm-43 ocfs2 ce7c5099-145f-457b-9644-923202450f31
> >
> > [root at ausracdbd01 ~]# mounted.ocfs2 -d | grep
> > ce7c5099-145f-457b-9644-923202450f31
> > /dev/sdw1 ocfs2 ce7c5099-145f-457b-9644-923202450f31
> > /dev/sdat1 ocfs2 ce7c5099-145f-457b-9644-923202450f31
> > /dev/sdbq1 ocfs2 ce7c5099-145f-457b-9644-923202450f31
> > /dev/sdcn1 ocfs2 ce7c5099-145f-457b-9644-923202450f31
> > /dev/sddk1 ocfs2 ce7c5099-145f-457b-9644-923202450f31
> > /dev/sdeh1 ocfs2 ce7c5099-145f-457b-9644-923202450f31
> > /dev/sdfe1 ocfs2 ce7c5099-145f-457b-9644-923202450f31
> > /dev/sdgb1 ocfs2 ce7c5099-145f-457b-9644-923202450f31
> > /dev/dm-43 ocfs2 ce7c5099-145f-457b-9644-923202450f31
> >
> > Daniel
> >
> >
> >> -----Original Message-----
> >> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
> >> Sent: Wednesday, December 03, 2008 3:01 PM
> >> To: Daniel Keisling
> >> Cc: Joel Becker; Sunil Mushran
> >> Subject: Re: [Ocfs2-users] Another node is heartbeating in
> >> our slot! errorswith LUN removal/addition
> >>
> >> OK... so now know what the problem is. Filed a bugzilla for this.
> >> http://oss.oracle.com/bugzilla/show_bug.cgi?id=1053
> >>
> >> Instead of waiting for the fix, may be quicker if you fix
> >> this by hand.
> >>
> >> Do you have a binary editor? While we could script this, it
> >> will be safer
> >> if you _fix_ this manually.
> >>
> >> Say. you had bvi. The steps for 4K blocksize fs would be:
> >>
> >> $ bvi -b 8192 -s 512 /dev/sdo
> >>
> >> You will see OCFSV2 signature at the very start. Edit 4F (O)
> >> to 00 (.).
> >> Or something other than Oh. In short, we want to clobber the
> >> signature.
> >> This needs to be repeated for each volume below. If you
> don't see the
> >> signature, abort. Means the blocksize is less than 4K... say
> >> 2K. In that
> >> case, it will become "bvi -b 4096 -s 512 DEVICE".
> >>
> >> You will know it is fixed when "mounted.ocfs2 -d" does not show any
> >> of these volumes.
> >>
> >> Sunil
> >>
> >> Daniel Keisling wrote:
> >>
> >>> [root at ausracdbd01 ~]# debugfs.ocfs2 -R "stat //heartbeat"
> /dev/sdo
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>>
> >>> [root at ausracdbd01 ~]# mount -t debugfs debugfs /debug
> >>> [root at ausracdbd01 ~]# debugfs.ocfs2 -R "stat //heartbeat"
> /dev/sdo
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>>
> >>> [root at ausracdbd01 ~]# for d in o r al ao bi bl cf ci dc df
> >>>
> >> dz ec ew ez
> >>
> >>> ft fw ; do
> >>>
> >>>
> >>>> echo Device /dev/sd${d} ;
> >>>> debugfs.ocfs2 -R "stat //heartbeat" /dev/sd${d} ;
> >>>> done ;
> >>>>
> >>>>
> >>> Device /dev/sdo
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdr
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdal
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdao
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdbi
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdbl
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdcf
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdci
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sddc
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sddf
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sddz
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdec
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdew
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdez
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdft
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> Device /dev/sdfw
> >>> stat: OCFS2 directory corrupted '//heartbeat'
> >>> [root at ausracdbd01 ~]#
> >>>
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
> >>>> Sent: Wednesday, December 03, 2008 1:07 PM
> >>>> To: Daniel Keisling
> >>>> Subject: Re: [Ocfs2-users] Another node is heartbeating in
> >>>> our slot! errorswith LUN removal/addition
> >>>>
> >>>> I think I know what the issue is.
> >>>>
> >>>> Can you run the following on your box?
> >>>> $ debugfs.ocfs2 -R "stat //heartbeat" /dev/sdo
> >>>>
> >>>> Email me the output.
> >>>>
> >>>> While we are at it, why don't you run this script as it may save
> >>>> us a roundtrip.
> >>>>
> >>>> $ for d in o r al ao bi bl cf ci dc df dz ec ew ez ft fw ; do
> >>>> echo Device /dev/sd${d} ;
> >>>> debugfs.ocfs2 -R "stat //heartbeat" /dev/sd${d} ;
> >>>> done ;
> >>>>
> >>>> All this does is dump the inode of the heartbeat inode
> >>>>
> >> file. I suspect
> >>
> >>>> these devices. Meaning no writing... only reading.
> >>>>
> >>>> Sunil
> >>>>
> >>>> Daniel Keisling wrote:
> >>>>
> >>>>
> >>>>> Yes, please do. I have development time on the machine
> >>>>>
> >> for the next
> >>
> >>>>> couple of days.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
> >>>>>> Sent: Tuesday, December 02, 2008 8:16 PM
> >>>>>> To: Daniel Keisling
> >>>>>> Cc: ocfs2-users at oss.oracle.com
> >>>>>> Subject: Re: [Ocfs2-users] Another node is heartbeating in
> >>>>>> our slot! errorswith LUN removal/addition
> >>>>>>
> >>>>>> Yes. Your diagnosis is correct.
> >>>>>>
> >>>>>> ocfs2_hb_ctl segfault is not making any sense. The
> >>>>>>
> >> coredump has not
> >>
> >>>>>> been helpful. I may have to send you a debug build.
> >>>>>>
> >> strace also led
> >>
> >>>>>> me down a blind alley.
> >>>>>>
> >>>>>> Let me know if you will be willing to copy a debug build of the
> >>>>>> ocfs2_hb_ctl util. The coredump from that should help us
> >>>>>>
> >> nail down
> >>
> >>>>>> this issue.
> >>>>>>
> >>>>>> Sunil
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
> ______________________________________________________________________
> >>
> >>> This email transmission and any documents, files or previous email
> >>> messages attached to it may contain information that is
> >>>
> >> confidential or
> >>
> >>> legally privileged. If you are not the intended recipient
> >>>
> >> or a person
> >>
> >>> responsible for delivering this transmission to the
> >>>
> >> intended recipient,
> >>
> >>> you are hereby notified that you must not read this
> transmission and
> >>> that any disclosure, copying, printing, distribution or
> use of this
> >>> transmission is strictly prohibited. If you have received
> >>>
> >> this transmission
> >>
> >>> in error, please immediately notify the sender by telephone
> >>>
> >> or return email
> >>
> >>> and delete the original transmission and its attachments
> >>>
> >> without reading
> >>
> >>> or saving in any manner.
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
> ______________________________________________________________________
> > This email transmission and any documents, files or previous email
> > messages attached to it may contain information that is
> confidential or
> > legally privileged. If you are not the intended recipient
> or a person
> > responsible for delivering this transmission to the
> intended recipient,
> > you are hereby notified that you must not read this transmission and
> > that any disclosure, copying, printing, distribution or use of this
> > transmission is strictly prohibited. If you have received
> this transmission
> > in error, please immediately notify the sender by telephone
> or return email
> > and delete the original transmission and its attachments
> without reading
> > or saving in any manner.
> >
> >
>
>
>
______________________________________________________________________
This email transmission and any documents, files or previous email
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient or a person
responsible for delivering this transmission to the intended recipient,
you are hereby notified that you must not read this transmission and
that any disclosure, copying, printing, distribution or use of this
transmission is strictly prohibited. If you have received this transmission
in error, please immediately notify the sender by telephone or return email
and delete the original transmission and its attachments without reading
or saving in any manner.
More information about the Ocfs2-users
mailing list