[Ocfs2-users] Another node is heartbeating in our slot! errorswith LUN removal/addition

Brian Kroth bpkroth at gmail.com
Fri Dec 5 06:11:25 PST 2008


Just for clarity, can you post the proper sequence you're now using to
take SAN based snapshots?  I'd like to try this on a new cluster I'm
setting up.

Thanks,
Brian

Daniel Keisling <daniel.keisling at austin.ppdi.com> 2008-12-04 12:15:
> I've restarted the box and the heartbeat threads and messages are now
> gone.  I've taken six snapshots and unmounted the filesystems several
> times and the segmentation faults do not occur.  
> 
> Thank you so much for looking into this, finding the problem, and
> getting me a fix.  I look forward to the 1.4.2 release.
> 
> Daniel
> 
> > -----Original Message-----
> > From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> > Sent: Thursday, December 04, 2008 11:45 AM
> > To: Daniel Keisling
> > Cc: Joel Becker
> > Subject: Re: [Ocfs2-users] Another node is heartbeating in 
> > our slot! errorswith LUN removal/addition
> > 
> > These could be hb thread that were not killed when you
> > umounted those volumes. Have you restarted the box
> > since you cleaned out those devices?
> > 
> > Daniel Keisling wrote:
> > > Sunil,
> > >
> > > I edited /dev/sdo and /dev/sdr and the rest of corrupted devices
> > > disappeared, so there are no more corrupted OCFS2 
> > filesystems when doing
> > > a 'mounted.ocfs2 -f.'  However, the 'heartbeating in our slot' error
> > > messages are still coming.  The devices in question are not in the
> > > device-mapper maps and are not mounted, but do appear in 
> > mounted.ocfs2.
> > > Do I need to do the same procedure and wipe out the signature?
> > >
> > > Dec  4 10:29:35 ausracdbd01 kernel: 
> > (26064,2):o2hb_do_disk_heartbeat:770
> > > ERROR: Device "dm-43": another node is heartbeating in our slot!
> > >
> > > [root at ausracdbd01 ~]# multipath -ll | grep dm-43
> > > [root at ausracdbd01 ~]#
> > >
> > > [root at ausracdbd01 ~]# mounted.ocfs2 -f | grep dm-43
> > > /dev/dm-43            ocfs2  ausracdbd01
> > >
> > > [root at ausracdbd01 ~]# mounted.ocfs2 -d | grep dm-43
> > > /dev/dm-43            ocfs2  ce7c5099-145f-457b-9644-923202450f31
> > >
> > > [root at ausracdbd01 ~]# mounted.ocfs2 -d | grep
> > > ce7c5099-145f-457b-9644-923202450f31
> > > /dev/sdw1             ocfs2  ce7c5099-145f-457b-9644-923202450f31
> > > /dev/sdat1            ocfs2  ce7c5099-145f-457b-9644-923202450f31
> > > /dev/sdbq1            ocfs2  ce7c5099-145f-457b-9644-923202450f31
> > > /dev/sdcn1            ocfs2  ce7c5099-145f-457b-9644-923202450f31
> > > /dev/sddk1            ocfs2  ce7c5099-145f-457b-9644-923202450f31
> > > /dev/sdeh1            ocfs2  ce7c5099-145f-457b-9644-923202450f31
> > > /dev/sdfe1            ocfs2  ce7c5099-145f-457b-9644-923202450f31
> > > /dev/sdgb1            ocfs2  ce7c5099-145f-457b-9644-923202450f31
> > > /dev/dm-43            ocfs2  ce7c5099-145f-457b-9644-923202450f31
> > >
> > > Daniel 
> > >
> > >   
> > >> -----Original Message-----
> > >> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> > >> Sent: Wednesday, December 03, 2008 3:01 PM
> > >> To: Daniel Keisling
> > >> Cc: Joel Becker; Sunil Mushran
> > >> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
> > >> our slot! errorswith LUN removal/addition
> > >>
> > >> OK... so now know what the problem is. Filed a bugzilla for this.
> > >> http://oss.oracle.com/bugzilla/show_bug.cgi?id=1053
> > >>
> > >> Instead of waiting for the fix, may be quicker if you fix 
> > >> this by hand.
> > >>
> > >> Do you have a binary editor? While we could script this, it 
> > >> will be safer
> > >> if you _fix_ this manually.
> > >>
> > >> Say. you had bvi. The steps for 4K blocksize fs would be:
> > >>
> > >> $ bvi -b 8192 -s 512 /dev/sdo
> > >>
> > >> You will see OCFSV2 signature at the very start. Edit 4F (O) 
> > >> to 00 (.).
> > >> Or something other than Oh. In short, we want to clobber the 
> > >> signature.
> > >> This needs to be repeated for each volume below. If you 
> > don't see the
> > >> signature, abort. Means the blocksize is less than 4K... say 
> > >> 2K. In that
> > >> case, it will become "bvi -b 4096 -s 512 DEVICE".
> > >>
> > >> You will know it is fixed when "mounted.ocfs2 -d" does not show any
> > >> of these volumes.
> > >>
> > >> Sunil
> > >>
> > >> Daniel Keisling wrote:
> > >>     
> > >>> [root at ausracdbd01 ~]# debugfs.ocfs2 -R "stat //heartbeat" 
> >  /dev/sdo
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>>
> > >>> [root at ausracdbd01 ~]# mount -t debugfs debugfs /debug
> > >>> [root at ausracdbd01 ~]# debugfs.ocfs2 -R "stat //heartbeat" 
> >  /dev/sdo
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>>
> > >>> [root at ausracdbd01 ~]# for d in o r al ao bi bl cf ci dc df 
> > >>>       
> > >> dz ec ew ez
> > >>     
> > >>> ft fw ; do
> > >>>   
> > >>>       
> > >>>> echo Device /dev/sd${d} ;
> > >>>> debugfs.ocfs2 -R "stat //heartbeat" /dev/sd${d} ;
> > >>>> done ;
> > >>>>     
> > >>>>         
> > >>> Device /dev/sdo
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdr
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdal
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdao
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdbi
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdbl
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdcf
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdci
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sddc
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sddf
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sddz
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdec
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdew
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdez
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdft
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> Device /dev/sdfw
> > >>> stat: OCFS2 directory corrupted '//heartbeat'
> > >>> [root at ausracdbd01 ~]#
> > >>>
> > >>>   
> > >>>       
> > >>>> -----Original Message-----
> > >>>> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> > >>>> Sent: Wednesday, December 03, 2008 1:07 PM
> > >>>> To: Daniel Keisling
> > >>>> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
> > >>>> our slot! errorswith LUN removal/addition
> > >>>>
> > >>>> I think I know what the issue is.
> > >>>>
> > >>>> Can you run the following on your box?
> > >>>> $ debugfs.ocfs2 -R "stat //heartbeat"  /dev/sdo
> > >>>>
> > >>>> Email me the output.
> > >>>>
> > >>>> While we are at it, why don't you run this script as it may save
> > >>>> us a roundtrip.
> > >>>>
> > >>>> $ for d in o r al ao bi bl cf ci dc df dz ec ew ez ft fw ; do
> > >>>>   echo Device /dev/sd${d} ; 
> > >>>>   debugfs.ocfs2 -R "stat //heartbeat" /dev/sd${d} ;
> > >>>>   done ;
> > >>>>
> > >>>> All this does is dump the inode of the heartbeat inode 
> > >>>>         
> > >> file. I suspect
> > >>     
> > >>>> these devices. Meaning no writing... only reading.
> > >>>>
> > >>>> Sunil
> > >>>>
> > >>>> Daniel Keisling wrote:
> > >>>>     
> > >>>>         
> > >>>>> Yes, please do.  I have development time on the machine 
> > >>>>>           
> > >> for the next
> > >>     
> > >>>>> couple of days.   
> > >>>>>
> > >>>>>   
> > >>>>>       
> > >>>>>           
> > >>>>>> -----Original Message-----
> > >>>>>> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> > >>>>>> Sent: Tuesday, December 02, 2008 8:16 PM
> > >>>>>> To: Daniel Keisling
> > >>>>>> Cc: ocfs2-users at oss.oracle.com
> > >>>>>> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
> > >>>>>> our slot! errorswith LUN removal/addition
> > >>>>>>
> > >>>>>> Yes. Your diagnosis is correct.
> > >>>>>>
> > >>>>>> ocfs2_hb_ctl segfault is not making any sense. The 
> > >>>>>>             
> > >> coredump has not
> > >>     
> > >>>>>> been helpful. I may have to send you a debug build. 
> > >>>>>>             
> > >> strace also led
> > >>     
> > >>>>>> me down a blind alley.
> > >>>>>>
> > >>>>>> Let me know if you will be willing to copy a debug build of the
> > >>>>>> ocfs2_hb_ctl util. The coredump from that should help us 
> > >>>>>>             
> > >> nail down
> > >>     
> > >>>>>> this issue.
> > >>>>>>
> > >>>>>> Sunil
> > >>>>>>     
> > >>>>>>         
> > >>>>>>             
> > >>>>     
> > >>>>         
> > >>>       
> > >> 
> > ______________________________________________________________________
> > >>     
> > >>> This email transmission and any documents, files or previous email
> > >>> messages attached to it may contain information that is 
> > >>>       
> > >> confidential or
> > >>     
> > >>> legally privileged. If you are not the intended recipient 
> > >>>       
> > >> or a person
> > >>     
> > >>> responsible for delivering this transmission to the 
> > >>>       
> > >> intended recipient,
> > >>     
> > >>> you are hereby notified that you must not read this 
> > transmission and
> > >>> that any disclosure, copying, printing, distribution or 
> > use of this
> > >>> transmission is strictly prohibited. If you have received 
> > >>>       
> > >> this transmission
> > >>     
> > >>> in error, please immediately notify the sender by telephone 
> > >>>       
> > >> or return email
> > >>     
> > >>> and delete the original transmission and its attachments 
> > >>>       
> > >> without reading
> > >>     
> > >>> or saving in any manner.
> > >>>
> > >>>   
> > >>>       
> > >>
> > >>     
> > >
> > > 
> > ______________________________________________________________________
> > > This email transmission and any documents, files or previous email
> > > messages attached to it may contain information that is 
> > confidential or
> > > legally privileged. If you are not the intended recipient 
> > or a person
> > > responsible for delivering this transmission to the 
> > intended recipient,
> > > you are hereby notified that you must not read this transmission and
> > > that any disclosure, copying, printing, distribution or use of this
> > > transmission is strictly prohibited. If you have received 
> > this transmission
> > > in error, please immediately notify the sender by telephone 
> > or return email
> > > and delete the original transmission and its attachments 
> > without reading
> > > or saving in any manner.
> > >
> > >   
> > 
> > 
> > 
> 
> ______________________________________________________________________
> This email transmission and any documents, files or previous email
> messages attached to it may contain information that is confidential or
> legally privileged. If you are not the intended recipient or a person
> responsible for delivering this transmission to the intended recipient,
> you are hereby notified that you must not read this transmission and
> that any disclosure, copying, printing, distribution or use of this
> transmission is strictly prohibited. If you have received this transmission
> in error, please immediately notify the sender by telephone or return email
> and delete the original transmission and its attachments without reading
> or saving in any manner.
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users



More information about the Ocfs2-users mailing list