[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Daniel Keisling daniel.keisling at austin.ppdi.com
Tue Dec 2 09:32:34 PST 2008


Yes, I can reboot the box, and the stale heartbeats do go away.
However, the problem reappears when the next snapshot is taken.  Even
though I can take new snapshots and mount them, what are the side
effects of this happening?  So far it just seems to be flooding my
messages file and not allowing me to snapshot newly-presented
filesystems without a reboot.

The steps you asked for:
- unmount the snapshot dir

Running 'umount.'  Ie, 
[root at ausracdbd01 bin]# umount /orasnapbackup/cret/bkup 

This brings the segfault:
Dec  2 11:22:51 ausracdbd01 kernel: ocfs2_hb_ctl[3074]: segfault at
0000000000000000 rip 0000000000428fa0 rsp 00007fff9fd1c738 error 4  



- unmap the snapshot lun
I use a utility to talk to the SAN that removes the SCSI ID presentation
from the server.

- present snapshot lun (same SCSI ID/WWNN) back to server
I use the same utility to talk to the SAN that presents the new snapshot
with the old SCSI ID to the server.  I then do a SCSI bus rescan using
the HP SCSI utilities (/opt/hp/hp_fibreutils/hp_rescan -a).

The time between unmapping the LUN, taking a snapshot, and presenting
the new snapshot back is less than 30 seconds.

Daniel

> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> Sent: Monday, December 01, 2008 6:00 PM
> To: Daniel Keisling
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
> our slot! errors with LUN removal/addition
> 
> The reason it is unable to stop hb by uuid is that none of the devices
> have that uuid.
> 
> So lookup by uuid fails because it cannot match the uuid to a device.
> 
> And shutdown by device name fails because it sees a different uuid
> on that device. So ocfs2_hb_ctl -K -d /dev/dm-36 o2cb does nothing.
> (Use o2cb as the service.)
> 
> The qs is: Can you reboot this box? If not, I could look into 
> providing
> a procedure that involves hand-editing the superblock. Fun! :)
> 
> Getting back to how this could have happened: Can you provide 
> the command
> for steps 1,2 and 4. I want to make sure I understand what 
> you are doing.
> 
> - unmount the snapshot dir
> - unmap the snapshot lun
> - take a SAN-based snapshot
> - present snapshot lun (same SCSI ID/WWNN) back to server
> - force a uuid reset with tunefs.ocfs2 on the snapshot filesystem
> - change the label with tunefs.ocfs2 on the snapshot filesystem
> - fsck the snapshot filesystem
> - mount the snapshot filesystem
> 
> Sunil
> 
> Daniel Keisling wrote:
> > [root at ausracdbd01 tmp]# uname -a
> > Linux ausracdbd01.austin.ppdi.com 2.6.18-92.1.13.el5 #1 SMP 
> Thu Sep 4
> > 03:51:21 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
> >
> > [root at ausracdbd01 tmp]# rpm -qa | grep ocfs2
> > ocfs2console-1.4.1-1.el5
> > ocfs2-2.6.18-53.el5-1.2.8-2.el5
> > ocfs2-tools-1.4.1-1.el5
> > ocfs2-2.6.18-92.1.13.el5-1.4.1-1.el5
> >
> > [root at ausracdbd01 tmp]# rpm -qf `which ocfs2_hb_ctl`
> > ocfs2-tools-1.4.1-1.el5
> >
> >
> >
> >
> >
> > [root at ausracdbd01 tmp]# cat
> > 
> /sys/kernel/config/cluster/racdbd/heartbeat/F5F0522D39FC4EB282
> 4C3E68C0B1
> > D589/dev
> > dm-36
> >
> > [root at ausracdbd01 tmp]# ocfs2_hb_ctl -I -d /dev/dm-36
> > 5C81428158004C66B8AD4011D023E7F9: 1 refs
> >
> > The kill syntax you gave me for devices needs the service name...I
> > assume o2hb?
> >
> > [root at ausracdbd01 tmp]# ocfs2_hb_ctl -K -d /dev/dm-36 o2hb
> > [root at ausracdbd01 tmp]# ocfs2_hb_ctl -I -d /dev/dm-36
> > 5C81428158004C66B8AD4011D023E7F9: 0 refs
> >
> > However, this did not kill the thread or remove any 
> references out of
> > /sys/kernel/config/cluster/racdbd/heartbeat/:
> >
> > [root at ausracdbd01 tmp]# ps -ef | grep F5F0
> > root       620   169  0 Nov29 ?        00:00:31 [o2hb-F5F0522D39]
> > root     14914 11922  0 15:03 pts/4    00:00:00 grep F5F0
> >
> > [root at ausracdbd01 tmp]# cat
> > 
> /sys/kernel/config/cluster/racdbd/heartbeat/F5F0522D39FC4EB282
> 4C3E68C0B1
> > D589/dev
> > dm-36
> >
> >
> > FWIW, the UUID 5C81428158004C66B8AD4011D023E7F9 does not exist in
> > /sys/kernel/config/cluster/racdbd/heartbeat but does in 
> 'mounted.ocfs2
> > -d.'
> >
> >
> >
> >
> >
> >   
> >> -----Original Message-----
> >> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> >> Sent: Monday, December 01, 2008 2:41 PM
> >> To: Daniel Keisling
> >> Cc: ocfs2-users at oss.oracle.com
> >> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
> >> our slot! errors with LUN removal/addition
> >>
> >> So the problem you are encountering is killing via uuid. You 
> >> could kill by
> >> device name too.
> >>
> >> By now you have the list of heartbeat regions. To get the 
> >> device name for
> >> a region, do:
> >>
> >> $ cat 
> >> /sys/kernel/config/cluster/CLUSERNAME/heartbeat/C43CB881C2C84B
> >> 09BAC14546BF6DCAD9/dev 
> >>
> >> sdf1
> >>
> >> $ ocfs2_hb_ctl -K -d /dev/sdf1
> >>
> >> Now makesure that that device is not mounted. It should 
> not be. If it
> >> is, then you probably have used force-uuid-reset to change 
> >> the uuid of 
> >> an active
> >> device. In that case, I see no solution other than a node reset.
> >>
> >> But before you do this, I would like some more info.
> >>
> >> 1. strace -o /tmp/hbctl.out ocfs2_hb_ctl -K -u 
> >> F5F0522D39FC4EB2824C3E68C0B1D589
> >> 2. uname -a
> >> 3. rpm -qa | grep ocfs2
> >> 4. rpm -qf `which ocfs2_hb_ctl`
> >> 5. mounted.ocfs2 -d >/tmp/mounted.out
> >>
> >> Thanks
> >> Sunil
> >>
> >> Daniel Keisling wrote:
> >>     
> >>> I wrote a script to easily get the heartbeats that should 
> have been
> >>> killed.  However, I get a segmentation fault everytime I 
> >>>       
> >> try and kill
> >>     
> >>> the "dead" heartbeats:
> >>>
> >>> [root at ausracdbd01 tmp]# mounted.ocfs2 -d | grep -i f5f0 | wc -l
> >>> 0
> >>>
> >>> [root at ausracdbd01 tmp]# ocfs2_hb_ctl -K -u
> >>> F5F0522D39FC4EB2824C3E68C0B1D589
> >>> Segmentation fault (core dumped)
> >>>
> >>>
> >>>
> >>> The process is still active:
> >>>
> >>> [root at ausracdbd01 tmp]# ps -ef | grep -i f5f0
> >>> root       620   169  0 Nov29 ?        00:00:30 [o2hb-F5F0522D39]
> >>> root     22608 18491  0 14:07 pts/4    00:00:00 grep -i f5f0
> >>>
> >>> Attached is the core.
> >>>
> >>> While I can create and mount snapshot filesystems on my 
> development
> >>> node, a dead heartbeat on one of my production nodes is not 
> >>>       
> >> letting me
> >>     
> >>> mount the snapshot for a newly presented filesystem (thus 
> >>>       
> >> causing our
> >>     
> >>> backups to fail).  What else can I do?  I really don't want 
> >>>       
> >> to open an
> >>     
> >>> SR with Oracle...
> >>>
> >>> Thanks,
> >>>
> >>> Daniel
> >>>       
> >>     
> >
> > 
> ______________________________________________________________________
> > This email transmission and any documents, files or previous email
> > messages attached to it may contain information that is 
> confidential or
> > legally privileged. If you are not the intended recipient 
> or a person
> > responsible for delivering this transmission to the 
> intended recipient,
> > you are hereby notified that you must not read this transmission and
> > that any disclosure, copying, printing, distribution or use of this
> > transmission is strictly prohibited. If you have received 
> this transmission
> > in error, please immediately notify the sender by telephone 
> or return email
> > and delete the original transmission and its attachments 
> without reading
> > or saving in any manner.
> >   
> 
> 
> 

______________________________________________________________________
This email transmission and any documents, files or previous email
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient or a person
responsible for delivering this transmission to the intended recipient,
you are hereby notified that you must not read this transmission and
that any disclosure, copying, printing, distribution or use of this
transmission is strictly prohibited. If you have received this transmission
in error, please immediately notify the sender by telephone or return email
and delete the original transmission and its attachments without reading
or saving in any manner.




More information about the Ocfs2-users mailing list