[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Daniel Keisling daniel.keisling at austin.ppdi.com
Mon Dec 1 13:12:14 PST 2008


[root at ausracdbd01 tmp]# uname -a
Linux ausracdbd01.austin.ppdi.com 2.6.18-92.1.13.el5 #1 SMP Thu Sep 4
03:51:21 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

[root at ausracdbd01 tmp]# rpm -qa | grep ocfs2
ocfs2console-1.4.1-1.el5
ocfs2-2.6.18-53.el5-1.2.8-2.el5
ocfs2-tools-1.4.1-1.el5
ocfs2-2.6.18-92.1.13.el5-1.4.1-1.el5

[root at ausracdbd01 tmp]# rpm -qf `which ocfs2_hb_ctl`
ocfs2-tools-1.4.1-1.el5





[root at ausracdbd01 tmp]# cat
/sys/kernel/config/cluster/racdbd/heartbeat/F5F0522D39FC4EB2824C3E68C0B1
D589/dev
dm-36

[root at ausracdbd01 tmp]# ocfs2_hb_ctl -I -d /dev/dm-36
5C81428158004C66B8AD4011D023E7F9: 1 refs

The kill syntax you gave me for devices needs the service name...I
assume o2hb?

[root at ausracdbd01 tmp]# ocfs2_hb_ctl -K -d /dev/dm-36 o2hb
[root at ausracdbd01 tmp]# ocfs2_hb_ctl -I -d /dev/dm-36
5C81428158004C66B8AD4011D023E7F9: 0 refs

However, this did not kill the thread or remove any references out of
/sys/kernel/config/cluster/racdbd/heartbeat/:

[root at ausracdbd01 tmp]# ps -ef | grep F5F0
root       620   169  0 Nov29 ?        00:00:31 [o2hb-F5F0522D39]
root     14914 11922  0 15:03 pts/4    00:00:00 grep F5F0

[root at ausracdbd01 tmp]# cat
/sys/kernel/config/cluster/racdbd/heartbeat/F5F0522D39FC4EB2824C3E68C0B1
D589/dev
dm-36


FWIW, the UUID 5C81428158004C66B8AD4011D023E7F9 does not exist in
/sys/kernel/config/cluster/racdbd/heartbeat but does in 'mounted.ocfs2
-d.'





> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> Sent: Monday, December 01, 2008 2:41 PM
> To: Daniel Keisling
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
> our slot! errors with LUN removal/addition
> 
> So the problem you are encountering is killing via uuid. You 
> could kill by
> device name too.
> 
> By now you have the list of heartbeat regions. To get the 
> device name for
> a region, do:
> 
> $ cat 
> /sys/kernel/config/cluster/CLUSERNAME/heartbeat/C43CB881C2C84B
> 09BAC14546BF6DCAD9/dev 
> 
> sdf1
> 
> $ ocfs2_hb_ctl -K -d /dev/sdf1
> 
> Now makesure that that device is not mounted. It should not be. If it
> is, then you probably have used force-uuid-reset to change 
> the uuid of 
> an active
> device. In that case, I see no solution other than a node reset.
> 
> But before you do this, I would like some more info.
> 
> 1. strace -o /tmp/hbctl.out ocfs2_hb_ctl -K -u 
> F5F0522D39FC4EB2824C3E68C0B1D589
> 2. uname -a
> 3. rpm -qa | grep ocfs2
> 4. rpm -qf `which ocfs2_hb_ctl`
> 5. mounted.ocfs2 -d >/tmp/mounted.out
> 
> Thanks
> Sunil
> 
> Daniel Keisling wrote:
> > I wrote a script to easily get the heartbeats that should have been
> > killed.  However, I get a segmentation fault everytime I 
> try and kill
> > the "dead" heartbeats:
> >
> > [root at ausracdbd01 tmp]# mounted.ocfs2 -d | grep -i f5f0 | wc -l
> > 0
> >
> > [root at ausracdbd01 tmp]# ocfs2_hb_ctl -K -u
> > F5F0522D39FC4EB2824C3E68C0B1D589
> > Segmentation fault (core dumped)
> >
> >
> >
> > The process is still active:
> >
> > [root at ausracdbd01 tmp]# ps -ef | grep -i f5f0
> > root       620   169  0 Nov29 ?        00:00:30 [o2hb-F5F0522D39]
> > root     22608 18491  0 14:07 pts/4    00:00:00 grep -i f5f0
> >
> > Attached is the core.
> >
> > While I can create and mount snapshot filesystems on my development
> > node, a dead heartbeat on one of my production nodes is not 
> letting me
> > mount the snapshot for a newly presented filesystem (thus 
> causing our
> > backups to fail).  What else can I do?  I really don't want 
> to open an
> > SR with Oracle...
> >
> > Thanks,
> >
> > Daniel
> 
> 

______________________________________________________________________
This email transmission and any documents, files or previous email
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient or a person
responsible for delivering this transmission to the intended recipient,
you are hereby notified that you must not read this transmission and
that any disclosure, copying, printing, distribution or use of this
transmission is strictly prohibited. If you have received this transmission
in error, please immediately notify the sender by telephone or return email
and delete the original transmission and its attachments without reading
or saving in any manner.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mounted.out
Type: application/octet-stream
Size: 11809 bytes
Desc: mounted.out
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081201/656de4b8/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hbctl.out
Type: application/octet-stream
Size: 33549 bytes
Desc: hbctl.out
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081201/656de4b8/attachment-0003.obj 


More information about the Ocfs2-users mailing list