[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Mon Dec 1 12:40:32 PST 2008

So the problem you are encountering is killing via uuid. You could kill by
device name too.

By now you have the list of heartbeat regions. To get the device name for
a region, do:

$ cat 
/sys/kernel/config/cluster/CLUSERNAME/heartbeat/C43CB881C2C84B09BAC14546BF6DCAD9/dev 

sdf1

$ ocfs2_hb_ctl -K -d /dev/sdf1

Now makesure that that device is not mounted. It should not be. If it
is, then you probably have used force-uuid-reset to change the uuid of 
an active
device. In that case, I see no solution other than a node reset.

But before you do this, I would like some more info.

1. strace -o /tmp/hbctl.out ocfs2_hb_ctl -K -u 
F5F0522D39FC4EB2824C3E68C0B1D589
2. uname -a
3. rpm -qa | grep ocfs2
4. rpm -qf `which ocfs2_hb_ctl`
5. mounted.ocfs2 -d >/tmp/mounted.out

Thanks
Sunil

Daniel Keisling wrote:
> I wrote a script to easily get the heartbeats that should have been
> killed.  However, I get a segmentation fault everytime I try and kill
> the "dead" heartbeats:
>
> [root at ausracdbd01 tmp]# mounted.ocfs2 -d | grep -i f5f0 | wc -l
> 0
>
> [root at ausracdbd01 tmp]# ocfs2_hb_ctl -K -u
> F5F0522D39FC4EB2824C3E68C0B1D589
> Segmentation fault (core dumped)
>
>
>
> The process is still active:
>
> [root at ausracdbd01 tmp]# ps -ef | grep -i f5f0
> root       620   169  0 Nov29 ?        00:00:30 [o2hb-F5F0522D39]
> root     22608 18491  0 14:07 pts/4    00:00:00 grep -i f5f0
>
> Attached is the core.
>
> While I can create and mount snapshot filesystems on my development
> node, a dead heartbeat on one of my production nodes is not letting me
> mount the snapshot for a newly presented filesystem (thus causing our
> backups to fail).  What else can I do?  I really don't want to open an
> SR with Oracle...
>
> Thanks,
>
> Daniel