[Ocfs2-users] Another node is heartbeating in our slot! errors with LUN removal/addition

Mon Dec 1 15:59:45 PST 2008

The reason it is unable to stop hb by uuid is that none of the devices
have that uuid.

So lookup by uuid fails because it cannot match the uuid to a device.

And shutdown by device name fails because it sees a different uuid
on that device. So ocfs2_hb_ctl -K -d /dev/dm-36 o2cb does nothing.
(Use o2cb as the service.)

The qs is: Can you reboot this box? If not, I could look into providing
a procedure that involves hand-editing the superblock. Fun! :)

Getting back to how this could have happened: Can you provide the command
for steps 1,2 and 4. I want to make sure I understand what you are doing.

- unmount the snapshot dir
- unmap the snapshot lun
- take a SAN-based snapshot
- present snapshot lun (same SCSI ID/WWNN) back to server
- force a uuid reset with tunefs.ocfs2 on the snapshot filesystem
- change the label with tunefs.ocfs2 on the snapshot filesystem
- fsck the snapshot filesystem
- mount the snapshot filesystem

Sunil

Daniel Keisling wrote:
> [root at ausracdbd01 tmp]# uname -a
> Linux ausracdbd01.austin.ppdi.com 2.6.18-92.1.13.el5 #1 SMP Thu Sep 4
> 03:51:21 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> [root at ausracdbd01 tmp]# rpm -qa | grep ocfs2
> ocfs2console-1.4.1-1.el5
> ocfs2-2.6.18-53.el5-1.2.8-2.el5
> ocfs2-tools-1.4.1-1.el5
> ocfs2-2.6.18-92.1.13.el5-1.4.1-1.el5
>
> [root at ausracdbd01 tmp]# rpm -qf `which ocfs2_hb_ctl`
> ocfs2-tools-1.4.1-1.el5
>
>
>
>
>
> [root at ausracdbd01 tmp]# cat
> /sys/kernel/config/cluster/racdbd/heartbeat/F5F0522D39FC4EB2824C3E68C0B1
> D589/dev
> dm-36
>
> [root at ausracdbd01 tmp]# ocfs2_hb_ctl -I -d /dev/dm-36
> 5C81428158004C66B8AD4011D023E7F9: 1 refs
>
> The kill syntax you gave me for devices needs the service name...I
> assume o2hb?
>
> [root at ausracdbd01 tmp]# ocfs2_hb_ctl -K -d /dev/dm-36 o2hb
> [root at ausracdbd01 tmp]# ocfs2_hb_ctl -I -d /dev/dm-36
> 5C81428158004C66B8AD4011D023E7F9: 0 refs
>
> However, this did not kill the thread or remove any references out of
> /sys/kernel/config/cluster/racdbd/heartbeat/:
>
> [root at ausracdbd01 tmp]# ps -ef | grep F5F0
> root       620   169  0 Nov29 ?        00:00:31 [o2hb-F5F0522D39]
> root     14914 11922  0 15:03 pts/4    00:00:00 grep F5F0
>
> [root at ausracdbd01 tmp]# cat
> /sys/kernel/config/cluster/racdbd/heartbeat/F5F0522D39FC4EB2824C3E68C0B1
> D589/dev
> dm-36
>
>
> FWIW, the UUID 5C81428158004C66B8AD4011D023E7F9 does not exist in
> /sys/kernel/config/cluster/racdbd/heartbeat but does in 'mounted.ocfs2
> -d.'
>
>
>
>
>
>   
>> -----Original Message-----
>> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
>> Sent: Monday, December 01, 2008 2:41 PM
>> To: Daniel Keisling
>> Cc: ocfs2-users at oss.oracle.com
>> Subject: Re: [Ocfs2-users] Another node is heartbeating in 
>> our slot! errors with LUN removal/addition
>>
>> So the problem you are encountering is killing via uuid. You 
>> could kill by
>> device name too.
>>
>> By now you have the list of heartbeat regions. To get the 
>> device name for
>> a region, do:
>>
>> $ cat 
>> /sys/kernel/config/cluster/CLUSERNAME/heartbeat/C43CB881C2C84B
>> 09BAC14546BF6DCAD9/dev 
>>
>> sdf1
>>
>> $ ocfs2_hb_ctl -K -d /dev/sdf1
>>
>> Now makesure that that device is not mounted. It should not be. If it
>> is, then you probably have used force-uuid-reset to change 
>> the uuid of 
>> an active
>> device. In that case, I see no solution other than a node reset.
>>
>> But before you do this, I would like some more info.
>>
>> 1. strace -o /tmp/hbctl.out ocfs2_hb_ctl -K -u 
>> F5F0522D39FC4EB2824C3E68C0B1D589
>> 2. uname -a
>> 3. rpm -qa | grep ocfs2
>> 4. rpm -qf `which ocfs2_hb_ctl`
>> 5. mounted.ocfs2 -d >/tmp/mounted.out
>>
>> Thanks
>> Sunil
>>
>> Daniel Keisling wrote:
>>     
>>> I wrote a script to easily get the heartbeats that should have been
>>> killed.  However, I get a segmentation fault everytime I 
>>>       
>> try and kill
>>     
>>> the "dead" heartbeats:
>>>
>>> [root at ausracdbd01 tmp]# mounted.ocfs2 -d | grep -i f5f0 | wc -l
>>> 0
>>>
>>> [root at ausracdbd01 tmp]# ocfs2_hb_ctl -K -u
>>> F5F0522D39FC4EB2824C3E68C0B1D589
>>> Segmentation fault (core dumped)
>>>
>>>
>>>
>>> The process is still active:
>>>
>>> [root at ausracdbd01 tmp]# ps -ef | grep -i f5f0
>>> root       620   169  0 Nov29 ?        00:00:30 [o2hb-F5F0522D39]
>>> root     22608 18491  0 14:07 pts/4    00:00:00 grep -i f5f0
>>>
>>> Attached is the core.
>>>
>>> While I can create and mount snapshot filesystems on my development
>>> node, a dead heartbeat on one of my production nodes is not 
>>>       
>> letting me
>>     
>>> mount the snapshot for a newly presented filesystem (thus 
>>>       
>> causing our
>>     
>>> backups to fail).  What else can I do?  I really don't want 
>>>       
>> to open an
>>     
>>> SR with Oracle...
>>>
>>> Thanks,
>>>
>>> Daniel
>>>       
>>     
>
> ______________________________________________________________________
> This email transmission and any documents, files or previous email
> messages attached to it may contain information that is confidential or
> legally privileged. If you are not the intended recipient or a person
> responsible for delivering this transmission to the intended recipient,
> you are hereby notified that you must not read this transmission and
> that any disclosure, copying, printing, distribution or use of this
> transmission is strictly prohibited. If you have received this transmission
> in error, please immediately notify the sender by telephone or return email
> and delete the original transmission and its attachments without reading
> or saving in any manner.
>