[Ocfs2-users] Unable to stop cluster as heartbeat region still active

Sunil Mushran sunil.mushran at oracle.com
Sun Oct 23 07:57:05 PDT 2011


I think it stops by uuid. So try doing this the next time.
You are encountering some issue that we have not seen before.
ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2

On 10/23/2011 05:32 AM, Laurentiu Gosu wrote:
> Hi Sunil,
> Sorry for my late reply, i just had time today to start from scratch 
> and test.
> I rebuilt my environment(2 nodes connected to a SAN via 
> iSCSI+multipath). I still have the issue that the heartbeat is active 
> after I umount my ocfs2 volume.
> /etc/init.d/o2cb stop
> Stopping O2CB cluster CLUST: Failed
> Unable to stop cluster as heartbeat region still active
>
> ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0
> 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
>
> After i manually kill the ref (ocfs2_hb_ctl -K -d 
> /dev/mapper/volgr1-lvol0 ocfs2 ) i can stop successfully o2cb. I can 
> live with that but why doesn't it stop automatically? As i understand, 
> hearbeat should be started and stopped once the volume gets 
> mounted/umounted.
>
> br,
> Laurentiu.
>
> On 10/19/2011 02:28, Sunil Mushran wrote:
>> Manual delete will only work if there are no references. In your case
>> there are references.
>>
>> You may want to start both nodes from scratch. Do not start/stop
>> heartbeat manually. Also, do not force-format.
>>
>> On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:
>>> OK, i rebooted one of the nodes(both had similar issues); . But 
>>> something is still fishy.
>>> - i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
>>> - i unmount it: umount /mnt/tmp/
>>> - tried to stop o2cb:  /etc/init.d/o2cb stop
>>> Stopping O2CB cluster CLUSTER: Failed
>>> Unable to stop cluster as heartbeat region still active
>>> - ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
>>> -  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
>>> - ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/:
>>> total 0
>>> drwxr-xr-x 2 root root    0 Oct 19 01:50 
>>> 0C4AB55FE9314FA5A9F81652FDB9B22D
>>> -rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold
>>>
>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
>>> total 0
>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
>>> -r--r--r-- 1 root root 4096 Oct 19 01:50 pid
>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block
>>>
>>> - i cannot manually delete 
>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/
>>>
>>> PS: i'm going to sleep now, i have to be up in a few hours. We can 
>>> continue tomorrow if it's ok with you.
>>> Thank you for your help.
>>>
>>> Laurentiu.
>>>
>>> On 10/19/2011 01:33, Sunil Mushran wrote:
>>>> One way this can happen is if one starts the hb manually and then force
>>>> formats on that volume. The format will generate a new uuid. Once that
>>>> happens, the hb tool cannot map the region to the device and thus fail
>>>> to stop it. Right now the easiest option on this box is resetting it.
>>>>
>>>> On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
>>>>> Yes, i did reformat it(even more than once i think, last week). 
>>>>> This is a pre-production system and i'm trying various options 
>>>>> before moving into real life.
>>>>>
>>>>>
>>>>> On 10/19/2011 01:19, Sunil Mushran wrote:
>>>>>> Did you reformat the volume recently? or, when did you format last?
>>>>>>
>>>>>> On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:
>>>>>>> well..this is weird
>>>>>>> ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
>>>>>>> *918673F06F8F4ED188DDCE14F39945F6*  dead_threshold
>>>>>>>
>>>>>>> looks like we have different UUIDs. Where is this coming from??
>>>>>>>
>>>>>>> ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
>>>>>>> 918673F06F8F4ED188DDCE14F39945F6: 1 refs
>>>>>>>
>>>>>>>
>>>>>>> On 10/19/2011 01:04, Sunil Mushran wrote:
>>>>>>>> Let's do it by hand.
>>>>>>>> rm -rf 
>>>>>>>> /sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D 
>>>>>>>> *
>>>>>>>>
>>>>>>>> On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:
>>>>>>>>>  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping 
>>>>>>>>> heartbeat
>>>>>>>>>
>>>>>>>>> No improvment :(
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/19/2011 00:50, Sunil Mushran wrote:
>>>>>>>>>> See if this cleans it up.
>>>>>>>>>> ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>
>>>>>>>>>> On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:
>>>>>>>>>>> ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/19/2011 00:43, Sunil Mushran wrote:
>>>>>>>>>>>> ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>>>>> Device                FS     Stack  
>>>>>>>>>>>>> UUID                              Label
>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0  ocfs2  o2cb   
>>>>>>>>>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2
>>>>>>>>>>>>>
>>>>>>>>>>>>> mounted.ocfs2 -f
>>>>>>>>>>>>> Device                FS     Nodes
>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001
>>>>>>>>>>>>>
>>>>>>>>>>>>> ro02xsrv001 = the other node in the cluster.
>>>>>>>>>>>>>
>>>>>>>>>>>>> By the way, there is no /dev/md-2
>>>>>>>>>>>>>  ls /dev/dm-*
>>>>>>>>>>>>> /dev/dm-0  /dev/dm-1
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/19/2011 00:37, Sunil Mushran wrote:
>>>>>>>>>>>>>> So it is not mounted. But we still have a hb thread because
>>>>>>>>>>>>>> hb could not be stopped during umount. The reason for that
>>>>>>>>>>>>>> could be the same that causes ocfs2_hb_ctl to fail.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do:
>>>>>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>> /sys/kernel/debug/ocfs2:
>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>> /sys/kernel/debug/o2dlm:
>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ocfs2_hb_ctl -I -d /dev/dm-2
>>>>>>>>>>>>>>> ocfs2_hb_ctl: Device name specified was not found while 
>>>>>>>>>>>>>>> reading uuid
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There is no /dev/dm-2 mounted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/19/2011 00:27, Sunil Mushran wrote:
>>>>>>>>>>>>>>>> mount -t debugfs debugfs /sys/kernel/debug
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Then list that dir.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>>>>> ocfs2_hb_ctl -l -d /dev/dm-2
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Be careful before killing. We want to be sure that dev 
>>>>>>>>>>>>>>>> is not mounted.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>> Again   the outputs:
>>>>>>>>>>>>>>>>>  cat 
>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>>>>> dm-2
>>>>>>>>>>>>>>>>> --->here should be volgr1-lvol0 i guess?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>>>> ls: /sys/kernel/debug/ocfs2: No such file or directory
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>>> ls: /sys/kernel/debug/o2dlm: No such file or directory
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think i have to enable debug first somehow..?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 10/19/2011 00:17, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>>>>> cat 
>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>>> Here is the output:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster:
>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER:
>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
>>>>>>>>>>>>>>>>>>> drwxr-xr-x 3 root root    0 Oct 19 00:12 heartbeat
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 
>>>>>>>>>>>>>>>>>>> idle_timeout_ms
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 
>>>>>>>>>>>>>>>>>>> keepalive_delay_ms
>>>>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root    0 Oct 11 20:23 node
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 
>>>>>>>>>>>>>>>>>>> reconnect_delay_ms
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat:
>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root    0 Oct 19 00:12 
>>>>>>>>>>>>>>>>>>> 918673F06F8F4ED188DDCE14F39945F6
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*: 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
>>>>>>>>>>>>>>>>>>> -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node:
>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
>>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 10/19/2011 00:12, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>> I have a 2 nodes ocfs2 cluster running UEK 
>>>>>>>>>>>>>>>>>>>>> 2.6.32-100.0.19.el5,
>>>>>>>>>>>>>>>>>>>>> ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
>>>>>>>>>>>>>>>>>>>>> My problem is that all the time when i try to run 
>>>>>>>>>>>>>>>>>>>>> /etc/init.d/o2cb stop
>>>>>>>>>>>>>>>>>>>>> it fails with this error:
>>>>>>>>>>>>>>>>>>>>>       Stopping O2CB cluster CLUSTER: Failed
>>>>>>>>>>>>>>>>>>>>>       Unable to stop cluster as heartbeat region 
>>>>>>>>>>>>>>>>>>>>> still active
>>>>>>>>>>>>>>>>>>>>> There is no active mount point. I tried to 
>>>>>>>>>>>>>>>>>>>>> manually stop the heartdbeat
>>>>>>>>>>>>>>>>>>>>> with "ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 
>>>>>>>>>>>>>>>>>>>>> ocfs2" (after finding
>>>>>>>>>>>>>>>>>>>>> the refs number with "ocfs2_hb_ctl -I -d 
>>>>>>>>>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0 ").
>>>>>>>>>>>>>>>>>>>>> But even if refs number is set to zero the 
>>>>>>>>>>>>>>>>>>>>> "heartbeat region still
>>>>>>>>>>>>>>>>>>>>> active" occurs.
>>>>>>>>>>>>>>>>>>>>> How can i fix this?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thank you in advance.
>>>>>>>>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>> Ocfs2-users mailing list
>>>>>>>>>>>>>>>>>>>>> Ocfs2-users at oss.oracle.com
>>>>>>>>>>>>>>>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20111023/45c7c379/attachment-0001.html 


More information about the Ocfs2-users mailing list