[Ocfs2-users] Unable to stop cluster as heartbeat region still active

Sunil Mushran sunil.mushran at oracle.com
Sun Oct 23 12:49:20 PDT 2011


Are you sure you have ocfs2-tools-1.6.3? I remember we had an
issue with this with an earlier release... 1.6.1/.2.

On 10/23/2011 10:43 AM, Laurentiu Gosu wrote:
> hmm..
> #ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
> 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
> *BUT:*
> #ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2
> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
> I can still kill the ref using device name (-d).
>
> On 10/23/2011 17:57, Sunil Mushran wrote:
>> I think it stops by uuid. So try doing this the next time.
>> You are encountering some issue that we have not seen before.
>> ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2
>>
>> On 10/23/2011 05:32 AM, Laurentiu Gosu wrote:
>>> Hi Sunil,
>>> Sorry for my late reply, i just had time today to start from scratch 
>>> and test.
>>> I rebuilt my environment(2 nodes connected to a SAN via 
>>> iSCSI+multipath). I still have the issue that the heartbeat is 
>>> active after I umount my ocfs2 volume.
>>> /etc/init.d/o2cb stop
>>> Stopping O2CB cluster CLUST: Failed
>>> Unable to stop cluster as heartbeat region still active
>>>
>>> ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0
>>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
>>>
>>> After i manually kill the ref (ocfs2_hb_ctl -K -d 
>>> /dev/mapper/volgr1-lvol0 ocfs2 ) i can stop successfully o2cb. I can 
>>> live with that but why doesn't it stop automatically? As i 
>>> understand, hearbeat should be started and stopped once the volume 
>>> gets mounted/umounted.
>>>
>>> br,
>>> Laurentiu.
>>>
>>> On 10/19/2011 02:28, Sunil Mushran wrote:
>>>> Manual delete will only work if there are no references. In your case
>>>> there are references.
>>>>
>>>> You may want to start both nodes from scratch. Do not start/stop
>>>> heartbeat manually. Also, do not force-format.
>>>>
>>>> On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:
>>>>> OK, i rebooted one of the nodes(both had similar issues); . But 
>>>>> something is still fishy.
>>>>> - i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
>>>>> - i unmount it: umount /mnt/tmp/
>>>>> - tried to stop o2cb:  /etc/init.d/o2cb stop
>>>>> Stopping O2CB cluster CLUSTER: Failed
>>>>> Unable to stop cluster as heartbeat region still active
>>>>> - ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
>>>>> -  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
>>>>> - ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/:
>>>>> total 0
>>>>> drwxr-xr-x 2 root root    0 Oct 19 01:50 
>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold
>>>>>
>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
>>>>> total 0
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
>>>>> -r--r--r-- 1 root root 4096 Oct 19 01:50 pid
>>>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block
>>>>>
>>>>> - i cannot manually delete 
>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/
>>>>>
>>>>> PS: i'm going to sleep now, i have to be up in a few hours. We can 
>>>>> continue tomorrow if it's ok with you.
>>>>> Thank you for your help.
>>>>>
>>>>> Laurentiu.
>>>>>
>>>>> On 10/19/2011 01:33, Sunil Mushran wrote:
>>>>>> One way this can happen is if one starts the hb manually and then 
>>>>>> force
>>>>>> formats on that volume. The format will generate a new uuid. Once 
>>>>>> that
>>>>>> happens, the hb tool cannot map the region to the device and thus 
>>>>>> fail
>>>>>> to stop it. Right now the easiest option on this box is resetting it.
>>>>>>
>>>>>> On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
>>>>>>> Yes, i did reformat it(even more than once i think, last week). 
>>>>>>> This is a pre-production system and i'm trying various options 
>>>>>>> before moving into real life.
>>>>>>>
>>>>>>>
>>>>>>> On 10/19/2011 01:19, Sunil Mushran wrote:
>>>>>>>> Did you reformat the volume recently? or, when did you format last?
>>>>>>>>
>>>>>>>> On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:
>>>>>>>>> well..this is weird
>>>>>>>>> ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
>>>>>>>>> *918673F06F8F4ED188DDCE14F39945F6*  dead_threshold
>>>>>>>>>
>>>>>>>>> looks like we have different UUIDs. Where is this coming from??
>>>>>>>>>
>>>>>>>>> ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
>>>>>>>>> 918673F06F8F4ED188DDCE14F39945F6: 1 refs
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/19/2011 01:04, Sunil Mushran wrote:
>>>>>>>>>> Let's do it by hand.
>>>>>>>>>> rm -rf 
>>>>>>>>>> /sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D 
>>>>>>>>>> *
>>>>>>>>>>
>>>>>>>>>> On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping 
>>>>>>>>>>> heartbeat
>>>>>>>>>>>
>>>>>>>>>>> No improvment :(
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/19/2011 00:50, Sunil Mushran wrote:
>>>>>>>>>>>> See if this cleans it up.
>>>>>>>>>>>> ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>> ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/19/2011 00:43, Sunil Mushran wrote:
>>>>>>>>>>>>>> ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>>>>>>> Device                FS     Stack  
>>>>>>>>>>>>>>> UUID                              Label
>>>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0  ocfs2  o2cb   
>>>>>>>>>>>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> mounted.ocfs2 -f
>>>>>>>>>>>>>>> Device                FS     Nodes
>>>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ro02xsrv001 = the other node in the cluster.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> By the way, there is no /dev/md-2
>>>>>>>>>>>>>>>  ls /dev/dm-*
>>>>>>>>>>>>>>> /dev/dm-0  /dev/dm-1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/19/2011 00:37, Sunil Mushran wrote:
>>>>>>>>>>>>>>>> So it is not mounted. But we still have a hb thread 
>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>> hb could not be stopped during umount. The reason for that
>>>>>>>>>>>>>>>> could be the same that causes ocfs2_hb_ctl to fail.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Do:
>>>>>>>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>>>> /sys/kernel/debug/ocfs2:
>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>>> /sys/kernel/debug/o2dlm:
>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ocfs2_hb_ctl -I -d /dev/dm-2
>>>>>>>>>>>>>>>>> ocfs2_hb_ctl: Device name specified was not found 
>>>>>>>>>>>>>>>>> while reading uuid
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There is no /dev/dm-2 mounted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 10/19/2011 00:27, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>>> mount -t debugfs debugfs /sys/kernel/debug
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Then list that dir.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>>>>>>> ocfs2_hb_ctl -l -d /dev/dm-2
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Be careful before killing. We want to be sure that 
>>>>>>>>>>>>>>>>>> dev is not mounted.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>>> Again   the outputs:
>>>>>>>>>>>>>>>>>>>  cat 
>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>>>>>>> dm-2
>>>>>>>>>>>>>>>>>>> --->here should be volgr1-lvol0 i guess?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>>>>>> ls: /sys/kernel/debug/ocfs2: No such file or directory
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>>>>> ls: /sys/kernel/debug/o2dlm: No such file or directory
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think i have to enable debug first somehow..?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 10/19/2011 00:17, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>>>>>>> cat 
>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>>>>> Here is the output:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster:
>>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER:
>>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
>>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 3 root root    0 Oct 19 00:12 heartbeat
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 
>>>>>>>>>>>>>>>>>>>>> idle_timeout_ms
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 
>>>>>>>>>>>>>>>>>>>>> keepalive_delay_ms
>>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root    0 Oct 11 20:23 node
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 
>>>>>>>>>>>>>>>>>>>>> reconnect_delay_ms
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat:
>>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root    0 Oct 19 00:12 
>>>>>>>>>>>>>>>>>>>>> 918673F06F8F4ED188DDCE14F39945F6
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 
>>>>>>>>>>>>>>>>>>>>> dead_threshold
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*: 
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
>>>>>>>>>>>>>>>>>>>>> -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node:
>>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
>>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
>>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
>>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 10/19/2011 00:12, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>> I have a 2 nodes ocfs2 cluster running UEK 
>>>>>>>>>>>>>>>>>>>>>>> 2.6.32-100.0.19.el5,
>>>>>>>>>>>>>>>>>>>>>>> ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
>>>>>>>>>>>>>>>>>>>>>>> My problem is that all the time when i try to 
>>>>>>>>>>>>>>>>>>>>>>> run /etc/init.d/o2cb stop
>>>>>>>>>>>>>>>>>>>>>>> it fails with this error:
>>>>>>>>>>>>>>>>>>>>>>>       Stopping O2CB cluster CLUSTER: Failed
>>>>>>>>>>>>>>>>>>>>>>>       Unable to stop cluster as heartbeat region 
>>>>>>>>>>>>>>>>>>>>>>> still active
>>>>>>>>>>>>>>>>>>>>>>> There is no active mount point. I tried to 
>>>>>>>>>>>>>>>>>>>>>>> manually stop the heartdbeat
>>>>>>>>>>>>>>>>>>>>>>> with "ocfs2_hb_ctl -K -d 
>>>>>>>>>>>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0 ocfs2" (after finding
>>>>>>>>>>>>>>>>>>>>>>> the refs number with "ocfs2_hb_ctl -I -d 
>>>>>>>>>>>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0 ").
>>>>>>>>>>>>>>>>>>>>>>> But even if refs number is set to zero the 
>>>>>>>>>>>>>>>>>>>>>>> "heartbeat region still
>>>>>>>>>>>>>>>>>>>>>>> active" occurs.
>>>>>>>>>>>>>>>>>>>>>>> How can i fix this?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thank you in advance.
>>>>>>>>>>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>> Ocfs2-users mailing list
>>>>>>>>>>>>>>>>>>>>>>> Ocfs2-users at oss.oracle.com
>>>>>>>>>>>>>>>>>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20111023/605928ee/attachment-0001.html 


More information about the Ocfs2-users mailing list