[Ocfs2-users] Unable to stop cluster as heartbeat region still active

Laurentiu Gosu lg at easic.ro
Sun Oct 23 05:32:04 PDT 2011


Hi Sunil,
Sorry for my late reply, i just had time today to start from scratch and 
test.
I rebuilt my environment(2 nodes connected to a SAN via 
iSCSI+multipath). I still have the issue that the heartbeat is active 
after I umount my ocfs2 volume.
/etc/init.d/o2cb stop
Stopping O2CB cluster CLUST: Failed
Unable to stop cluster as heartbeat region still active

ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs

After i manually kill the ref (ocfs2_hb_ctl -K -d 
/dev/mapper/volgr1-lvol0 ocfs2 ) i can stop successfully o2cb. I can 
live with that but why doesn't it stop automatically? As i understand, 
hearbeat should be started and stopped once the volume gets 
mounted/umounted.

br,
Laurentiu.

On 10/19/2011 02:28, Sunil Mushran wrote:
> Manual delete will only work if there are no references. In your case
> there are references.
>
> You may want to start both nodes from scratch. Do not start/stop
> heartbeat manually. Also, do not force-format.
>
> On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:
>> OK, i rebooted one of the nodes(both had similar issues); . But 
>> something is still fishy.
>> - i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
>> - i unmount it: umount /mnt/tmp/
>> - tried to stop o2cb:  /etc/init.d/o2cb stop
>> Stopping O2CB cluster CLUSTER: Failed
>> Unable to stop cluster as heartbeat region still active
>> - ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
>> -  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
>> - ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
>> /sys/kernel/config/cluster/CLUSTER/heartbeat/:
>> total 0
>> drwxr-xr-x 2 root root    0 Oct 19 01:50 0C4AB55FE9314FA5A9F81652FDB9B22D
>> -rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold
>>
>> /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
>> total 0
>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
>> -r--r--r-- 1 root root 4096 Oct 19 01:50 pid
>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block
>>
>> - i cannot manually delete 
>> /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/
>>
>> PS: i'm going to sleep now, i have to be up in a few hours. We can 
>> continue tomorrow if it's ok with you.
>> Thank you for your help.
>>
>> Laurentiu.
>>
>> On 10/19/2011 01:33, Sunil Mushran wrote:
>>> One way this can happen is if one starts the hb manually and then force
>>> formats on that volume. The format will generate a new uuid. Once that
>>> happens, the hb tool cannot map the region to the device and thus fail
>>> to stop it. Right now the easiest option on this box is resetting it.
>>>
>>> On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
>>>> Yes, i did reformat it(even more than once i think, last week). 
>>>> This is a pre-production system and i'm trying various options 
>>>> before moving into real life.
>>>>
>>>>
>>>> On 10/19/2011 01:19, Sunil Mushran wrote:
>>>>> Did you reformat the volume recently? or, when did you format last?
>>>>>
>>>>> On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:
>>>>>> well..this is weird
>>>>>> ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
>>>>>> *918673F06F8F4ED188DDCE14F39945F6*  dead_threshold
>>>>>>
>>>>>> looks like we have different UUIDs. Where is this coming from??
>>>>>>
>>>>>> ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
>>>>>> 918673F06F8F4ED188DDCE14F39945F6: 1 refs
>>>>>>
>>>>>>
>>>>>> On 10/19/2011 01:04, Sunil Mushran wrote:
>>>>>>> Let's do it by hand.
>>>>>>> rm -rf 
>>>>>>> /sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D 
>>>>>>> *
>>>>>>>
>>>>>>> On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:
>>>>>>>>  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping 
>>>>>>>> heartbeat
>>>>>>>>
>>>>>>>> No improvment :(
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/19/2011 00:50, Sunil Mushran wrote:
>>>>>>>>> See if this cleans it up.
>>>>>>>>> ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>
>>>>>>>>> On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:
>>>>>>>>>> ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/19/2011 00:43, Sunil Mushran wrote:
>>>>>>>>>>> ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>>
>>>>>>>>>>> On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>>>> Device                FS     Stack  
>>>>>>>>>>>> UUID                              Label
>>>>>>>>>>>> /dev/mapper/volgr1-lvol0  ocfs2  o2cb   
>>>>>>>>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2
>>>>>>>>>>>>
>>>>>>>>>>>> mounted.ocfs2 -f
>>>>>>>>>>>> Device                FS     Nodes
>>>>>>>>>>>> /dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001
>>>>>>>>>>>>
>>>>>>>>>>>> ro02xsrv001 = the other node in the cluster.
>>>>>>>>>>>>
>>>>>>>>>>>> By the way, there is no /dev/md-2
>>>>>>>>>>>>  ls /dev/dm-*
>>>>>>>>>>>> /dev/dm-0  /dev/dm-1
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/19/2011 00:37, Sunil Mushran wrote:
>>>>>>>>>>>>> So it is not mounted. But we still have a hb thread because
>>>>>>>>>>>>> hb could not be stopped during umount. The reason for that
>>>>>>>>>>>>> could be the same that causes ocfs2_hb_ctl to fail.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do:
>>>>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>> /sys/kernel/debug/ocfs2:
>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>> /sys/kernel/debug/o2dlm:
>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ocfs2_hb_ctl -I -d /dev/dm-2
>>>>>>>>>>>>>> ocfs2_hb_ctl: Device name specified was not found while 
>>>>>>>>>>>>>> reading uuid
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is no /dev/dm-2 mounted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/19/2011 00:27, Sunil Mushran wrote:
>>>>>>>>>>>>>>> mount -t debugfs debugfs /sys/kernel/debug
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then list that dir.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>>>> ocfs2_hb_ctl -l -d /dev/dm-2
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Be careful before killing. We want to be sure that dev 
>>>>>>>>>>>>>>> is not mounted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>> Again   the outputs:
>>>>>>>>>>>>>>>>  cat 
>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>>>> dm-2
>>>>>>>>>>>>>>>> --->here should be volgr1-lvol0 i guess?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>>> ls: /sys/kernel/debug/ocfs2: No such file or directory
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>> ls: /sys/kernel/debug/o2dlm: No such file or directory
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think i have to enable debug first somehow..?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 10/19/2011 00:17, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>>>> cat 
>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>> Here is the output:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster:
>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER:
>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
>>>>>>>>>>>>>>>>>> drwxr-xr-x 3 root root    0 Oct 19 00:12 heartbeat
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 
>>>>>>>>>>>>>>>>>> keepalive_delay_ms
>>>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root    0 Oct 11 20:23 node
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 
>>>>>>>>>>>>>>>>>> reconnect_delay_ms
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat:
>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root    0 Oct 19 00:12 
>>>>>>>>>>>>>>>>>> 918673F06F8F4ED188DDCE14F39945F6
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*: 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
>>>>>>>>>>>>>>>>>> -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node:
>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 10/19/2011 00:12, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>> I have a 2 nodes ocfs2 cluster running UEK 
>>>>>>>>>>>>>>>>>>>> 2.6.32-100.0.19.el5,
>>>>>>>>>>>>>>>>>>>> ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
>>>>>>>>>>>>>>>>>>>> My problem is that all the time when i try to run 
>>>>>>>>>>>>>>>>>>>> /etc/init.d/o2cb stop
>>>>>>>>>>>>>>>>>>>> it fails with this error:
>>>>>>>>>>>>>>>>>>>>       Stopping O2CB cluster CLUSTER: Failed
>>>>>>>>>>>>>>>>>>>>       Unable to stop cluster as heartbeat region 
>>>>>>>>>>>>>>>>>>>> still active
>>>>>>>>>>>>>>>>>>>> There is no active mount point. I tried to manually 
>>>>>>>>>>>>>>>>>>>> stop the heartdbeat
>>>>>>>>>>>>>>>>>>>> with "ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 
>>>>>>>>>>>>>>>>>>>> ocfs2" (after finding
>>>>>>>>>>>>>>>>>>>> the refs number with "ocfs2_hb_ctl -I -d 
>>>>>>>>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0 ").
>>>>>>>>>>>>>>>>>>>> But even if refs number is set to zero the 
>>>>>>>>>>>>>>>>>>>> "heartbeat region still
>>>>>>>>>>>>>>>>>>>> active" occurs.
>>>>>>>>>>>>>>>>>>>> How can i fix this?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thank you in advance.
>>>>>>>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>> Ocfs2-users mailing list
>>>>>>>>>>>>>>>>>>>> Ocfs2-users at oss.oracle.com
>>>>>>>>>>>>>>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20111023/9982b4e2/attachment-0001.html 


More information about the Ocfs2-users mailing list