[Ocfs2-users] Unable to stop cluster as heartbeat region still active

Sunil Mushran sunil.mushran at oracle.com
Tue Oct 18 16:28:18 PDT 2011


Manual delete will only work if there are no references. In your case
there are references.

You may want to start both nodes from scratch. Do not start/stop
heartbeat manually. Also, do not force-format.

On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:
> OK, i rebooted one of the nodes(both had similar issues); . But something is still fishy.
> - i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
> - i unmount it: umount /mnt/tmp/
> - tried to stop o2cb:  /etc/init.d/o2cb stop
> Stopping O2CB cluster CLUSTER: Failed
> Unable to stop cluster as heartbeat region still active
> - ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
> 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
> -  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
> - ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
> /sys/kernel/config/cluster/CLUSTER/heartbeat/:
> total 0
> drwxr-xr-x 2 root root    0 Oct 19 01:50 0C4AB55FE9314FA5A9F81652FDB9B22D
> -rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold
>
> /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
> total 0
> -rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
> -rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
> -rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
> -r--r--r-- 1 root root 4096 Oct 19 01:50 pid
> -rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block
>
> - i cannot manually delete /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/
>
> PS: i'm going to sleep now, i have to be up in a few hours. We can continue tomorrow if it's ok with you.
> Thank you for your help.
>
> Laurentiu.
>
> On 10/19/2011 01:33, Sunil Mushran wrote:
>> One way this can happen is if one starts the hb manually and then force
>> formats on that volume. The format will generate a new uuid. Once that
>> happens, the hb tool cannot map the region to the device and thus fail
>> to stop it. Right now the easiest option on this box is resetting it.
>>
>> On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
>>> Yes, i did reformat it(even more than once i think, last week). This is a pre-production system and i'm trying various options before moving into real life.
>>>
>>>
>>> On 10/19/2011 01:19, Sunil Mushran wrote:
>>>> Did you reformat the volume recently? or, when did you format last?
>>>>
>>>> On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:
>>>>> well..this is weird
>>>>> ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
>>>>> *918673F06F8F4ED188DDCE14F39945F6*  dead_threshold
>>>>>
>>>>> looks like we have different UUIDs. Where is this coming from??
>>>>>
>>>>> ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
>>>>> 918673F06F8F4ED188DDCE14F39945F6: 1 refs
>>>>>
>>>>>
>>>>> On 10/19/2011 01:04, Sunil Mushran wrote:
>>>>>> Let's do it by hand.
>>>>>> rm -rf /sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D *
>>>>>>
>>>>>> On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:
>>>>>>>  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
>>>>>>>
>>>>>>> No improvment :(
>>>>>>>
>>>>>>>
>>>>>>> On 10/19/2011 00:50, Sunil Mushran wrote:
>>>>>>>> See if this cleans it up.
>>>>>>>> ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>
>>>>>>>> On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:
>>>>>>>>> ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/19/2011 00:43, Sunil Mushran wrote:
>>>>>>>>>> ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>
>>>>>>>>>> On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
>>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>>> Device                FS     Stack  UUID                              Label
>>>>>>>>>>> /dev/mapper/volgr1-lvol0  ocfs2  o2cb   0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2
>>>>>>>>>>>
>>>>>>>>>>> mounted.ocfs2 -f
>>>>>>>>>>> Device                FS     Nodes
>>>>>>>>>>> /dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001
>>>>>>>>>>>
>>>>>>>>>>> ro02xsrv001 = the other node in the cluster.
>>>>>>>>>>>
>>>>>>>>>>> By the way, there is no /dev/md-2
>>>>>>>>>>>  ls /dev/dm-*
>>>>>>>>>>> /dev/dm-0  /dev/dm-1
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/19/2011 00:37, Sunil Mushran wrote:
>>>>>>>>>>>> So it is not mounted. But we still have a hb thread because
>>>>>>>>>>>> hb could not be stopped during umount. The reason for that
>>>>>>>>>>>> could be the same that causes ocfs2_hb_ctl to fail.
>>>>>>>>>>>>
>>>>>>>>>>>> Do:
>>>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>> /sys/kernel/debug/ocfs2:
>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>
>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>> /sys/kernel/debug/o2dlm:
>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>
>>>>>>>>>>>>> ocfs2_hb_ctl -I -d /dev/dm-2
>>>>>>>>>>>>> ocfs2_hb_ctl: Device name specified was not found while reading uuid
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is no /dev/dm-2 mounted.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/19/2011 00:27, Sunil Mushran wrote:
>>>>>>>>>>>>>> mount -t debugfs debugfs /sys/kernel/debug
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then list that dir.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>>> ocfs2_hb_ctl -l -d /dev/dm-2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Be careful before killing. We want to be sure that dev is not mounted.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>> Again   the outputs:
>>>>>>>>>>>>>>>  cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>>> dm-2
>>>>>>>>>>>>>>> --->here should be volgr1-lvol0 i guess?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>> ls: /sys/kernel/debug/ocfs2: No such file or directory
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>> ls: /sys/kernel/debug/o2dlm: No such file or directory
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think i have to enable debug first somehow..?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/19/2011 00:17, Sunil Mushran wrote:
>>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>>> cat /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>> Here is the output:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster:
>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER:
>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
>>>>>>>>>>>>>>>>> drwxr-xr-x 3 root root    0 Oct 19 00:12 heartbeat
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 keepalive_delay_ms
>>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root    0 Oct 11 20:23 node
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 reconnect_delay_ms
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat:
>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root    0 Oct 19 00:12 918673F06F8F4ED188DDCE14F39945F6
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*:
>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
>>>>>>>>>>>>>>>>> -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node:
>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 10/19/2011 00:12, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>> I have a 2 nodes ocfs2 cluster running UEK 2.6.32-100.0.19.el5,
>>>>>>>>>>>>>>>>>>> ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
>>>>>>>>>>>>>>>>>>> My problem is that all the time when i try to run /etc/init.d/o2cb stop
>>>>>>>>>>>>>>>>>>> it fails with this error:
>>>>>>>>>>>>>>>>>>>       Stopping O2CB cluster CLUSTER: Failed
>>>>>>>>>>>>>>>>>>>       Unable to stop cluster as heartbeat region still active
>>>>>>>>>>>>>>>>>>> There is no active mount point. I tried to manually stop the heartdbeat
>>>>>>>>>>>>>>>>>>> with "ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 ocfs2" (after finding
>>>>>>>>>>>>>>>>>>> the refs number with "ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0 ").
>>>>>>>>>>>>>>>>>>> But even if refs number is set to zero the "heartbeat region still
>>>>>>>>>>>>>>>>>>> active" occurs.
>>>>>>>>>>>>>>>>>>> How can i fix this?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you in advance.
>>>>>>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> Ocfs2-users mailing list
>>>>>>>>>>>>>>>>>>> Ocfs2-users at oss.oracle.com
>>>>>>>>>>>>>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20111018/f748fb3a/attachment-0001.html 


More information about the Ocfs2-users mailing list