[Ocfs2-users] Unable to stop cluster as heartbeat region still active

Laurentiu Gosu lg at easic.ro
Tue Oct 18 15:54:47 PDT 2011


OK, i rebooted one of the nodes(both had similar issues); . But 
something is still fishy.
- i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
- i unmount it: umount /mnt/tmp/
- tried to stop o2cb:  /etc/init.d/o2cb stop
Stopping O2CB cluster CLUSTER: Failed
Unable to stop cluster as heartbeat region still active
- ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
-  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
- ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
/sys/kernel/config/cluster/CLUSTER/heartbeat/:
total 0
drwxr-xr-x 2 root root    0 Oct 19 01:50 0C4AB55FE9314FA5A9F81652FDB9B22D
-rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold

/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
total 0
-rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
-rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
-rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
-r--r--r-- 1 root root 4096 Oct 19 01:50 pid
-rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block

- i cannot manually delete 
/sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/

PS: i'm going to sleep now, i have to be up in a few hours. We can 
continue tomorrow if it's ok with you.
Thank you for your help.

Laurentiu.

On 10/19/2011 01:33, Sunil Mushran wrote:
> One way this can happen is if one starts the hb manually and then force
> formats on that volume. The format will generate a new uuid. Once that
> happens, the hb tool cannot map the region to the device and thus fail
> to stop it. Right now the easiest option on this box is resetting it.
>
> On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
>> Yes, i did reformat it(even more than once i think, last week). This 
>> is a pre-production system and i'm trying various options before 
>> moving into real life.
>>
>>
>> On 10/19/2011 01:19, Sunil Mushran wrote:
>>> Did you reformat the volume recently? or, when did you format last?
>>>
>>> On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:
>>>> well..this is weird
>>>> ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
>>>> *918673F06F8F4ED188DDCE14F39945F6*  dead_threshold
>>>>
>>>> looks like we have different UUIDs. Where is this coming from??
>>>>
>>>> ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
>>>> 918673F06F8F4ED188DDCE14F39945F6: 1 refs
>>>>
>>>>
>>>> On 10/19/2011 01:04, Sunil Mushran wrote:
>>>>> Let's do it by hand.
>>>>> rm -rf 
>>>>> /sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D 
>>>>> *
>>>>>
>>>>> On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:
>>>>>>  ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping 
>>>>>> heartbeat
>>>>>>
>>>>>> No improvment :(
>>>>>>
>>>>>>
>>>>>> On 10/19/2011 00:50, Sunil Mushran wrote:
>>>>>>> See if this cleans it up.
>>>>>>> ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>
>>>>>>> On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:
>>>>>>>> ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/19/2011 00:43, Sunil Mushran wrote:
>>>>>>>>> ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>
>>>>>>>>> On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>> Device                FS     Stack  
>>>>>>>>>> UUID                              Label
>>>>>>>>>> /dev/mapper/volgr1-lvol0  ocfs2  o2cb   
>>>>>>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D  ocfs2
>>>>>>>>>>
>>>>>>>>>> mounted.ocfs2 -f
>>>>>>>>>> Device                FS     Nodes
>>>>>>>>>> /dev/mapper/volgr1-lvol0  ocfs2  ro02xsrv001
>>>>>>>>>>
>>>>>>>>>> ro02xsrv001 = the other node in the cluster.
>>>>>>>>>>
>>>>>>>>>> By the way, there is no /dev/md-2
>>>>>>>>>>  ls /dev/dm-*
>>>>>>>>>> /dev/dm-0  /dev/dm-1
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/19/2011 00:37, Sunil Mushran wrote:
>>>>>>>>>>> So it is not mounted. But we still have a hb thread because
>>>>>>>>>>> hb could not be stopped during umount. The reason for that
>>>>>>>>>>> could be the same that causes ocfs2_hb_ctl to fail.
>>>>>>>>>>>
>>>>>>>>>>> Do:
>>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>>>
>>>>>>>>>>> On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>> /sys/kernel/debug/ocfs2:
>>>>>>>>>>>> total 0
>>>>>>>>>>>>
>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>> /sys/kernel/debug/o2dlm:
>>>>>>>>>>>> total 0
>>>>>>>>>>>>
>>>>>>>>>>>> ocfs2_hb_ctl -I -d /dev/dm-2
>>>>>>>>>>>> ocfs2_hb_ctl: Device name specified was not found while 
>>>>>>>>>>>> reading uuid
>>>>>>>>>>>>
>>>>>>>>>>>> There is no /dev/dm-2 mounted.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/19/2011 00:27, Sunil Mushran wrote:
>>>>>>>>>>>>> mount -t debugfs debugfs /sys/kernel/debug
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then list that dir.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>> ocfs2_hb_ctl -l -d /dev/dm-2
>>>>>>>>>>>>>
>>>>>>>>>>>>> Be careful before killing. We want to be sure that dev is 
>>>>>>>>>>>>> not mounted.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>> Again   the outputs:
>>>>>>>>>>>>>>  cat 
>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>> dm-2
>>>>>>>>>>>>>> --->here should be volgr1-lvol0 i guess?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>> ls: /sys/kernel/debug/ocfs2: No such file or directory
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>> ls: /sys/kernel/debug/o2dlm: No such file or directory
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think i have to enable debug first somehow..?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/19/2011 00:17, Sunil Mushran wrote:
>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>> cat 
>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>> Here is the output:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>> /sys/kernel/config/cluster:
>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER:
>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
>>>>>>>>>>>>>>>> drwxr-xr-x 3 root root    0 Oct 19 00:12 heartbeat
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 idle_timeout_ms
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 
>>>>>>>>>>>>>>>> keepalive_delay_ms
>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root    0 Oct 11 20:23 node
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 
>>>>>>>>>>>>>>>> reconnect_delay_ms
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat:
>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root    0 Oct 19 00:12 
>>>>>>>>>>>>>>>> 918673F06F8F4ED188DDCE14F39945F6
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dead_threshold
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*: 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
>>>>>>>>>>>>>>>> -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node:
>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 10/19/2011 00:12, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> I have a 2 nodes ocfs2 cluster running UEK 
>>>>>>>>>>>>>>>>>> 2.6.32-100.0.19.el5,
>>>>>>>>>>>>>>>>>> ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
>>>>>>>>>>>>>>>>>> My problem is that all the time when i try to run 
>>>>>>>>>>>>>>>>>> /etc/init.d/o2cb stop
>>>>>>>>>>>>>>>>>> it fails with this error:
>>>>>>>>>>>>>>>>>>       Stopping O2CB cluster CLUSTER: Failed
>>>>>>>>>>>>>>>>>>       Unable to stop cluster as heartbeat region 
>>>>>>>>>>>>>>>>>> still active
>>>>>>>>>>>>>>>>>> There is no active mount point. I tried to manually 
>>>>>>>>>>>>>>>>>> stop the heartdbeat
>>>>>>>>>>>>>>>>>> with "ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0 
>>>>>>>>>>>>>>>>>> ocfs2" (after finding
>>>>>>>>>>>>>>>>>> the refs number with "ocfs2_hb_ctl -I -d 
>>>>>>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0 ").
>>>>>>>>>>>>>>>>>> But even if refs number is set to zero the "heartbeat 
>>>>>>>>>>>>>>>>>> region still
>>>>>>>>>>>>>>>>>> active" occurs.
>>>>>>>>>>>>>>>>>> How can i fix this?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thank you in advance.
>>>>>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> Ocfs2-users mailing list
>>>>>>>>>>>>>>>>>> Ocfs2-users at oss.oracle.com
>>>>>>>>>>>>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20111019/3cf0f38a/attachment-0001.html 


More information about the Ocfs2-users mailing list