[Ocfs2-users] Unable to stop cluster as heartbeat region still active
Laurentiu Gosu
lg at easic.ro
Sun Oct 23 10:43:52 PDT 2011
hmm..
#ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
*BUT:*
#ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2
ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
I can still kill the ref using device name (-d).
On 10/23/2011 17:57, Sunil Mushran wrote:
> I think it stops by uuid. So try doing this the next time.
> You are encountering some issue that we have not seen before.
> ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2
>
> On 10/23/2011 05:32 AM, Laurentiu Gosu wrote:
>> Hi Sunil,
>> Sorry for my late reply, i just had time today to start from scratch
>> and test.
>> I rebuilt my environment(2 nodes connected to a SAN via
>> iSCSI+multipath). I still have the issue that the heartbeat is active
>> after I umount my ocfs2 volume.
>> /etc/init.d/o2cb stop
>> Stopping O2CB cluster CLUST: Failed
>> Unable to stop cluster as heartbeat region still active
>>
>> ocfs2_hb_ctl -I -d /dev/mapper/volgr1-lvol0
>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
>>
>> After i manually kill the ref (ocfs2_hb_ctl -K -d
>> /dev/mapper/volgr1-lvol0 ocfs2 ) i can stop successfully o2cb. I can
>> live with that but why doesn't it stop automatically? As i
>> understand, hearbeat should be started and stopped once the volume
>> gets mounted/umounted.
>>
>> br,
>> Laurentiu.
>>
>> On 10/19/2011 02:28, Sunil Mushran wrote:
>>> Manual delete will only work if there are no references. In your case
>>> there are references.
>>>
>>> You may want to start both nodes from scratch. Do not start/stop
>>> heartbeat manually. Also, do not force-format.
>>>
>>> On 10/18/2011 03:54 PM, Laurentiu Gosu wrote:
>>>> OK, i rebooted one of the nodes(both had similar issues); . But
>>>> something is still fishy.
>>>> - i mounted the device: mount -t ocfs2 /dev/volgr1/lvol0 /mnt/tmp/
>>>> - i unmount it: umount /mnt/tmp/
>>>> - tried to stop o2cb: /etc/init.d/o2cb stop
>>>> Stopping O2CB cluster CLUSTER: Failed
>>>> Unable to stop cluster as heartbeat region still active
>>>> - ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 1 refs
>>>> - ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat
>>>> - ls -Rl /sys/kernel/config/cluster/CLUSTER/heartbeat/
>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/:
>>>> total 0
>>>> drwxr-xr-x 2 root root 0 Oct 19 01:50
>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>> -rw-r--r-- 1 root root 4096 Oct 19 01:40 dead_threshold
>>>>
>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D:
>>>> total 0
>>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 block_bytes
>>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 blocks
>>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 dev
>>>> -r--r--r-- 1 root root 4096 Oct 19 01:50 pid
>>>> -rw-r--r-- 1 root root 4096 Oct 19 01:50 start_block
>>>>
>>>> - i cannot manually delete
>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/0C4AB55FE9314FA5A9F81652FDB9B22D/
>>>>
>>>> PS: i'm going to sleep now, i have to be up in a few hours. We can
>>>> continue tomorrow if it's ok with you.
>>>> Thank you for your help.
>>>>
>>>> Laurentiu.
>>>>
>>>> On 10/19/2011 01:33, Sunil Mushran wrote:
>>>>> One way this can happen is if one starts the hb manually and then
>>>>> force
>>>>> formats on that volume. The format will generate a new uuid. Once that
>>>>> happens, the hb tool cannot map the region to the device and thus fail
>>>>> to stop it. Right now the easiest option on this box is resetting it.
>>>>>
>>>>> On 10/18/2011 03:24 PM, Laurentiu Gosu wrote:
>>>>>> Yes, i did reformat it(even more than once i think, last week).
>>>>>> This is a pre-production system and i'm trying various options
>>>>>> before moving into real life.
>>>>>>
>>>>>>
>>>>>> On 10/19/2011 01:19, Sunil Mushran wrote:
>>>>>>> Did you reformat the volume recently? or, when did you format last?
>>>>>>>
>>>>>>> On 10/18/2011 03:13 PM, Laurentiu Gosu wrote:
>>>>>>>> well..this is weird
>>>>>>>> ls /sys/kernel/config/cluster/CLUSTER/heartbeat/
>>>>>>>> *918673F06F8F4ED188DDCE14F39945F6* dead_threshold
>>>>>>>>
>>>>>>>> looks like we have different UUIDs. Where is this coming from??
>>>>>>>>
>>>>>>>> ocfs2_hb_ctl -I -u 918673F06F8F4ED188DDCE14F39945F6
>>>>>>>> 918673F06F8F4ED188DDCE14F39945F6: 1 refs
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/19/2011 01:04, Sunil Mushran wrote:
>>>>>>>>> Let's do it by hand.
>>>>>>>>> rm -rf
>>>>>>>>> /sys/kernel/config/cluster/.../heartbeat/*0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>> On 10/18/2011 02:52 PM, Laurentiu Gosu wrote:
>>>>>>>>>> ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>> ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping
>>>>>>>>>> heartbeat
>>>>>>>>>>
>>>>>>>>>> No improvment :(
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/19/2011 00:50, Sunil Mushran wrote:
>>>>>>>>>>> See if this cleans it up.
>>>>>>>>>>> ocfs2_hb_ctl -K -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>>
>>>>>>>>>>> On 10/18/2011 02:44 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>> ocfs2_hb_ctl -I -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D: 0 refs
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/19/2011 00:43, Sunil Mushran wrote:
>>>>>>>>>>>>> ocfs2_hb_ctl -l -u 0C4AB55FE9314FA5A9F81652FDB9B22D
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/18/2011 02:40 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>>>>>> Device FS Stack
>>>>>>>>>>>>>> UUID Label
>>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0 ocfs2 o2cb
>>>>>>>>>>>>>> 0C4AB55FE9314FA5A9F81652FDB9B22D ocfs2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> mounted.ocfs2 -f
>>>>>>>>>>>>>> Device FS Nodes
>>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0 ocfs2 ro02xsrv001
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ro02xsrv001 = the other node in the cluster.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> By the way, there is no /dev/md-2
>>>>>>>>>>>>>> ls /dev/dm-*
>>>>>>>>>>>>>> /dev/dm-0 /dev/dm-1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/19/2011 00:37, Sunil Mushran wrote:
>>>>>>>>>>>>>>> So it is not mounted. But we still have a hb thread because
>>>>>>>>>>>>>>> hb could not be stopped during umount. The reason for that
>>>>>>>>>>>>>>> could be the same that causes ocfs2_hb_ctl to fail.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do:
>>>>>>>>>>>>>>> mounted.ocfs2 -d
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/18/2011 02:32 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>>> /sys/kernel/debug/ocfs2:
>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>> /sys/kernel/debug/o2dlm:
>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ocfs2_hb_ctl -I -d /dev/dm-2
>>>>>>>>>>>>>>>> ocfs2_hb_ctl: Device name specified was not found while
>>>>>>>>>>>>>>>> reading uuid
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There is no /dev/dm-2 mounted.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 10/19/2011 00:27, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>> mount -t debugfs debugfs /sys/kernel/debug
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Then list that dir.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>>>>>> ocfs2_hb_ctl -l -d /dev/dm-2
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Be careful before killing. We want to be sure that dev
>>>>>>>>>>>>>>>>> is not mounted.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 10/18/2011 02:23 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>> Again the outputs:
>>>>>>>>>>>>>>>>>> cat
>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>>>>>> dm-2
>>>>>>>>>>>>>>>>>> --->here should be volgr1-lvol0 i guess?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>>>>> ls: /sys/kernel/debug/ocfs2: No such file or directory
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>>>> ls: /sys/kernel/debug/o2dlm: No such file or directory
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I think i have to enable debug first somehow..?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 10/19/2011 00:17, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>>>>>> cat
>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/918673F06F8F4ED188DDCE14F39945F6/dev
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also, do:
>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/ocfs2
>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/debug/o2dlm
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 10/18/2011 02:14 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>>>> Here is the output:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster:
>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root 0 Oct 19 00:12 CLUSTER
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER:
>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 fence_method
>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 3 root root 0 Oct 19 00:12 heartbeat
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12
>>>>>>>>>>>>>>>>>>>> idle_timeout_ms
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12
>>>>>>>>>>>>>>>>>>>> keepalive_delay_ms
>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 4 root root 0 Oct 11 20:23 node
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12
>>>>>>>>>>>>>>>>>>>> reconnect_delay_ms
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat:
>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12
>>>>>>>>>>>>>>>>>>>> 918673F06F8F4ED188DDCE14F39945F6
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12
>>>>>>>>>>>>>>>>>>>> dead_threshold
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/heartbeat/*918673F06F8F4ED188DDCE14F39945F6*:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 block_bytes
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 blocks
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 dev
>>>>>>>>>>>>>>>>>>>> -r--r--r-- 1 root root 4096 Oct 19 00:12 pid
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 start_block
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node:
>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv001
>>>>>>>>>>>>>>>>>>>> drwxr-xr-x 2 root root 0 Oct 19 00:12 ro02xsrv002
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv001:
>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> /sys/kernel/config/cluster/CLUSTER/node/ro02xsrv002:
>>>>>>>>>>>>>>>>>>>> total 0
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_address
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 ipv4_port
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 local
>>>>>>>>>>>>>>>>>>>> -rw-r--r-- 1 root root 4096 Oct 19 00:12 num
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 10/19/2011 00:12, Sunil Mushran wrote:
>>>>>>>>>>>>>>>>>>>>> ls -lR /sys/kernel/config/cluster
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> What does this return?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 10/18/2011 02:05 PM, Laurentiu Gosu wrote:
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>> I have a 2 nodes ocfs2 cluster running UEK
>>>>>>>>>>>>>>>>>>>>>> 2.6.32-100.0.19.el5,
>>>>>>>>>>>>>>>>>>>>>> ocfs2console-1.6.3-2.el5, ocfs2-tools-1.6.3-2.el5.
>>>>>>>>>>>>>>>>>>>>>> My problem is that all the time when i try to run
>>>>>>>>>>>>>>>>>>>>>> /etc/init.d/o2cb stop
>>>>>>>>>>>>>>>>>>>>>> it fails with this error:
>>>>>>>>>>>>>>>>>>>>>> Stopping O2CB cluster CLUSTER: Failed
>>>>>>>>>>>>>>>>>>>>>> Unable to stop cluster as heartbeat region
>>>>>>>>>>>>>>>>>>>>>> still active
>>>>>>>>>>>>>>>>>>>>>> There is no active mount point. I tried to
>>>>>>>>>>>>>>>>>>>>>> manually stop the heartdbeat
>>>>>>>>>>>>>>>>>>>>>> with "ocfs2_hb_ctl -K -d /dev/mapper/volgr1-lvol0
>>>>>>>>>>>>>>>>>>>>>> ocfs2" (after finding
>>>>>>>>>>>>>>>>>>>>>> the refs number with "ocfs2_hb_ctl -I -d
>>>>>>>>>>>>>>>>>>>>>> /dev/mapper/volgr1-lvol0 ").
>>>>>>>>>>>>>>>>>>>>>> But even if refs number is set to zero the
>>>>>>>>>>>>>>>>>>>>>> "heartbeat region still
>>>>>>>>>>>>>>>>>>>>>> active" occurs.
>>>>>>>>>>>>>>>>>>>>>> How can i fix this?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thank you in advance.
>>>>>>>>>>>>>>>>>>>>>> Laurentiu.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>> Ocfs2-users mailing list
>>>>>>>>>>>>>>>>>>>>>> Ocfs2-users at oss.oracle.com
>>>>>>>>>>>>>>>>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20111023/7991800e/attachment-0001.html
More information about the Ocfs2-users
mailing list