[Ocfs-users] Hard system restart when DRBD connection fails while in use

Sun Sep 7 17:55:16 PDT 2008

I realise the timeout is configurable - how will the cluster hang for
two days? I don't understand.

If one node in the (2 node) cluster dies - the other one should just be
able to continue surely? When the other node comes back its shared block
device (ocfs2 drive) will be overwritten with the contents of the active
host by DRBD

Sunil Mushran wrote:
> The fencing mechanism is meant to avoid disk corruptions. If you
> extend the
> disk heartbeat to 2 days, then if a node dies, the cluster will hang
> for 2 days.
> The timeout is configurable. Details are in the 1.2 FAQ and 1.4 user's
> guide.
>
> Henri Cook wrote:
>> Dear Sunil,
>>
>> It is OCFS2 - I found the code, it's the self-fencing mechanism that
>> simply reboots the node - if I alter the OCFS2 timeout, the reboot is
>> delayed by that many seconds. It's a real shame, i'm going to have to
>> try to work with it - probably by extending the node timeout to 2 days
>> or something - with DRBD I don't see the need for OCFS2 to be rebooting
>> or anything really as DRBD takes care of block device synchronisation -
>> I just wish this behaviour was configureable!
>>
>> Henri
>>
>> Sunil Mushran wrote:
>>  
>>> Repeat the test. This time run the following on Node A
>>> after you have killed Node B.
>>>
>>> $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
>>>
>>> If we are lucky we'll get to see where that process is waiting.
>>>
>>> Henri Cook wrote:
>>>    
>>>> Hi all,
>>>>
>>>> I have two nodes (A+B) running a DRBD file system (using OCFS2) on
>>>> /shared.
>>>>
>>>> If I start say, an FTP file transfer to my drbd /shared directory on
>>>> node A, then reboot node B which is the other machine in a
>>>> Primary-Primary DRBD configuration while the transfer is in progress
>>>> - node A stops at a similar time that DRBD notices the connection
>>>> with Node B has been lost (hence crippling both machines for the time
>>>> it takes to reboot). If the drive is inactive (i.e. nothing is being
>>>> written to it) then this does not occur.
>>>>
>>>> My question then is, could OCFS2 tools be the source of these
>>>> reboots, is there any such default action configured? If so, how
>>>> would I go about investigating/altering it?  There are no log entries
>>>> about the reboot to speak of.
>>>>
>>>> OS is Ubuntu Hardy (Server) 8.04 and ocfs2-tools 1.3.9-0ubuntu1
>>>>
>>>> Thanks in advance,
>>>>
>>>> Henri
>>>>
>>>>
>>>> _______________________________________________
>>>> Ocfs-users mailing list
>>>> Ocfs-users at oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs-users
>>>>         
>