[Ocfs-users] Hard system restart when DRBD connection fails while in use

Sun Sep 7 18:05:33 PDT 2008

What's the ps output?

My suspicion is that drbd is blocking the ios including the
disk hb io leading to the fence.

Henri Cook wrote:
> I realise the timeout is configurable - how will the cluster hang for
> two days? I don't understand.
>
> If one node in the (2 node) cluster dies - the other one should just be
> able to continue surely? When the other node comes back its shared block
> device (ocfs2 drive) will be overwritten with the contents of the active
> host by DRBD
>
> Sunil Mushran wrote:
>   
>> The fencing mechanism is meant to avoid disk corruptions. If you
>> extend the
>> disk heartbeat to 2 days, then if a node dies, the cluster will hang
>> for 2 days.
>> The timeout is configurable. Details are in the 1.2 FAQ and 1.4 user's
>> guide.
>>
>> Henri Cook wrote:
>>     
>>> Dear Sunil,
>>>
>>> It is OCFS2 - I found the code, it's the self-fencing mechanism that
>>> simply reboots the node - if I alter the OCFS2 timeout, the reboot is
>>> delayed by that many seconds. It's a real shame, i'm going to have to
>>> try to work with it - probably by extending the node timeout to 2 days
>>> or something - with DRBD I don't see the need for OCFS2 to be rebooting
>>> or anything really as DRBD takes care of block device synchronisation -
>>> I just wish this behaviour was configureable!
>>>
>>> Henri
>>>
>>> Sunil Mushran wrote:
>>>  
>>>       
>>>> Repeat the test. This time run the following on Node A
>>>> after you have killed Node B.
>>>>
>>>> $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
>>>>
>>>> If we are lucky we'll get to see where that process is waiting.
>>>>
>>>> Henri Cook wrote:
>>>>    
>>>>         
>>>>> Hi all,
>>>>>
>>>>> I have two nodes (A+B) running a DRBD file system (using OCFS2) on
>>>>> /shared.
>>>>>
>>>>> If I start say, an FTP file transfer to my drbd /shared directory on
>>>>> node A, then reboot node B which is the other machine in a
>>>>> Primary-Primary DRBD configuration while the transfer is in progress
>>>>> - node A stops at a similar time that DRBD notices the connection
>>>>> with Node B has been lost (hence crippling both machines for the time
>>>>> it takes to reboot). If the drive is inactive (i.e. nothing is being
>>>>> written to it) then this does not occur.
>>>>>
>>>>> My question then is, could OCFS2 tools be the source of these
>>>>> reboots, is there any such default action configured? If so, how
>>>>> would I go about investigating/altering it?  There are no log entries
>>>>> about the reboot to speak of.
>>>>>
>>>>> OS is Ubuntu Hardy (Server) 8.04 and ocfs2-tools 1.3.9-0ubuntu1
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Henri
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Ocfs-users mailing list
>>>>> Ocfs-users at oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs-users
>>>>>         
>>>>>