[Ocfs-users] Hard system restart when DRBD connection fails while in use

Sun Sep 7 20:46:45 PDT 2008

60 secs is the current default for hb timeout. It's been like that for
a long time now.

Henri Cook wrote:
> So my timeout was 7 seconds before which means my node A shutdowns very
> quickly after node B - it's now 30 seconds so after i've shutdown B,
> once A's noticed that node B's gone - that's when i'd run that command?
> e.g. within the 30 second timeout?
>
> It's interesting to note that if I simply reboot node B with a long
> timeout (e.g. 30 seconds) when it comes back normal operation resumes -
> which is what led me to believe we could extend this to a couple of days
> or more.
>
> Sunil Mushran wrote:
>   
>> What's the ps output?
>>
>> My suspicion is that drbd is blocking the ios including the
>> disk hb io leading to the fence.
>>
>> Henri Cook wrote:
>>     
>>> I realise the timeout is configurable - how will the cluster hang for
>>> two days? I don't understand.
>>>
>>> If one node in the (2 node) cluster dies - the other one should just be
>>> able to continue surely? When the other node comes back its shared block
>>> device (ocfs2 drive) will be overwritten with the contents of the active
>>> host by DRBD
>>>
>>> Sunil Mushran wrote:
>>>  
>>>       
>>>> The fencing mechanism is meant to avoid disk corruptions. If you
>>>> extend the
>>>> disk heartbeat to 2 days, then if a node dies, the cluster will hang
>>>> for 2 days.
>>>> The timeout is configurable. Details are in the 1.2 FAQ and 1.4 user's
>>>> guide.
>>>>
>>>> Henri Cook wrote:
>>>>    
>>>>         
>>>>> Dear Sunil,
>>>>>
>>>>> It is OCFS2 - I found the code, it's the self-fencing mechanism that
>>>>> simply reboots the node - if I alter the OCFS2 timeout, the reboot is
>>>>> delayed by that many seconds. It's a real shame, i'm going to have to
>>>>> try to work with it - probably by extending the node timeout to 2 days
>>>>> or something - with DRBD I don't see the need for OCFS2 to be
>>>>> rebooting
>>>>> or anything really as DRBD takes care of block device
>>>>> synchronisation -
>>>>> I just wish this behaviour was configureable!
>>>>>
>>>>> Henri
>>>>>
>>>>> Sunil Mushran wrote:
>>>>>  
>>>>>      
>>>>>           
>>>>>> Repeat the test. This time run the following on Node A
>>>>>> after you have killed Node B.
>>>>>>
>>>>>> $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN
>>>>>>
>>>>>> If we are lucky we'll get to see where that process is waiting.
>>>>>>
>>>>>> Henri Cook wrote:
>>>>>>           
>>>>>>             
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have two nodes (A+B) running a DRBD file system (using OCFS2) on
>>>>>>> /shared.
>>>>>>>
>>>>>>> If I start say, an FTP file transfer to my drbd /shared directory on
>>>>>>> node A, then reboot node B which is the other machine in a
>>>>>>> Primary-Primary DRBD configuration while the transfer is in progress
>>>>>>> - node A stops at a similar time that DRBD notices the connection
>>>>>>> with Node B has been lost (hence crippling both machines for the
>>>>>>> time
>>>>>>> it takes to reboot). If the drive is inactive (i.e. nothing is being
>>>>>>> written to it) then this does not occur.
>>>>>>>
>>>>>>> My question then is, could OCFS2 tools be the source of these
>>>>>>> reboots, is there any such default action configured? If so, how
>>>>>>> would I go about investigating/altering it?  There are no log
>>>>>>> entries
>>>>>>> about the reboot to speak of.
>>>>>>>
>>>>>>> OS is Ubuntu Hardy (Server) 8.04 and ocfs2-tools 1.3.9-0ubuntu1
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>> Henri
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Ocfs-users mailing list
>>>>>>> Ocfs-users at oss.oracle.com
>>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs-users
>>>>>>>                   
>>>>>>>