[Ocfs2-users] Failover testing problem and a heartbeat question

Daniel McDonald wasade at gmail.com
Wed May 26 13:39:09 PDT 2010


> When a node dies, the cluster ops pause for the node to be first
> declared dead followed by recovery. Threshold governs the time
> it takes to declare the node dead. The higher the value, the longer
> the pause.

Okay, thank you.

> ocfs2 does not reset without a log message. Do you have netconsole
> setup? Messages logged a tick before reset can only be captured by
> netconsole/kdump etc.

Unfortunately no. Here are the two lines in /var/log/message prior to the
un-intended reboot and then syslog restarting:

May 25 22:26:03 ST2540_X4450_1 kernel: ocfs2_dlm: Nodes in domain ("7CCC109F8F16433DB7DB79526A29375A"): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
May 26 04:05:27 ST2540_X4450_1 init: Trying to re-exec init
May 26 11:49:31 ST2540_X4450_1 syslogd 1.4.1: restart.

At approximately 11:46:34 a fibre cable was intentionally pulled out of the 
SAN. Prior to that, all 15 OCFS2 nodes were performing I/O operations
with OCFS2 volumes on the SAN. 8 or so nodes fenced, but two simply
reboot. 

Any ideas? I'm curious as to if you believe this reboot could be attributed
to OCFS2 or possibly a separate issue. I was surprised to see some 
machines with fencing messages and then these two without. 

fyi, the same test, when performed with a disk heartbeat threshold of 61,
did not result in any nodes dropping off.

Thank you for the quick response
-Daniel

> On 05/26/2010 12:53 PM, Daniel McDonald wrote:
>> We have a setup with 15 hosts fibre attached via a switch to a common SAN. Each host has a single fibre port, the SAN has two controllers each with two ports. The SAN is exposing four OCFS2 v1.4.2 volumes. While performing a failover test, we observed 8 hosts fence and 2 reboot _without_ fencing. The OCFS2 FAQ recommends a default disk heartbeat of 31 - 61 loops for multipath io users. Our initial thought was to increase the default from 31 to 61.
>> 
>> I have a two hopefully simple questions. First, is there any reason why we would not want to increase the threshold to 61? Performance or otherwise?
>> 
>> Second, is there any reason in which, during IO operations and experiencing a single fibre path (out of 4) failure, an OCFS2 node would reset itself without _any_ kernel log message?
>> 
>> Thank you for your time
>> -Daniel
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>   
> 




More information about the Ocfs2-users mailing list