[Ocfs2-users] Failover testing problem and a heartbeat question

Wed May 26 13:22:36 PDT 2010

When a node dies, the cluster ops pause for the node to be first
declared dead followed by recovery. Threshold governs the time
it takes to declare the node dead. The higher the value, the longer
the pause.

ocfs2 does not reset without a log message. Do you have netconsole
setup? Messages logged a tick before reset can only be captured by
netconsole/kdump etc.

On 05/26/2010 12:53 PM, Daniel McDonald wrote:
> We have a setup with 15 hosts fibre attached via a switch to a common SAN. Each host has a single fibre port, the SAN has two controllers each with two ports. The SAN is exposing four OCFS2 v1.4.2 volumes. While performing a failover test, we observed 8 hosts fence and 2 reboot _without_ fencing. The OCFS2 FAQ recommends a default disk heartbeat of 31 - 61 loops for multipath io users. Our initial thought was to increase the default from 31 to 61.
>
> I have a two hopefully simple questions. First, is there any reason why we would not want to increase the threshold to 61? Performance or otherwise?
>
> Second, is there any reason in which, during IO operations and experiencing a single fibre path (out of 4) failure, an OCFS2 node would reset itself without _any_ kernel log message?
>
> Thank you for your time
> -Daniel
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>