[Ocfs2-users] Heartbeat Timeout Threshold

Mon Aug 10 13:15:54 PDT 2009

Bret,

   These are my two cents on this subject. If I am not mistaken, there are two heartbeats, one on the network and one on the disk. Failure on either one of them will cause a node to be evicted.

    If you have network bonding, depending on your configuration and the network topology, when one path fails the switches might have to reconfigure the path to that MAC address, and that could take some time. This can be reduced forcing arp broadcasts on the new path, so the network equipment between your servers can reconfigure itself faster.

    For the disk heartbeat, assuming that:

- You have dual FC cards on each server
- You have dual FC switches connected to each other
- You have a storage with two or more FC ports, connected to the switches.

   You have a FC card timeout (Probably set on the HBA firmware or on the driver), the multipath timeout, and a storage controller timeout.

   Each of these need to be smaller than your cluster heartbeat timeout in order for all the nodes to survive a component failure. For example, an EMC storage has two internal controllers (SP) and a SP failover in the order of two minutes. 

   During this time the LUNs that are routed through the failed SP will be unresponsive, and the FC card on your server will not report this to the O/S until its timeout is reached. After the failover inside the EMC storage, the multipath software on your server will also need to establish the new path to the LUN using the surviving SP. 

Best Regards,
Luis Freitas

--- On Fri, 8/7/09, Brett Worth <brett at worth.id.au> wrote:

> From: Brett Worth <brett at worth.id.au>
> Subject: [Ocfs2-users] Heartbeat Timeout Threshold
> To: ocfs2-users at oss.oracle.com
> Date: Friday, August 7, 2009, 9:29 PM
> I've been using OCFS2 on a 3 way
> Centos 5.2 Xen cluster for a while now using it to share
> the VM disk images.   In this way I can have
> live and transparent VM migration.
> 
> I'd been having intermittent (every 2-3 weeks) incidents
> where a server would self fence.
>  After configuring netconsole I managed to see that the
> fencing was due to a heartbeat
> threshold timeout so I have now increased all three servers
> to have a threshold of 61 i.e.
> 2 minutes from the default 31 i.e. 1 minute.  So far
> there have been no panics.
> 
> I do have a couple of questions though:
> 
> 1. To get this timeout applied I had to have a complete
> cluster outage so that I could
> make all the changes simultaneously.  Making the
> change to single node prevented it from
> joining in the fun.  Do all parameters really need to
> match before you can join?  The
> timeout threshold seems to be one that could differ from
> node to node.
> 
> 2. Even though this appears to have fixed the problem, 2
> minutes is a long time to wait
> for a heartbeat.  Even one minute seems like a very
> long time.  I assume that missing a
> heartbeat would be a symptom of a very busy filesystem but
> for a packet to take over a
> minute to get over the wire is odd.  Or is it that the
> heartbeats are actually being lost
> for an extended period?  Is this a network
> problem?  All my nodes communicate heartbeat on
> a dedicated VLAN.
> 
> Regards
> Brett
> PS:  If anyone is planning to do Xen like this my main
> piece of advice is that you must
> put a ceiling on how much RAM the Dom0 domain can
> use.  If you don't it will expand to use
> all non-vm memory for buffer cache so that when you try to
> do a migration to it there is
> no ram left.
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>