[Ocfs2-users] Heartbeat Timeout Threshold

Tue Aug 11 05:37:45 PDT 2009

Hi,

I am also facing same problem once or twice in a week node reboots and its to catch the reason. Even I have SAN storage with all redundency. 
Now moving to NFS because couldn't find reason why its reboots, even heartbeat is increased to 61 on all nodes.

Raheel

----- Original Message -----
From: Luis Freitas <lfreitas34 at yahoo.com>
Date: Monday, August 10, 2009 4:16 pm
Subject: Re: [Ocfs2-users] Heartbeat Timeout Threshold
To: ocfs2-users at oss.oracle.com, brett at worth.id.au

> Bret,
>  
>     These are my two cents on this subject. If I am not mistaken, 
> there are two heartbeats, one on the network and one on the disk. 
> Failure on either one of them will cause a node to be evicted.
>  
>      If you have network bonding, depending on your configuration and 
> the network topology, when one path fails the switches might have to 
> reconfigure the path to that MAC address, and that could take some 
> time. This can be reduced forcing arp broadcasts on the new path, so 
> the network equipment between your servers can reconfigure itself faster.
>  
>      For the disk heartbeat, assuming that:
>  
>  - You have dual FC cards on each server
>  - You have dual FC switches connected to each other
>  - You have a storage with two or more FC ports, connected to the switches.
>  
>     You have a FC card timeout (Probably set on the HBA firmware or on 
> the driver), the multipath timeout, and a storage controller timeout.
>  
>     Each of these need to be smaller than your cluster heartbeat 
> timeout in order for all the nodes to survive a component failure. For 
> example, an EMC storage has two internal controllers (SP) and a SP 
> failover in the order of two minutes. 
>  
>     During this time the LUNs that are routed through the failed SP 
> will be unresponsive, and the FC card on your server will not report 
> this to the O/S until its timeout is reached. After the failover 
> inside the EMC storage, the multipath software on your server will 
> also need to establish the new path to the LUN using the surviving SP. 
> 
>  
>  Best Regards,
>  Luis Freitas
>  
>  
>  
>  --- On Fri, 8/7/09, Brett Worth <brett at worth.id.au> wrote:
>  
>  > From: Brett Worth <brett at worth.id.au>
>  > Subject: [Ocfs2-users] Heartbeat Timeout Threshold
>  > To: ocfs2-users at oss.oracle.com
>  > Date: Friday, August 7, 2009, 9:29 PM
>  > I've been using OCFS2 on a 3 way
>  > Centos 5.2 Xen cluster for a while now using it to share
>  > the VM disk images.   In this way I can have
>  > live and transparent VM migration.
>  > 
>  > I'd been having intermittent (every 2-3 weeks) incidents
>  > where a server would self fence.
>  >  After configuring netconsole I managed to see that the
>  > fencing was due to a heartbeat
>  > threshold timeout so I have now increased all three servers
>  > to have a threshold of 61 i.e.
>  > 2 minutes from the default 31 i.e. 1 minute.  So far
>  > there have been no panics.
>  > 
>  > I do have a couple of questions though:
>  > 
>  > 1. To get this timeout applied I had to have a complete
>  > cluster outage so that I could
>  > make all the changes simultaneously.  Making the
>  > change to single node prevented it from
>  > joining in the fun.  Do all parameters really need to
>  > match before you can join?  The
>  > timeout threshold seems to be one that could differ from
>  > node to node.
>  > 
>  > 2. Even though this appears to have fixed the problem, 2
>  > minutes is a long time to wait
>  > for a heartbeat.  Even one minute seems like a very
>  > long time.  I assume that missing a
>  > heartbeat would be a symptom of a very busy filesystem but
>  > for a packet to take over a
>  > minute to get over the wire is odd.  Or is it that the
>  > heartbeats are actually being lost
>  > for an extended period?  Is this a network
>  > problem?  All my nodes communicate heartbeat on
>  > a dedicated VLAN.
>  > 
>  > Regards
>  > Brett
>  > PS:  If anyone is planning to do Xen like this my main
>  > piece of advice is that you must
>  > put a ceiling on how much RAM the Dom0 domain can
>  > use.  If you don't it will expand to use
>  > all non-vm memory for buffer cache so that when you try to
>  > do a migration to it there is
>  > no ram left.
>  > 
>  > _______________________________________________
>  > Ocfs2-users mailing list
>  > Ocfs2-users at oss.oracle.com
>  > http://oss.oracle.com/mailman/listinfo/ocfs2-users
>  > 
>  
>  
>        
>  
>  _______________________________________________
>  Ocfs2-users mailing list
>  Ocfs2-users at oss.oracle.com
>  http://oss.oracle.com/mailman/listinfo/ocfs2-users
>