[Ocfs2-users] Heartbeat Timeout Threshold
Raheel Akhtar
rakhtar at ryerson.ca
Tue Aug 11 05:37:45 PDT 2009
Hi,
I am also facing same problem once or twice in a week node reboots and its to catch the reason. Even I have SAN storage with all redundency.
Now moving to NFS because couldn't find reason why its reboots, even heartbeat is increased to 61 on all nodes.
Raheel
----- Original Message -----
From: Luis Freitas <lfreitas34 at yahoo.com>
Date: Monday, August 10, 2009 4:16 pm
Subject: Re: [Ocfs2-users] Heartbeat Timeout Threshold
To: ocfs2-users at oss.oracle.com, brett at worth.id.au
> Bret,
>
> These are my two cents on this subject. If I am not mistaken,
> there are two heartbeats, one on the network and one on the disk.
> Failure on either one of them will cause a node to be evicted.
>
> If you have network bonding, depending on your configuration and
> the network topology, when one path fails the switches might have to
> reconfigure the path to that MAC address, and that could take some
> time. This can be reduced forcing arp broadcasts on the new path, so
> the network equipment between your servers can reconfigure itself faster.
>
> For the disk heartbeat, assuming that:
>
> - You have dual FC cards on each server
> - You have dual FC switches connected to each other
> - You have a storage with two or more FC ports, connected to the switches.
>
> You have a FC card timeout (Probably set on the HBA firmware or on
> the driver), the multipath timeout, and a storage controller timeout.
>
> Each of these need to be smaller than your cluster heartbeat
> timeout in order for all the nodes to survive a component failure. For
> example, an EMC storage has two internal controllers (SP) and a SP
> failover in the order of two minutes.
>
> During this time the LUNs that are routed through the failed SP
> will be unresponsive, and the FC card on your server will not report
> this to the O/S until its timeout is reached. After the failover
> inside the EMC storage, the multipath software on your server will
> also need to establish the new path to the LUN using the surviving SP.
>
>
> Best Regards,
> Luis Freitas
>
>
>
> --- On Fri, 8/7/09, Brett Worth <brett at worth.id.au> wrote:
>
> > From: Brett Worth <brett at worth.id.au>
> > Subject: [Ocfs2-users] Heartbeat Timeout Threshold
> > To: ocfs2-users at oss.oracle.com
> > Date: Friday, August 7, 2009, 9:29 PM
> > I've been using OCFS2 on a 3 way
> > Centos 5.2 Xen cluster for a while now using it to share
> > the VM disk images. In this way I can have
> > live and transparent VM migration.
> >
> > I'd been having intermittent (every 2-3 weeks) incidents
> > where a server would self fence.
> > After configuring netconsole I managed to see that the
> > fencing was due to a heartbeat
> > threshold timeout so I have now increased all three servers
> > to have a threshold of 61 i.e.
> > 2 minutes from the default 31 i.e. 1 minute. So far
> > there have been no panics.
> >
> > I do have a couple of questions though:
> >
> > 1. To get this timeout applied I had to have a complete
> > cluster outage so that I could
> > make all the changes simultaneously. Making the
> > change to single node prevented it from
> > joining in the fun. Do all parameters really need to
> > match before you can join? The
> > timeout threshold seems to be one that could differ from
> > node to node.
> >
> > 2. Even though this appears to have fixed the problem, 2
> > minutes is a long time to wait
> > for a heartbeat. Even one minute seems like a very
> > long time. I assume that missing a
> > heartbeat would be a symptom of a very busy filesystem but
> > for a packet to take over a
> > minute to get over the wire is odd. Or is it that the
> > heartbeats are actually being lost
> > for an extended period? Is this a network
> > problem? All my nodes communicate heartbeat on
> > a dedicated VLAN.
> >
> > Regards
> > Brett
> > PS: If anyone is planning to do Xen like this my main
> > piece of advice is that you must
> > put a ceiling on how much RAM the Dom0 domain can
> > use. If you don't it will expand to use
> > all non-vm memory for buffer cache so that when you try to
> > do a migration to it there is
> > no ram left.
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >
>
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
More information about the Ocfs2-users
mailing list