[Ocfs2-users] Heartbeat Timeout Threshold

Fri Aug 7 17:29:49 PDT 2009

I've been using OCFS2 on a 3 way Centos 5.2 Xen cluster for a while now using it to share
the VM disk images.   In this way I can have live and transparent VM migration.

I'd been having intermittent (every 2-3 weeks) incidents where a server would self fence.
 After configuring netconsole I managed to see that the fencing was due to a heartbeat
threshold timeout so I have now increased all three servers to have a threshold of 61 i.e.
2 minutes from the default 31 i.e. 1 minute.  So far there have been no panics.

I do have a couple of questions though:

1. To get this timeout applied I had to have a complete cluster outage so that I could
make all the changes simultaneously.  Making the change to single node prevented it from
joining in the fun.  Do all parameters really need to match before you can join?  The
timeout threshold seems to be one that could differ from node to node.

2. Even though this appears to have fixed the problem, 2 minutes is a long time to wait
for a heartbeat.  Even one minute seems like a very long time.  I assume that missing a
heartbeat would be a symptom of a very busy filesystem but for a packet to take over a
minute to get over the wire is odd.  Or is it that the heartbeats are actually being lost
for an extended period?  Is this a network problem?  All my nodes communicate heartbeat on
a dedicated VLAN.

Regards
Brett
PS:  If anyone is planning to do Xen like this my main piece of advice is that you must
put a ceiling on how much RAM the Dom0 domain can use.  If you don't it will expand to use
all non-vm memory for buffer cache so that when you try to do a migration to it there is
no ram left.