[Ocfs2-users] update on o2net_idle_timer

Thu Jan 4 03:55:37 PST 2007

Hello,

   I've made some progress with the o2net_idle_timer issue. Various
people seem to occasionally report instability and faults where the
following message is generated;

(From Andrew Brunton)
Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection to
node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been idle
for 10 seconds, shutting it down.

(From Peter Santos)
Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at
192.168.134.141:7777 has been idle for 10 seconds, shutting it down.

And from me;
Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
172.16.6.10:7777 has been idle for 10 seconds, shutting it down.

I've tried unsuccessfully to replicate the issue on my testbed
environment. The problem stems from the o2net layer function
'o2net_idle_timer' firing, after not receiving a valid packet after 
O2NET_IDLE_TIMEOUT_SECS, which is defined to be 10 seconds in
ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h. This then causes the rest
of the code to fall over in a heap, once the underlying socket goes.

 It turns out that its very likely not a bug in ocfs2. 

This code is doing what its supposed to do. Others will (and have)
argued that the network timeout is too low - see any and all posts by
Alexei to this list. Leaving that aside, or indeed the idea that the 
network layer should make an attempt at reconnecting before killing the
entire machine, I'll focus on the causes we've found here of this
problem which are not spanning tree related. 

One common thread is that people finding this are on EM64T or Opteron
based systems. There are various bugs reported against RedHat Linux (and
probably SuSE as well) for the kernels before RHAS 4.4. 

e.g. page 16 of this document - "lost ticks" Message Under Stress With
Non Uniform Memory Access Enabled on AMD Processor-Based Systems
http://support.dell.com/support/edocs/software/osrhel4/en/INT/HJ834A00.pdf

Or oracle bug 4593892 referenced in;
http://www.oracle.com/technology/tech/linux/validated-configurations/html/vc_dell6850-rhel4-cx500-1_1.html

We were also seeing messages of the form;

Dec 18 10:35:44 gs2dwdb02 kernel: warning: many lost ticks.
Dec 18 10:35:44 gs2dwdb02 kernel: Your time source seems to be instable
or some driver is hogging interupts
(sic)

Our problem seems to have been at least partially down to dodgy AMI
megaraid firmware for the system disks. We were getting messages from
the megaraid driver module on the console, which correlated with dropped
packets as logged by Oracle RAC's cssd.log. 

So given the above numa and driver/hardware errors its likely that ocfs2
was going for periods as long as 10 seconds without receiving a packet,
and failing accordingly.

Ocfs2 was hit the worst, as it has the finest trigger on lost packets.
The heartbeat failure times for rac are over 60 seconds. The o2cb
heartbeat is set to 61 for us, which is about 120 seconds IIRC, which is
fine for interruptions to the SAN/multipathing failover failures. 

We're planning an upgrade to 4.4 which apparently has fixed several of
these bugs, and would recommend others with this problem to carefully
check for signs of driver misbehaviour, particularly lost ticks
messages. If you're running a large amd box with more than a couple of
sockets, then turning numa off seems to be a way of making things more
stable according to some pdfs. 

Sunil, I think that 10 seconds is too low for this timeout. Please
consider making this tunable, in the way that O2CB_HEARTBEAT_THRESHOLD
is tunable in /etc/sysconfig/o2cb. It can kill the box, and its a bit 
counter intuitive to have the documented o2cb_heartbeat_threshold
effectively ignored when it comes to the network heartbeat. Having this
in 1.2.4 would be ideal. Please. 

This is the point where alexei can jump in and tell us all that he told
us so. He has a point about network spanning tree convergence, even
though most sensible designs for heartbeat networks would never allow
that to happen. I hope I've made it clear that this is a somewhat
different problem. 

What we're planning to do next - once we've confirmed that our new disk
firmware has eliminated the problem, is to test with numa=off, and
eventually upgrade. We're also looking at trying to simulate a bad
driver blocking interrupts in the kernel for configurable periods to
confirm that this diagnosis is correct. 

I hope this some what long winded message is of use to people. 

Andy

-- 
Andy Phillips
Systems Architecture Manager, Betfair.com

Office: 0208 8348436

Betfair Ltd|Winslow Road|Hammersmith Embankment|London|W69HP 
Company No. 5140986 
The information in this e-mail and any attachment is confidential and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.