[Ocfs2-users] update on o2net_idle_timer

Thu Jan 4 15:32:04 PST 2007

That and also we've seen similar issues with Broadcom TG3 drivers. We use
Intel E1000 mostly and thus did not experience the same issue.

As far as the configurable net timeouts goes, the patch was added into
mainline on Dec 4th. So it will be available with ocfs2 1.4. We are still
seeing if we have the bandwidth to backport it to 1.2.

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=history;f=fs/ocfs2/cluster/tcp.c;h=ae4ff4a6636b23759522994898a95c148a4401f1;hb=HEAD

commit 828ae6afbef03bfe107a4a8cc38798419d6a2765
Author: Andrew Beekhof <abeekhof at suse.de>
Date:   Mon Dec 4 14:04:55 2006 +0100

    [patch 3/3] OCFS2 Configurable timeouts - Protocol changes

    Modify the OCFS2 handshake to ensure essential timeouts are configured
    identically on all nodes.

    Only allow changes when there are no connected peers

    Improves the logic in o2net_advance_rx() which broke now that
    sizeof(struct o2net_handshake) is greater than sizeof(struct o2net_msg)

    Included is the field for userspace-heartbeat timeout to avoid the 
need for
    further protocol changes.

    Uses a global spinlock to ensure the decisions to update configfs 
entries
    are made on the correct value.  The region covered by the spinlock when
    incrementing the counter is much larger as this is the more critical 
case.

    Small cleanup contributed by Adrian Bunk <bunk at stusta.de>

    Signed-off-by: Andrew Beekhof <abeekhof at suse.de>
    Signed-off-by: Mark Fasheh <mark.fasheh at oracle.com>

commit b5dd80304da482d77b2320e1a01a189e656b9770
Author: Jeff Mahoney <jeffm at suse.de>
Date:   Mon Dec 4 14:04:54 2006 +0100

    [patch 2/3] OCFS2 Configurable timeouts

    Allow configuration of OCFS2 timeouts from userspace via configfs

    Signed-off-by: Andrew Beekhof <abeekhof at suse.de>
    Signed-off-by: Mark Fasheh <mark.fasheh at oracle.com>

Andy Phillips wrote:
> Hello,
>
>    I've made some progress with the o2net_idle_timer issue. Various
> people seem to occasionally report instability and faults where the
> following message is generated;
>
> (From Andrew Brunton)
> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection to
> node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been idle
> for 10 seconds, shutting it down.
>
> (From Peter Santos)
> Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at
> 192.168.134.141:7777 has been idle for 10 seconds, shutting it down.
>
> And from me;
> Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
> 172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
>
> I've tried unsuccessfully to replicate the issue on my testbed
> environment. The problem stems from the o2net layer function
> 'o2net_idle_timer' firing, after not receiving a valid packet after 
> O2NET_IDLE_TIMEOUT_SECS, which is defined to be 10 seconds in
> ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h. This then causes the rest
> of the code to fall over in a heap, once the underlying socket goes.
>
>  It turns out that its very likely not a bug in ocfs2. 
>
> This code is doing what its supposed to do. Others will (and have)
> argued that the network timeout is too low - see any and all posts by
> Alexei to this list. Leaving that aside, or indeed the idea that the 
> network layer should make an attempt at reconnecting before killing the
> entire machine, I'll focus on the causes we've found here of this
> problem which are not spanning tree related. 
>
> One common thread is that people finding this are on EM64T or Opteron
> based systems. There are various bugs reported against RedHat Linux (and
> probably SuSE as well) for the kernels before RHAS 4.4. 
>
> e.g. page 16 of this document - "lost ticks" Message Under Stress With
> Non Uniform Memory Access Enabled on AMD Processor-Based Systems
> http://support.dell.com/support/edocs/software/osrhel4/en/INT/HJ834A00.pdf
>
> Or oracle bug 4593892 referenced in;
> http://www.oracle.com/technology/tech/linux/validated-configurations/html/vc_dell6850-rhel4-cx500-1_1.html
>
> We were also seeing messages of the form;
>
> Dec 18 10:35:44 gs2dwdb02 kernel: warning: many lost ticks.
> Dec 18 10:35:44 gs2dwdb02 kernel: Your time source seems to be instable
> or some driver is hogging interupts
> (sic)
>
> Our problem seems to have been at least partially down to dodgy AMI
> megaraid firmware for the system disks. We were getting messages from
> the megaraid driver module on the console, which correlated with dropped
> packets as logged by Oracle RAC's cssd.log. 
>
> So given the above numa and driver/hardware errors its likely that ocfs2
> was going for periods as long as 10 seconds without receiving a packet,
> and failing accordingly.
>
> Ocfs2 was hit the worst, as it has the finest trigger on lost packets.
> The heartbeat failure times for rac are over 60 seconds. The o2cb
> heartbeat is set to 61 for us, which is about 120 seconds IIRC, which is
> fine for interruptions to the SAN/multipathing failover failures. 
>
> We're planning an upgrade to 4.4 which apparently has fixed several of
> these bugs, and would recommend others with this problem to carefully
> check for signs of driver misbehaviour, particularly lost ticks
> messages. If you're running a large amd box with more than a couple of
> sockets, then turning numa off seems to be a way of making things more
> stable according to some pdfs. 
>
> Sunil, I think that 10 seconds is too low for this timeout. Please
> consider making this tunable, in the way that O2CB_HEARTBEAT_THRESHOLD
> is tunable in /etc/sysconfig/o2cb. It can kill the box, and its a bit 
> counter intuitive to have the documented o2cb_heartbeat_threshold
> effectively ignored when it comes to the network heartbeat. Having this
> in 1.2.4 would be ideal. Please. 
>
> This is the point where alexei can jump in and tell us all that he told
> us so. He has a point about network spanning tree convergence, even
> though most sensible designs for heartbeat networks would never allow
> that to happen. I hope I've made it clear that this is a somewhat
> different problem. 
>
> What we're planning to do next - once we've confirmed that our new disk
> firmware has eliminated the problem, is to test with numa=off, and
> eventually upgrade. We're also looking at trying to simulate a bad
> driver blocking interrupts in the kernel for configurable periods to
> confirm that this diagnosis is correct. 
>
> I hope this some what long winded message is of use to people. 
>
> Andy
>    
>