[Ocfs2-users] o2net: connect to node has been idle for 10 secs

Thu Aug 3 10:20:24 PDT 2006

Hello,

  Its doubly odd then. We'll need to schedule an upgrade to 1.2.3.
In the mean time, we've scheduled a cron job that touches a file
on each ocfs2 file system every 3 seconds. This should ensure a 
constant flow of traffic assuming metadata updates travel across
the interconnect. 

  I've noticed that there is one other person who  seems to have 
seen this problem -
http://oss.oracle.com/pipermail/ocfs2-users/2006-July/000612.html
   but they were on an old version of kernel and fs code. Any idea
as to what the underlying cause may be if its not a dropped packet?

  Would you also mind letting me know what those two line changes were,
just for my own interest's sake. 

   Thanks for the quick response.

       Andy

On Thu, 2006-08-03 at 09:44 -0700, Sunil Mushran wrote:
> 1. o2net talks tcp. It should be able to handle this.
> 2. If the cluster is active and the nodes are communicating,
> the keepalive packet is rarely sent. It only sends the packet
> if it does not hear from the other node for 5 secs.
> 3. Try the same with 1.2.3. (We made 2 important 1 line fixes.)
> 4. If this does happen again, and you are interested, we
> could always give you a drop that dumps the stack of
> all the procs, to get a better feel for the situation.
> 
> Andy Phillips wrote:
> > Hello,
> >
> >    Apologies for following up on myself.
> >
> > in ocfs2/cluster/tcp_internal.h
> > #define O2NET_KEEPALIVE_DELAY_SECS      5
> > #define O2NET_IDLE_TIMEOUT_SECS         10
> >
> >
> >    Is this really sensible? Potentially, given small variance in 
> > system clocks losing one keepalive packet (assuming that 
> > o2net_sc_send_keep_req is the only thing keeping the connection alive)
> > the loss of one packet could cause a node to self fence and reboot.
> >
> >    Would
> > #define O2NET_KEEPALIVE_DELAY_SECS      5
> > #define O2NET_IDLE_TIMEOUT_SECS         20
> >
> >    Cause any problems?
> >
> >    Andy
> >
> >
> >
> > On Thu, 2006-08-03 at 12:41 +0100, Andy Phillips wrote:
> >   
> >> Hello,
> >>
> >>    I've a two node 10gR2 rac cluster on a pair of sun opteron boxes.
> >> Redhat AS 4.3 2.6.9-34.0.1.ELsmp x86_64. ocfs 1.2.2. RAC is using 
> >> ASM to talk to the data files, but we have 3 ocfs2 filesystems up
> >> to share dba files, and the usual bits and bobs. 
> >>
> >>    Things were fine until, on mostly idle system, this happened out
> >> of the blue;
> >>
> >> Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
> >> 172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
> >> Aug  2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
> >> times that might help debug the situation: (tmr 1154545576.798263 now
> >> 1154545586.796978 dr 1154545576.798238 adv
> >> 1154545576.798291:1154545576.798293 func (06aac8a1:1)
> >> 1154545566.800782:1154545566.800787)
> >> Aug  2 19:06:27 fred kernel: o2net: no longer connected to node barney
> >> (num 0) at 172.16.6.10:7777
> >> Aug  2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
> >> fencing this node because it is connected to
> >> a half-quorum of 1 out of 2 nodes which doesn't include the lowest
> >> active node 0
> >> Aug  2 19:08:33 fred kernel: (25,7):o2hb_stop_all_regions:1908 ERROR:
> >> stopping heartbeat on all active regions.
> >>
> >>    And the node then halted. 
> >>
> >>    Barney is node 0. The systems were idle. We've hammered the ocfs2 
> >> file systems, and set o2cb_heartbeat_threshold to 61. All is good and
> >> stable under heavy i/o.
> >>    
> >>    The interconnect is a bonded interface, with two gig cards, each
> >> connected (with flow control on) to two separate FESX424 switches.
> >> The switches dont register any problems at this time, nor does linux
> >> register any interface issues.
> >>
> >>    I'm looking at the source code at the moment, but nothing is leaping
> >> out at me. Any ideas - Do the timer debug lines above mean anything to
> >> anyone.
> >>
> >>   Thanks
> >>    Andy 
> >>
> >>     
> >>
> >>    
> >>   
> >>
> >>     
> 
> ________________________________________________________________________
> In order to protect our email recipients, Betfair use SkyScan from 
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
> 
> ________________________________________________________________________
-- 
Andy Phillips, FRAS
Systems Architect, Information Systems.
Betfair.com   

Direct Line: 0208 834 8436

Betfair Limited (Company No.5140986), Winslow Road, Hammersmith
Embankment, London W6 9HP, United Kingdom, +44 208 834 8000, +44 208 834
8501 (direct). The information in this e-mail and any attachment is
confidential, may contain legal advice protected by privilege and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.