[Ocfs2-users] o2net: connect to node has been idle for 10 secs

Andy Phillips Andrew.Phillips at betfair.com
Thu Aug 3 05:24:44 PDT 2006


Hello,

   Apologies for following up on myself.

in ocfs2/cluster/tcp_internal.h
#define O2NET_KEEPALIVE_DELAY_SECS      5
#define O2NET_IDLE_TIMEOUT_SECS         10


   Is this really sensible? Potentially, given small variance in 
system clocks losing one keepalive packet (assuming that 
o2net_sc_send_keep_req is the only thing keeping the connection alive)
the loss of one packet could cause a node to self fence and reboot.

   Would
#define O2NET_KEEPALIVE_DELAY_SECS      5
#define O2NET_IDLE_TIMEOUT_SECS         20

   Cause any problems?

   Andy



On Thu, 2006-08-03 at 12:41 +0100, Andy Phillips wrote:
> Hello,
> 
>    I've a two node 10gR2 rac cluster on a pair of sun opteron boxes.
> Redhat AS 4.3 2.6.9-34.0.1.ELsmp x86_64. ocfs 1.2.2. RAC is using 
> ASM to talk to the data files, but we have 3 ocfs2 filesystems up
> to share dba files, and the usual bits and bobs. 
> 
>    Things were fine until, on mostly idle system, this happened out
> of the blue;
> 
> Aug  2 19:06:27 fred kernel: o2net: connection to node barney (num 0) at
> 172.16.6.10:7777 has been idle for 10 seconds, shutting it down.
> Aug  2 19:06:27 fred kernel: (0,7):o2net_idle_timer:1309 here are some
> times that might help debug the situation: (tmr 1154545576.798263 now
> 1154545586.796978 dr 1154545576.798238 adv
> 1154545576.798291:1154545576.798293 func (06aac8a1:1)
> 1154545566.800782:1154545566.800787)
> Aug  2 19:06:27 fred kernel: o2net: no longer connected to node barney
> (num 0) at 172.16.6.10:7777
> Aug  2 19:08:33 fred kernel: (25,7):o2quo_make_decision:143 ERROR:
> fencing this node because it is connected to
> a half-quorum of 1 out of 2 nodes which doesn't include the lowest
> active node 0
> Aug  2 19:08:33 fred kernel: (25,7):o2hb_stop_all_regions:1908 ERROR:
> stopping heartbeat on all active regions.
> 
>    And the node then halted. 
> 
>    Barney is node 0. The systems were idle. We've hammered the ocfs2 
> file systems, and set o2cb_heartbeat_threshold to 61. All is good and
> stable under heavy i/o.
>    
>    The interconnect is a bonded interface, with two gig cards, each
> connected (with flow control on) to two separate FESX424 switches.
> The switches dont register any problems at this time, nor does linux
> register any interface issues.
> 
>    I'm looking at the source code at the moment, but nothing is leaping
> out at me. Any ideas - Do the timer debug lines above mean anything to
> anyone.
> 
>   Thanks
>    Andy 
> 
>     
> 
>    
>   
> 
-- 
Andy Phillips, FRAS
Systems Architect, Information Systems.
Betfair.com   

Direct Line: 0208 834 8436

Betfair Limited (Company No.5140986), Winslow Road, Hammersmith
Embankment, London W6 9HP, United Kingdom, +44 208 834 8000, +44 208 834
8501 (direct). The information in this e-mail and any attachment is
confidential, may contain legal advice protected by privilege and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.




More information about the Ocfs2-users mailing list