[Ocfs2-users] o2net: connect to node has been idle for 10 secs

Tue Aug 8 03:29:20 PDT 2006

Did you have elevator=deadline on the kernel commandline?

I'm with you on the timeouts. Its absolutely crazy to have a 
10 second timeout, given spanning tree reconvergence times. 

I'd love to know the reason that those values were set. But then again
it seems that in general the ocfs2 timers are way too short. Otherwise
why would it be "standard practice" to always modify the o2cb_heartbeat
threshold up. 

I can understand having short timers to flush problems in a lab
testing environment. The production branch of the code should have
slightly more real world values. 

Andy

On Mon, 2006-08-07 at 12:01 -0700, Alexei_Roudnev wrote:
> I have not much details for now.
> 
> Story was simple.
> 
> (1) We are running Oracle 10.2.0.1 RAC cluster in the lab.
> Cluster includes:
> - 2 x DELL 2850 PC servers;
> - SLES9 SP3 kernel 255 (or 257, must check);
> - iSCSI SAN network
> - 3 interfaces per server - 1 private, 1 public and 1 SAN
> - we use raw devices and ASM for Oracle database and tried to use OCFSv2 for
> backup and archive logs.
> 
> After some time , we got pretty stable configuration (both iSCSI, OCFS and
> RAC) which was able to run weeks without problems and survived heavy TPC-C
> testing (it is lab,. so system used for few development databases with low
> performance load).
> 
> Then we decided to upgrade to 10.2.0.2 patchset.
> 
> After upgrade, OCFSv2 began to die every few days with the same message as
> you had. I increased timeouts, added 3-th node, etc etc - nothing helped.
> Then we decided to investigate, how system behave under the load, and find
> that, since 10.2.0.2 upgrade, system freeze to 30 - 60 seconds every few
> hours. Freeze reason left unknown, but looked related to HugeTLB or some
> other memory resources lockout.
> 
> We downgraded back to 10.2.0.1, and had not any problems with OCFSv2 after
> it -;). It shows me now, for example
> 
>     alex at testrac11:~> uptime
>      11:58am  up 68 days 17:12,  2 users,  load average: 0.08, 0.12, 0.12
> 
> Good point was that it was not OCFSv2 problem but system problem. Bad point
> is that OCFSv2 was in passive state (it used for backups only at some point)
> so I'd like to see it disconnected, frozen etc but not causing the whole OS
> to reboot.
> 
> 
> I plan to re-investigate this case (why 10.2.0.2 pactset killed the whole
> cluster) but did not do it yet.
> 
> On the other hand, if I analyze network (all over) I can see that 10 seconds
> requirement CAN NOT BE SAATISFIED at all. The lowers timer which I can
> garantee to work in case of any system reconvergence (such as switch / trunk
> failover, router reboots, failer failover and so on) is 40 - 60 seconds. For
> example, Ethernte spanning tree have a MINIMUM (!) reconvergence time as 30
> or 40 seconds, except if you some fast method (which is not supported on
> many switches), and even this will not allow to have this time less than
> 20 - 30 seconds.
> 
> If OCFSv2 require FAST idle timeout, then it should support multi-interface
> (it allows cluster work, if any of public, private or interconnect links is
> UP. If it cannot support multi interface, then it must survive up to 1
> minute network delays (as we have with all other clusters around, even those
> which support multi interface connection such as Linux cluster or Veritas
> cluster).
> 
> 
> ----- Original Message ----- 
> From: "Sunil Mushran" <Sunil.Mushran at oracle.com>
> To: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
> Cc: "Andy Phillips" <Andrew.Phillips at betfair.com>; "ocfs2-users"
> <ocfs2-users at oss.oracle.com>
> Sent: Monday, August 07, 2006 11:01 AM
> Subject: Re: [Ocfs2-users] o2net: connect to node has been idle for 10 secs
> 
> 
> > Alexei_Roudnev wrote:
> > > In my case, after spending few days, I find that my HugeTLB setting (in
> > > Oracle) caused long kernel loop and it forced OCFSv2 to reboot because
> of
> > > losing connection.
> > >
> > I am keen to hear more about this. Please could you elaborate.
> >
> >
> 
> 
> ________________________________________________________________________
> In order to protect our email recipients, Betfair use SkyScan from 
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
> 
> ________________________________________________________________________
-- 
Andy Phillips, FRAS
Systems Architect, Information Systems.
Betfair.com   

Direct Line: 0208 834 8436

Betfair Limited (Company No.5140986), Winslow Road, Hammersmith
Embankment, London W6 9HP, United Kingdom, +44 208 834 8000, +44 208 834
8501 (direct). The information in this e-mail and any attachment is
confidential, may contain legal advice protected by privilege and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.