[Ocfs2-users] o2net: connect to node has been idle for 10 secs

Alexei_Roudnev Alexei_Roudnev at exigengroup.com
Mon Aug 7 12:01:22 PDT 2006


I have not much details for now.

Story was simple.

(1) We are running Oracle 10.2.0.1 RAC cluster in the lab.
Cluster includes:
- 2 x DELL 2850 PC servers;
- SLES9 SP3 kernel 255 (or 257, must check);
- iSCSI SAN network
- 3 interfaces per server - 1 private, 1 public and 1 SAN
- we use raw devices and ASM for Oracle database and tried to use OCFSv2 for
backup and archive logs.

After some time , we got pretty stable configuration (both iSCSI, OCFS and
RAC) which was able to run weeks without problems and survived heavy TPC-C
testing (it is lab,. so system used for few development databases with low
performance load).

Then we decided to upgrade to 10.2.0.2 patchset.

After upgrade, OCFSv2 began to die every few days with the same message as
you had. I increased timeouts, added 3-th node, etc etc - nothing helped.
Then we decided to investigate, how system behave under the load, and find
that, since 10.2.0.2 upgrade, system freeze to 30 - 60 seconds every few
hours. Freeze reason left unknown, but looked related to HugeTLB or some
other memory resources lockout.

We downgraded back to 10.2.0.1, and had not any problems with OCFSv2 after
it -;). It shows me now, for example

    alex at testrac11:~> uptime
     11:58am  up 68 days 17:12,  2 users,  load average: 0.08, 0.12, 0.12

Good point was that it was not OCFSv2 problem but system problem. Bad point
is that OCFSv2 was in passive state (it used for backups only at some point)
so I'd like to see it disconnected, frozen etc but not causing the whole OS
to reboot.


I plan to re-investigate this case (why 10.2.0.2 pactset killed the whole
cluster) but did not do it yet.

On the other hand, if I analyze network (all over) I can see that 10 seconds
requirement CAN NOT BE SAATISFIED at all. The lowers timer which I can
garantee to work in case of any system reconvergence (such as switch / trunk
failover, router reboots, failer failover and so on) is 40 - 60 seconds. For
example, Ethernte spanning tree have a MINIMUM (!) reconvergence time as 30
or 40 seconds, except if you some fast method (which is not supported on
many switches), and even this will not allow to have this time less than
20 - 30 seconds.

If OCFSv2 require FAST idle timeout, then it should support multi-interface
(it allows cluster work, if any of public, private or interconnect links is
UP. If it cannot support multi interface, then it must survive up to 1
minute network delays (as we have with all other clusters around, even those
which support multi interface connection such as Linux cluster or Veritas
cluster).


----- Original Message ----- 
From: "Sunil Mushran" <Sunil.Mushran at oracle.com>
To: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
Cc: "Andy Phillips" <Andrew.Phillips at betfair.com>; "ocfs2-users"
<ocfs2-users at oss.oracle.com>
Sent: Monday, August 07, 2006 11:01 AM
Subject: Re: [Ocfs2-users] o2net: connect to node has been idle for 10 secs


> Alexei_Roudnev wrote:
> > In my case, after spending few days, I find that my HugeTLB setting (in
> > Oracle) caused long kernel loop and it forced OCFSv2 to reboot because
of
> > losing connection.
> >
> I am keen to hear more about this. Please could you elaborate.
>
>




More information about the Ocfs2-users mailing list