<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML DIR=ltr><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"></HEAD><BODY><DIV><FONT face='Arial' color=#000000 size=2>Hello again. Our production cluster has begun
experiencing some vicious slowdowns that may (or may not) be related to the
filesystems. When the problem occurs, the load average on the servers
jumps up to 30 or higher. Usually one node will climb while the other
drops, then they will switch places a few minutes later. At one point, we
had one node's load average up over 300. Our site activity has been on the
rise, and the problems usually occur during peak mid-day hours.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Under normal conditions, "top" shows the CPUs
spending most of their time waiting on the very busy fibre channel. During
the slowdowns, the processors are mostly busy with system calls. Traffic
over both the fibre channel and gigabit interconnect seems to drop off
considerably at the same time.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I've got a TAR open, but the support people are
still in the very preliminary stages (for example, we just installed a switch
between the two nodes because a crossover cable is apparently not
supported). There doesn't seem to be any good indication of what's going
on. We suspected the interconnect, but the private interfaces seem to
behave normally while Oracle is grinding to a halt.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>After 10-30 minutes, the problem will fade away on
its own. I'm inclined to blame something in the RAC inter-node
communications code, but I was wondering if this situation resembled any kind of
OCFS problem anyone has seen. These servers are still on 1.0.9-12, with
plans to go to 1.0.12 soon after this issue is resolved.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV>Derek</DIV></BODY></HTML>