[Ocfs2-users] RAC and OCFS2 timeout interaction issues

Mon Nov 22 16:11:19 PST 2010

Here are some hints:

-       Starting with versions 1.2.5 and later, the ocfs2 network timeout
may be configured. 

-       If using network bonding, you should set the network idle timeout to
at least 30 seconds.

-       Set O2CB_IDLE_TIMEOUT_MS to at least 30000. If the problem persists,
set it to 60000.

From: ocfs2-users-bounces at oss.oracle.com
[mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Kolstee, Ronald A
(Tony)
Sent: Monday, November 22, 2010 11:57 PM
To: ocfs2-users at oss.oracle.com
Subject: [Ocfs2-users] RAC and OCFS2 timeout interaction issues

We have a RAC cluster as follows:

*	3 nodes with RHEL5, Oracle 11.1.0.7, and OCFS2 1.4.2 
*	Voting and OCR are on OCFS2, all other shared storage is on ASM 
*	Storage hardware is provided by a fibre-channel SAN fabric 
*	Interconnect uses two bonded NICs per server, connected to different
blades on a single switch

Previously we had an issue where all three nodes would reboot if one node
had problems. This could be caused by one node crashing completely (OS
crash), or losing interconnect. For testing purposes, we've been simulating
OS crashes by suddenly resetting the server without a graceful shutdown, and
simulating loss of interconnect by using ifconfig down on both interfaces in
the bond.

>From what we've seen, it appears that the issue is an interaction between
the O2CB and CRS timeout values - namely that CRS is self-fencing on the
surviving nodes before O2CB has a chance to time out and recover the dead
node.

By adjusting O2CB's timeout (O2CB_HEARTBEAT_THRESHOLD x 2 seconds) lower
than the CRS disktimeout value, we were able to configure the cluster so
that the "OS Crash" scenario is properly handled. The other two nodes will
survive when we completely reset the third. 

However, we haven't solved the loss of interconnect scenario and believe the
problem to be similar. I'd rather not get bogged down in the specifics,
logfiles, etc. at this point in time, as we have to apply these concepts to
other environments as well, and there is still some tweaking of these
configurations to be done. 

Can anyone please provide a generic conceptual overview on how the CRS and
O2CB timeout values interact in this scenario?  Does
O2CB_HEARTBEAT_THRESHOLD come into play at all when dealing with loss of
interconnect, and how does it interact with the value of
O2CB_IDLE_TIMEOUT_MS? How does this correlate with the timeouts on the CRS
side of the equation?

I appreciate any help that anyone can offer.

Thanks in advance,

Tony Kolstee

Sr. Systems Engineer

Aetna

This e-mail may contain confidential or privileged information. If you think
you have received this e-mail in error, please advise the sender by reply
e-mail and then delete this e-mail immediately. Thank you. Aetna 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20101123/ec39c730/attachment.html