[Ocfs2-users] RAC and OCFS2 timeout interaction issues

Kolstee, Ronald A (Tony) KolsteeR at aetna.com
Mon Nov 22 13:56:42 PST 2010


We have a RAC cluster as follows:

 *   3 nodes with RHEL5, Oracle 11.1.0.7, and OCFS2 1.4.2
 *   Voting and OCR are on OCFS2, all other shared storage is on ASM
 *   Storage hardware is provided by a fibre-channel SAN fabric
 *   Interconnect uses two bonded NICs per server, connected to different blades on a single switch

Previously we had an issue where all three nodes would reboot if one node had problems. This could be caused by one node crashing completely (OS crash), or losing interconnect. For testing purposes, we've been simulating OS crashes by suddenly resetting the server without a graceful shutdown, and simulating loss of interconnect by using ifconfig down on both interfaces in the bond.

From what we've seen, it appears that the issue is an interaction between the O2CB and CRS timeout values - namely that CRS is self-fencing on the surviving nodes before O2CB has a chance to time out and recover the dead node.

By adjusting O2CB's timeout (O2CB_HEARTBEAT_THRESHOLD x 2 seconds) lower than the CRS disktimeout value, we were able to configure the cluster so that the "OS Crash" scenario is properly handled. The other two nodes will survive when we completely reset the third.

However, we haven't solved the loss of interconnect scenario and believe the problem to be similar. I'd rather not get bogged down in the specifics, logfiles, etc. at this point in time, as we have to apply these concepts to other environments as well, and there is still some tweaking of these configurations to be done.

Can anyone please provide a generic conceptual overview on how the CRS and O2CB timeout values interact in this scenario?  Does O2CB_HEARTBEAT_THRESHOLD come into play at all when dealing with loss of interconnect, and how does it interact with the value of O2CB_IDLE_TIMEOUT_MS? How does this correlate with the timeouts on the CRS side of the equation?

I appreciate any help that anyone can offer.

Thanks in advance,
Tony Kolstee
Sr. Systems Engineer
Aetna




This e-mail may contain confidential or privileged information. If
you think you have received this e-mail in error, please advise the
sender by reply e-mail and then delete this e-mail immediately.
Thank you. Aetna   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20101122/3ed2ac4b/attachment.html 


More information about the Ocfs2-users mailing list