<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.6036" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><FONT face=Arial color=#0000ff
size=2></FONT> </DIV>
<DIV></DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial size=2>We have a RAC
cluster as follows:</FONT></SPAN></DIV>
<UL>
<LI><SPAN class=925474420-22112010><FONT face=Arial size=2>3 nodes with RHEL5,
Oracle 11.1.0.7, and OCFS2 1.4.2</FONT></SPAN>
<LI><SPAN class=925474420-22112010><FONT face=Arial size=2>Voting and OCR are
on OCFS2, all other shared storage is on ASM</FONT></SPAN>
<LI><SPAN class=925474420-22112010><FONT face=Arial size=2>Storage hardware is
provided by a fibre-channel SAN fabric</FONT></SPAN>
<LI><FONT face=Arial><FONT size=2><SPAN class=925474420-22112010>Interconnect
uses two bonded NICs per server, connected to different blades on a single
switch</SPAN></FONT></FONT></LI></UL>
<DIV><SPAN class=925474420-22112010><FONT face=Arial size=2>Previously we had an
issue where all three nodes would reboot if one node had problems. This could be
caused by one node crashing completely (OS crash), or losing interconnect.
</FONT></SPAN><SPAN class=925474420-22112010><FONT face=Arial size=2>For testing
purposes, we've been simulating OS crashes by suddenly resetting the server
without a graceful shutdown, and simulating loss of interconnect by using
ifconfig down on both interfaces in the bond.</FONT></SPAN></DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial size=2>From what we've
seen, it appears that the issue is an interaction between the O2CB and CRS
timeout values - namely that CRS is self-fencing on the surviving nodes before
O2CB has a chance to time out and recover the dead node.</FONT></SPAN></DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial size=2>By adjusting O2CB's
timeout (O2CB_HEARTBEAT_THRESHOLD x 2 seconds) lower than the CRS disktimeout
value, we were able to configure the cluster so that the "OS Crash" scenario is
properly handled. The other two nodes will survive when we completely reset the
third. </FONT></SPAN></DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial size=2>However, we haven't
solved the loss of interconnect scenario and believe the problem to be similar.
I'd rather not get bogged down in the specifics, logfiles, etc. at this point in
time, as we have to apply these concepts to other environments as well, and
there is still some tweaking of these configurations to be done.
</FONT></SPAN></DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=925474420-22112010></SPAN><SPAN class=925474420-22112010><FONT
face=Arial size=2>Can anyone please provide a generic conceptual overview on how
the CRS and O2CB timeout values interact in this scenario? Does
O2CB_HEARTBEAT_THRESHOLD come into play at all when dealing with loss of
interconnect, and how does it interact with the value of O2CB_IDLE_TIMEOUT_MS?
How does this correlate with the timeouts on the CRS side of the
equation?</FONT></SPAN></DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial size=2>I appreciate any
help that anyone can offer.</FONT></SPAN></DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial size=2>Thanks in
advance,</FONT></SPAN></DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial size=2>Tony
Kolstee</FONT></SPAN></DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial size=2>Sr. Systems
Engineer</FONT></SPAN></DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial
size=2>Aetna</FONT></SPAN></DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=925474420-22112010><FONT face=Arial
size=2></FONT></SPAN> </DIV></BODY></HTML>
This e-mail may contain confidential or privileged information. If you think you have received this e-mail in error, please advise the sender by reply e-mail and then delete this e-mail immediately. Thank you. Aetna