<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML><HEAD>

<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">

<META content="MSHTML 6.00.2800.1586" name=GENERATOR></HEAD>

<BODY bgColor=#ffffff>

<DIV><FONT size=2>Absolutely. I know how redo and RAC interacts, you are 

absolutely correct.</FONT></DIV>

<DIV><FONT size=2></FONT>&nbsp;</DIV>

<DIV><FONT size=2>Sometimes CSSD reboots one node and that's all - good luck. 

Sometimes OCFS reboots one node and CSSD reboots another node - bad luck. That's 

why it is important do not mix different cluster managers on the same servers, 

or at least allow them to interact and make similar&nbsp; decision _who is 

master today_ (so who will survive split-brain situation).</FONT></DIV>

<DIV><FONT size=2></FONT>&nbsp;</DIV>

<DIV><FONT size=2>RAC is little simple case because Oracle is usually primary 

service - so if it decide to reboot, it's reasonable decision. OCFSv2 is another 

story - sometimes</FONT></DIV>

<DIV><FONT size=2>it is _secondary_ service (for example, it is used for the 

backups only), and if it is secondary then it should better stop working then 

reboot.</FONT></DIV>

<DIV><FONT size=2></FONT>&nbsp;</DIV>

<DIV><FONT size=2>It reveals 2 big problems (both, Oracle and OCFSv2, are 

affected):</FONT></DIV>

<DIV><FONT size=2>- single interface heartbeat is not reliable. You CAN NOT 

build reliable cluster using single heartbeat channel. Classical clusters 

(Veritas VCS)</FONT></DIV>

<DIV><FONT size=2>uses 2 - 4 different heartbeat media (we use 2 independent 

Ethernet HUBS and 2 generic Ethernet LAN-s in Veritas, use 2 Ethernets + 1 

Serial in Linux clusters,</FONT></DIV>

<DIV><FONT size=2>use 2 Ethernet + 1 Serial in Cisco PIX cluster, and so on). 

Both OCFS and Oracle RAC can not use more then one (to be correct, you can 

configure few interfaces</FONT></DIV>

<DIV><FONT size=2>for RAC interconnection in SPFile, but it wil not affect 

CSSD). In addition, OCFS defaults are very strange and unrealistic - Ethernet 

and FC can not guarantee heartbeat times better than 1 minute, in average (I 

mean - it case of any network reconfiguration heartbeat wil experience 30 - 50 

seconds delay, so if you configure 12 seconds timeout /default in OCFSv2/ you 

are at least naive.</FONT></DIV>

<DIV><FONT size=2></FONT>&nbsp;</DIV>

<DIV><FONT size=2>- too easy _self fencing_. Just again, if OCFS node lost 

connection to the disk, it should not self fence - it can send data to another 

nodes (or request them from</FONT></DIV>

<DIV><FONT size=2>another nodes), it can unmount file system and try to remount 

it, it can release control and resume operations. Immediate fencing is necessary 

in SOME cases</FONT></DIV>

<DIV><FONT size=2>but not in all. If FS have not pending operations, then 

Fencing by reboot don't make much difference with just _remount_. It's not so 

simple as I explain here,</FONT></DIV>

<DIV><FONT size=2>but the truth is that fencing decisions are not flexible 

enough and decrease reliability dramatically (I posted a list of scenarios when 

fencing should not happen).</FONT></DIV>

<DIV><FONT size=2></FONT>&nbsp;</DIV>

<DIV><FONT size=2>IN addition, I noticed other problems with OCFSv2 too (such as 

excessive CPU usage in some cases).</FONT></DIV>

<DIV><FONT size=2></FONT>&nbsp;</DIV>

<DIV><FONT size=2>I use OCFSv2, even in production. But I do it with a grain of 

salt, have a backup plan _how to run without it_, and don't use it for heavily 

loaded</FONT></DIV>

<DIV><FONT size=2>file systems with million files (I use 

heartbeat,&nbsp;reiserfs and APC switch fencing - and 3 independent heartbeats, 

with 40 seconds timeout). For now, I had one glitch on OCFSv2 (when it remounted 

read only on one node) and that's all - no other problems&nbsp; in production 

(OCFSv2 is used during start/stops only, so it&nbsp;is safe). But I run stress 

tests in the lab,&nbsp;I am running it in the lab clusters now (including RAC), 

and conclusion is simple -&nbsp;as a cluster, it is not reliable; as a file 

system, it may have hidden bugs so be extra careful with it.</FONT></DIV>

<DIV>&nbsp;</DIV>

<DIV><FONT size=2>PS. Good point - it improves every month. Some problems are in 

the past already.</FONT>&nbsp;</DIV>

<DIV><FONT size=2></FONT>&nbsp;</DIV>

<DIV><FONT size=2>PPS. All this lab reboots have been caused by extremely heavy 

load or by hardware failures (simulated or real). It works better in real life. 

But my experience says me, that if I can break something in the lab in 3 days, 

it's a matter of few month, when it broke in production.</FONT></DIV>

<DIV><FONT size=2></FONT>&nbsp;</DIV>

<BLOCKQUOTE 

style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">

  <DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>

  <DIV 

  style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B> 

  <A title=lfreitas34@yahoo.com href="mailto:lfreitas34@yahoo.com">Luis 

  Freitas</A> </DIV>

  <DIV style="FONT: 10pt arial"><B>To:</B> <A title=Ocfs2-users@oss.oracle.com 

  href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</A> </DIV>

  <DIV style="FONT: 10pt arial"><B>Sent:</B> Saturday, February 10, 2007 4:52 

  PM</DIV>

  <DIV style="FONT: 10pt arial"><B>Subject:</B> Re: Hmm,here is an example. Re: 

  [Ocfs2-users] Also just a comment to theOracle guys</DIV>

  <DIV><BR></DIV>

  <DIV>Alexei,</DIV>

  <DIV>&nbsp;</DIV>

  <DIV>&nbsp;&nbsp;&nbsp; Actually your log seems to show that CSSD (Oracle 

  CRS)&nbsp;rebooted the node before OCFS2 got a chance to do it.</DIV>

  <DIV>&nbsp;</DIV>

  <DIV>&nbsp;&nbsp;&nbsp; On a RAC cluster, if the interconnect is 

  interrupted,&nbsp;all the nodes hang until a split brain resolution is 

  complete and the recovery of all the crashed nodes is completed. This is 

  needed because every read on a Oracle datablock needs a ping to the other 

  nodes. </DIV>

  <DIV>&nbsp;</DIV>

  <DIV>&nbsp;&nbsp;&nbsp; The view of the data must be consistent, when one node 

  read a particular data block, the Oracle Database&nbsp;first ping the other 

  nodes to ensure that they did not modify the block and still have not flushed 

  it to disk. Another node may&nbsp;even forward a reply with the block, 

  preventing the disk access (Cache Fusion). </DIV>

  <DIV>&nbsp;</DIV>

  <DIV>&nbsp;&nbsp;&nbsp; When a split brain occurs, there is the loss of these 

  blocks not flushed to disk, and they are rebuilt using the redo threads of the 

  particular nodes that crashed.&nbsp;During this interval all the database 

  instances "freeze", since before the node recovery is complete there is no way 

  to guarantee that a block&nbsp;read from disk has not been altered on the 

  crashed node.</DIV>

  <DIV>&nbsp;</DIV>

  <DIV>&nbsp;&nbsp;&nbsp; So the fencing is needed even if there is no disk 

  activity, as the entire cluster becomes "hang" the moment the interconnect is 

  down. And the timeout for the fencing must be as small as possible to prevent 

  a long cluster reconfiguration delay. Of course the timeout must be tuned so 

  as to be larger than ethernet switch failovers, or storage controller&nbsp;or 

  disk multipath&nbsp;failovers. Or if possible the failover times should be 

  reduced.</DIV>

  <DIV>&nbsp;</DIV>

  <DIV>&nbsp;&nbsp; Now, on the other hand, I am too having problems with OCFS2. 

  It seems much less robust than ASM and the previous version, OCFS, specially 

  under heavy disk activity. But I do expect these problems to get solved in the 

  near future, as did the 2.4 kernel VM problems.</DIV>

  <DIV>&nbsp;</DIV>

  <DIV>Regards,</DIV>

  <DIV>Luis</DIV>

  <DIV><BR><B><I>Alexei_Roudnev &lt;Alexei_Roudnev@exigengroup.com&gt;</I></B> 

  wrote:</DIV>

  <BLOCKQUOTE class=replbq 

  style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid">

    <META content="MSHTML 6.00.2800.1561" name=GENERATOR>

    <STYLE></STYLE>


    <DIV><FONT face=Arial size=2>Additional info - node had not ANY active 

    OCFSv2 operations (OCFSv2 used for backups only and from another node only). 

    So, if system just SUSPEND all FS operations and try to rejoin to the 

    cluster, it all could work (moreover, connection to the disk system was 

    intact, so it could close file sytem gracefully).</FONT></DIV>

    <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>

    <DIV><FONT face=Arial size=2>It reveals 3 problems at once:</FONT></DIV>

    <DIV><FONT face=Arial size=2>- single heartbeat link (instead of multiple 

    links)</FONT></DIV>

    <DIV><FONT face=Arial size=2>- timeout too short (ethernet can't guarantee 

    10 seconds, it can guarantee 1 minute minimum);</FONT></DIV>

    <DIV><FONT face=Arial size=2>- fencing even if system is passive and can 

    remount / reconnect instead of rebooting.</FONT></DIV>

    <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>

    <DIV><FONT face=Arial size=2>All we did in the lab was _disconnect 1 of 

    trunks between switches for a few seconds, then insert it back into the 

    socket_. No one other application failed</FONT></DIV>

    <DIV><FONT face=Arial size=2>(including heartbeat clusters). Database 

    cluster was not doing anything on OCFS in time of failure (even 

    backups).</FONT></DIV>

    <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>

    <DIV><FONT face=Arial size=2>I will try heartbeat between loopback 

    interfaces (and OCFS protocol) next time (I am just curios if it can provide 

    10 seconds for network reconfiguration).</FONT></DIV>

    <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>

    <DIV><FONT face=Arial color=#0000ff size=2><STRONG>...</STRONG></FONT></DIV>

    <DIV><FONT face=Arial color=#0000ff size=2><STRONG>Feb &nbsp;1 12:19:13 

    testrac12 kernel: o2net: connection to node testrac11 (num 0) at 

    10.254.32.111:7777 has been idle for <FONT color=#ff0000>10 seconds</FONT>, 

    shutting it down. <BR>Feb &nbsp;1 12:19:13 testrac12 kernel: 

    (13,3):o2net_idle_timer:1310 here are some times that might help debug the 

    situation: (tmr 1170361135.521061 now 1170361145.520476 dr 1170361141.852795 

    adv 1170361135.521063:1170361135.521064 func (c4378452:505) 

    1170361067.762941:1170361067.762967) <BR>Feb &nbsp;1 12:19:13 testrac12 

    kernel: o2net: no longer connected to node testrac11 (num 0) at 

    10.254.32.111:7777 <BR>Feb &nbsp;1 12:19:13 testrac12 kernel: 

    (1855,3):dlm_send_remote_convert_request:398 ERROR: status = -107 <BR>Feb 

    &nbsp;1 12:19:13 testrac12 kernel: (1855,3):dlm_wait_for_node_death:371 

    5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting 5000ms for notification of death 

    of node 0 <BR>Feb &nbsp;1 12:19:13 testrac12 kernel: 

    (1855,1):dlm_send_remote_convert_request:398 ERROR: status = -107 <BR>Feb 

    &nbsp;1 12:19:13 testrac12 kernel: (1855,1):dlm_wait_for_node_death:371 

    5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting 5000ms for notification of death 

    of node 0 <BR>Feb &nbsp;1 12:22:22 testrac12 kernel: 

    (1855,2):dlm_send_remote_convert_request:398 ERROR: status = -107 <BR>Feb 

    &nbsp;1 12:22:22 testrac12 kernel: (1855,2):dlm_wait_for_node_death:371 

    5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting 5000ms for notification of death 

    of node 0 <BR>Feb &nbsp;1 12:22:27 testrac12 kernel: 

    (13,3):o2quo_make_decision:144 ERROR: fencing this node because it is 

    connected to a half-quorum of 1 out of 2 nodes which doesn't include the 

    lowest active node 0 <BR>Feb &nbsp;1 12:22:27 testrac12 kernel: 

    (13,3):o2hb_stop_all_regions:1889 ERROR: stopping heartbeat on all active 

    regions. <BR>Feb &nbsp;1 12:22:27 testrac12 kernel: Kernel panic: ocfs2 is 

    very sorry to be fencing this system by panicing <BR>Feb &nbsp;1 12:22:27 

    testrac12 kernel: <BR>Feb &nbsp;1 12:22:28 testrac12 su: pam_unix2: session 

    finished for user oracle, service su <BR>Feb &nbsp;1 12:22:29 testrac12 

    logger: Oracle CSSD failure. &nbsp;Rebooting for cluster integrity. <BR>Feb 

    &nbsp;1 12:22:32 testrac12 su: pam_unix2: session finished for user oracle, 

    service su 

    <BR>...</STRONG></FONT></DIV>_______________________________________________<BR>Ocfs2-users 

    mailing 

    list<BR>Ocfs2-users@oss.oracle.com<BR>http://oss.oracle.com/mailman/listinfo/ocfs2-users</BLOCKQUOTE><BR>

  <P>

  <HR SIZE=1>

  Expecting? Get great news right away with <A 

  href="http://us.rd.yahoo.com/evt=49982/*http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html">email 

  Auto-Check.</A><BR>Try the <A 

  href="http://us.rd.yahoo.com/evt=49982/*http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html">Yahoo! 

  Mail Beta.</A>

  <P>

  <HR>


  <P></P>_______________________________________________<BR>Ocfs2-users mailing 

  list<BR>Ocfs2-users@oss.oracle.com<BR>http://oss.oracle.com/mailman/listinfo/ocfs2-users</BLOCKQUOTE></BODY></HTML>