Laurent,<br><br>&nbsp; I am not a developer and I am not very familiar with the inner workings of OCFS2, so I am assuming some of the things below based on generic cluster design.<br><br>&nbsp; There are two heartbeats, the network heartbeat and the disk heartbeat.<br><br>&nbsp;&nbsp; If I am not mistaken the disk heartbeat is done on the block device that is mounted as a OCFS2 filesystem. So you can decide which node will fence by cutting its access to the disk device. <br><br>&nbsp;&nbsp; When using a SAN this is kind of simple, since there is a external disk device and one node eventually locks the device and forces the other node to be evicted.<br><br>&nbsp;&nbsp;&nbsp; Since you are using DRDB, you need to make sure that the node that your cluster manager evicts cannot access the DRDB device any longer. As there are two paths to the DRDB device on each node (one local device and one remote device), I am not exactly sure how you will acomplish this or if DRDB already

 has this kind of control to prevent a split brain, but what you need to do is to block access to the shared disk device on the evicted node before the OCFS2 timeout.<br><br>Regards,<br>Luis<br><br><b><i>Laurent Neiger &lt;Laurent.Neiger@grenoble.cnrs.fr&gt;</i></b> wrote:<blockquote class="replbq" style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px;">      <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">  Hi guys,<br> <br> I keep you in touch with my issue...<br> <br> <br> Luis Freitas wrote: <blockquote cite="mid:481106.68258.qm@web51410.mail.re2.yahoo.com" type="cite">Laurent,<br>   <br> &nbsp;&nbsp;&nbsp; What you need to be able to decide is what node still have network connectivity. If both have network connectivity you could fence any of them. If both lost connectivity (someone turned the switch off), then you are in trouble.<br>   <br> &nbsp;&nbsp; You will need to plug the backend network in a switch and

 monitor the interface status, so when one machine is shutdown or you disconnect its network cable, you still get the up status on the other machine. If you dont want to use two switches, plug them into the same switch and use different vlans.<br> </blockquote> <br> Yes I achieved to do that. In my cluster manager, I'm able to know which node is still up before ocfs2 timers fence<br> all nodes but the lower one, even if it's node0 which is off the network and node1 still connected.<br> <br> <blockquote cite="mid:481106.68258.qm@web51410.mail.re2.yahoo.com" type="cite">&nbsp;&nbsp; To deal with OCFS2 I think the easiest approach is increase its timeouts to let your cluster manager decide which node will survive before the OCFS2 heartbeat fences the node. I wouldnt be messing with its inner workings, YMMV...<br> </blockquote> <br> I think I managed to get time for my cluster manager to decide without having to increase ocfs2 timeouts.<br> <br> But my problem is not here.<br>

 It's _HOW_ to cancel ocfs2 self-fencing on node 1 if I work out node0 have to be fenced and not node1.<br> <br> I tried this :<br> node0 and node1 are OK, into the ocfs2 cluster, shared disk is mounted, all is fine.<br> I guess both of them are writing their timers every two secs to their blocks in the "heartbeat system file",<br> as mentionned in the FAQ.<br> <br> But what/where is "heartbeat system file", BTW ?<br> <br> When I unplug node0 network link, both of them say they lost their netcomm to the peer.<br> Within the five first seconds, my cluster manager works out node0 is off the network,<br> and node1 is OK. So the decision to have node0 fenced and cancel fencing for node1<br> is taken (as node1 would have to be fenced according to ocfs2 decision of fencing the<br> upper node number and leave the lower alive).<br> <br> So cluster manager runs "ocfs2_hb_ctl -K -d /dev/drbd0", which stops heartbeat on node1.<br> <br> But this doesn't prevent node1 to be self-fencing

 28 seconds after netcomm lost, and<br> node0 to stay alive with its deceased card. My entire cluster is down. No more service,<br> nor data access, still available.<br> <br> Logical, afterwards, as heartbeat was stopped but timers still countdown, nothing reset them.<br> <br> <blockquote cite="mid:481106.68258.qm@web51410.mail.re2.yahoo.com" type="cite"><b><i>Sunil Mushran <a class="moz-txt-link-rfc2396E" href="mailto:Sunil.Mushran@oracle.com">&lt;Sunil.Mushran@oracle.com&gt;</a></i></b> wrote:   <blockquote class="replbq" style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px;">Each of those pings will require a timeout - short timeouts. So short <br> that you<br> may not even be able to distinguish between errors and overloaded run-queue,<br> transmit queue, router, etc.</blockquote> </blockquote> <br> Once more I think I achieved that. My problem is to cancel self-fencing of node1,<br> not to decide to do so.<br> <br> <br> I'm sorry to annoy

 you, you might find it trivial but I probably missed something.<br> <br> You wrote "one does not have to have 3 nodes when one only wants 2 nodes".<br> Great, this is fine for me as I don't (and can't) have SANs and drbd allows max 2 nodes<br> for disk-sharing.<br> <br> I read too that behavior of fencing all nodes but the lower one is the wanted behavior<br> of ocfs2.<br> <br> So I rephrase my question :<br> <br> How can I make a 2-node cluster works with high-availablity, i.e. still having access to<br> the remaining node in the eventuality of _ANY_ node failure ? Cluster will be degraded,<br> only one node remaining until we repair and power up the node which failed, but no<br> services loss.<br> Even if node0 fails, node1 still assumes tasks, rather than self-fencing.<br> <br> Once more thanks a lot for your help.<br> <br> Have a good day,<br> <br> best regards,<br> <br> Laurent.<br> <br> <br> begin:vcard<br>fn:Laurent

 Neiger<br>n:Neiger;Laurent<br>org;quoted-printable:CNRS Grenoble;Centre R=C3=A9seau &amp; Informatique Commun<br>adr:B.P. 166;;25, avenue des Martyrs;Grenoble;;38042;France<br>email;internet:Laurent.Neiger@grenoble.cnrs.fr<br>title;quoted-printable:Administrateur Syst=C3=A8mes &amp; R=C3=A9seaux<br>tel;work:(0033) (0)4 76 88 79 91<br>tel;fax:(0033) (0)4 76 88 12 95<br>note:Certificats : http://igc.services.cnrs.fr/Doc/General/trust.html<br>x-mozilla-html:TRUE<br>url:http://cric.grenoble.cnrs.fr<br>version:2.1<br>end:vcard<br><br>_______________________________________________<br>Ocfs2-users mailing list<br>Ocfs2-users@oss.oracle.com<br>http://oss.oracle.com/mailman/listinfo/ocfs2-users</blockquote><br><p>&#32;

      <hr size=1>Looking for last minute shopping deals? <a href="http://us.rd.yahoo.com/evt=51734/*http://tools.search.yahoo.com/newsearch/category.php?category=shopping"> 

Find them fast with Yahoo! Search.</a>