<HTML>

<HEAD>

<TITLE>Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10)</TITLE>

</HEAD>

<BODY>

<FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>I am running a two-node web cluster on OCFS2 via DRBD Primary/Primary (v8.3.8) and Pacemaker. Everything &nbsp;seems to be working great, except during testing of hard-boot scenarios.<BR>

<BR>

Whenever I hard-boot one of the nodes, the other node is successfully fenced and marked &#8220;Outdated&#8221;<BR>

<BR>

</SPAN></FONT><UL><LI><SPAN STYLE='font-size:11pt'><FONT FACE="Courier, Courier New">&lt;resource minor=&quot;0&quot; cs=&quot;WFConnection&quot; ro1=&quot;Primary&quot; ro2=&quot;<B>Unknown</B>&quot; ds1=&quot;UpToDate&quot; ds2=&quot;<B>Outdated</B>&quot; /&gt;<BR>

</FONT></SPAN></UL><SPAN STYLE='font-size:11pt'><FONT FACE="Courier, Courier New"><BR>

</FONT><FONT FACE="Calibri, Verdana, Helvetica, Arial">However, this locks up I/O on the still active node and prevents any operations within the cluster :(<BR>

I have even forced DRBD into StandAlone mode while in this state, but that does not resolve the I/O lock either.<BR>

</FONT><FONT FACE="Courier, Courier New"><BR>

</FONT></SPAN><UL><LI><SPAN STYLE='font-size:11pt'><FONT FACE="Courier, Courier New">&lt;resource minor=&quot;0&quot; cs=&quot;<B>StandAlone</B>&quot; ro1=&quot;<B>Primary</B>&quot; ro2=&quot;Unknown&quot; ds1=&quot;<B>UpToDate</B>&quot; ds2=&quot;Outdated&quot; /&gt;<BR>

</FONT></SPAN></UL><SPAN STYLE='font-size:11pt'><FONT FACE="Courier, Courier New"><BR>

</FONT><FONT FACE="Calibri, Verdana, Helvetica, Arial">The only way I&#8217;ve been able to successfully regain I/O within the cluster is to bring back up the other node. While monitoring the logs, it seems that it is OCFS2 that&#8217;s establishing the lock/unlock and <I>not</I> DRBD at all.<BR>

</FONT></SPAN><BLOCKQUOTE><SPAN STYLE='font-size:11pt'><FONT FACE="Calibri, Verdana, Helvetica, Arial"><BR>

<BR>

Apr &nbsp;1 12:07:19 ubu10a kernel: [ 1352.739777] (ocfs2rec,3643,0):ocfs2_replay_journal:1605 Recovering node 1124116672 from slot 1 on device (147,0)<BR>

Apr &nbsp;1 12:07:19 ubu10a kernel: [ 1352.900874] (ocfs2rec,3643,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in slot 1<BR>

Apr &nbsp;1 12:07:19 ubu10a kernel: [ 1352.902509] (ocfs2_wq,1213,0):ocfs2_finish_quota_recovery:598 Finishing quota recovery in slot 1<BR>

<BR>

Apr &nbsp;1 12:07:20 ubu10a kernel: [ 1354.423915] block drbd0: Handshake successful: Agreed network protocol version 94<BR>

Apr &nbsp;1 12:07:20 ubu10a kernel: [ 1354.433074] block drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC<BR>

Apr &nbsp;1 12:07:20 ubu10a kernel: [ 1354.433083] block drbd0: conn( WFConnection -&gt; WFReportParams )<BR>

Apr &nbsp;1 12:07:20 ubu10a kernel: [ 1354.433097] block drbd0: Starting asender thread (from drbd0_receiver [2145])<BR>

Apr &nbsp;1 12:07:20 ubu10a kernel: [ 1354.433562] block drbd0: data-integrity-alg: &lt;not-used&gt;<BR>

Apr &nbsp;1 12:07:20 ubu10a kernel: [ 1354.434090] block drbd0: drbd_sync_handshake:<BR>

Apr &nbsp;1 12:07:20 ubu10a kernel: [ 1354.434094] block drbd0: self FBA98A2F89E05B83:EE17466F4DEC2F8B:6A4CD8FDD0562FA1:EC7831379B78B997 bits:4 flags:0<BR>

Apr &nbsp;1 12:07:20 ubu10a kernel: [ 1354.434097] block drbd0: peer EE17466F4DEC2F8A:0000000000000000:6A4CD8FDD0562FA0:EC7831379B78B997 bits:2048 flags:2<BR>

Apr &nbsp;1 12:07:20 ubu10a kernel: [ 1354.434099] block drbd0: uuid_compare()=1 by rule 70<BR>

Apr &nbsp;1 12:07:20 ubu10a kernel: [ 1354.434104] block drbd0: peer( Unknown -&gt; Secondary ) conn( WFReportParams -&gt; WFBitMapS )<BR>

Apr &nbsp;1 12:07:21 ubu10a kernel: [ 1354.601353] block drbd0: conn( WFBitMapS -&gt; SyncSource ) pdsk( Outdated -&gt; Inconsistent )<BR>

Apr &nbsp;1 12:07:21 ubu10a kernel: [ 1354.601367] block drbd0: Began resync as SyncSource (will sync 8192 KB [2048 bits set]).<BR>

Apr &nbsp;1 12:07:21 ubu10a kernel: [ 1355.401912] block drbd0: Resync done (total 1 sec; paused 0 sec; 8192 K/sec)<BR>

Apr &nbsp;1 12:07:21 ubu10a kernel: [ 1355.401923] block drbd0: conn( SyncSource -&gt; Connected ) pdsk( Inconsistent -&gt; UpToDate )<BR>

Apr &nbsp;1 12:07:22 ubu10a kernel: [ 1355.612601] block drbd0: peer( Secondary -&gt; Primary )<BR>

<BR>

<BR>

</FONT></SPAN></BLOCKQUOTE><SPAN STYLE='font-size:11pt'><FONT FACE="Calibri, Verdana, Helvetica, Arial">Therefore, my question is if there is an option in OCFS2 to remove / prevent this lock, especially since it&#8217;s inside a DRBD configuration? I&#8217;m still new to OCFS2, so I am definitely open to any criticism regarding my setup/approach, or any recommendations related to keeping the cluster active when another node is shutdown during testing.</FONT></SPAN>

</BODY>

</HTML>