<HTML>

<HEAD>

<TITLE>Re: [Ocfs2-users] Trouble getting node to re-join two node cluster (OCFS2/DRBD Primary/Primary)</TITLE>

</HEAD>

<BODY>

<FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>Interesting observation. Thank you, Sunil.<BR>

<BR>

I should note that I could not figure out how to perform a stack trace from within Pacemaker directly, so I waiting for Pacemaker to start O2CB/OCFS2/DLM and tried manually to mount to get the trace.<BR>

<BR>

I&#8217;ve noticed that as soon as it fails (via Pacemaker) the DRBD Primary device gets demoted to Secondary...I wonder if perhaps attempt was possibly too late and the /dev/drbd0 perhaps already in Secondary state? ...it seems likely in order to satisfy the first condition: if( </SPAN></FONT><FONT SIZE="2"><FONT FACE="Consolas, Courier New, Courier"><SPAN STYLE='font-size:10pt'>mdev-&gt;state.role != R_PRIMARY ) { ...<BR>

</SPAN></FONT></FONT><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'><BR>

I wonder what else I could try (manually or via pacemaker) to help determine what may be at fault here?<BR>

<BR>

Normally I can set a node to standby, and then back online with no issues...but somehow this node will no longer join, even after rebooting both nodes in the cluster, etc..<BR>

<BR>

<HR ALIGN=CENTER SIZE="3" WIDTH="95%"><B>From: </B>Sunil Mushran &lt;<a href="sunil.mushran@oracle.com">sunil.mushran@oracle.com</a>&gt;<BR>

<B>Date: </B>Thu, 15 Sep 2011 13:42:54 -0700<BR>

<B>To: </B>Mike Reid &lt;<a href="mbreid@thepei.com">mbreid@thepei.com</a>&gt;<BR>

<B>Cc: </B>&lt;<a href="ocfs2-users@oss.oracle.com">ocfs2-users@oss.oracle.com</a>&gt;<BR>

<B>Subject: </B>Re: [Ocfs2-users] Trouble getting node to re-join two node cluster (OCFS2/DRBD Primary/Primary)<BR>

<BR>

&nbsp;&nbsp;&nbsp;</SPAN></FONT><FONT SIZE="2"><FONT FACE="Consolas, Courier New, Courier"><SPAN STYLE='font-size:10pt'>open(&quot;/dev/drbd0&quot;, O_RDONLY|O_DIRECT) = -1 EMEDIUMTYPE (Wrong medium type)<BR>

&nbsp;<BR>

&nbsp;drbd_open()<BR>

&nbsp;...<BR>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if (mdev-&gt;state.role != R_PRIMARY) {<BR>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if (mode &amp; FMODE_WRITE)<BR>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;rv = -EROFS;<BR>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else if (!allow_oos)<BR>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;rv = -EMEDIUMTYPE;<BR>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<BR>

&nbsp;...<BR>

&nbsp;<BR>

&nbsp;So the failure appears to be emanating from drbd. There seems<BR>

&nbsp;to be a allow_oos module param that is not 0. I have no idea<BR>

&nbsp;what this param does. Also, am reading current mainline. 2.6.35 may<BR>

&nbsp;be different.<BR>

&nbsp;<BR>

</SPAN></FONT></FONT><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'> On 09/15/2011 01:26 PM, Mike Reid wrote: <BR>

</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'> <BR>

Hello all,<BR>

<BR>

** I have also posted this in the pacemaker list, but I have a feeling it's<BR>

more OCFS2 specific **<BR>

<BR>

We have a two-node cluster still in development that has been running fine<BR>

for weeks (little to no traffic). I made some updates to our CIB recently,<BR>

and everything seemed just fine.<BR>

<BR>

Yesterday I attempted to untar ~1.5GB to the OCFS2/DRBD volume, and once it<BR>

was complete one of the nodes had become completely disconnected and I<BR>

haven't been able to reconnect since.<BR>

<BR>

DRBD is working fine, everything is UpToDate and I can get both nodes in<BR>

Primary/Primary, but when it comes down to starting OCFS2 and mounting the<BR>

volume, I'm left with:<BR>

<BR>

&nbsp;<BR>

</SPAN></FONT><BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'> <BR>

resFS:0_start_0 (node=node1, call=21, rc=1, status=complete): unknown error<BR>

&nbsp;<BR>

</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'> <BR>

<BR>

I am using &quot;pcmk&quot; as the cluster_stack, and letting Pacemaker control<BR>

everything...<BR>

<BR>

The last time this happened the only way I was able to resolve it was to<BR>

reformat the device (via mkfs.ocfs2 -F). I don't think I should have to do<BR>

this, underlying blocks seem fine, and one of the nodes is running just<BR>

fine. The (currently) unmounted node is staying in sync as far as DRBD is<BR>

concerned.<BR>

<BR>

Here's some detail that hopefully will help, please let me know if there's<BR>

anything else I can provide to help know the best way to get this node back<BR>

&quot;online&quot;:<BR>

<BR>

<BR>

Ubuntu 10.10 / Kernel 2.6.35<BR>

<BR>

Pacemaker 1.0.9.1<BR>

Corosync 1.2.1<BR>

Cluster Agents 1.0.3 (Heartbeat)<BR>

Cluster Glue 1.0.6<BR>

OpenAIS 1.1.2<BR>

<BR>

DRBD 8.3.10<BR>

OCFS2 1.5.0<BR>

<BR>

cat /sys/fs/ocfs2/cluster_stack = pcmk<BR>

<BR>

node1: mounted.ocfs2 -d<BR>

<BR>

Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;FS &nbsp;&nbsp;&nbsp;&nbsp;UUID &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Label<BR>

/dev/sda3 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ocfs2 &nbsp;fe4273e1-f866-4541-bbcf-66c5dfd496d6<BR>

<BR>

node2: mounted.ocfs2 -d<BR>

<BR>

Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;FS &nbsp;&nbsp;&nbsp;&nbsp;UUID &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Label<BR>

/dev/sda3 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ocfs2 &nbsp;d6f7cc6d-21d1-46d3-9792-bc650736a5ef<BR>

/dev/drbd0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ocfs2 &nbsp;d6f7cc6d-21d1-46d3-9792-bc650736a5ef<BR>

<BR>

* NOTES:<BR>

- Both nodes are identical, in fact one node is a direct mirror (hdd clone)<BR>

- I have attached the CIB (crm configure edit contents) and mount trace<BR>

<BR>

&nbsp;<BR>

<BR>

<BR>

_______________________________________________<BR>

Ocfs2-users mailing list<BR>

<a href="Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a><BR>

<a href="http://oss.oracle.com/mailman/listinfo/ocfs2-users">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a><BR>

&nbsp;<BR>

</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'> <BR>

&nbsp;<BR>

</SPAN></FONT>

</BODY>

</HTML>