[Ocfs2-users] Trouble getting node to re-join two node cluster (OCFS2/DRBD Primary/Primary)

Thu Sep 15 13:59:35 PDT 2011

Interesting observation. Thank you, Sunil.

I should note that I could not figure out how to perform a stack trace from
within Pacemaker directly, so I waiting for Pacemaker to start
O2CB/OCFS2/DLM and tried manually to mount to get the trace.

I¹ve noticed that as soon as it fails (via Pacemaker) the DRBD Primary
device gets demoted to Secondary...I wonder if perhaps attempt was possibly
too late and the /dev/drbd0 perhaps already in Secondary state? ...it seems
likely in order to satisfy the first condition: if( mdev->state.role !=
R_PRIMARY ) { ...

I wonder what else I could try (manually or via pacemaker) to help determine
what may be at fault here?

Normally I can set a node to standby, and then back online with no
issues...but somehow this node will no longer join, even after rebooting
both nodes in the cluster, etc..

From: Sunil Mushran <sunil.mushran at oracle.com>
Date: Thu, 15 Sep 2011 13:42:54 -0700
To: Mike Reid <mbreid at thepei.com>
Cc: <ocfs2-users at oss.oracle.com>
Subject: Re: [Ocfs2-users] Trouble getting node to re-join two node cluster
(OCFS2/DRBD Primary/Primary)

   open("/dev/drbd0", O_RDONLY|O_DIRECT) = -1 EMEDIUMTYPE (Wrong medium
type)

 drbd_open()
 ...
         if (mdev->state.role != R_PRIMARY) {
                 if (mode & FMODE_WRITE)
                         rv = -EROFS;
                 else if (!allow_oos)
                         rv = -EMEDIUMTYPE;
         }
 ...

 So the failure appears to be emanating from drbd. There seems
 to be a allow_oos module param that is not 0. I have no idea
 what this param does. Also, am reading current mainline. 2.6.35 may
 be different.

 On 09/15/2011 01:26 PM, Mike Reid wrote:
>  
> Hello all,
> 
> ** I have also posted this in the pacemaker list, but I have a feeling it's
> more OCFS2 specific **
> 
> We have a two-node cluster still in development that has been running fine
> for weeks (little to no traffic). I made some updates to our CIB recently,
> and everything seemed just fine.
> 
> Yesterday I attempted to untar ~1.5GB to the OCFS2/DRBD volume, and once it
> was complete one of the nodes had become completely disconnected and I
> haven't been able to reconnect since.
> 
> DRBD is working fine, everything is UpToDate and I can get both nodes in
> Primary/Primary, but when it comes down to starting OCFS2 and mounting the
> volume, I'm left with:
> 
>  
>>  
>> resFS:0_start_0 (node=node1, call=21, rc=1, status=complete): unknown error
>>  
>  
> 
> I am using "pcmk" as the cluster_stack, and letting Pacemaker control
> everything...
> 
> The last time this happened the only way I was able to resolve it was to
> reformat the device (via mkfs.ocfs2 -F). I don't think I should have to do
> this, underlying blocks seem fine, and one of the nodes is running just
> fine. The (currently) unmounted node is staying in sync as far as DRBD is
> concerned.
> 
> Here's some detail that hopefully will help, please let me know if there's
> anything else I can provide to help know the best way to get this node back
> "online":
> 
> 
> Ubuntu 10.10 / Kernel 2.6.35
> 
> Pacemaker 1.0.9.1
> Corosync 1.2.1
> Cluster Agents 1.0.3 (Heartbeat)
> Cluster Glue 1.0.6
> OpenAIS 1.1.2
> 
> DRBD 8.3.10
> OCFS2 1.5.0
> 
> cat /sys/fs/ocfs2/cluster_stack = pcmk
> 
> node1: mounted.ocfs2 -d
> 
> Device                FS     UUID                                  Label
> /dev/sda3             ocfs2  fe4273e1-f866-4541-bbcf-66c5dfd496d6
> 
> node2: mounted.ocfs2 -d
> 
> Device                FS     UUID                                  Label
> /dev/sda3             ocfs2  d6f7cc6d-21d1-46d3-9792-bc650736a5ef
> /dev/drbd0            ocfs2  d6f7cc6d-21d1-46d3-9792-bc650736a5ef
> 
> * NOTES:
> - Both nodes are identical, in fact one node is a direct mirror (hdd clone)
> - I have attached the CIB (crm configure edit contents) and mount trace
> 
>  
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110915/5ce332be/attachment-0001.html