[Ocfs2-users] Removing a node from cluster.conf (on a specific node)

Sunil Mushran sunil.mushran at gmail.com
Sun Apr 29 18:54:21 PDT 2012


Online add/remove of nodes and of global heartbeat devices has been in mainline for over a year. I think 2.6.38+ and tools 1.8. The ocfs2-tools tree hosted on oss.oracle.com/git has a 1.8.2 tag that can be used safely. It has been fully tested. The user's guide has been moved to man pages bundled with the tools. Do man ocfs2 after building and installing the tools.

On Apr 29, 2012, at 1:21 PM, Sébastien Riccio <sr at swisscenter.com> wrote:

> Hi dear list,
> 
> I think the subjet might already been discussed, but I can only found 
> old threads about removing a node from the cluster.
> 
> I was hoping that in 2012 it would be possible to dynamically add/remove 
> nodes from a shared filesystem but this evening I had this problem:
> 
> I wanted to add a node to our ocfs2 cluster, node named xen-blade11 with 
> ip 10.111.10.111
> 
> So on every other node I ran this command:
> 
> o2cb_ctl -C -i -n xen-blade11 -t node -a number=5 -a 
> ip_address=10.111.10.111 -a ip_port=7777 -a cluster=ocfs2
> 
> Which successfully added the node to every cluster node, except on 
> xen-server16
> 
> On every node the original cluster.conf was:
> 
> node:
>         ip_port = 7777
>         ip_address = 10.111.10.116
>         number = 0
>         name = xen-blade16
>         cluster = ocfs2
> 
> node:
>         ip_port = 7777
>         ip_address = 10.111.10.115
>         number = 1
>         name = xen-blade15
>         cluster = ocfs2
> 
> node:
>         ip_port = 7777
>         ip_address = 10.111.10.114
>         number = 2
>         name = xen-blade14
>         cluster = ocfs2
> 
> node:
>         ip_port = 7777
>         ip_address = 10.111.10.113
>         number = 3
>         name = xen-blade13
>         cluster = ocfs2
> 
> node:
>         ip_port = 7777
>         ip_address = 10.111.10.112
>         number = 4
>         name = xen-blade12
>         cluster = ocfs2
> 
> cluster:
>         node_count = 5
>         name = ocfs2
> 
> 
> After adding the node, on every cluster.conf I can see that this was added:
> 
> node:
>         ip_port = 7777
>         ip_address = 10.111.10.111
>         number = 5
>         name = xen-blade11
>         cluster = ocfs2
> 
> cluster:
>         node_count = 6
>         name = ocfs2
> 
> EXCEPT on xen-blade16
> 
> It added like this:
> 
> node:
>         ip_port = 7777
>         ip_address = 10.111.10.111
>         number = 6
>         name = xen-blade11
>         cluster = ocfs2
> 
> cluster:
>         node_count = 6
>         name = ocfs2
> 
> (Notice the number = 6 instead of number = 5)
> 
> So now when i'm trying to connect the xen-blade11 every host accept the 
> connection except the xen-blade16, and the cluster joining is being 
> rejected.
> 
> as we can see in the kernel messages on xen-blade11
> 
> [ 1852.729539] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1852.729892] o2net: Connected to node xen-blade12 (num 4) at 
> 10.111.10.112:7777
> [ 1852.737122] o2net: Connected to node xen-blade14 (num 2) at 
> 10.111.10.114:7777
> [ 1852.741408] o2net: Connected to node xen-blade15 (num 1) at 
> 10.111.10.115:7777
> [ 1854.733759] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1856.737129] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1856.764520] OCFS2 1.5.0
> [ 1858.740877] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1860.744847] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1862.748919] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1864.752929] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1866.756825] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1868.760809] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1870.764937] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1872.768905] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1874.772947] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1876.776928] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1878.780828] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1880.784974] o2net: Connection to node xen-blade16 (num 0) at 
> 10.111.10.116:7777 shutdown, state 7
> [ 1882.784529] o2net: No connection established with node 0 after 30.0 
> seconds, giving up.
> [ 1912.864531] o2net: No connection established with node 0 after 30.0 
> seconds, giving up.
> [ 1917.028531] o2cb: This node could not connect to nodes: 0.
> [ 1917.028684] o2cb: Cluster check failed. Fix errors before retrying.
> [ 1917.028758] (mount.ocfs2,4238,4):ocfs2_dlm_init:3001 ERROR: status = -107
> [ 1917.028880] (mount.ocfs2,4238,4):ocfs2_mount_volume:1879 ERROR: 
> status = -107
> [ 1917.029005] ocfs2: Unmounting device (254,5) on (node 0)
> [ 1917.029022] (mount.ocfs2,4238,4):ocfs2_fill_super:1234 ERROR: status 
> = -107
> [ 1918.860551] o2net: No longer connected to node xen-blade15 (num 1) at 
> 10.111.10.115:7777
> [ 1918.860599] o2net: No longer connected to node xen-blade14 (num 2) at 
> 10.111.10.114:7777
> [ 1918.860636] o2net: No longer connected to node xen-blade12 (num 4) at 
> 10.111.10.112:7777
> 
> Okay so far, I thought I would try to remove that node from xen-blade16 
> and re-add it again, but...
> 
> [root at xen-blade16 ~]# o2cb_ctl -D -n xen-blade11
> o2cb_ctl: Not yet supported
> 
> (Not yet supported, how long is "yet"?)
> 
> Please, tell me that there is a way to clean this so I can attach 
> xen-blade11 to the cluster?
> I mean Isn't OCFS2 is supposed to be a production ready filesystem, 
> meaning that you can add/remove
> nodes without having to shut down the cluster ?
> 
> I can't do that, it's in production and I can't even consider shutting 
> down the single node xen-blade16
> That would need me to migrate virtual machines (taking almost 64GB of 
> ram of that server) on another server in the cluster, but we have no 
> free server (that's why i'm adding xen-blade11 to the cluster...).
> 
> I mean even adding a new server with another name will lead to the same 
> problem, on every node it will add it as node number 6 but it will be 
> node number 7 on the xen-blade16... Same problem again...
> 
> Please help :)
> 
> Thanks for reading me.
> 
> Cheers,
> Sébastien
> 
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users



More information about the Ocfs2-users mailing list