[Ocfs2-users] Removing a node from cluster.conf (on a specific node)

Sébastien Riccio sr at swisscenter.com
Sun Apr 29 13:21:48 PDT 2012


Hi dear list,

I think the subjet might already been discussed, but I can only found 
old threads about removing a node from the cluster.

I was hoping that in 2012 it would be possible to dynamically add/remove 
nodes from a shared filesystem but this evening I had this problem:

I wanted to add a node to our ocfs2 cluster, node named xen-blade11 with 
ip 10.111.10.111

So on every other node I ran this command:

o2cb_ctl -C -i -n xen-blade11 -t node -a number=5 -a 
ip_address=10.111.10.111 -a ip_port=7777 -a cluster=ocfs2

Which successfully added the node to every cluster node, except on 
xen-server16

On every node the original cluster.conf was:

node:
         ip_port = 7777
         ip_address = 10.111.10.116
         number = 0
         name = xen-blade16
         cluster = ocfs2

node:
         ip_port = 7777
         ip_address = 10.111.10.115
         number = 1
         name = xen-blade15
         cluster = ocfs2

node:
         ip_port = 7777
         ip_address = 10.111.10.114
         number = 2
         name = xen-blade14
         cluster = ocfs2

node:
         ip_port = 7777
         ip_address = 10.111.10.113
         number = 3
         name = xen-blade13
         cluster = ocfs2

node:
         ip_port = 7777
         ip_address = 10.111.10.112
         number = 4
         name = xen-blade12
         cluster = ocfs2

cluster:
         node_count = 5
         name = ocfs2


After adding the node, on every cluster.conf I can see that this was added:

node:
         ip_port = 7777
         ip_address = 10.111.10.111
         number = 5
         name = xen-blade11
         cluster = ocfs2

cluster:
         node_count = 6
         name = ocfs2

EXCEPT on xen-blade16

It added like this:

node:
         ip_port = 7777
         ip_address = 10.111.10.111
         number = 6
         name = xen-blade11
         cluster = ocfs2

cluster:
         node_count = 6
         name = ocfs2

(Notice the number = 6 instead of number = 5)

So now when i'm trying to connect the xen-blade11 every host accept the 
connection except the xen-blade16, and the cluster joining is being 
rejected.

as we can see in the kernel messages on xen-blade11

[ 1852.729539] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1852.729892] o2net: Connected to node xen-blade12 (num 4) at 
10.111.10.112:7777
[ 1852.737122] o2net: Connected to node xen-blade14 (num 2) at 
10.111.10.114:7777
[ 1852.741408] o2net: Connected to node xen-blade15 (num 1) at 
10.111.10.115:7777
[ 1854.733759] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1856.737129] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1856.764520] OCFS2 1.5.0
[ 1858.740877] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1860.744847] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1862.748919] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1864.752929] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1866.756825] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1868.760809] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1870.764937] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1872.768905] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1874.772947] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1876.776928] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1878.780828] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1880.784974] o2net: Connection to node xen-blade16 (num 0) at 
10.111.10.116:7777 shutdown, state 7
[ 1882.784529] o2net: No connection established with node 0 after 30.0 
seconds, giving up.
[ 1912.864531] o2net: No connection established with node 0 after 30.0 
seconds, giving up.
[ 1917.028531] o2cb: This node could not connect to nodes: 0.
[ 1917.028684] o2cb: Cluster check failed. Fix errors before retrying.
[ 1917.028758] (mount.ocfs2,4238,4):ocfs2_dlm_init:3001 ERROR: status = -107
[ 1917.028880] (mount.ocfs2,4238,4):ocfs2_mount_volume:1879 ERROR: 
status = -107
[ 1917.029005] ocfs2: Unmounting device (254,5) on (node 0)
[ 1917.029022] (mount.ocfs2,4238,4):ocfs2_fill_super:1234 ERROR: status 
= -107
[ 1918.860551] o2net: No longer connected to node xen-blade15 (num 1) at 
10.111.10.115:7777
[ 1918.860599] o2net: No longer connected to node xen-blade14 (num 2) at 
10.111.10.114:7777
[ 1918.860636] o2net: No longer connected to node xen-blade12 (num 4) at 
10.111.10.112:7777

Okay so far, I thought I would try to remove that node from xen-blade16 
and re-add it again, but...

[root at xen-blade16 ~]# o2cb_ctl -D -n xen-blade11
o2cb_ctl: Not yet supported

(Not yet supported, how long is "yet"?)

Please, tell me that there is a way to clean this so I can attach 
xen-blade11 to the cluster?
I mean Isn't OCFS2 is supposed to be a production ready filesystem, 
meaning that you can add/remove
nodes without having to shut down the cluster ?

I can't do that, it's in production and I can't even consider shutting 
down the single node xen-blade16
That would need me to migrate virtual machines (taking almost 64GB of 
ram of that server) on another server in the cluster, but we have no 
free server (that's why i'm adding xen-blade11 to the cluster...).

I mean even adding a new server with another name will lead to the same 
problem, on every node it will add it as node number 6 but it will be 
node number 7 on the xen-blade16... Same problem again...

Please help :)

Thanks for reading me.

Cheers,
Sébastien





More information about the Ocfs2-users mailing list