[Ocfs2-users] one node rejects connection from new node

Carl J. Benson cbenson at fhcrc.org
Mon Feb 2 09:57:36 PST 2009


Sunil,

I read the user's guide, and added node "gladstone" by entering
the following command, as root on each of the four nodes:

o2cb_ctl -C -i -n gladstone -t node -a number=3 -a
ip_address=140.107.170.108 -a ip_port=7778 -a cluster=ocfs2

I copied/pasted the command, so it was identical on all nodes.

On gladstone, /etc/init.d/o2cb status shows:

Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 31
  Network idle timeout: 30000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Not active

So I attempt to mount the filesystem with mount /mnt/cpb_clust.

Merlot1 likes it:
Feb  2 09:40:33 merlot1 kernel: o2net: accepted connection from node
gladstone (num 3) at 140.107.170.108:7778

Merlot2 likes it:
Feb  2 09:40:33 merlot2 kernel: o2net: accepted connection from node
gladstone (num 3) at 140.107.170.108:7777

But wilson1 does not:
Feb  2 09:40:33 wilson1 kernel: (4447,3):o2net_accept_one:1795 attempt
to connect from unknown node at 140.107.170.108:35267
<...>
Feb  2 09:41:00 wilson1 kernel: (4447,3):o2net_connect_expired:1659
ERROR: no connection established with node 3 after 30.0 seconds, giving
up and returning errors.

On the new node, gladstone, I see:
Feb  2 09:40:33 gladstone kernel: o2net: connected to node merlot2 (num
1) at 140.107.158.54:7777
Feb  2 09:40:33 gladstone kernel: o2net: connected to node merlot1 (num
0) at 140.107.170.116:7777
Feb  2 09:41:03 gladstone kernel: (7347,2):o2net_connect_expired:1659
ERROR: noconnection established with node 2 after 30.0 seconds, giving
up and returning errors.
Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_request_join:1033 ERROR:
status= -107
Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_try_to_join_domain:1207
ERROR: status = -107
Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_join_domain:1485 ERROR:
status = -107
Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_register_domain:1732
ERROR: status = -107
Feb  2 09:41:03 gladstone kernel: (24118,2):o2cb_cluster_connect:302
ERROR: status = -107
Feb  2 09:41:03 gladstone kernel: (24118,2):ocfs2_dlm_init:2786 ERROR:
status =-107
Feb  2 09:41:03 gladstone kernel: (24118,2):ocfs2_mount_volume:1560
ERROR: status = -107
Feb  2 09:41:03 gladstone kernel: ocfs2: Unmounting device (8,17) on
(node 0)
Feb  2 09:41:03 gladstone kernel: o2net: no longer connected to node
merlot1 (num 0) at 140.107.170.116:7777
Feb  2 09:41:03 gladstone kernel: o2net: no longer connected to node
merlot2 (num 1) at 140.107.158.54:7777

Can you help me figure out where the problem is?

-- 
Carl Benson, PHS Linux SysAdmin  (206-667-4862, cbenson at fhcrc.org)

Sunil Mushran wrote:
> Nodes can be added to an online cluster. The instructions are listed in
> the user's guide.
> 
> On Jan 31, 2009, at 7:53 AM, Carl Benson <cbenson at fhcrc.org> wrote:
> 
>> Sunil,
>>
>> Thank you for responding. I will try o2cb_ctl on Monday, when I have
>> physical access to hit Reset in case one or more nodes lock up.
>>
>> If there really is a requirement to restart the cluster on wilson1
>> every time
>> I add a new node (and I have five or six more nodes to add), that is too
>> bad. Wilson1 is a 24x7 production system.
>>
>> --Carl Benson
>>
>> Sunil Mushran wrote:
>>> Could be that the cluster was already online on wilson1 when you
>>> propagated the cluster.conf to all nodes. If so, restart the cluster
>>> on that node.
>>>
>>> To add a node to an online cluster, you need to use the o2cb_ctl
>>> command. Details are in the 1.4 user's guide.
>>>
>>>
>>> Carl J. Benson wrote:
>>>
>>>> Hello.
>>>>
>>>> I have three systems that share an ocfs2 filesystem, and I'm
>>>> trying to add a fourth system.
>>>>
>>>> These are all openSUSE 11.1, x86_64, kernel 2.6.27.7-9-default.
>>>> All have RPMs ocfs2-tools-1.4.1-6.9 and ocfs2console-1.4.1-6.9
>>>>
>>>> cluster.conf looks like this:
>>>> node:
>>>>        ip_port = 7777
>>>>        ip_address = 140.107.170.116
>>>>        number = 0
>>>>        name = merlot1
>>>>        cluster = ocfs2
>>>>
>>>> node:
>>>>        ip_port = 7777
>>>>        ip_address = 140.107.158.54
>>>>        number = 1
>>>>        name = merlot2
>>>>        cluster = ocfs2
>>>>
>>>> node:
>>>>        ip_port = 7777
>>>>        ip_address = 140.107.158.82
>>>>        number = 2
>>>>        name = wilson1
>>>>        cluster = ocfs2
>>>>
>>>> node:
>>>>        ip_port = 7778
>>>>        ip_address = 140.107.170.108
>>>>        number = 3
>>>>        name = gladstone
>>>>        cluster = ocfs2
>>>>
>>>> cluster:
>>>>        node_count = 4
>>>>        name = ocfs2
>>>>
>>>> gladstone is the new node.
>>>>
>>>> I edited the cluster.conf on wilson1 using ocfs2console, and
>>>> propagated it to the other systems from there.
>>>>
>>>> When I try to bring my ocfs2 online with /etc/init.d/o2cb online ocfs2,
>>>> merlot1 accepts the connection from gladstone, as does merlot2.
>>>> However, wilson1 rejects it as an unknown node! For example:
>>>>
>>>> Jan 30 14:11:46 wilson1 kernel: (4447,3):o2net_accept_one:1795 attempt
>>>> to connect from unknown node at 140.107.170.108:37795
>>>>
>>>> Why would this happen?
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>
>>



More information about the Ocfs2-users mailing list