[Ocfs2-users] one node rejects connection from new node

Sunil Mushran sunil.mushran at oracle.com
Mon Feb 2 10:31:53 PST 2009


While you can add nodes to an online cluster, editing existing
configuration is not allowed. It will require a cluster restart.

Carl J. Benson wrote:
> Sunil,
>
> I compared across the four systems. Merlot2 has directories for
> several non-existent nodes due to an earlier attempt. File
> ownership and permissions are the same across all nodes. Here's
> the listing for wilson1:
>
> /root # ls -lR /sys/kernel/config/cluster
>
> /sys/kernel/config/cluster:
>
> total 0
>
> drwxr-xr-x 4 root root 0 2009-01-30 14:05 ocfs2
>
>
> /sys/kernel/config/cluster/ocfs2:
> total 0
> drwxr-xr-x 3 root root    0 2009-01-30 14:05 heartbeat
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 idle_timeout_ms
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 keepalive_delay_ms
> drwxr-xr-x 6 root root    0 2009-01-30 11:25 node
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 reconnect_delay_ms
>
> /sys/kernel/config/cluster/ocfs2/heartbeat:
> total 0
> drwxr-xr-x 2 root root    0 2009-01-30 11:25
> 234829284D7144E6B41D8875C96946D3
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 dead_threshold
>
>
> /sys/kernel/config/cluster/ocfs2/heartbeat/234829284D7144E6B41D8875C96946D3:
> total 0
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 block_bytes
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 blocks
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 dev
> -r--r--r-- 1 root root 4096 2009-02-02 10:08 pid
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 start_block
>
> /sys/kernel/config/cluster/ocfs2/node:
> total 0
> drwxr-xr-x 2 root root 0 2009-01-30 14:05 gladstone
> drwxr-xr-x 2 root root 0 2009-01-30 14:05 merlot1
> drwxr-xr-x 2 root root 0 2009-01-30 14:05 merlot2
> drwxr-xr-x 2 root root 0 2009-01-30 14:05 wilson1
>
> /sys/kernel/config/cluster/ocfs2/node/gladstone:
> total 0
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 num
>
> /sys/kernel/config/cluster/ocfs2/node/merlot1:
> total 0
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 num
>
> /sys/kernel/config/cluster/ocfs2/node/merlot2:
> total 0
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 num
>
> /sys/kernel/config/cluster/ocfs2/node/wilson1:
> total 0
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 num
>
> OH -- I think I see the problem. Contents of ipv4_port
> is 7777 on wilson1, 7778 on all others.
>
> The reason I switched to 7778 was that 7777 seemed to be
> blocked on gladstone after my previous attempts. Rebooting
> wouldn't clear it. I managed to get o2cb going again by
> switching to 7778.
>
> It seems I can't directly edit ipv4_port -- is it open?
> How do I fix this?
>
> Or alternatively, how do I clear port 7777 so I can use it
> again?
>
> Thanks for your help!
>
> --Carl
>
> Sunil Mushran wrote:
>   
>> The o2cb_ctl command should have added the new node to
>> the cluster.conf and configfs (/sys/kernel/config). If wilson1 is not
>> recognizing the new node, something went wrong in adding it to
>> configfs.
>>
>> Do: ls -lR /sys/kernel/config/cluster. The contents should be the
>> same on all nodes. What does it say on wilson1?
>>
>> Carl J. Benson wrote:
>>     
>>> Sunil,
>>>
>>> I read the user's guide, and added node "gladstone" by entering
>>> the following command, as root on each of the four nodes:
>>>
>>> o2cb_ctl -C -i -n gladstone -t node -a number=3 -a
>>> ip_address=140.107.170.108 -a ip_port=7778 -a cluster=ocfs2
>>>
>>> I copied/pasted the command, so it was identical on all nodes.
>>>
>>> On gladstone, /etc/init.d/o2cb status shows:
>>>
>>> Driver for "configfs": Loaded
>>> Filesystem "configfs": Mounted
>>> Stack glue driver: Loaded
>>> Stack plugin "o2cb": Loaded
>>> Driver for "ocfs2_dlmfs": Loaded
>>> Filesystem "ocfs2_dlmfs": Mounted
>>> Checking O2CB cluster ocfs2: Online
>>> Heartbeat dead threshold = 31
>>>   Network idle timeout: 30000
>>>   Network keepalive delay: 2000
>>>   Network reconnect delay: 2000
>>> Checking O2CB heartbeat: Not active
>>>
>>> So I attempt to mount the filesystem with mount /mnt/cpb_clust.
>>>
>>> Merlot1 likes it:
>>> Feb  2 09:40:33 merlot1 kernel: o2net: accepted connection from node
>>> gladstone (num 3) at 140.107.170.108:7778
>>>
>>> Merlot2 likes it:
>>> Feb  2 09:40:33 merlot2 kernel: o2net: accepted connection from node
>>> gladstone (num 3) at 140.107.170.108:7777
>>>
>>> But wilson1 does not:
>>> Feb  2 09:40:33 wilson1 kernel: (4447,3):o2net_accept_one:1795 attempt
>>> to connect from unknown node at 140.107.170.108:35267
>>> <...>
>>> Feb  2 09:41:00 wilson1 kernel: (4447,3):o2net_connect_expired:1659
>>> ERROR: no connection established with node 3 after 30.0 seconds, giving
>>> up and returning errors.
>>>
>>> On the new node, gladstone, I see:
>>> Feb  2 09:40:33 gladstone kernel: o2net: connected to node merlot2 (num
>>> 1) at 140.107.158.54:7777
>>> Feb  2 09:40:33 gladstone kernel: o2net: connected to node merlot1 (num
>>> 0) at 140.107.170.116:7777
>>> Feb  2 09:41:03 gladstone kernel: (7347,2):o2net_connect_expired:1659
>>> ERROR: noconnection established with node 2 after 30.0 seconds, giving
>>> up and returning errors.
>>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_request_join:1033 ERROR:
>>> status= -107
>>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_try_to_join_domain:1207
>>> ERROR: status = -107
>>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_join_domain:1485 ERROR:
>>> status = -107
>>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_register_domain:1732
>>> ERROR: status = -107
>>> Feb  2 09:41:03 gladstone kernel: (24118,2):o2cb_cluster_connect:302
>>> ERROR: status = -107
>>> Feb  2 09:41:03 gladstone kernel: (24118,2):ocfs2_dlm_init:2786 ERROR:
>>> status =-107
>>> Feb  2 09:41:03 gladstone kernel: (24118,2):ocfs2_mount_volume:1560
>>> ERROR: status = -107
>>> Feb  2 09:41:03 gladstone kernel: ocfs2: Unmounting device (8,17) on
>>> (node 0)
>>> Feb  2 09:41:03 gladstone kernel: o2net: no longer connected to node
>>> merlot1 (num 0) at 140.107.170.116:7777
>>> Feb  2 09:41:03 gladstone kernel: o2net: no longer connected to node
>>> merlot2 (num 1) at 140.107.158.54:7777
>>>
>>> Can you help me figure out where the problem is?
>>>
>>>   
>>>       
>
>   




More information about the Ocfs2-users mailing list