[Ocfs2-users] one node rejects connection from new node

Carl J. Benson cbenson at fhcrc.org
Mon Feb 2 10:51:54 PST 2009


Sunil,

Thanks. I took a close look at the production jobs on wilson1
and realized I had a window of opportunity, so I unmounted
/mnt/cpb_clust and stopped o2cb.

I actually did NOT have to edit gladstone/ipv4_port on wilson1.

I forgot that I have a cron job that runs every 5 minutes and
re-mounts /mnt/cpb_clust if it isn't mounted. It ran before
I had a chance to stop it. By the time I edited ipv4_port,
it was already set to 7778 !

So I tried "mount /mnt/cpb_clust" on gladstone again, and it
worked!

Thanks again for your help. I think I'm better prepared now
to add the rest of the nodes.

--Carl

Sunil Mushran wrote:
> While you can add nodes to an online cluster, editing existing
> configuration is not allowed. It will require a cluster restart.
> 
> Carl J. Benson wrote:
>> Sunil,
>>
>> I compared across the four systems. Merlot2 has directories for
>> several non-existent nodes due to an earlier attempt. File
>> ownership and permissions are the same across all nodes. Here's
>> the listing for wilson1:
>>
>> /root # ls -lR /sys/kernel/config/cluster
>>
>> /sys/kernel/config/cluster:
>>
>> total 0
>>
>> drwxr-xr-x 4 root root 0 2009-01-30 14:05 ocfs2
>>
>>
>> /sys/kernel/config/cluster/ocfs2:
>> total 0
>> drwxr-xr-x 3 root root    0 2009-01-30 14:05 heartbeat
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 idle_timeout_ms
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 keepalive_delay_ms
>> drwxr-xr-x 6 root root    0 2009-01-30 11:25 node
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 reconnect_delay_ms
>>
>> /sys/kernel/config/cluster/ocfs2/heartbeat:
>> total 0
>> drwxr-xr-x 2 root root    0 2009-01-30 11:25
>> 234829284D7144E6B41D8875C96946D3
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 dead_threshold
>>
>>
>> /sys/kernel/config/cluster/ocfs2/heartbeat/234829284D7144E6B41D8875C96946D3:
>>
>> total 0
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 block_bytes
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 blocks
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 dev
>> -r--r--r-- 1 root root 4096 2009-02-02 10:08 pid
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 start_block
>>
>> /sys/kernel/config/cluster/ocfs2/node:
>> total 0
>> drwxr-xr-x 2 root root 0 2009-01-30 14:05 gladstone
>> drwxr-xr-x 2 root root 0 2009-01-30 14:05 merlot1
>> drwxr-xr-x 2 root root 0 2009-01-30 14:05 merlot2
>> drwxr-xr-x 2 root root 0 2009-01-30 14:05 wilson1
>>
>> /sys/kernel/config/cluster/ocfs2/node/gladstone:
>> total 0
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 num
>>
>> /sys/kernel/config/cluster/ocfs2/node/merlot1:
>> total 0
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 num
>>
>> /sys/kernel/config/cluster/ocfs2/node/merlot2:
>> total 0
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 num
>>
>> /sys/kernel/config/cluster/ocfs2/node/wilson1:
>> total 0
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
>> -rw-r--r-- 1 root root 4096 2009-02-02 10:08 num
>>
>> OH -- I think I see the problem. Contents of ipv4_port
>> is 7777 on wilson1, 7778 on all others.
>>
>> The reason I switched to 7778 was that 7777 seemed to be
>> blocked on gladstone after my previous attempts. Rebooting
>> wouldn't clear it. I managed to get o2cb going again by
>> switching to 7778.
>>
>> It seems I can't directly edit ipv4_port -- is it open?
>> How do I fix this?
>>
>> Or alternatively, how do I clear port 7777 so I can use it
>> again?
>>
>> Thanks for your help!
>>
>> --Carl
>>
>> Sunil Mushran wrote:
>>  
>>> The o2cb_ctl command should have added the new node to
>>> the cluster.conf and configfs (/sys/kernel/config). If wilson1 is not
>>> recognizing the new node, something went wrong in adding it to
>>> configfs.
>>>
>>> Do: ls -lR /sys/kernel/config/cluster. The contents should be the
>>> same on all nodes. What does it say on wilson1?
>>>
>>> Carl J. Benson wrote:
>>>    
>>>> Sunil,
>>>>
>>>> I read the user's guide, and added node "gladstone" by entering
>>>> the following command, as root on each of the four nodes:
>>>>
>>>> o2cb_ctl -C -i -n gladstone -t node -a number=3 -a
>>>> ip_address=140.107.170.108 -a ip_port=7778 -a cluster=ocfs2
>>>>
>>>> I copied/pasted the command, so it was identical on all nodes.
>>>>
>>>> On gladstone, /etc/init.d/o2cb status shows:
>>>>
>>>> Driver for "configfs": Loaded
>>>> Filesystem "configfs": Mounted
>>>> Stack glue driver: Loaded
>>>> Stack plugin "o2cb": Loaded
>>>> Driver for "ocfs2_dlmfs": Loaded
>>>> Filesystem "ocfs2_dlmfs": Mounted
>>>> Checking O2CB cluster ocfs2: Online
>>>> Heartbeat dead threshold = 31
>>>>   Network idle timeout: 30000
>>>>   Network keepalive delay: 2000
>>>>   Network reconnect delay: 2000
>>>> Checking O2CB heartbeat: Not active
>>>>
>>>> So I attempt to mount the filesystem with mount /mnt/cpb_clust.
>>>>
>>>> Merlot1 likes it:
>>>> Feb  2 09:40:33 merlot1 kernel: o2net: accepted connection from node
>>>> gladstone (num 3) at 140.107.170.108:7778
>>>>
>>>> Merlot2 likes it:
>>>> Feb  2 09:40:33 merlot2 kernel: o2net: accepted connection from node
>>>> gladstone (num 3) at 140.107.170.108:7777
>>>>
>>>> But wilson1 does not:
>>>> Feb  2 09:40:33 wilson1 kernel: (4447,3):o2net_accept_one:1795 attempt
>>>> to connect from unknown node at 140.107.170.108:35267
>>>> <...>
>>>> Feb  2 09:41:00 wilson1 kernel: (4447,3):o2net_connect_expired:1659
>>>> ERROR: no connection established with node 3 after 30.0 seconds, giving
>>>> up and returning errors.
>>>>
>>>> On the new node, gladstone, I see:
>>>> Feb  2 09:40:33 gladstone kernel: o2net: connected to node merlot2 (num
>>>> 1) at 140.107.158.54:7777
>>>> Feb  2 09:40:33 gladstone kernel: o2net: connected to node merlot1 (num
>>>> 0) at 140.107.170.116:7777
>>>> Feb  2 09:41:03 gladstone kernel: (7347,2):o2net_connect_expired:1659
>>>> ERROR: noconnection established with node 2 after 30.0 seconds, giving
>>>> up and returning errors.
>>>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_request_join:1033
>>>> ERROR:
>>>> status= -107
>>>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_try_to_join_domain:1207
>>>> ERROR: status = -107
>>>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_join_domain:1485 ERROR:
>>>> status = -107
>>>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_register_domain:1732
>>>> ERROR: status = -107
>>>> Feb  2 09:41:03 gladstone kernel: (24118,2):o2cb_cluster_connect:302
>>>> ERROR: status = -107
>>>> Feb  2 09:41:03 gladstone kernel: (24118,2):ocfs2_dlm_init:2786 ERROR:
>>>> status =-107
>>>> Feb  2 09:41:03 gladstone kernel: (24118,2):ocfs2_mount_volume:1560
>>>> ERROR: status = -107
>>>> Feb  2 09:41:03 gladstone kernel: ocfs2: Unmounting device (8,17) on
>>>> (node 0)
>>>> Feb  2 09:41:03 gladstone kernel: o2net: no longer connected to node
>>>> merlot1 (num 0) at 140.107.170.116:7777
>>>> Feb  2 09:41:03 gladstone kernel: o2net: no longer connected to node
>>>> merlot2 (num 1) at 140.107.158.54:7777
>>>>
>>>> Can you help me figure out where the problem is?



More information about the Ocfs2-users mailing list