[Ocfs2-users] one node rejects connection from new node

Carl J. Benson cbenson at fhcrc.org
Mon Feb 2 10:24:19 PST 2009


Sunil,

I compared across the four systems. Merlot2 has directories for
several non-existent nodes due to an earlier attempt. File
ownership and permissions are the same across all nodes. Here's
the listing for wilson1:

/root # ls -lR /sys/kernel/config/cluster

/sys/kernel/config/cluster:

total 0

drwxr-xr-x 4 root root 0 2009-01-30 14:05 ocfs2


/sys/kernel/config/cluster/ocfs2:
total 0
drwxr-xr-x 3 root root    0 2009-01-30 14:05 heartbeat
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 idle_timeout_ms
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 keepalive_delay_ms
drwxr-xr-x 6 root root    0 2009-01-30 11:25 node
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 reconnect_delay_ms

/sys/kernel/config/cluster/ocfs2/heartbeat:
total 0
drwxr-xr-x 2 root root    0 2009-01-30 11:25
234829284D7144E6B41D8875C96946D3
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 dead_threshold


/sys/kernel/config/cluster/ocfs2/heartbeat/234829284D7144E6B41D8875C96946D3:
total 0
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 block_bytes
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 blocks
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 dev
-r--r--r-- 1 root root 4096 2009-02-02 10:08 pid
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 start_block

/sys/kernel/config/cluster/ocfs2/node:
total 0
drwxr-xr-x 2 root root 0 2009-01-30 14:05 gladstone
drwxr-xr-x 2 root root 0 2009-01-30 14:05 merlot1
drwxr-xr-x 2 root root 0 2009-01-30 14:05 merlot2
drwxr-xr-x 2 root root 0 2009-01-30 14:05 wilson1

/sys/kernel/config/cluster/ocfs2/node/gladstone:
total 0
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 num

/sys/kernel/config/cluster/ocfs2/node/merlot1:
total 0
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 num

/sys/kernel/config/cluster/ocfs2/node/merlot2:
total 0
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 num

/sys/kernel/config/cluster/ocfs2/node/wilson1:
total 0
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_address
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 ipv4_port
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 local
-rw-r--r-- 1 root root 4096 2009-02-02 10:08 num

OH -- I think I see the problem. Contents of ipv4_port
is 7777 on wilson1, 7778 on all others.

The reason I switched to 7778 was that 7777 seemed to be
blocked on gladstone after my previous attempts. Rebooting
wouldn't clear it. I managed to get o2cb going again by
switching to 7778.

It seems I can't directly edit ipv4_port -- is it open?
How do I fix this?

Or alternatively, how do I clear port 7777 so I can use it
again?

Thanks for your help!

--Carl

Sunil Mushran wrote:
> The o2cb_ctl command should have added the new node to
> the cluster.conf and configfs (/sys/kernel/config). If wilson1 is not
> recognizing the new node, something went wrong in adding it to
> configfs.
> 
> Do: ls -lR /sys/kernel/config/cluster. The contents should be the
> same on all nodes. What does it say on wilson1?
> 
> Carl J. Benson wrote:
>> Sunil,
>>
>> I read the user's guide, and added node "gladstone" by entering
>> the following command, as root on each of the four nodes:
>>
>> o2cb_ctl -C -i -n gladstone -t node -a number=3 -a
>> ip_address=140.107.170.108 -a ip_port=7778 -a cluster=ocfs2
>>
>> I copied/pasted the command, so it was identical on all nodes.
>>
>> On gladstone, /etc/init.d/o2cb status shows:
>>
>> Driver for "configfs": Loaded
>> Filesystem "configfs": Mounted
>> Stack glue driver: Loaded
>> Stack plugin "o2cb": Loaded
>> Driver for "ocfs2_dlmfs": Loaded
>> Filesystem "ocfs2_dlmfs": Mounted
>> Checking O2CB cluster ocfs2: Online
>> Heartbeat dead threshold = 31
>>   Network idle timeout: 30000
>>   Network keepalive delay: 2000
>>   Network reconnect delay: 2000
>> Checking O2CB heartbeat: Not active
>>
>> So I attempt to mount the filesystem with mount /mnt/cpb_clust.
>>
>> Merlot1 likes it:
>> Feb  2 09:40:33 merlot1 kernel: o2net: accepted connection from node
>> gladstone (num 3) at 140.107.170.108:7778
>>
>> Merlot2 likes it:
>> Feb  2 09:40:33 merlot2 kernel: o2net: accepted connection from node
>> gladstone (num 3) at 140.107.170.108:7777
>>
>> But wilson1 does not:
>> Feb  2 09:40:33 wilson1 kernel: (4447,3):o2net_accept_one:1795 attempt
>> to connect from unknown node at 140.107.170.108:35267
>> <...>
>> Feb  2 09:41:00 wilson1 kernel: (4447,3):o2net_connect_expired:1659
>> ERROR: no connection established with node 3 after 30.0 seconds, giving
>> up and returning errors.
>>
>> On the new node, gladstone, I see:
>> Feb  2 09:40:33 gladstone kernel: o2net: connected to node merlot2 (num
>> 1) at 140.107.158.54:7777
>> Feb  2 09:40:33 gladstone kernel: o2net: connected to node merlot1 (num
>> 0) at 140.107.170.116:7777
>> Feb  2 09:41:03 gladstone kernel: (7347,2):o2net_connect_expired:1659
>> ERROR: noconnection established with node 2 after 30.0 seconds, giving
>> up and returning errors.
>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_request_join:1033 ERROR:
>> status= -107
>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_try_to_join_domain:1207
>> ERROR: status = -107
>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_join_domain:1485 ERROR:
>> status = -107
>> Feb  2 09:41:03 gladstone kernel: (24118,2):dlm_register_domain:1732
>> ERROR: status = -107
>> Feb  2 09:41:03 gladstone kernel: (24118,2):o2cb_cluster_connect:302
>> ERROR: status = -107
>> Feb  2 09:41:03 gladstone kernel: (24118,2):ocfs2_dlm_init:2786 ERROR:
>> status =-107
>> Feb  2 09:41:03 gladstone kernel: (24118,2):ocfs2_mount_volume:1560
>> ERROR: status = -107
>> Feb  2 09:41:03 gladstone kernel: ocfs2: Unmounting device (8,17) on
>> (node 0)
>> Feb  2 09:41:03 gladstone kernel: o2net: no longer connected to node
>> merlot1 (num 0) at 140.107.170.116:7777
>> Feb  2 09:41:03 gladstone kernel: o2net: no longer connected to node
>> merlot2 (num 1) at 140.107.158.54:7777
>>
>> Can you help me figure out where the problem is?
>>
>>   
> 

-- 
Carl Benson, PHS Linux SysAdmin  (206-667-4862, cbenson at fhcrc.org)



More information about the Ocfs2-users mailing list