[Ocfs2-users] Odd error on FC12 with ocfs2

Sunil Mushran sunil.mushran at oracle.com
Mon Mar 29 20:23:04 PDT 2010


No

On Mar 29, 2010, at 8:10 PM, Angelo McComis <angelo at mccomis.com> wrote:

> Does it matter that the nodes are numbered 1-6 instead of 0-5?
>
>
>
> On Mon, Mar 29, 2010 at 4:25 PM, Sunil Mushran <sunil.mushran at oracle.com 
> > wrote:
>> Enable some debugging.
>>
>> #debugfs.ocfs2 -l TCP allow
>> ...do mount...
>> #debugfs.ocfs2 -l TCP off
>>
>>
>> David Murphy wrote:
>>> [root at web2 ~]# nc -z  192.168.102.140 7777
>>> Connection to 192.168.102.140 7777 port [tcp/cbt] succeeded!
>>>
>>> [root at web1 /etc/sysconfig/network-scripts]# nc -z  192.168.102.141 7777
>>> Connection to 192.168.102.141 7777 port [tcp/cbt] succeeded!
>>>
>>> -----Original Message-----
>>> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
>>> Sent: Monday, March 29, 2010 5:08 PM
>>> To: David Murphy
>>> Cc: ocfs2-users at oss.oracle.com
>>> Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2
>>>
>>> What happens when you use netcat to ping the node?
>>> nc -z host.example.com 7777
>>>
>>> David Murphy wrote:
>>>
>>>> Some additional data:
>>>> From Web1 ( New Fedora Machine) to Web2:
>>>>      [root at web1 /etc/sysconfig/network-scripts]# nmap 192.168.102.141
>>>>
>>>>      Starting Nmap 5.21 ( http://nmap.org ) at 2010-03-29 16:56 CDT
>>>>      Nmap scan report for 192.168.102.141
>>>>      Host is up (0.000076s latency).
>>>>      Not shown: 993 closed ports
>>>>      PORT     STATE SERVICE
>>>>      22/tcp   open  ssh
>>>>      80/tcp   open  http
>>>>      81/tcp   open  hosts2-ns
>>>>      111/tcp  open  rpcbind
>>>>      5666/tcp open  nrpe
>>>>      7777/tcp open  unknown
>>>>      9102/tcp open  jetdirect
>>>>      MAC Address: 00:50:56:A3:58:5D (VMware)
>>>>
>>>>      Nmap done: 1 IP address (1 host up) scanned in 1.18 seconds
>>>>
>>>>
>>>> From   web2 -> web1 (new fedora machine)
>>>>      [root at web2 ~]# nmap 192.168.102.140
>>>>
>>>>      Starting Nmap 5.00 ( http://nmap.org ) at 2010-03-29 16:40 CDT
>>>>      Interesting ports on 192.168.102.140:
>>>>      Not shown: 994 closed ports
>>>>      PORT     STATE SERVICE
>>>>      22/tcp   open  ssh
>>>>      80/tcp   open  http
>>>>      81/tcp   open  hosts2-ns
>>>>      111/tcp  open  rpcbind
>>>>      443/tcp  open  https
>>>>      7777/tcp open  unknown
>>>>      MAC Address: 00:50:56:A3:14:62 (VMWare)
>>>>
>>>>      Nmap done: 1 IP address (1 host up) scanned in 1.31 seconds
>>>>
>>>>
>>>> Cluster.conf:
>>>>      cluster:
>>>>              node_count = 6
>>>>              name = appshare
>>>>
>>>>      node:
>>>>              ip_port = 7777
>>>>              ip_address = 192.168.102.140
>>>>              number = 1
>>>>              name = web1
>>>>              cluster = appshare
>>>>
>>>>      node:
>>>>              ip_port = 7777
>>>>              ip_address = 192.168.102.141
>>>>              number = 2
>>>>              name = web2
>>>>              cluster = appshare
>>>>
>>>>      node:
>>>>              ip_port = 7777
>>>>              ip_address = 192.168.102.142
>>>>              number = 3
>>>>              name = web3
>>>>              cluster = appshare
>>>>
>>>>      node:
>>>>              ip_port = 7777
>>>>              ip_address = 192.168.102.111
>>>>              number = 4
>>>>              name = rgapp1
>>>>              cluster = appshare
>>>>
>>>>      node:
>>>>              ip_port = 7777
>>>>              ip_address = 192.168.102.122
>>>>              number = 5
>>>>              name = deploy
>>>>              cluster = appshare
>>>>
>>>>      node:
>>>>              ip_port = 7777
>>>>              ip_address = 192.168.102.112
>>>>              number = 6
>>>>              name = app1
>>>>              cluster = appshare
>>>>
>>>> DMESG on WEB1:
>>>>      OCFS2 1.5.0
>>>>      (1199,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 2 after 30.0 seconds, giving up and returning errors.
>>>>      (1199,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 3 after 30.0 seconds, giving up and returning errors.
>>>>      (1199,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 4 after 30.0 seconds, giving up and returning errors.
>>>>      (1199,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 5 after 30.0 seconds, giving up and returning errors.
>>>>      (1199,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 6 after 30.0 seconds, giving up and returning errors.
>>>>      (1262,0):dlm_request_join:1035 ERROR: status = -107
>>>>      (1262,0):dlm_try_to_join_domain:1209 ERROR: status = -107
>>>>      (1262,0):dlm_join_domain:1487 ERROR: status = -107
>>>>      (1262,0):dlm_register_domain:1753 ERROR: status = -107
>>>>      (1262,0):o2cb_cluster_connect:313 ERROR: status = -107
>>>>      (1262,0):ocfs2_dlm_init:2963 ERROR: status = -107
>>>>      (1262,0):ocfs2_mount_volume:1788 ERROR: status = -107
>>>>      ocfs2: Unmounting device (253,1) on (node 0)
>>>>      (1199,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 2 after 30.0 seconds, giving up and returning errors.
>>>>      (1199,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 3 after 30.0 seconds, giving up and returning errors.
>>>>      (1199,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 5 after 30.0 seconds, giving up and returning errors.
>>>>      (1199,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 6 after 30.0 seconds, giving up and returning errors.
>>>>      (1323,0):dlm_request_join:1035 ERROR: status = -107
>>>>      (1323,0):dlm_try_to_join_domain:1209 ERROR: status = -107
>>>>      (1323,0):dlm_join_domain:1487 ERROR: status = -107
>>>>      (1323,0):dlm_register_domain:1753 ERROR: status = -107
>>>>      (1323,0):o2cb_cluster_connect:313 ERROR: status = -107
>>>>      (1323,0):ocfs2_dlm_init:2963 ERROR: status = -107
>>>>      (1323,0):ocfs2_mount_volume:1788 ERROR: status = -107
>>>>      ocfs2: Unmounting device (253,1) on (node 0)
>>>>      VMCI: Major device number is: 249
>>>>      VMware memory control driver initialized
>>>>      vmmemctl: started kernel thread pid=1522
>>>>      ocfs2: Unregistered cluster interface o2cb
>>>>      OCFS2 Node Manager 1.5.0
>>>>      OCFS2 DLM 1.5.0
>>>>      ocfs2: Registered cluster interface o2cb
>>>>      OCFS2 DLMFS 1.5.0
>>>>      OCFS2 User DLM kernel interface loaded
>>>>      OCFS2 1.5.0
>>>>      (1810,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 4 after 30.0 seconds, giving up and returning errors.
>>>>      (1810,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 5 after 30.0 seconds, giving up and returning errors.
>>>>      (1810,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 6 after 30.0 seconds, giving up and returning errors.
>>>>      (1810,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 2 after 30.0 seconds, giving up and returning errors.
>>>>      (1810,0):o2net_connect_expired:1656 ERROR: no connection  
>>>> established
>>>> with node 3 after 30.0 seconds, giving up and returning errors.
>>>>      (1839,0):dlm_request_join:1035 ERROR: status = -107
>>>>      (1839,0):dlm_try_to_join_domain:1209 ERROR: status = -107
>>>>      (1839,0):dlm_join_domain:1487 ERROR: status = -107
>>>>      (1839,0):dlm_register_domain:1753 ERROR: status = -107
>>>>      (1839,0):o2cb_cluster_connect:313 ERROR: status = -107
>>>>      (1839,0):ocfs2_dlm_init:2963 ERROR: status = -107
>>>>      (1839,0):ocfs2_mount_volume:1788 ERROR: status = -107
>>>>      ocfs2: Unmounting device (253,1) on (node 0)
>>>>
>>>>
>>>>
>>>> So clearly  ocfs2 the service things it can connect to the node,  
>>>> but nmap
>>>> sees the connection just fine. And Web2 can see the port on web1  
>>>> just
>>>>
>>> fine,
>>>
>>>> so there is no firewall blocking the connections.
>>>>
>>>> I think it might be   Fedora 12 used 1.50 for the OCFS kernel  
>>>> module and
>>>> CentOS 5.3/5.4 use 1.4.4-1. Am I correct in thinking this?
>>>>
>>>> David
>>>> -----Original Message-----
>>>> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
>>>> Sent: Thursday, March 25, 2010 6:46 PM
>>>> To: David Murphy
>>>> Cc: ocfs2-users at oss.oracle.com
>>>> Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2
>>>>
>>>> hmm.. o2cb_ctl makes no connections. It just reads the  
>>>> cluster.conf and
>>>> populates configfs. AFAIK.
>>>>
>>>> David Murphy wrote:
>>>>
>>>>
>>>>> We had  6 nodes running CentOS 5.4 using  1.4.3 ocfs2-tools.
>>>>>
>>>>>
>>>>>
>>>>> I decided to rebuild one node with FC12.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Which is working fine, however
>>>>>
>>>>>
>>>>>
>>>>> Nmap 192.168.200.112  shows 7777 as open
>>>>>
>>>>> And
>>>>>
>>>>>
>>>>>
>>>>> O2cb_ctl is  timing out when trying to connect to that node  
>>>>> which then
>>>>> causes a 107 error. This happens with all node and all node have  
>>>>> 7777
>>>>> open  via nmap from the FC machine.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Is there a way to further debug this to see what exactly   
>>>>> o2cb_ctl is
>>>>> seeing when trying to connect?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> David
>>>>>
>>>>> --- 
>>>>> --- 
>>>>> ----------------------------------------------------------------
>>>>> --
>>>>>
>>>>> _______________________________________________
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users at oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>



More information about the Ocfs2-users mailing list