[Ocfs2-users] Odd error on FC12 with ocfs2
Sunil Mushran
sunil.mushran at oracle.com
Tue Mar 30 09:45:21 PDT 2010
# debugfs.ocfs2 -l TCP allow /dev/mapper/OCFS2_200Gp1
Enable it by "allow"ing the tracing.
Also, do it on both nodes. The node you are mounting and any one
node. Say node 2.
David Murphy wrote:
> [root at web1 /dev]# debugfs.ocfs2 -l TCP off /dev/mapper/OCFS2_200Gp1
> [root at web1 /dev]# mount /dev/mapper/OCFS2_200Gp1 -v
> device=/dev/mapper/OCFS2_200Gp1
> mount.ocfs2: Transport endpoint is not connected while mounting
> /dev/mapper/OCFS2_200Gp1 on /mnt/appshare. Check 'dmesg' for more
> information on this error.
> [root at web1 /dev]#dmesg
>
> DMESG:
> Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656
> ERROR: no connection established with node 2 after 30.0 seconds, giving up
> and returning errors.
> Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656
> ERROR: no connection established with node 3 after 30.0 seconds, giving up
> and returning errors.
> Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656
> ERROR: no connection established with node 4 after 30.0 seconds, giving up
> and returning errors.
> Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656
> ERROR: no connection established with node 5 after 30.0 seconds, giving up
> and returning errors.
> Mar 30 10:23:38 web1 kernel: (1236,0):o2net_connect_expired:1656
> ERROR: no connection established with node 6 after 30.0 seconds, giving up
> and returning errors.
> Mar 30 10:23:38 web1 kernel: (1740,0):dlm_request_join:1035 ERROR:
> status = -107
> Mar 30 10:23:38 web1 kernel: (1740,0):dlm_try_to_join_domain:1209
> ERROR: status = -107
> Mar 30 10:23:38 web1 kernel: (1740,0):dlm_join_domain:1487 ERROR:
> status = -107
> Mar 30 10:23:38 web1 kernel: (1740,0):dlm_register_domain:1753
> ERROR: status = -107
> Mar 30 10:23:38 web1 kernel: (1740,0):o2cb_cluster_connect:313
> ERROR: status = -107
> Mar 30 10:23:38 web1 kernel: (1740,0):ocfs2_dlm_init:2963 ERROR:
> status = -107
> Mar 30 10:23:38 web1 kernel: (1740,0):ocfs2_mount_volume:1788 ERROR:
> status = -107
> Mar 30 10:23:38 web1 kernel: ocfs2: Unmounting device (253,1) on
> (node 0)
>
> DEBUGFS:
> debugfs: curdev
> /dev/mapper/OCFS2_200Gp1
> debugfs: controld dump
> controld: Unable to access cluster service while obtaining the debug
> buffer
> debugfs: slotmap
> Slot# Node#
> 0 3
> 1 5
> 2 2
> 4 4
> 5 6
> debugfs: stats
> Revision: 0.90
> Mount Count: 0 Max Mount Count: 20
> State: 0 Errors: 0
> Check Interval: 0 Last Check: Mon Mar 29 10:53:52 2010
> Creator OS: 0
> Feature Compat: 1 backup-super
> Feature Incompat: 16 sparse
> Tunefs Incomplete: 0
> Feature RO compat: 1 unwritten
> Root Blknum: 5 System Dir Blknum: 6
> First Cluster Group Blknum: 3
> Block Size Bits: 12 Cluster Size Bits: 12
> Max Node Slots: 6
> Extended Attributes Inline Size: 0
> Label: OCFS2_APPSHARE_200G
> UUID: D6E0DD0AAC8844ED94A4A459FBB6F7FF
> UUID_hash: 0 (0x0)
> Cluster stack: classic o2cb
> Inode: 2 Mode: 00 Generation: 2428834932 (0x90c51474)
> FS Generation: 2428834932 (0x90c51474)
> CRC32: 00000000 ECC: 0000
> Type: Unknown Attr: 0x0 Flags: Valid System Superblock
> Dynamic Features: (0x0)
> User: 0 (root) Group: 0 (root) Size: 0
> Links: 0 Clusters: 52428119
> ctime: 0x4a0b2372 -- Wed May 13 14:45:54 2009
> atime: 0x0 -- Wed Dec 31 18:00:00 1969
> mtime: 0x4a0b2372 -- Wed May 13 14:45:54 2009
> dtime: 0x0 -- Wed Dec 31 18:00:00 1969
> ctime_nsec: 0x00000000 -- 0
> atime_nsec: 0x00000000 -- 0
> mtime_nsec: 0x00000000 -- 0
> Last Extblk: 0
> Sub Alloc Slot: Global Sub Alloc Bit: 65535
>
>
>
> It doesn't appear any extra debug logging actually was created.
>
> David
> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
> Sent: Monday, March 29, 2010 10:23 PM
> To: Angelo McComis
> Cc: David Murphy; ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2
>
> No
>
> On Mar 29, 2010, at 8:10 PM, Angelo McComis <angelo at mccomis.com> wrote:
>
>
>> Does it matter that the nodes are numbered 1-6 instead of 0-5?
>>
>>
>>
>> On Mon, Mar 29, 2010 at 4:25 PM, Sunil Mushran
>> <sunil.mushran at oracle.com
>>
>>> wrote:
>>> Enable some debugging.
>>>
>>> #debugfs.ocfs2 -l TCP allow
>>> ...do mount...
>>> #debugfs.ocfs2 -l TCP off
>>>
>>>
>>> David Murphy wrote:
>>>
>>>> [root at web2 ~]# nc -z 192.168.102.140 7777 Connection to
>>>> 192.168.102.140 7777 port [tcp/cbt] succeeded!
>>>>
>>>> [root at web1 /etc/sysconfig/network-scripts]# nc -z 192.168.102.141
>>>> 7777 Connection to 192.168.102.141 7777 port [tcp/cbt] succeeded!
>>>>
>>>> -----Original Message-----
>>>> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
>>>> Sent: Monday, March 29, 2010 5:08 PM
>>>> To: David Murphy
>>>> Cc: ocfs2-users at oss.oracle.com
>>>> Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2
>>>>
>>>> What happens when you use netcat to ping the node?
>>>> nc -z host.example.com 7777
>>>>
>>>> David Murphy wrote:
>>>>
>>>>
>>>>> Some additional data:
>>>>> From Web1 ( New Fedora Machine) to Web2:
>>>>> [root at web1 /etc/sysconfig/network-scripts]# nmap
>>>>> 192.168.102.141
>>>>>
>>>>> Starting Nmap 5.21 ( http://nmap.org ) at 2010-03-29 16:56 CDT
>>>>> Nmap scan report for 192.168.102.141
>>>>> Host is up (0.000076s latency).
>>>>> Not shown: 993 closed ports
>>>>> PORT STATE SERVICE
>>>>> 22/tcp open ssh
>>>>> 80/tcp open http
>>>>> 81/tcp open hosts2-ns
>>>>> 111/tcp open rpcbind
>>>>> 5666/tcp open nrpe
>>>>> 7777/tcp open unknown
>>>>> 9102/tcp open jetdirect
>>>>> MAC Address: 00:50:56:A3:58:5D (VMware)
>>>>>
>>>>> Nmap done: 1 IP address (1 host up) scanned in 1.18 seconds
>>>>>
>>>>>
>>>>> From web2 -> web1 (new fedora machine)
>>>>> [root at web2 ~]# nmap 192.168.102.140
>>>>>
>>>>> Starting Nmap 5.00 ( http://nmap.org ) at 2010-03-29 16:40 CDT
>>>>> Interesting ports on 192.168.102.140:
>>>>> Not shown: 994 closed ports
>>>>> PORT STATE SERVICE
>>>>> 22/tcp open ssh
>>>>> 80/tcp open http
>>>>> 81/tcp open hosts2-ns
>>>>> 111/tcp open rpcbind
>>>>> 443/tcp open https
>>>>> 7777/tcp open unknown
>>>>> MAC Address: 00:50:56:A3:14:62 (VMWare)
>>>>>
>>>>> Nmap done: 1 IP address (1 host up) scanned in 1.31 seconds
>>>>>
>>>>>
>>>>> Cluster.conf:
>>>>> cluster:
>>>>> node_count = 6
>>>>> name = appshare
>>>>>
>>>>> node:
>>>>> ip_port = 7777
>>>>> ip_address = 192.168.102.140
>>>>> number = 1
>>>>> name = web1
>>>>> cluster = appshare
>>>>>
>>>>> node:
>>>>> ip_port = 7777
>>>>> ip_address = 192.168.102.141
>>>>> number = 2
>>>>> name = web2
>>>>> cluster = appshare
>>>>>
>>>>> node:
>>>>> ip_port = 7777
>>>>> ip_address = 192.168.102.142
>>>>> number = 3
>>>>> name = web3
>>>>> cluster = appshare
>>>>>
>>>>> node:
>>>>> ip_port = 7777
>>>>> ip_address = 192.168.102.111
>>>>> number = 4
>>>>> name = rgapp1
>>>>> cluster = appshare
>>>>>
>>>>> node:
>>>>> ip_port = 7777
>>>>> ip_address = 192.168.102.122
>>>>> number = 5
>>>>> name = deploy
>>>>> cluster = appshare
>>>>>
>>>>> node:
>>>>> ip_port = 7777
>>>>> ip_address = 192.168.102.112
>>>>> number = 6
>>>>> name = app1
>>>>> cluster = appshare
>>>>>
>>>>> DMESG on WEB1:
>>>>> OCFS2 1.5.0
>>>>> (1199,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 2 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1199,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 3 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1199,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 4 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1199,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 5 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1199,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 6 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1262,0):dlm_request_join:1035 ERROR: status = -107
>>>>> (1262,0):dlm_try_to_join_domain:1209 ERROR: status = -107
>>>>> (1262,0):dlm_join_domain:1487 ERROR: status = -107
>>>>> (1262,0):dlm_register_domain:1753 ERROR: status = -107
>>>>> (1262,0):o2cb_cluster_connect:313 ERROR: status = -107
>>>>> (1262,0):ocfs2_dlm_init:2963 ERROR: status = -107
>>>>> (1262,0):ocfs2_mount_volume:1788 ERROR: status = -107
>>>>> ocfs2: Unmounting device (253,1) on (node 0)
>>>>> (1199,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 2 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1199,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 3 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1199,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 5 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1199,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 6 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1323,0):dlm_request_join:1035 ERROR: status = -107
>>>>> (1323,0):dlm_try_to_join_domain:1209 ERROR: status = -107
>>>>> (1323,0):dlm_join_domain:1487 ERROR: status = -107
>>>>> (1323,0):dlm_register_domain:1753 ERROR: status = -107
>>>>> (1323,0):o2cb_cluster_connect:313 ERROR: status = -107
>>>>> (1323,0):ocfs2_dlm_init:2963 ERROR: status = -107
>>>>> (1323,0):ocfs2_mount_volume:1788 ERROR: status = -107
>>>>> ocfs2: Unmounting device (253,1) on (node 0)
>>>>> VMCI: Major device number is: 249
>>>>> VMware memory control driver initialized
>>>>> vmmemctl: started kernel thread pid=1522
>>>>> ocfs2: Unregistered cluster interface o2cb
>>>>> OCFS2 Node Manager 1.5.0
>>>>> OCFS2 DLM 1.5.0
>>>>> ocfs2: Registered cluster interface o2cb
>>>>> OCFS2 DLMFS 1.5.0
>>>>> OCFS2 User DLM kernel interface loaded
>>>>> OCFS2 1.5.0
>>>>> (1810,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 4 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1810,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 5 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1810,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 6 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1810,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 2 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1810,0):o2net_connect_expired:1656 ERROR: no connection
>>>>> established with node 3 after 30.0 seconds, giving up and returning
>>>>> errors.
>>>>> (1839,0):dlm_request_join:1035 ERROR: status = -107
>>>>> (1839,0):dlm_try_to_join_domain:1209 ERROR: status = -107
>>>>> (1839,0):dlm_join_domain:1487 ERROR: status = -107
>>>>> (1839,0):dlm_register_domain:1753 ERROR: status = -107
>>>>> (1839,0):o2cb_cluster_connect:313 ERROR: status = -107
>>>>> (1839,0):ocfs2_dlm_init:2963 ERROR: status = -107
>>>>> (1839,0):ocfs2_mount_volume:1788 ERROR: status = -107
>>>>> ocfs2: Unmounting device (253,1) on (node 0)
>>>>>
>>>>>
>>>>>
>>>>> So clearly ocfs2 the service things it can connect to the node,
>>>>> but nmap sees the connection just fine. And Web2 can see the port
>>>>> on web1 just
>>>>>
>>>>>
>>>> fine,
>>>>
>>>>
>>>>> so there is no firewall blocking the connections.
>>>>>
>>>>> I think it might be Fedora 12 used 1.50 for the OCFS kernel
>>>>> module and
>>>>> CentOS 5.3/5.4 use 1.4.4-1. Am I correct in thinking this?
>>>>>
>>>>> David
>>>>> -----Original Message-----
>>>>> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
>>>>> Sent: Thursday, March 25, 2010 6:46 PM
>>>>> To: David Murphy
>>>>> Cc: ocfs2-users at oss.oracle.com
>>>>> Subject: Re: [Ocfs2-users] Odd error on FC12 with ocfs2
>>>>>
>>>>> hmm.. o2cb_ctl makes no connections. It just reads the cluster.conf
>>>>> and populates configfs. AFAIK.
>>>>>
>>>>> David Murphy wrote:
>>>>>
>>>>>
>>>>>
>>>>>> We had 6 nodes running CentOS 5.4 using 1.4.3 ocfs2-tools.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I decided to rebuild one node with FC12.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Which is working fine, however
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nmap 192.168.200.112 shows 7777 as open
>>>>>>
>>>>>> And
>>>>>>
>>>>>>
>>>>>>
>>>>>> O2cb_ctl is timing out when trying to connect to that node which
>>>>>> then causes a 107 error. This happens with all node and all node
>>>>>> have
>>>>>> 7777
>>>>>> open via nmap from the FC machine.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Is there a way to further debug this to see what exactly
>>>>>> o2cb_ctl is
>>>>>> seeing when trying to connect?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> David
>>>>>>
>>>>>> ---
>>>>>> ---
>>>>>> ----------------------------------------------------------------
>>>>>> --
>>>>>>
>>>>>> _______________________________________________
>>>>>> Ocfs2-users mailing list
>>>>>> Ocfs2-users at oss.oracle.com
>>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>
>>>
>
>
>
More information about the Ocfs2-users
mailing list