[Ocfs2-users] OCFS2 tuning, fragmentation and localalloc option. Cluster hanging during mix read+write workloads

Goldwyn Rodrigues rgoldwyn at suse.de
Thu Jul 18 15:54:23 PDT 2013


On 07/18/2013 11:42 AM, Gavin Jones wrote:
> Hello,
>
> Sure, I'd be happy to provide such information next time this occurs.
>
> Can you elaborate, or point me at documentation / procedure regarding
> the DLM debug logs and what would be helpful to see?  I have read
> "Troubleshooting OCFS2" [1] and the section "Debugging File System
> Locks" --is this what you're referring to?

No. I was looking for a more proactive approach we use in debugging.
# debugfs.ocfs2 -l
will provide you a list of debug messages you can turn on/off.

In order to turn on DLM_GLUE (the layer between ocfs2 and DLM) specific 
operations, issue

# debugfs.ocfs2 -l DLM_GLUE allow

Please note, this generates a lot of messages.

>
> Not sure if this will provide additional context or just muddy the
> waters, but I thought to provide some syslog messages from an affected
> server the last time this occurred.
>
> Jul 14 15:36:55 slipapp07 kernel: [2173588.704093] o2net: Connection
> to node slipapp03 (num 2) at 172.16.40.122:7777 has been idle for
> 30.97 secs, shutting it down.
> Jul 14 15:36:55 slipapp07 kernel: [2173588.704146] o2net: No longer
> connected to node slipapp03 (num 2) at 172.16.40.122:7777
> Jul 14 15:36:55 slipapp07 kernel: [2173588.704279]
> (kworker/u:1,12787,4):dlm_do_assert_master:1665 ERROR: Error -112 when
> sending message 502 (key 0xdc8be796) to node 2
> Jul 14 15:36:55 slipapp07 kernel: [2173588.704295]
> (kworker/u:5,26056,5):dlm_do_master_request:1332 ERROR: link to 2 went
> down!
> Jul 14 15:36:55 slipapp07 kernel: [2173588.704301]
> (kworker/u:5,26056,5):dlm_get_lock_resource:917 ERROR: status = -112
> Jul 14 15:37:25 slipapp07 kernel: [2173618.784153] o2net: No
> connection established with node 2 after 30.0 seconds, giving up.
> <snip>
> Jul 14 15:39:14 slipapp07 kernel: [2173727.920793]
> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -112 when
> sending message 502 (key 0xdc8be796) to node 4
> Jul 14 15:39:14 slipapp07 kernel: [2173727.920833]
> (/usr/sbin/httpd,5023,5):dlm_send_remote_lock_request:336 ERROR:
> A08674A831ED4048B5136BD8613B21E0: res N000000000152a8da, Error -112
> send CREATE LOCK to node 4
> Jul 14 15:39:14 slipapp07 kernel: [2173727.930562]
> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when
> sending message 502 (key 0xdc8be796) to node 4
> Jul 14 15:39:14 slipapp07 kernel: [2173727.944998]
> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when
> sending message 502 (key 0xdc8be796) to node 4
> Jul 14 15:39:14 slipapp07 kernel: [2173727.951511]
> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when
> sending message 502 (key 0xdc8be796) to node 4
> Jul 14 15:39:14 slipapp07 kernel: [2173727.973848]
> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when
> sending message 502 (key 0xdc8be796) to node 4
> Jul 14 15:39:14 slipapp07 kernel: [2173727.990216]
> (kworker/u:2,13894,7):dlm_do_assert_master:1665 ERROR: Error -107 when
> sending message 502 (key 0xdc8be796) to node 4
> Jul 14 15:39:14 slipapp07 kernel: [2173728.024139]
> (/usr/sbin/httpd,5023,5):dlm_send_remote_lock_request:336 ERROR:
> A08674A831ED4048B5136BD8613B21E0: res N000000000152a8da, Error -107
> send CREATE LOCK to node 4
> <snip, many, many more like the above>
>
> Which I suppose would indicate DLM issues; I have previously tried to
> investigate this (via abovementioned guide) but was unable to make
> real headway.

No, this means your network is the problem, which in turn is affecting 
your DLM operations. The problem could be anywhere from hardware to the 
network device to possibly a bug in the networking code. You may want to 
check if there are other indications that the network interface is down.

>
> I apologize for the rather basic questions...
>
No problem. I hope this helps resolve your issues.


> Thanks,
>
> Gavin W. Jones
> Where 2 Get It, Inc.
>
> [1]:  http://docs.oracle.com/cd/E37670_01/E37355/html/ol_tshoot_ocfs2.html
>

<snipped>

-- 
Goldwyn



More information about the Ocfs2-users mailing list