<div dir="ltr">If the storage connectivity is not stable, then dlm issues are to be expected.<br>In this case, the processes are all trying to take the readlock. One possible<br>scenario is that the node holding the writelock is not able to relinquish the lock<br>
because it cannot flush the updated inodes to disk. I would suggest you look<br>into load balancing and how it affects the iscsi connectivity from the hosts.<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">
On Tue, Aug 6, 2013 at 2:51 PM, Gavin Jones <span dir="ltr"><<a href="mailto:gjones@where2getit.com" target="_blank">gjones@where2getit.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hello Goldwyn,<br>
<br>
Thanks for taking a look at this. So, then, it does seem to be DLM<br>
related. We were running fine for a few weeks and then it came up<br>
again this morning and has been going on throughout the day.<br>
<br>
Regarding the DLM debugging, I allowed debugging for DLM_GLUE,<br>
DLM_THREAD, DLM_MASTER and DLM_RECOVERY. However, I don't see any DLM<br>
logging output in dmesg or syslog --is there perhaps another way to<br>
get at the actual DLM log? I've searched around a bit but didn't find<br>
anything that made it clear.<br>
<br>
As for OCFS2 and iSCSI communications, they use the same physical<br>
network interface but different VLANs on that interface. The<br>
"connectionX:0" errors, then, seem to indicate an issue with the ISCSI<br>
connection. The system logs and monitoring software don't show any<br>
warnings or errors about the interface going down, so the only thing I<br>
can think of is the connection load balancing on the SAN, though<br>
that's merely a hunch. Maybe I should mail the list and see if anyone<br>
has a similar setup.<br>
<br>
If you could please point me in the right direction to make use of the<br>
DLM debugging via debugs.ocfs2, I would appreciate it.<br>
<br>
Thanks again,<br>
<div class="im HOEnZb"><br>
Gavin W. Jones<br>
Where 2 Get It, Inc.<br>
<br>
</div><div class="HOEnZb"><div class="h5">On Tue, Aug 6, 2013 at 4:16 PM, Goldwyn Rodrigues <<a href="mailto:rgoldwyn@suse.de">rgoldwyn@suse.de</a>> wrote:<br>
> Hi Gavin,<br>
><br>
><br>
> On 08/06/2013 01:59 PM, Gavin Jones wrote:<br>
>><br>
>> Hi Goldwyn,<br>
>><br>
>> Apologies for the delayed reply.<br>
>><br>
>> The hung Apache process / OCFS issue cropped up again, so I thought<br>
>> I'd pass along the contents of /proc/<pid>/stack of a few affected<br>
>> processes:<br>
>><br>
>> gjones@slipapp02:~> sudo cat /proc/27521/stack<br>
>> gjones's password:<br>
>> [<ffffffff811663b4>] poll_schedule_timeout+0x44/0x60<br>
>> [<ffffffff81166d56>] do_select+0x5a6/0x670<br>
>> [<ffffffff81166fbe>] core_sys_select+0x19e/0x2d0<br>
>> [<ffffffff811671a5>] sys_select+0xb5/0x110<br>
>> [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f<br>
>> [<00007f394bdd5f23>] 0x7f394bdd5f23<br>
>> [<ffffffffffffffff>] 0xffffffffffffffff<br>
>> gjones@slipapp02:~> sudo cat /proc/27530/stack<br>
>> [<ffffffff81249721>] sys_semtimedop+0x5a1/0x8b0<br>
>> [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f<br>
>> [<00007f394bdddb77>] 0x7f394bdddb77<br>
>> [<ffffffffffffffff>] 0xffffffffffffffff<br>
>> gjones@slipapp02:~> sudo cat /proc/27462/stack<br>
>> [<ffffffff81249721>] sys_semtimedop+0x5a1/0x8b0<br>
>> [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f<br>
>> [<00007f394bdddb77>] 0x7f394bdddb77<br>
>> [<ffffffffffffffff>] 0xffffffffffffffff<br>
>> gjones@slipapp02:~> sudo cat /proc/27526/stack<br>
>> [<ffffffff81249721>] sys_semtimedop+0x5a1/0x8b0<br>
>> [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f<br>
>> [<00007f394bdddb77>] 0x7f394bdddb77<br>
>> [<ffffffffffffffff>] 0xffffffffffffffff<br>
>><br>
>><br>
>> Additionally, in dmesg I see, for example,<br>
>><br>
>> [774981.361149] (/usr/sbin/httpd,8266,3):ocfs2_unlink:951 ERROR: status =<br>
>> -2<br>
>> [775896.135467]<br>
>> (/usr/sbin/httpd,8435,3):ocfs2_check_dir_for_entry:2119 ERROR: status<br>
>> = -17<br>
>> [775896.135474] (/usr/sbin/httpd,8435,3):ocfs2_mknod:459 ERROR: status =<br>
>> -17<br>
>> [775896.135477] (/usr/sbin/httpd,8435,3):ocfs2_create:629 ERROR: status =<br>
>> -17<br>
>> [788406.624126] connection1:0: ping timeout of 5 secs expired, recv<br>
>> timeout 5, last rx 4491991450, last ping 4491992701, now 4491993952<br>
>> [788406.624138] connection1:0: detected conn error (1011)<br>
>> [788406.640132] connection2:0: ping timeout of 5 secs expired, recv<br>
>> timeout 5, last rx 4491991451, last ping 4491992702, now 4491993956<br>
>> [788406.640142] connection2:0: detected conn error (1011)<br>
>> [788406.928134] connection4:0: ping timeout of 5 secs expired, recv<br>
>> timeout 5, last rx 4491991524, last ping 4491992775, now 4491994028<br>
>> [788406.928150] connection4:0: detected conn error (1011)<br>
>> [788406.944147] connection5:0: ping timeout of 5 secs expired, recv<br>
>> timeout 5, last rx 4491991528, last ping 4491992779, now 4491994032<br>
>> [788406.944165] connection5:0: detected conn error (1011)<br>
>> [788408.640123] connection3:0: ping timeout of 5 secs expired, recv<br>
>> timeout 5, last rx 4491991954, last ping 4491993205, now 4491994456<br>
>> [788408.640134] connection3:0: detected conn error (1011)<br>
>> [788409.907968] connection1:0: detected conn error (1020)<br>
>> [788409.908280] connection2:0: detected conn error (1020)<br>
>> [788409.912683] connection4:0: detected conn error (1020)<br>
>> [788409.913152] connection5:0: detected conn error (1020)<br>
>> [788411.491818] connection3:0: detected conn error (1020)<br>
>><br>
>><br>
>> that repeats for a bit and then I see<br>
>><br>
>> [1952161.012214] INFO: task /usr/sbin/httpd:27491 blocked for more<br>
>> than 480 seconds.<br>
>> [1952161.012219] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"<br>
>> disables this message.<br>
>> [1952161.012221] /usr/sbin/httpd D ffff88081fc52b40 0 27491 27449<br>
>> 0x00000000<br>
>> [1952161.012226] ffff88031a85dc50 0000000000000082 ffff880532a92640<br>
>> ffff88031a85dfd8<br>
>> [1952161.012231] ffff88031a85dfd8 ffff88031a85dfd8 ffff8807f791c300<br>
>> ffff880532a92640<br>
>> [1952161.012235] ffffffff8115f3ae ffff8802804bdd98 ffff880532a92640<br>
>> 00000000ffffffff<br>
>> [1952161.012239] Call Trace:<br>
>> [1952161.012251] [<ffffffff81538fea>] __mutex_lock_slowpath+0xca/0x140<br>
>> [1952161.012257] [<ffffffff81538b0a>] mutex_lock+0x1a/0x40<br>
>> [1952161.012262] [<ffffffff81160e80>] do_lookup+0x290/0x340<br>
>> [1952161.012269] [<ffffffff81161c7f>] path_lookupat+0x10f/0x700<br>
>> [1952161.012274] [<ffffffff8116229c>] do_path_lookup+0x2c/0xc0<br>
>> [1952161.012279] [<ffffffff8116372d>] user_path_at_empty+0x5d/0xb0<br>
>> [1952161.012283] [<ffffffff81158d9d>] vfs_fstatat+0x2d/0x70<br>
>> [1952161.012288] [<ffffffff81158fe2>] sys_newstat+0x12/0x30<br>
>> [1952161.012293] [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f<br>
>> [1952161.012308] [<00007f394bdcfb05>] 0x7f394bdcfb04<br>
>> [1952161.012382] INFO: task /usr/sbin/httpd:27560 blocked for more<br>
>> than 480 seconds.<br>
>> [1952161.012384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"<br>
>> disables this message.<br>
>> [1952161.012385] /usr/sbin/httpd D ffff88081fd52b40 0 27560 27449<br>
>> 0x00000000<br>
>> [1952161.012389] ffff880224023c50 0000000000000086 ffff88024326e580<br>
>> ffff880224023fd8<br>
>> [1952161.012393] ffff880224023fd8 ffff880224023fd8 ffff8807f79cc800<br>
>> ffff88024326e580<br>
>> [1952161.012397] ffffffff8115f3ae ffff8802804bdd98 ffff88024326e580<br>
>> 00000000ffffffff<br>
>> [1952161.012401] Call Trace:<br>
>> [1952161.012406] [<ffffffff81538fea>] __mutex_lock_slowpath+0xca/0x140<br>
>> [1952161.012410] [<ffffffff81538b0a>] mutex_lock+0x1a/0x40<br>
>> [1952161.012415] [<ffffffff81160e80>] do_lookup+0x290/0x340<br>
>> [1952161.012420] [<ffffffff81161c7f>] path_lookupat+0x10f/0x700<br>
>> [1952161.012425] [<ffffffff8116229c>] do_path_lookup+0x2c/0xc0<br>
>> [1952161.012430] [<ffffffff8116372d>] user_path_at_empty+0x5d/0xb0<br>
>> [1952161.012434] [<ffffffff81158d9d>] vfs_fstatat+0x2d/0x70<br>
>> [1952161.012438] [<ffffffff81158fe2>] sys_newstat+0x12/0x30<br>
>> [1952161.012442] [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f<br>
>> [1952161.012448] [<00007f394bdcfb05>] 0x7f394bdcfb04[1953601.012280]<br>
>> INFO: task /usr/sbin/httpd:27506 blocked for more than 480 seconds.<br>
>> [1953601.012282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"<br>
>> disables this message.<br>
>> [1953601.012284] /usr/sbin/httpd D ffff88081fd52b40 0 27506 27449<br>
>> 0x00000000<br>
>> [1953601.012287] ffff8804044cf988 0000000000000082 ffff8802e1bfc340<br>
>> ffff8804044cffd8<br>
>> [1953601.012291] ffff8804044cffd8 ffff8804044cffd8 ffff8807f79cc800<br>
>> ffff8802e1bfc340<br>
>> [1953601.012296] 0000000000012b40 ffff8804044cfb58 7fffffffffffffff<br>
>> ffff8802e1bfc340<br>
>> [1953601.012300] Call Trace:<br>
>> [1953601.012305] [<ffffffff815387a2>] schedule_timeout+0x272/0x2f0<br>
>> [1953601.012311] [<ffffffff815396e2>] wait_for_common+0xd2/0x180<br>
>> [1953601.012345] [<ffffffffa03780a2>]<br>
>> __ocfs2_cluster_lock.isra.34+0x1f2/0x6d0 [ocfs2]<br>
>> [1953601.012395] [<ffffffffa0379578>]<br>
>> ocfs2_inode_lock_full_nested+0x168/0x9d0 [ocfs2]<br>
>> [1953601.012436] [<ffffffffa037edd2>] ocfs2_permission+0x32/0xf0 [ocfs2]<br>
>> [1953601.012466] [<ffffffff8115f503>] inode_permission+0x73/0x110<br>
>> [1953601.012472] [<ffffffff811612cd>] link_path_walk+0x22d/0x850<br>
>> [1953601.012477] [<ffffffff81161bc1>] path_lookupat+0x51/0x700<br>
>> [1953601.012482] [<ffffffff8116229c>] do_path_lookup+0x2c/0xc0<br>
>> [1953601.012486] [<ffffffff8116372d>] user_path_at_empty+0x5d/0xb0<br>
>> [1953601.012490] [<ffffffff81158d9d>] vfs_fstatat+0x2d/0x70<br>
>> [1953601.012494] [<ffffffff81158fe2>] sys_newstat+0x12/0x30<br>
>> [1953601.012498] [<ffffffff815429bd>] system_call_fastpath+0x1a/0x1f<br>
>> [1953601.012504] [<00007f394bdcfb05>] 0x7f394bdcfb04<br>
>><br>
><br>
><br>
> This is the most interesting backtrace. The process is definitely waiting<br>
> for a dlm lock while holding a mutex from fstat() call. All the others are<br>
> waiting for this process to finish.<br>
><br>
><br>
>><br>
>> Your comment about the network got me thinking; I don't see on the<br>
>> hosts via any other means or monitoring that the network interface is<br>
>> going down. However, I was logged into the (EqualLogic) SAN (where<br>
>> OCFS2 volumes reside) console yesterday and noticed some iSCSI session<br>
>> messages like:<br>
>><br>
>> iSCSI session to target ........ was closed. Load balancing request<br>
>> was received on the array.<br>
>><br>
>> Is it possible / probable that the iSCSI load balancing on the SAN is<br>
>> causing OCFS2 problems? I'm wondering if exceeding the iSCSI ping<br>
>> timeout leads to a connection error, which then causes OCFS2 to be<br>
>> unsure of cluster state or have some DLM issue, when then leads to<br>
>> hung processes. I know it seems like I'm shooting in the dark, but, I<br>
>> don't have much else to go on.<br>
><br>
><br>
> Do you have iSCSI and ocfs2 communication on the same network interface? If<br>
> yes, the "connection4:0 " messages are from your iSCSI subsystem and your<br>
> network does seem to have a problem there (don't know what though).<br>
><br>
><br>
>><br>
>> Still working on permission for the DLM debugging, hope to have it<br>
>> turned on and get some messages for you either today or tomorrow.<br>
>><br>
>> Thanks,<br>
>><br>
>> Gavin W. Jones<br>
>> Where 2 Get It, Inc.<br>
>><br>
>> On Thu, Jul 18, 2013 at 5:54 PM, Goldwyn Rodrigues <<a href="mailto:rgoldwyn@suse.de">rgoldwyn@suse.de</a>><br>
>> wrote:<br>
>>><br>
>>> On 07/18/2013 11:42 AM, Gavin Jones wrote:<br>
>>>><br>
>>>><br>
>>>> Hello,<br>
>>>><br>
>>>> Sure, I'd be happy to provide such information next time this occurs.<br>
>>>><br>
>>>> Can you elaborate, or point me at documentation / procedure regarding<br>
>>>> the DLM debug logs and what would be helpful to see? I have read<br>
>>>> "Troubleshooting OCFS2" [1] and the section "Debugging File System<br>
>>>> Locks" --is this what you're referring to?<br>
>>><br>
>>><br>
>>><br>
>>> No. I was looking for a more proactive approach we use in debugging.<br>
>>> # debugfs.ocfs2 -l<br>
>>> will provide you a list of debug messages you can turn on/off.<br>
>>><br>
>>> In order to turn on DLM_GLUE (the layer between ocfs2 and DLM) specific<br>
>>> operations, issue<br>
>>><br>
>>> # debugfs.ocfs2 -l DLM_GLUE allow<br>
>>><br>
>>> Please note, this generates a lot of messages.<br>
>>><br>
>>><br>
>>>><br>
>>>> Not sure if this will provide additional context or just muddy the<br>
>>>> waters, but I thought to provide some syslog messages from an affected<br>
>>>> server the last time this occurred.<br>
>>>><br>
>>>> Jul 14 15:36:55 slipapp07 kernel: [2173588.704093] o2net: Connection<br>
>>>> to node slipapp03 (num 2) at <a href="http://172.16.40.122:7777" target="_blank">172.16.40.122:7777</a> has been idle for<br>
>>>> 30.97 secs, shutting it down.<br>
>>>> Jul 14 15:36:55 slipapp07 kernel: [2173588.704146] o2net: No longer<br>
>>>> connected to node slipapp03 (num 2) at <a href="http://172.16.40.122:7777" target="_blank">172.16.40.122:7777</a><br>
>>>> Jul 14 15:36:55 slipapp07 kernel: [2173588.704279]<br>
>>>> (kworker/u:1,12787,4):dlm_do_assert_master:1665 ERROR: Error -112 when<br>
>>>> sending message 502 (key 0xdc8be796) to node 2<br>
>>>> Jul 14 15:36:55 slipapp07 kernel: [2173588.704295]<br>
>>>> (kworker/u:5,26056,5):dlm_do_master_request:1332 ERROR: link to 2 went<br>
>>>> down!<br>
>>>> Jul 14 15:36:55 slipapp07 kernel: [2173588.704301]<br>
>>>> (kworker/u:5,26056,5):dlm_get_lock_resource:917 ERROR: status = -112<br>
>>>> Jul 14 15:37:25 slipapp07 kernel: [2173618.784153] o2net: No<br>
>>>> connection established with node 2 after 30.0 seconds, giving up.<br>
>>>> <snip><br>
>>>> Jul 14 15:39:14 slipapp07 kernel: [2173727.920793]<br>
>>>> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -112 when<br>
>>>> sending message 502 (key 0xdc8be796) to node 4<br>
>>>> Jul 14 15:39:14 slipapp07 kernel: [2173727.920833]<br>
>>>> (/usr/sbin/httpd,5023,5):dlm_send_remote_lock_request:336 ERROR:<br>
>>>> A08674A831ED4048B5136BD8613B21E0: res N000000000152a8da, Error -112<br>
>>>> send CREATE LOCK to node 4<br>
>>>> Jul 14 15:39:14 slipapp07 kernel: [2173727.930562]<br>
>>>> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when<br>
>>>> sending message 502 (key 0xdc8be796) to node 4<br>
>>>> Jul 14 15:39:14 slipapp07 kernel: [2173727.944998]<br>
>>>> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when<br>
>>>> sending message 502 (key 0xdc8be796) to node 4<br>
>>>> Jul 14 15:39:14 slipapp07 kernel: [2173727.951511]<br>
>>>> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when<br>
>>>> sending message 502 (key 0xdc8be796) to node 4<br>
>>>> Jul 14 15:39:14 slipapp07 kernel: [2173727.973848]<br>
>>>> (kworker/u:2,13894,1):dlm_do_assert_master:1665 ERROR: Error -107 when<br>
>>>> sending message 502 (key 0xdc8be796) to node 4<br>
>>>> Jul 14 15:39:14 slipapp07 kernel: [2173727.990216]<br>
>>>> (kworker/u:2,13894,7):dlm_do_assert_master:1665 ERROR: Error -107 when<br>
>>>> sending message 502 (key 0xdc8be796) to node 4<br>
>>>> Jul 14 15:39:14 slipapp07 kernel: [2173728.024139]<br>
>>>> (/usr/sbin/httpd,5023,5):dlm_send_remote_lock_request:336 ERROR:<br>
>>>> A08674A831ED4048B5136BD8613B21E0: res N000000000152a8da, Error -107<br>
>>>> send CREATE LOCK to node 4<br>
>>>> <snip, many, many more like the above><br>
>>>><br>
>>>> Which I suppose would indicate DLM issues; I have previously tried to<br>
>>>> investigate this (via abovementioned guide) but was unable to make<br>
>>>> real headway.<br>
>>><br>
>>><br>
>>><br>
>>> No, this means your network is the problem, which in turn is affecting<br>
>>> your<br>
>>> DLM operations. The problem could be anywhere from hardware to the<br>
>>> network<br>
>>> device to possibly a bug in the networking code. You may want to check if<br>
>>> there are other indications that the network interface is down.<br>
>>><br>
>>><br>
>>>><br>
>>>> I apologize for the rather basic questions...<br>
>>>><br>
>>> No problem. I hope this helps resolve your issues.<br>
>>><br>
>>><br>
>>><br>
>>>> Thanks,<br>
>>>><br>
>>>> Gavin W. Jones<br>
>>>> Where 2 Get It, Inc.<br>
>>>><br>
>>>> [1]:<br>
>>>> <a href="http://docs.oracle.com/cd/E37670_01/E37355/html/ol_tshoot_ocfs2.html" target="_blank">http://docs.oracle.com/cd/E37670_01/E37355/html/ol_tshoot_ocfs2.html</a><br>
>>>><br>
>>><br>
>>> <snipped><br>
>>><br>
>>> --<br>
>>> Goldwyn<br>
>><br>
>><br>
>><br>
>><br>
><br>
><br>
> --<br>
> Goldwyn<br>
<br>
<br>
<br>
</div></div><div class="im HOEnZb">--<br>
"There has grown up in the minds of certain groups in this country the<br>
notion that because a man or corporation has made a profit out of the<br>
public for a number of years, the government and the courts are<br>
charged with the duty of guaranteeing such profit in the future, even<br>
in the face of changing circumstances and contrary to public interest.<br>
This strange doctrine is not supported by statute nor common law."<br>
<br>
~Robert Heinlein<br>
<br>
</div><div class="HOEnZb"><div class="h5">_______________________________________________<br>
Ocfs2-users mailing list<br>
<a href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a><br>
<a href="https://oss.oracle.com/mailman/listinfo/ocfs2-users" target="_blank">https://oss.oracle.com/mailman/listinfo/ocfs2-users</a><br>
</div></div></blockquote></div><br></div>