[Ocfs2-users] dlm locking bug?

Fri Sep 2 07:28:03 PDT 2011

Hello,

We have got a problem this morning with our cluster.

Cluster setup:

Servers:
* Two R800 Dell servers running CentOS 5.5 and ULEK
  2.6.32-100.0.19.el5, with 8G ram each;
* OCFS2 1.6.4;
* iSCSI connection using two bonded Gbit nics.

Storage:
* Dell EqualLogic 4000VX -- iSCSI

Network:
* Two dell 1Gbit trunked switches;

Problem description:

The node #1 has hanged access to the filesystem and the hung tasks has
almost the same stack trace as one of the following:

-------------------------------------------------------------------------------
INFO: task maildirsize:17252 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
maildirsize   D 0000000000000004     0 17252  17249 0x00000080
 ffff8800b181d7f8 0000000000000086 0000000000000000 ffff880098de3c40
 ffff88001293a1c0 ffff88022bb7c4c0 ffff88001293a598 ffffffffa036a654
 ffff8800b181d7e8 ffffffff81043d10 ffff88001293a1f8 7fffffffffffffff
Call Trace:
 [<ffffffffa036a654>] ? dlmlock+0x12e2/0x13bb [ocfs2_dlm]
 [<ffffffff81043d10>] ? update_curr+0xc9/0xd2
 [<ffffffff8143798a>] schedule_timeout+0x36/0xe7
 [<ffffffff810425b3>] ? need_resched+0x23/0x2d
 [<ffffffff814377fc>] wait_for_common+0xb7/0x12c
 [<ffffffff8104bc15>] ? default_wake_function+0x0/0x19
 [<ffffffffa03b07fe>] ? lockres_clear_flags+0x15/0x17 [ocfs2]
 [<ffffffff81437914>] wait_for_completion+0x1d/0x1f
 [<ffffffffa03b15ad>] ocfs2_wait_for_mask+0x1a/0x29 [ocfs2]
 [<ffffffffa03b1f59>] __ocfs2_cluster_lock+0x83c/0x861 [ocfs2]
 [<ffffffffa03c6319>] ? ocfs2_inode_cache_io_unlock+0x12/0x14 [ocfs2]
 [<ffffffffa03f48f5>] ? ocfs2_metadata_cache_io_unlock+0x1e/0x20 [ocfs2]
 [<ffffffffa03c344c>] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2]
 [<ffffffffa03c3242>] ? ocfs2_read_inode_block_full+0x3e/0x5a [ocfs2]
 [<ffffffffa03b38dc>] ocfs2_inode_lock_full_nested+0x194/0xb8d [ocfs2]
 [<ffffffffa03d3f8c>] ? ocfs2_rename+0x49e/0x183d [ocfs2]
 [<ffffffffa03c344c>] ? ocfs2_validate_inode_block+0x0/0x1cd [ocfs2]
 [<ffffffffa03d3f8c>] ocfs2_rename+0x49e/0x183d [ocfs2]
 [<ffffffffa03a6792>] ? brelse+0x13/0x15 [ocfs2]
 [<ffffffffa03b0692>] ? init_completion+0x1f/0x21 [ocfs2]
 [<ffffffffa03b0741>] ? ocfs2_init_mask_waiter+0x26/0x3f [ocfs2]
 [<ffffffffa03b0692>] ? init_completion+0x1f/0x21 [ocfs2]
 [<ffffffffa03b202c>] ? ocfs2_should_refresh_lock_res+0x8f/0x1ad [ocfs2]
 [<ffffffff810425b3>] ? need_resched+0x23/0x2d
 [<ffffffff810e8512>] ? kstrdup+0x2b/0xc0
 [<ffffffff811229eb>] vfs_rename+0x221/0x3c0
 [<ffffffff81124968>] sys_renameat+0x18b/0x201
 [<ffffffff81075a7c>] ? autoremove_wake_function+0x0/0x3d
 [<ffffffff811178fc>] ? fsnotify_modify+0x6c/0x74
 [<ffffffff8112186b>] ? path_put+0x22/0x27
 [<ffffffff811249f9>] sys_rename+0x1b/0x1d
 [<ffffffff81011db2>] system_call_fastpath+0x16/0x1b

INFO: task imapd:17386 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
imapd         D 000000000000000b     0 17386   4367 0x00000080
 ffff880208709c08 0000000000000086 0000000000000000 0000000000000286
 ffff8801501ac800 ffff88012b008180 ffff8801501acbd8 00000001bbc4c49f
 0000000000000000 0000000000000000 0000000000000000 ffff880127bf8d9c
Call Trace:
 [<ffffffff81437e31>] __mutex_lock_common+0x12f/0x1a1
 [<ffffffff81437ef2>] __mutex_lock_slowpath+0x19/0x1b
 [<ffffffff81437f5b>] mutex_lock+0x23/0x3a
 [<ffffffff81121aaa>] do_lookup+0x85/0x162
 [<ffffffff8112445c>] __link_path_walk+0x49e/0x5fb
 [<ffffffff81238204>] ? __strncpy_from_user+0x31/0x4a
 [<ffffffff8112460c>] path_walk+0x53/0x9c
 [<ffffffff81124727>] do_path_lookup+0x2f/0x7a
 [<ffffffff8112514f>] user_path_at+0x57/0x91
 [<ffffffff810f0001>] ? handle_mm_fault+0x14b/0x7d9
 [<ffffffff8111c366>] vfs_fstatat+0x3a/0x67
 [<ffffffffa03b0567>] ? ocfs2_inode_unlock+0x140/0x1a5 [ocfs2]
 [<ffffffff8111c479>] vfs_stat+0x1b/0x1d
 [<ffffffff8111c49a>] sys_newstat+0x1f/0x39
 [<ffffffff8143b1e3>] ? do_page_fault+0x25d/0x26c
 [<ffffffff810a7d03>] ? audit_syscall_entry+0x103/0x12f
 [<ffffffff81011db2>] system_call_fastpath+0x16/0x1b
-------------------------------------------------------------------------------

When we rebooted the node#1, the fallowing recovery messages was
logged by node#0:

-------------------------------------------------------------------------------
o2net: connection to node XXXXXXXXXX (num 1) at ip.ip.ip.2:7777 has
been idle for 60.0 seconds, shutting it down.
(swapper,0,0):o2net_idle_timer:1498 here are some times that might help
debug the situation: (tmr 1314962116.650772 now 1314962176.650058 dr
1314962116.650749 adv 1314962116.650781:1314962116.650782 func
(3f8ab666:504) 1314962114.651682:1314962114.651687)
o2net: no longer connected to node XXXXXXXXXX (num 1) at
ip.ip.ip.2:7777
(dlm_thread,4143,2):dlm_send_proxy_ast_msg:456 ERROR: Error -112 when
sending message 505 (key 0x3f8ab666) to node 1
(dlm_thread,4143,2):dlm_flush_asts:599 ERROR: status = -112
(o2net,4052,0):o2net_connect_expired:1659 ERROR: no connection
established with node 1 after 60.0 seconds, giving up and returning
errors.
(o2hb-E4FCE3DA14,4121,1):o2dlm_eviction_cb:267 o2dlm has evicted node 1
from group E4FCE3DA14E94515B327E8558F4641DA
(ocfs2rec,13693,10):ocfs2_replay_journal:1605 Recovering node 1 from
slot 1 on device (8,16)
(dlm_reco_thread,4144,0):dlm_get_lock_resource:836
E4FCE3DA14E94515B327E8558F4641DA:$RECOVERY: at least one node (1) to
recover before lock mastery can begin
(dlm_reco_thread,4144,0):dlm_get_lock_resource:870
E4FCE3DA14E94515B327E8558F4641DA: recovery map is not empty, but must
master $RECOVERY lock now
(dlm_reco_thread,4144,0):dlm_do_recovery:523 (4144) Node 0 is the
Recovery Master for the Dead Node 1 for Domain
E4FCE3DA14E94515B327E8558F4641DA
(ocfs2rec,13693,7):ocfs2_begin_quota_recovery:407 Beginning quota
recovery in slot 1
(ocfs2_wq,4124,1):ocfs2_finish_quota_recovery:598 Finishing quota
recovery in slot 1
-------------------------------------------------------------------------------

The processes hung at node#1 was more than 1 hour waiting for IO, so it
seems to me that node#0 never replied the dlmlock shown on the first
stack trace of node#1 process, but the message was queued to be sent,
as show on node#0 dmesg when node#1 rebooted. The question is what
could prevent dlm on node#0 replying the locking request from node#1?

We have made copies of all debug files from /sys/kernel/debug on node#1
before rebooting it and we can provide them if needed (9.3MiB
compressed data).

Regards,
-- 
  .:''''':.
.:'        `     Sérgio Surkamp | Gerente de Rede
::    ........   sergio at gruposinternet.com.br
`:.        .:'
  `:,   ,.:'     *Grupos Internet S.A.*
    `: :'        R. Lauro Linhares, 2123 Torre B - Sala 201
     : :         Trindade - Florianópolis - SC
     :.'
     ::          +55 48 3234-4109
     :
     '           http://www.gruposinternet.com.br