[Ocfs2-devel] Dead lock and cluster blocked, any advices will be appreciated.
Eric Ren
zren at suse.com
Sun May 8 22:39:52 PDT 2016
Hello Zhonghua,
Thanks for reporting this.
On 05/07/2016 07:30 PM, Guozhonghua wrote:
> Hi, we had find one dead lock scenario.
>
> Suddenly, the Node 2 is rebooted(fenced) for IO error accessing storage. So its slot 2 is remained valid on storage disk.
> The node 1 which is in the same cluster with node 2, is to mount the same disk. At the same time, the node 2 restarted and mount the same disk.
It'll be great if we have some specific steps to reproduce the deadlock.
Do we?
Eric
>
> So the work flow is as below.
>
> Node 1 Node 2
> ocfs2_dlm_init ocfs2_dlm_init
> ocfs2_super_lock waiting ocfs2_super_lock
> ocfs2_find_slot
> ocfs2_check_volume
> ocfs2_mark_dead_nodes
> ocfs2_slot_to_node_num_locked
> Finding node slot 2 is valid
> and set it into recovery map
>
> ocfs2_trylock_journal
> This time, try lock journal:0002
> will successfully for node 2 is
> waiting super lock.
>
> ocfs2_recovery_thread
> Starting recovery for node 2
> ocfs2_super_unlock
> ocfs2_dlm_init
> ocfs2_super_lock
> ocfs2_find_slot
> Grant the journal:0002 lock with slot 2
> ocfs2_super_unlock
> __ocfs2_recovery_thread
> ocfs2_super_lock
> ocfs2_recover_node
> Recovering node 2, to granted journal:0002
> Node 1 will still waiting for node 2.
> And Node 2 will never release the journal:0002 .... ....
> ocfs2_super_lock
> At this time node 2 will waiting node 1 to release super lock;
> So One dead lock occurred.
>
>
>
>
> Stack, and lock res infos:
> 122 /dev/dm-1: LABEL="o20160426150630" UUID="83269946-3428-4a04-8d78-1d76053b3f28" TYPE="ocfs2"
> 123
> 124 find deadlock on /dev/dm-1
> 125 Lockres: M000000000000000000026a863e451d Mode: No Lock
> 126 Flags: Initialized Attached Busy
> 127 RO Holders: 0 EX Holders: 0
> 128 Pending Action: Convert Pending Unlock Action: None
> 129 Requested Mode: Exclusive Blocking Mode: No Lock
> 130 PR > Gets: 0 Fails: 0 Waits Total: 0us Max: 0us Avg: 0ns
> 131 EX > Gets: 1 Fails: 0 Waits Total: 772us Max: 772us Avg: 772470ns
> 132 Disk Refreshes: 1
> 133
> 134 inode of lock: M000000000000000000026a863e451d is 000000000000026a, file is:
> 135 618 //journal:0002
> 136 lock: M000000000000000000026a863e451d on local is:
> 137 Lockres: M000000000000000000026a863e451d Owner: 1 State: 0x0
> 138 Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No
> 139 Refs: 4 Locks: 2 On Lists: None
> 140 Reference Map: 2
> 141 Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action
> 142 Granted 2 EX -1 2:18553 2 No No None
> 143 Converting 1 NL EX 1:15786 2 No No None
> 144
> 145 Local host is the Owner of M000000000000000000026a863e451d
>
> Node 1
> ========= find hung_up process ==========
> 2 16398 D kworker/u128:0 ocfs2_wait_for_recovery
> 3 35883 D ocfs2rec-832699 ocfs2_cluster_lock.isra.37
> 4 36601 D df ocfs2_wait_for_recovery
> 5 54451 D kworker/u128:2 chbk_store_chk_proc
> 6 62872 D kworker/u128:3 ocfs2_wait_for_recovery
> 7
> 8 ========== get stack of 16398 ==========
> 9
> 10 [<ffffffffc06367a5>] ocfs2_wait_for_recovery+0x75/0xc0 [ocfs2]
> 11 [<ffffffffc0621d68>] ocfs2_inode_lock_full_nested+0x318/0xc50 [ocfs2]
> 12 [<ffffffffc063b210>] ocfs2_complete_local_alloc_recovery+0x70/0x3f0 [ocfs2]
> 13 [<ffffffffc063698e>] ocfs2_complete_recovery+0x19e/0xfa0 [ocfs2]
> 14 [<ffffffff81096e64>] process_one_work+0x144/0x4c0
> 15 [<ffffffff810978fd>] worker_thread+0x11d/0x540
> 16 [<ffffffff8109def9>] kthread+0xc9/0xe0
> 17 [<ffffffff817f6a22>] ret_from_fork+0x42/0x70
> 18 [<ffffffffffffffff>] 0xffffffffffffffff
> 19
> 20 ========== get stack of 35883 ==========
> 21
> 22 [<ffffffffc0620260>] __ocfs2_cluster_lock.isra.37+0x2b0/0x9f0 [ocfs2]
> 23 [<ffffffffc0621c4d>] ocfs2_inode_lock_full_nested+0x1fd/0xc50 [ocfs2]
> 24 [<ffffffffc0638b72>] __ocfs2_recovery_thread+0x6f2/0x14d0 [ocfs2]
> 25 [<ffffffff8109def9>] kthread+0xc9/0xe0
> 26 [<ffffffff817f6a22>] ret_from_fork+0x42/0x70
> 27 [<ffffffffffffffff>] 0xffffffffffffffff
> 28
> 29 ========== get stack of 36601 ==========
> 30 df^@-BM^@-TP^@
> 31 [<ffffffffc06367a5>] ocfs2_wait_for_recovery+0x75/0xc0 [ocfs2]
> 32 [<ffffffffc0621d68>] ocfs2_inode_lock_full_nested+0x318/0xc50 [ocfs2]
> 33 [<ffffffffc066a1e1>] ocfs2_statfs+0x81/0x400 [ocfs2]
> 34 [<ffffffff81235969>] statfs_by_dentry+0x99/0x140
> 35 [<ffffffff81235a2b>] vfs_statfs+0x1b/0xa0
> 36 [<ffffffff81235af5>] user_statfs+0x45/0x80
> 37 [<ffffffff81235bab>] SYSC_statfs+0x1b/0x40
> 38 [<ffffffff81235cee>] SyS_statfs+0xe/0x10
> 39 [<ffffffff817f65f2>] system_call_fastpath+0x16/0x75
> 40 [<ffffffffffffffff>] 0xffffffffffffffff
>
>
> Thanks
>
> Guozhonghua
> -------------------------------------------------------------------------------------------------------------------------------------
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from H3C, which is
> intended only for the person or entity whose address is listed above. Any use of the
> information contained herein in any way (including, but not limited to, total or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
More information about the Ocfs2-devel
mailing list