[Ocfs2-devel] May be deadlock for wrong locking order, patch request reviewed, thanks

Xue jiufei xuejiufei at huawei.com
Thu Sep 11 23:05:52 PDT 2014


Hi, Zhonghua
On 2014/9/11 19:28, Guozhonghua wrote:
> As we test the ocfs2 cluster, the cluster is sometime hangs up.
> 
>  
> 
> I got some information about the dead lock, which cause the cluster hangs up, the sys dir / lock is held and the node did not release it which cause the cluster hangs up.
> 
>     root at cvknode-21:~# ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN | grep D
> 
>       PID STAT COMMAND WIDE-WCHAN-COLUMN
> 
>      7489 D jbd2/sdh-621 jbd2_journal_commit_transaction
> 
>     16218 D ls iterate_dir
> 
>     16533 D mkdir dlm_wait_for_lock_mastery
> 
>     31195 D+ ls iterate_dir
> 
>  
> 
> So the code reviewed, and I found the order of the lock may wrong.
> 
> In the function dlm_master_request_handler, the resource lock is held and so after the lock of &dlm->master_lock is locked.
> 
> But in the function dlm_get_lock_resource, the &dlm->master_lock is locked first and so resource lock.

Resource lock is not required in dlm_get_lock_resouce() because it
is a new lock resource.
commit 8d400b81cc83 add this spinlock when cleanup code, I think we can
remove this spinlock.

Thanks
Xue Jiufei

> 
> They are different order in different function.
> 
> If there are two task, one holds the res->lock waiting for the dlm->master_lock, with the function dlm_master_request_handler.
> 
> Another task holds the &dlm->master_lock waiting for the res->lock with dlm_get_lock_resource.
> 
> So the deadlock may be up.
> 
>  
> 
> I changed some code, and the patch request reviews.
> 
>  
> 
>  
> 
>  
> 
> *** ocfs2-ko-3.16/dlm/dlmmaster.c      2014-09-11 12:45:45.821657634 +0800
> 
> --- ocfs2-ko-3.16_compared/dlm/dlmmaster.c      2014-09-11 18:54:34.970243238 +0800
> 
> *************** way_up_top:
> 
> *** 1506,1512 ****
> 
> --- 1506,1515 ----
> 
>               }
> 
>  
> 
>               // mlog(0, "lockres is in progress...\n");
> 
> +             spin_unlock(&res->spinlock);
> 
> +        
> 
>               spin_lock(&dlm->master_lock);
> 
> +             spin_lock(&res->spinlock);
> 
>               found = dlm_find_mle(dlm, &tmpmle, name, namelen);
> 
>               if (!found) {
> 
>                       mlog(ML_ERROR, "no mle found for this lock!\n");
> 
> *************** way_up_top:
> 
> *** 1551,1558 ****
> 
>                       set_bit(request->node_idx, tmpmle->maybe_map);
> 
>               spin_unlock(&tmpmle->spinlock);
> 
>  
> 
> -              spin_unlock(&dlm->master_lock);
> 
>               spin_unlock(&res->spinlock);
> 
> +             spin_unlock(&dlm->master_lock);
> 
>  
> 
>               /* keep the mle attached to heartbeat events */
> 
>               dlm_put_mle(tmpmle);
> 
> -------------------------------------------------------------------------------------------------------------------------------------
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from H3C, which is
> intended only for the person or entity whose address is listed above. Any use of the
> information contained herein in any way (including, but not limited to, total or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 





More information about the Ocfs2-devel mailing list