[Ocfs2-devel] [PATCH] ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled

Changwei Ge ge.changwei at h3c.com
Tue Dec 4 17:49:57 PST 2018


Hi Jian,

I suppose you can't just remove the BUG_ON() check.
If you remove it, below code violates the original logic.

'''
2141                                 dlm_commit_pending_cancel(res, lock);  
'''

What's more important is *locking cancel* must be against a *locking conversion* progress.
So it makes sense to check if this lock is on converting list.

So I have to NACK to this patch.

Thanks,
Changwei


On 2018/12/3 20:23, wangjian wrote:
> In the dlm_move_lockres_to_recovery_list function, if the lock
> is in the granted queue and cancel_pending is set, it will
> encounter a BUG. I think this is a meaningless BUG,
> so be prepared to remove it. A scenario that causes
> this BUG will be given below.
> 
> At the beginning, Node 1 is the master and has NL lock,
> Node 2 has PR lock, Node 3 has PR lock too.
> 
> Node 1          Node 2          Node 3
>              want to get EX lock.
> 
>                              want to get EX lock.
> 
> Node 3 hinder
> Node 2 to get
> EX lock, send
> Node 3 a BAST.
> 
>                              receive BAST from
>                              Node 1. downconvert
>                              thread begin to
>                              cancel PR to EX conversion.
>                              In dlmunlock_common function,
>                              downconvert thread has set
>                              lock->cancel_pending,
>                              but did not enter
>                              dlm_send_remote_unlock_request
>                              function.
> 
>              Node2 dies because
>              the host is powered down.
> 
> In recovery process,
> clean the lock that
> related to Node2.
> then finish Node 3
> PR to EX request.
> give Node 3 a AST.
> 
>                              receive AST from Node 1.
>                              change lock level to EX,
>                              move lock to granted list.
> 
> Node1 dies because
> the host is powered down.
> 
>                              In dlm_move_lockres_to_recovery_list
>                              function. the lock is in the
>                              granted queue and cancel_pending
>                              is set. BUG_ON.
> 
> But after clearing this BUG, process will encounter
> the second BUG in the ocfs2_unlock_ast function.
> Here is a scenario that will cause the second BUG
> in ocfs2_unlock_ast as follows:
> 
> At the beginning, Node 1 is the master and has NL lock,
> Node 2 has PR lock, Node 3 has PR lock too.
> 
> Node 1          Node 2          Node 3
>              want to get EX lock.
> 
>                              want to get EX lock.
> 
> Node 3 hinder
> Node 2 to get
> EX lock, send
> Node 3 a BAST.
> 
>                              receive BAST from
>                              Node 1. downconvert
>                              thread begin to
>                              cancel PR to EX conversion.
>                              In dlmunlock_common function,
>                              downconvert thread has released
>                              lock->spinlock and res->spinlock,
>                              but did not enter
>                              dlm_send_remote_unlock_request
>                              function.
> 
>              Node2 dies because
>              the host is powered down.
> 
> In recovery process,
> clean the lock that
> related to Node2.
> then finish Node 3
> PR to EX request.
> give Node 3 a AST.
> 
>                              receive AST from Node 1.
>                              change lock level to EX,
>                              move lock to granted list,
>                              set lockres->l_unlock_action
>                              as OCFS2_UNLOCK_INVALID
>                              in ocfs2_locking_ast function.
> 
> Node2 dies because
> the host is powered down.
> 
>                              Node 3 realize that Node 1
>                              is dead, remove Node 1 from
>                              domain_map. downconvert thread
>                              get DLM_NORMAL from
>                              dlm_send_remote_unlock_request
>                              function and set *call_ast as 1.
>                              Then downconvert thread meet
>                              BUG in ocfs2_unlock_ast function.
> 
> To avoid meet the second BUG, function dlmunlock_common shuold
> return DLM_CANCELGRANT if the lock is on granted list and
> the operation is canceled.
> 
> Signed-off-by: Jian Wang<wangjian161 at huawei.com>
> Reviewed-by: Yiwen Jiang<jiangyiwen at huawei.com>
> ---
>   fs/ocfs2/dlm/dlmrecovery.c | 1 -
>   fs/ocfs2/dlm/dlmunlock.c   | 5 +++++
>   2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 802636d..7489652 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -2134,7 +2134,6 @@ void dlm_move_lockres_to_recovery_list(struct dlm_ctxt *dlm,
>   				 * if this had completed successfully
>   				 * before sending this lock state to the
>   				 * new master */
> -				BUG_ON(i != DLM_CONVERTING_LIST);
>   				mlog(0, "node died with cancel pending "
>   				     "on %.*s. move back to granted list.\n",
>   				     res->lockname.len, res->lockname.name);
> diff --git a/fs/ocfs2/dlm/dlmunlock.c b/fs/ocfs2/dlm/dlmunlock.c
> index 63d701c..505bb6c 100644
> --- a/fs/ocfs2/dlm/dlmunlock.c
> +++ b/fs/ocfs2/dlm/dlmunlock.c
> @@ -183,6 +183,11 @@ static enum dlm_status dlmunlock_common(struct dlm_ctxt *dlm,
>   							flags, owner);
>   		spin_lock(&res->spinlock);
>   		spin_lock(&lock->spinlock);
> +
> +		if ((flags & LKM_CANCEL) &&
> +				dlm_lock_on_list(&res->granted, lock))
> +			status = DLM_CANCELGRANT;
> +
>   		/* if the master told us the lock was already granted,
>   		 * let the ast handle all of these actions */
>   		if (status == DLM_CANCELGRANT) {
> -- 
> 1.8.3.1
> 



More information about the Ocfs2-devel mailing list