[Ocfs2-devel] [PATCH] ocfs2: wait for recovering done after direct unlock request
piaojun
piaojun at huawei.com
Fri Feb 15 00:36:03 PST 2019
Hi Changwei,
Thanks for your explaination, and I still have a question below.
On 2019/2/15 14:29, Changwei Ge wrote:
> Hi Jun,
>
> On 2019/2/15 14:20, piaojun wrote:
>> Hi Changwei,
>>
>> The DLM process is a little bit complex, so I suggest pasting the code
>> path. And I wonder if my code is right?
>>
>> Thanks,
>> Jun
>>
>> On 2019/2/14 14:14, Changwei Ge wrote:
>>> There is scenario causing ocfs2 umount hang when multiple hosts are
>>> rebooting at the same time.
>>>
>>> NODE1 NODE2 NODE3
>>> send unlock requset to NODE2
>>> dies
>>> become recovery master
>>> recover NODE2
>>> find NODE2 dead
>>> mark resource RECOVERING
>>> directly remove lock from grant list
>> dlm_do_local_recovery_cleanup
>> dlm_move_lockres_to_recovery_list
>> res->state |= DLM_LOCK_RES_RECOVERING;
>> list_add_tail(&res->recovering, &dlm->reco.resources);
>>
>>> calculate usage but RECOVERING marked
>>> **miss the window of purging
>> dlmunlock
>> dlmunlock_remote
>> dlmunlock_common // unlock successfully directly
>>
>> dlm_lockres_calc_usage
>> __dlm_lockres_calc_usage
>> __dlm_lockres_unused
>> if (res->state & (DLM_LOCK_RES_RECOVERING| // won't purge lock as DLM_LOCK_RES_RECOVERING is set
>
> True.
>
>>
>>> clear RECOVERING
>> dlm_finish_local_lockres_recovery
>> res->state &= ~DLM_LOCK_RES_RECOVERING;
>>
>> Could you help explaining where getting stuck?
>
> Sure,
> As dlm missed the window to purge lock resource, it can't be unhashed.
>
> During umount:
> dlm_unregister_domain
> dlm_migrate_all_locks -> there is always a lock resource hashed, so can't return from dlm_migrate_all_locks() thus hang during umount.
In dlm_migrate_all_locks, lockres will be move to purge_list and purged again:
dlm_migrate_all_locks
__dlm_lockres_calc_usage
list_add_tail(&res->purge, &dlm->purge_list);
Do you mean this process does not work?
>
> Thanks,
> Changwei
>
>
>>
>>>
>>> To reproduce this iusse, crash a host and then umount ocfs2
>>> from another node.
>>>
>>> To sovle this, just let unlock progress wait for recovery done.
>>>
>>> Signed-off-by: Changwei Ge <ge.changwei at h3c.com>
>>> ---
>>> fs/ocfs2/dlm/dlmunlock.c | 23 +++++++++++++++++++----
>>> 1 file changed, 19 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/dlm/dlmunlock.c b/fs/ocfs2/dlm/dlmunlock.c
>>> index 63d701c..c8e9b70 100644
>>> --- a/fs/ocfs2/dlm/dlmunlock.c
>>> +++ b/fs/ocfs2/dlm/dlmunlock.c
>>> @@ -105,7 +105,8 @@ static enum dlm_status dlmunlock_common(struct dlm_ctxt *dlm,
>>> enum dlm_status status;
>>> int actions = 0;
>>> int in_use;
>>> - u8 owner;
>>> + u8 owner;
>>> + int recovery_wait = 0;
>>>
>>> mlog(0, "master_node = %d, valblk = %d\n", master_node,
>>> flags & LKM_VALBLK);
>>> @@ -208,9 +209,12 @@ static enum dlm_status dlmunlock_common(struct dlm_ctxt *dlm,
>>> }
>>> if (flags & LKM_CANCEL)
>>> lock->cancel_pending = 0;
>>> - else
>>> - lock->unlock_pending = 0;
>>> -
>>> + else {
>>> + if (!lock->unlock_pending)
>>> + recovery_wait = 1;
>>> + else
>>> + lock->unlock_pending = 0;
>>> + }
>>> }
>>>
>>> /* get an extra ref on lock. if we are just switching
>>> @@ -244,6 +248,17 @@ static enum dlm_status dlmunlock_common(struct dlm_ctxt *dlm,
>>> spin_unlock(&res->spinlock);
>>> wake_up(&res->wq);
>>>
>>> + if (recovery_wait) {
>>> + spin_lock(&res->spinlock);
>>> + /* Unlock request will directly succeed after owner dies,
>>> + * and the lock is already removed from grant list. We have to
>>> + * wait for RECOVERING done or we miss the chance to purge it
>>> + * since the removement is much faster than RECOVERING proc.
>>> + */
>>> + __dlm_wait_on_lockres_flags(res, DLM_LOCK_RES_RECOVERING);
>>> + spin_unlock(&res->spinlock);
>>> + }
>>> +
>>> /* let the caller's final dlm_lock_put handle the actual kfree */
>>> if (actions & DLM_UNLOCK_FREE_LOCK) {
>>> /* this should always be coupled with list removal */
>>>
>>
> .
>
More information about the Ocfs2-devel
mailing list