[Ocfs2-devel] [PATCH] ocfs2: wait for recovering done after direct unlock request
Changwei Ge
ge.changwei at h3c.com
Mon Feb 18 01:46:57 PST 2019
Hi Jun,
Ping...
Do you have any further question?
If any question uncleared, please let me know.
Otherwise, could you please give a feedback?
Thanks,
Changwei
On 2019/2/15 16:48, Changwei Ge wrote:
> Hi Jun,
>
> On 2019/2/15 16:36, piaojun wrote:
>> Hi Changwei,
>>
>> Thanks for your explaination, and I still have a question below.
>
> Thank you for looking into this.
>
>>
>> On 2019/2/15 14:29, Changwei Ge wrote:
>>> Hi Jun,
>>>
>>> On 2019/2/15 14:20, piaojun wrote:
>>>> Hi Changwei,
>>>>
>>>> The DLM process is a little bit complex, so I suggest pasting the code
>>>> path. And I wonder if my code is right?
>>>>
>>>> Thanks,
>>>> Jun
>>>>
>>>> On 2019/2/14 14:14, Changwei Ge wrote:
>>>>> There is scenario causing ocfs2 umount hang when multiple hosts are
>>>>> rebooting at the same time.
>>>>>
>>>>> NODE1 NODE2 NODE3
>>>>> send unlock requset to NODE2
>>>>> dies
>>>>> become recovery master
>>>>> recover NODE2
>>>>> find NODE2 dead
>>>>> mark resource RECOVERING
>>>>> directly remove lock from grant list
>>>> dlm_do_local_recovery_cleanup
>>>> dlm_move_lockres_to_recovery_list
>>>> res->state |= DLM_LOCK_RES_RECOVERING;
>>>> list_add_tail(&res->recovering, &dlm->reco.resources);
>>>>
>>>>> calculate usage but RECOVERING marked
>>>>> **miss the window of purging
>>>> dlmunlock
>>>> dlmunlock_remote
>>>> dlmunlock_common // unlock successfully directly
>>>>
>>>> dlm_lockres_calc_usage
>>>> __dlm_lockres_calc_usage
>>>> __dlm_lockres_unused
>>>> if (res->state & (DLM_LOCK_RES_RECOVERING| // won't purge lock as DLM_LOCK_RES_RECOVERING is set
>>>
>>> True.
>>>
>>>>
>>>>> clear RECOVERING
>>>> dlm_finish_local_lockres_recovery
>>>> res->state &= ~DLM_LOCK_RES_RECOVERING;
>>>>
>>>> Could you help explaining where getting stuck?
>>>
>>> Sure,
>>> As dlm missed the window to purge lock resource, it can't be unhashed.
>>>
>>> During umount:
>>> dlm_unregister_domain
>>> dlm_migrate_all_locks -> there is always a lock resource hashed, so can't return from dlm_migrate_all_locks() thus hang during umount.
>>
>> In dlm_migrate_all_locks, lockres will be move to purge_list and purged again:
>> dlm_migrate_all_locks
>> __dlm_lockres_calc_usage
>> list_add_tail(&res->purge, &dlm->purge_list);
>>
>> Do you mean this process does not work?
>
> Yes, only for Migrating lock resources, it has a chance to call __dlm_lockres_calc_usage()
> But in our situation, the problematic lock resource is obviously no master. So no chance for it to set stack variable *dropped* in dlm_migrate_all_locks()
>
> Thanks,
> Changwei
>
>>
>>>
>>> Thanks,
>>> Changwei
>>>
>>>
>>>>
>>>>>
>>>>> To reproduce this iusse, crash a host and then umount ocfs2
>>>>> from another node.
>>>>>
>>>>> To sovle this, just let unlock progress wait for recovery done.
>>>>>
>>>>> Signed-off-by: Changwei Ge <ge.changwei at h3c.com>
>>>>> ---
>>>>> fs/ocfs2/dlm/dlmunlock.c | 23 +++++++++++++++++++----
>>>>> 1 file changed, 19 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/fs/ocfs2/dlm/dlmunlock.c b/fs/ocfs2/dlm/dlmunlock.c
>>>>> index 63d701c..c8e9b70 100644
>>>>> --- a/fs/ocfs2/dlm/dlmunlock.c
>>>>> +++ b/fs/ocfs2/dlm/dlmunlock.c
>>>>> @@ -105,7 +105,8 @@ static enum dlm_status dlmunlock_common(struct dlm_ctxt *dlm,
>>>>> enum dlm_status status;
>>>>> int actions = 0;
>>>>> int in_use;
>>>>> - u8 owner;
>>>>> + u8 owner;
>>>>> + int recovery_wait = 0;
>>>>>
>>>>> mlog(0, "master_node = %d, valblk = %d\n", master_node,
>>>>> flags & LKM_VALBLK);
>>>>> @@ -208,9 +209,12 @@ static enum dlm_status dlmunlock_common(struct dlm_ctxt *dlm,
>>>>> }
>>>>> if (flags & LKM_CANCEL)
>>>>> lock->cancel_pending = 0;
>>>>> - else
>>>>> - lock->unlock_pending = 0;
>>>>> -
>>>>> + else {
>>>>> + if (!lock->unlock_pending)
>>>>> + recovery_wait = 1;
>>>>> + else
>>>>> + lock->unlock_pending = 0;
>>>>> + }
>>>>> }
>>>>>
>>>>> /* get an extra ref on lock. if we are just switching
>>>>> @@ -244,6 +248,17 @@ static enum dlm_status dlmunlock_common(struct dlm_ctxt *dlm,
>>>>> spin_unlock(&res->spinlock);
>>>>> wake_up(&res->wq);
>>>>>
>>>>> + if (recovery_wait) {
>>>>> + spin_lock(&res->spinlock);
>>>>> + /* Unlock request will directly succeed after owner dies,
>>>>> + * and the lock is already removed from grant list. We have to
>>>>> + * wait for RECOVERING done or we miss the chance to purge it
>>>>> + * since the removement is much faster than RECOVERING proc.
>>>>> + */
>>>>> + __dlm_wait_on_lockres_flags(res, DLM_LOCK_RES_RECOVERING);
>>>>> + spin_unlock(&res->spinlock);
>>>>> + }
>>>>> +
>>>>> /* let the caller's final dlm_lock_put handle the actual kfree */
>>>>> if (actions & DLM_UNLOCK_FREE_LOCK) {
>>>>> /* this should always be coupled with list removal */
>>>>>
>>>>
>>> .
>>>
>>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>
More information about the Ocfs2-devel
mailing list