[Ocfs2-devel] [PATCH v2] ocfs2/dlm: fix race between convert and recovery
Junxiao Bi
junxiao.bi at oracle.com
Tue Sep 22 18:25:58 PDT 2015
On 09/22/2015 08:23 PM, Joseph Qi wrote:
> Hi Junxiao,
>
> On 2015/9/22 8:55, Junxiao Bi wrote:
>> On 09/21/2015 05:23 PM, Joseph Qi wrote:
>>> Hi Junxiao,
>>> This solution may have problem in the following scenario:
>>> Consider dlm_send_remote_convert_request has taken too much time, and
>>> dlm_move_lockres_to_recovery_list runs first and new master will see
>>> this node is currently in convert list after recovery.
>>> Then dlm_send_remote_convert_request returns other than DLM_NORMAL and
>>> it will revert it to grant list, then retry convert. This will makes
>>> this node and master inconsistent.
>>> I will try to find another solution to resolve the race issue.
>>
>> If master is down, no need retry convert. May check the return value of
>> dlm_send_remote_convert_request(), if DLM_RECOVERING, don't retry,
>> otherwise retry?
>
> I want to keep the original logic. And for fixing the race case I
> described, how about the following idea?
>
> Check the status DLM_NORMAL. If res->state is currently
> DLM_LOCK_RES_RECOVERING (set in dlm_move_lockres_to_recovery_list, means
> still in recovery) or res master changed (means new master has finished
> the recovery), reset the status to DLM_RECOVERING, just like the check
> at the beginning of dlmconvert_remote. Then it is now in grant list and
> outer will retry.
How this idea fix the race windows you described in patch log? Lock is
reverted to granted list but dlm_send_remote_convert_request() return
DLM_NORMAL.
Thanks,
Junxiao.
>
>>
>> Thanks,
>> Junxiao.
>>
>>>
>>> On 2015/9/20 15:20, Junxiao Bi wrote:
>>>> Reviewed-by: Junxiao Bi <junxiao.bi at oracle.com>
>>>>
>>>>>> 在 2015年9月18日,下午7:25,Joseph Qi <joseph.qi at huawei.com> 写道:
>>>>>>
>>>>>> There is a race window between dlmconvert_remote and
>>>>>> dlm_move_lockres_to_recovery_list, which will cause a lock with
>>>>>> OCFS2_LOCK_BUSY in grant list, thus system hangs.
>>>>>>
>>>>>> dlmconvert_remote
>>>>>> {
>>>>>> spin_lock(&res->spinlock);
>>>>>> list_move_tail(&lock->list, &res->converting);
>>>>>> lock->convert_pending = 1;
>>>>>> spin_unlock(&res->spinlock);
>>>>>>
>>>>>> status = dlm_send_remote_convert_request();
>>>>>>>>>>>>>>>>>> race window, master has queued ast and return DLM_NORMAL,
>>>>>> and then down before sending ast.
>>>>>> this node detects master down and calls
>>>>>> dlm_move_lockres_to_recovery_list, which will revert the
>>>>>> lock to grant list.
>>>>>> Then OCFS2_LOCK_BUSY won't be cleared as new master won't
>>>>>> send ast any more because it thinks already be authorized.
>>>>>>
>>>>>> spin_lock(&res->spinlock);
>>>>>> lock->convert_pending = 0;
>>>>>> if (status != DLM_NORMAL)
>>>>>> dlm_revert_pending_convert(res, lock);
>>>>>> spin_unlock(&res->spinlock);
>>>>>> }
>>>>>>
>>>>>> In this case, just leave it in convert list and new master will take care
>>>>>> of it after recovery. And if convert request returns other than
>>>>>> DLM_NORMAL, convert thread will do the revert itself. So remove the
>>>>>> revert logic in dlm_move_lockres_to_recovery_list.
>>>>>>
>>>>>> changelog since v1:
>>>>>> Clean up convert_pending since it is now useless.
>>>
>>>
>>
>>
>> .
>>
>
>
More information about the Ocfs2-devel
mailing list