[Ocfs2-devel] [PATCH] ocfs2: dlm: fix recovery hung

Wed Feb 19 21:51:28 PST 2014

Yes.

thanks
wengang

于 2014年02月20日 13:51, Junxiao Bi 写道:
> Hi Wengang,
>
> On 02/20/2014 10:32 AM, Wengang wrote:
>> Yes. Checking DLM_RECO_STATE_ACTIVE would introduce the dead lock you
>> described.  I don't see the need we have to set this flag before
>> picking up the recovery master(till now the remaster locks not started
>> yet). We may think of setting this flag only when the node become the
>> really recovery master. While this may be a separate change.
>> For your change, seems you changed the use of xxx_FINALIZE(making it
>> also can set on recovery master). you'd better add some comment for that.
> Thank you for reviewing it. I will add some comments for it.
>> Signed-off-by: Wengang Wang<wen.gang.wang at oracle.com>
> Do you mean Reviewed-by: Wengang Wang<wen.gang.wang at oracle.com> ?
>
> Thanks,
> Junxiao.
>>
>> 于 2014年02月19日 21:44, Junxiao Bi 写道:
>>> Hi Wengang,
>>>
>>> I thought about this changes before, it seemed may cause a deadlock.
>>> When the new recovery master is sending the RECO_BEGIN message,
>>> another node x may be trying to become the recovery master in
>>> dlm_pick_recovery_master(), at there, DLM_RECO_STATE_ACTIVE had been
>>> set before in dlm_do_recovery(), so the RECO_BEGIN message handler
>>> dlm_pick_recovery_master() can not be executed, then
>>> dlm->reco.new_master will not be set, so node x will loop in
>>> dlm_pick_recovery_master() forever.
>>>
>>> Thanks,
>>> Junxiao.
>>>
>>> ----- Original Message -----
>>> From: wen.gang.wang at oracle.com
>>> To: junxiao.bi at oracle.com, ocfs2-devel at oss.oracle.com
>>> Cc: mfasheh at suse.de, joe.jin at oracle.com
>>> Sent: Wednesday, February 19, 2014 6:10:43 PM GMT +08:00 Beijing /
>>> Chongqing / Hong Kong / Urumqi
>>> Subject: Re: [Ocfs2-devel] [PATCH] ocfs2: dlm: fix recovery hung
>>>
>>> Seems DLM_RECO_STATE_FINALIZE is used to track the situation that the
>>> recovery master(node 0) dead (just) after it had finished the dlm
>>> recovery. So that the original dead node(node 1) needed not be
>>> recovered again. This flag is set only on non recovery master nodes.
>>> In this case, things looks like the recovery master reset
>>> "recover_master" and "dead node" after another node(node 2) had
>>> started to recover another dead node(node 3). This would happen only
>>> on the "old" recovery master(node 0). Note this flag
>>> DLM_RECO_STATE_ACTIVE, it's set only on master when it begins to
>>> recover and cleared when recovery finished. For an "old" master,
>>> checking DLM_RECO_STATE_ACTIVE makes sense. I'd like to suggest:
>>>
>>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>> @@ -2704,7 +2704,7 @@ int dlm_begin_reco_handler(struct o2net_msg
>>> *msg, u32 len, void *data,
>>>                    return 0;
>>>               spin_lock(&dlm->spinlock);
>>> -       if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) {
>>> +       if (dlm->reco.state & (DLM_RECO_STATE_FINALIZE |
>>> DLM_RECO_STATE_ACTIVE)) {
>>>                    mlog(0, "%s: node %u wants to recover node %u
>>> (%u:%u) "
>>>                         "but this node is in finalize state, waiting
>>> on finalize2\n",
>>>                         dlm->name, br->node_idx, br->dead_node,
>>>
>>> thanks
>>> wengang
>>>
>>> 于 2014年02月19日 16:30, Junxiao Bi 写道:
>>>> There is a race window in dlm_do_recovery() between
>>>> dlm_remaster_locks()
>>>> and dlm_reset_recovery() when the recovery master nearly finish the
>>>> recovery
>>>> process for a dead node. After the master sends FINALIZE_RECO
>>>> message in
>>>> dlm_remaster_locks(), another node may become the recovery master
>>>> for another
>>>> dead node, and then send the BEGIN_RECO message to all the nodes
>>>> included the
>>>> old master, in the handler of this message dlm_begin_reco_handler()
>>>> of old master,
>>>> dlm->reco.dead_node and dlm->reco.new_master will be set to the
>>>> second dead
>>>> node and the new master, then in dlm_reset_recovery(), these two
>>>> variables
>>>> will be reset to default value. This will cause new recovery master
>>>> can not finish
>>>> the recovery process and hung, at last the whole cluster will hung
>>>> for recovery.
>>>>
>>>> old recovery master:                                 new recovery
>>>> master:
>>>> dlm_remaster_locks()
>>>>                                                      become recovery
>>>> master for
>>>>                                                      another dead node.
>>>>                                                     
>>>> dlm_send_begin_reco_message()
>>>> dlm_begin_reco_handler()
>>>> {
>>>>     if (dlm->reco.state & DLM_RECO_STATE_FINALIZE) {
>>>>      return -EAGAIN;
>>>>     }
>>>>     dlm_set_reco_master(dlm, br->node_idx);
>>>>     dlm_set_reco_dead_node(dlm, br->dead_node);
>>>> }
>>>> dlm_reset_recovery()
>>>> {
>>>>     dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM);
>>>>     dlm_set_reco_master(dlm, O2NM_INVALID_NODE_NUM);
>>>> }
>>>>                                                      will hung in
>>>> dlm_remaster_locks() for
>>>>                                                      request dlm
>>>> locks info
>>>>
>>>> Before send FINALIZE_RECO message, recovery master should set
>>>> DLM_RECO_STATE_FINALIZE
>>>> for itself and clear it after the recovery done, this can break the
>>>> race windows as
>>>> the BEGIN_RECO messages will not be handled before
>>>> DLM_RECO_STATE_FINALIZE flag is
>>>> cleared.
>>>>
>>>> A similar race may happen between new recovery master and normal
>>>> node which is in
>>>> dlm_finalize_reco_handler(), also fix it.
>>>>
>>>> Signed-off-by: Junxiao Bi <junxiao.bi at oracle.com>
>>>> ---
>>>>     fs/ocfs2/dlm/dlmrecovery.c |   11 +++++++++--
>>>>     1 file changed, 9 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
>>>> index 7035af0..fe58e8b 100644
>>>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>> @@ -537,7 +537,10 @@ master_here:
>>>>             /* success!  see if any other nodes need recovery */
>>>>             mlog(0, "DONE mastering recovery of %s:%u here(this=%u)!\n",
>>>>                  dlm->name, dlm->reco.dead_node, dlm->node_num);
>>>> -        dlm_reset_recovery(dlm);
>>>> +        spin_lock(&dlm->spinlock);
>>>> +        __dlm_reset_recovery(dlm);
>>>> +        dlm->reco.state &= ~DLM_RECO_STATE_FINALIZE;
>>>> +        spin_unlock(&dlm->spinlock);
>>>>         }
>>>>         dlm_end_recovery(dlm);
>>>>     @@ -695,6 +698,10 @@ static int dlm_remaster_locks(struct
>>>> dlm_ctxt *dlm, u8 dead_node)
>>>>             if (all_nodes_done) {
>>>>                 int ret;
>>>>     +            spin_lock(&dlm->spinlock);
>>>> +            dlm->reco.state |= DLM_RECO_STATE_FINALIZE;
>>>> +            spin_unlock(&dlm->spinlock);
>>>> +
>>>>                 /* all nodes are now in DLM_RECO_NODE_DATA_DONE state
>>>>                   * just send a finalize message to everyone and
>>>>                   * clean up */
>>>> @@ -2882,8 +2889,8 @@ int dlm_finalize_reco_handler(struct o2net_msg
>>>> *msg, u32 len, void *data,
>>>>                     BUG();
>>>>                 }
>>>>                 dlm->reco.state &= ~DLM_RECO_STATE_FINALIZE;
>>>> +            __dlm_reset_recovery(dlm);
>>>>                 spin_unlock(&dlm->spinlock);
>>>> -            dlm_reset_recovery(dlm);
>>>>                 dlm_kick_recovery_thread(dlm);
>>>>                 break;
>>>>             default: