[Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery V2
Jeff Liu
jeff.liu at oracle.com
Sun Jun 23 04:12:22 PDT 2013
On 06/18/2013 11:13 AM, Xue jiufei wrote:
>> From: "Xiaowei.Hu" <xiaowei.hu at oracle.com>
>>
>> when the master requested locks ,but one/some of the live nodes died,
>> after it received the request msg and before send out the locks packages,
>> the recovery will fall into endless loop,waiting for the status changed to finalize
>>
>> NodeA NodeB
>> selected as recovery master
>> dlm_remaster_locks
>> -> dlm_requeset_all_locks
>> this send request locks msg to B
>> received the msg from A,
>> queue worker dlm_request_all_locks_worker
>> return 0
>> go on set state to requested
>> wait for the state become done
>> NodeB lost connection due to network
>> before the worker begin, or it die.
>>
>> NodeA still waiting for the change of reco state.
>> It won't end if it not get data done msg.
>> And at this time nodeB do not realize this (or it just died),
>> it won't send the msg for ever, nodeA left in the recovery process forever.
>>
>> This patch let the recovery master check if the node still in live node
>> map when it stay in REQUESTED status.
>>
>
> Hi, xiaowei,
> We have reviewed this patch and have some questions:
> 1) in dlm_is_node_in_livemap(), I think it should use
> !!(test_bit(node, dlm->live_nodes_map)) to determine whether a node is
> live;
Hmm? test_bit(node,...) is ok.
> 2) why not use dlm_is_node_dead() instead of dlm_is_node_in_livemap()
> in dlm_remaster_locks?
> I think dlm_is_node_dead() is better because dlm->live_nodes_map
> may be remain when another node umount.
Please refer to:
http://marc.info/?l=ocfs2-devel&m=133799814717270&w=2
Thanks,
-Jeff
>
> Thanks.
>
>> Signed-off-by: Xiaowei.Hu <xiaowei.hu at oracle.com>
>> ---
>> fs/ocfs2/dlm/dlmrecovery.c | 16 +++++++++++++++-
>> 1 files changed, 15 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
>> index 01ebfd0..546c5b5 100644
>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>> @@ -339,6 +339,17 @@ static int dlm_reco_master_ready(struct dlm_ctxt *dlm)
>> return ready;
>> }
>>
>> +/* returns true if node is still in the live node map
>> + * this map is cleared before domain map,could be checked in recovery*/
>> +int dlm_is_node_in_livemap(struct dlm_ctxt *dlm, u8 node)
>> +{
>> + int live;
>> + spin_lock(&dlm->spinlock);
>> + live = !test_bit(node, dlm->live_nodes_map);
>> + spin_unlock(&dlm->spinlock);
>> + return live;
>> +}
>> +
>> /* returns true if node is no longer in the domain
>> * could be dead or just not joined */
>> int dlm_is_node_dead(struct dlm_ctxt *dlm, u8 node)
>> @@ -679,7 +690,10 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node)
>> dlm->name, ndata->node_num,
>> ndata->state==DLM_RECO_NODE_DATA_RECEIVING ?
>> "receiving" : "requested");
>> - all_nodes_done = 0;
>> + if (!dlm_is_node_in_livemap(dlm, ndata->node_num))
>> + ndata->state = DLM_RECO_NODE_DATA_DEAD;
>> + else
>> + all_nodes_done = 0;
>> break;
>> case DLM_RECO_NODE_DATA_DONE:
>> mlog(0, "%s: node %u state is done\n",
>>
>
>
>
>
> .
>
>
>
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
More information about the Ocfs2-devel
mailing list