[Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery V2

Xue jiufei xuejiufei at huawei.com
Mon Jun 17 20:13:14 PDT 2013


> From: "Xiaowei.Hu" <xiaowei.hu at oracle.com>
> 
> when the master requested locks ,but one/some of the live nodes died,
> after it received the request msg and before send out the locks packages,
> the recovery will fall into endless loop,waiting for the status changed to finalize
> 
> NodeA                                     NodeB
> selected as recovery master
> dlm_remaster_locks
>   -> dlm_requeset_all_locks
>   this send request locks msg to B
>                                           received the msg from A,
>                                           queue worker dlm_request_all_locks_worker
>                                           return 0
> go on set state to requested
> wait for the state become done
>                                           NodeB lost connection due to network
>                                           before the worker begin, or it die.
> 
> NodeA still waiting for the change of reco state.
> It won't end if it not get data done msg.
> And at this time nodeB do not realize this (or it just died),
> it won't send the msg for ever, nodeA left in the recovery process forever.
> 
> This patch let the recovery master check if the node still in live node
> map when it stay in REQUESTED status.
> 

Hi, xiaowei,
We have reviewed this patch and have some questions:
1) in dlm_is_node_in_livemap(), I think it should use 
!!(test_bit(node, dlm->live_nodes_map)) to determine whether a node is
live;
2) why not use dlm_is_node_dead() instead of dlm_is_node_in_livemap()
in dlm_remaster_locks?
I think dlm_is_node_dead() is better because dlm->live_nodes_map
may be remain when another node umount.

Thanks.

> Signed-off-by: Xiaowei.Hu <xiaowei.hu at oracle.com>
> ---
>  fs/ocfs2/dlm/dlmrecovery.c |   16 +++++++++++++++-
>  1 files changed, 15 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 01ebfd0..546c5b5 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -339,6 +339,17 @@ static int dlm_reco_master_ready(struct dlm_ctxt *dlm)
>  	return ready;
>  }
>  
> +/* returns true if node is still in the live node map
> + * this map is cleared before domain map,could be checked in recovery*/
> +int dlm_is_node_in_livemap(struct dlm_ctxt *dlm, u8 node)
> +{
> +	int live;
> +	spin_lock(&dlm->spinlock);
> +	live = !test_bit(node, dlm->live_nodes_map);
> +	spin_unlock(&dlm->spinlock);
> +	return live;
> +}
> +
>  /* returns true if node is no longer in the domain
>   * could be dead or just not joined */
>  int dlm_is_node_dead(struct dlm_ctxt *dlm, u8 node)
> @@ -679,7 +690,10 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node)
>  					     dlm->name, ndata->node_num,
>  					     ndata->state==DLM_RECO_NODE_DATA_RECEIVING ?
>  					     "receiving" : "requested");
> -					all_nodes_done = 0;
> +					if (!dlm_is_node_in_livemap(dlm, ndata->node_num))
> +						ndata->state = DLM_RECO_NODE_DATA_DEAD;
> +					else
> +						all_nodes_done = 0;
>  					break;
>  				case DLM_RECO_NODE_DATA_DONE:
>  					mlog(0, "%s: node %u state is done\n",
> 




.







More information about the Ocfs2-devel mailing list