[Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery V2
Xue jiufei
xuejiufei at huawei.com
Mon Jun 17 20:13:14 PDT 2013
> From: "Xiaowei.Hu" <xiaowei.hu at oracle.com>
>
> when the master requested locks ,but one/some of the live nodes died,
> after it received the request msg and before send out the locks packages,
> the recovery will fall into endless loop,waiting for the status changed to finalize
>
> NodeA NodeB
> selected as recovery master
> dlm_remaster_locks
> -> dlm_requeset_all_locks
> this send request locks msg to B
> received the msg from A,
> queue worker dlm_request_all_locks_worker
> return 0
> go on set state to requested
> wait for the state become done
> NodeB lost connection due to network
> before the worker begin, or it die.
>
> NodeA still waiting for the change of reco state.
> It won't end if it not get data done msg.
> And at this time nodeB do not realize this (or it just died),
> it won't send the msg for ever, nodeA left in the recovery process forever.
>
> This patch let the recovery master check if the node still in live node
> map when it stay in REQUESTED status.
>
Hi, xiaowei,
We have reviewed this patch and have some questions:
1) in dlm_is_node_in_livemap(), I think it should use
!!(test_bit(node, dlm->live_nodes_map)) to determine whether a node is
live;
2) why not use dlm_is_node_dead() instead of dlm_is_node_in_livemap()
in dlm_remaster_locks?
I think dlm_is_node_dead() is better because dlm->live_nodes_map
may be remain when another node umount.
Thanks.
> Signed-off-by: Xiaowei.Hu <xiaowei.hu at oracle.com>
> ---
> fs/ocfs2/dlm/dlmrecovery.c | 16 +++++++++++++++-
> 1 files changed, 15 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 01ebfd0..546c5b5 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -339,6 +339,17 @@ static int dlm_reco_master_ready(struct dlm_ctxt *dlm)
> return ready;
> }
>
> +/* returns true if node is still in the live node map
> + * this map is cleared before domain map,could be checked in recovery*/
> +int dlm_is_node_in_livemap(struct dlm_ctxt *dlm, u8 node)
> +{
> + int live;
> + spin_lock(&dlm->spinlock);
> + live = !test_bit(node, dlm->live_nodes_map);
> + spin_unlock(&dlm->spinlock);
> + return live;
> +}
> +
> /* returns true if node is no longer in the domain
> * could be dead or just not joined */
> int dlm_is_node_dead(struct dlm_ctxt *dlm, u8 node)
> @@ -679,7 +690,10 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node)
> dlm->name, ndata->node_num,
> ndata->state==DLM_RECO_NODE_DATA_RECEIVING ?
> "receiving" : "requested");
> - all_nodes_done = 0;
> + if (!dlm_is_node_in_livemap(dlm, ndata->node_num))
> + ndata->state = DLM_RECO_NODE_DATA_DEAD;
> + else
> + all_nodes_done = 0;
> break;
> case DLM_RECO_NODE_DATA_DONE:
> mlog(0, "%s: node %u state is done\n",
>
.
More information about the Ocfs2-devel
mailing list