[Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery
srinivas eeda
srinivas.eeda at oracle.com
Fri May 25 15:17:57 PDT 2012
comments inline
On 5/24/2012 10:53 PM, xiaowei.hu at oracle.com wrote:
> From: "Xiaowei.Hu"<xiaowei.hu at oracle.com>
>
> when the master requested locks ,but one/some of the live nodes died,
> after it received the request msg and before send out the locks packages,
> the recovery will fall into endless loop,waiting for the status changed to finalize
>
> NodeA NodeB
> selected as recovery master
> dlm_remaster_locks
> -> dlm_requeset_all_locks
> this send request locks msg to B
> received the msg from A,
> queue worker dlm_request_all_locks_worker
> return 0
> go on set state to requested
> wait for the state become done
> NodeB lost connection due to network
> before the worker begin, or it die.
> NodeA still waiting for the
> change of reco state.
> It won't end if it not get data done msg
> And at this time nodeB do not realize this (or it just died),
> it won't send the msg for ever, nodeA left in the recovery process forever.
>
> This patch let the recovery master check if the node still in live node
> map when it stay in REQUESTED status.
>
> Signed-off-by: Xiaowei.Hu<xiaowei.hu at oracle.com>
> ---
> fs/ocfs2/dlm/dlmrecovery.c | 9 +++++++++
> 1 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> index 01ebfd0..62659e8 100644
> --- a/fs/ocfs2/dlm/dlmrecovery.c
> +++ b/fs/ocfs2/dlm/dlmrecovery.c
> @@ -555,6 +555,7 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node)
> int all_nodes_done;
> int destroy = 0;
> int pass = 0;
> + int dying = 0;
>
> do {
> /* we have become recovery master. there is no escaping
> @@ -659,6 +660,7 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node)
> list_for_each_entry(ndata,&dlm->reco.node_data, list) {
> mlog(0, "checking recovery state of node %u\n",
> ndata->node_num);
> + dying = 0;
> switch (ndata->state) {
> case DLM_RECO_NODE_DATA_INIT:
> case DLM_RECO_NODE_DATA_REQUESTING:
> @@ -679,6 +681,13 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node)
> dlm->name, ndata->node_num,
> ndata->state==DLM_RECO_NODE_DATA_RECEIVING ?
> "receiving" : "requested");
> + spin_lock(&dlm->spinlock);
> + dying = !test_bit(ndata->node_num, dlm->live_nodes_map);
> + spin_unlock(&dlm->spinlock);
> + if (dying) {
> + ndata->state = DLM_RECO_NODE_DATA_DEAD;
> + break;
> + }
> all_nodes_done = 0;
> break;
> case DLM_RECO_NODE_DATA_DONE:
fix seems to address the issue, but can you please add a function
dlm_is_node_in_livemap similar to dlm_is_node_dead so that it' improves
readability. You can then add the following to check if the node is
still alive
+ if (!dlm_is_node_in_livemap(dlm, ndata->node_num))
+ ndate->state = DLM_RECO_NODE_DATA_DEAD;
+ else
+ all_nodes_done = 0;
More information about the Ocfs2-devel
mailing list