[Ocfs2-devel] [PATCH] ocfs2/dlm: avoid incorrect bit set in refmap on recovery master
Wengang Wang
wen.gang.wang at oracle.com
Sun Jul 25 18:42:26 PDT 2010
Hi Srini,
Thanks for your review!
On 10-07-23 15:27, Srinivas Eeda wrote:
> thanks for making this patch, it looks good just few minor changes
> about comments
>
> On 7/23/2010 5:15 AM, Wengang Wang wrote:
> >In the following situation, there remains an incorrect bit in refmap on the
> >recovery master. Finally the recovery master will fail at purging the lockres
> >due to the incorrect bit in refmap.
> >
> >1) node A has no interest on lockres A any longer, so it is purging it.
> >2) the owner of lockres A is node B, so node A is sending de-ref message
> >to node B.
> >3) at this time, node B crashed. node C becomes the recovery master. it recovers
> >lockres A(because the master is the dead node B).
> >4) node A migrated lockres A to node C with a refbit there.
> >5) node A failed to send de-ref message to node B because it crashed. The failure
> >is ignored. no other action is done for lockres A any more.
> >
> >For mormal, re-send the deref message to it to recovery master can fix it. Well,
> >ignoring the failure of deref to the original master and not recovering the lockres
> >to recovery master has the same effect. And the later is simpler.
> >
> >Signed-off-by: Wengang Wang <wen.gang.wang at oracle.com>
> >---
> > fs/ocfs2/dlm/dlmrecovery.c | 17 +++++++++++++++--
> > fs/ocfs2/dlm/dlmthread.c | 28 +++++++++++++++++-----------
> > 2 files changed, 32 insertions(+), 13 deletions(-)
> >
> >diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c
> >index 9dfaac7..06640f6 100644
> >--- a/fs/ocfs2/dlm/dlmrecovery.c
> >+++ b/fs/ocfs2/dlm/dlmrecovery.c
> >@@ -1997,6 +1997,8 @@ void dlm_move_lockres_to_recovery_list(struct dlm_ctxt *dlm,
> > struct list_head *queue;
> > struct dlm_lock *lock, *next;
> >+ assert_spin_locked(&dlm->spinlock);
> >+ assert_spin_locked(&res->spinlock);
> > res->state |= DLM_LOCK_RES_RECOVERING;
> > if (!list_empty(&res->recovering)) {
> > mlog(0,
> >@@ -2336,9 +2338,20 @@ static void dlm_do_local_recovery_cleanup(struct dlm_ctxt *dlm, u8 dead_node)
> > /* the wake_up for this will happen when the
> > * RECOVERING flag is dropped later */
> remove above comment as it doesn't seem to be relevant anymore.
OK.
> >- res->state &= ~DLM_LOCK_RES_DROPPING_REF;
> >+ if (res->state & DLM_LOCK_RES_DROPPING_REF) {
> >+ /*
> >+ * don't migrate a lockres which is in
> >+ * progress of dropping ref
> >+ */
> move this comment to before the if condition
OK.
> >+ mlog(ML_NOTICE, "%.*s ignored for "
> >+ "migration\n", res->lockname.len,
> >+ res->lockname.name);
> This information only helps us in diagnosing any related issue and
> is not helpful in normal cases. So it should be 0 instead of
> ML_NOTICE.
I think this happens in a rare case. So it won't write many such logs but
it could help in diagnosing in case something wroing take place.
mlog(0, ..) almost help nothing.
> >+ res->state &=
> >+ ~DLM_LOCK_RES_DROPPING_REF;
> we don't need to clear this state as dlm_purge_lockres removes it.
OK.
regards,
wengang.
> >+ } else
> >+ dlm_move_lockres_to_recovery_list(dlm,
> >+ res);
> >- dlm_move_lockres_to_recovery_list(dlm, res);
> > } else if (res->owner == dlm->node_num) {
> > dlm_free_dead_locks(dlm, res, dead_node);
> > __dlm_lockres_calc_usage(dlm, res);
More information about the Ocfs2-devel
mailing list