[Ocfs2-devel] [PATCH] ocfs2/dlm: retry migrating if nomem or lockres is in recovery on target

Thu Sep 9 21:08:40 PDT 2010

OK.
This is not a customer reported problem. It's found when I was testing
other patches. And it can be easily reproduced(around 50%).

I will think more on the reco+mig race.

regards,
wengang.
On 10-09-09 18:41, Sunil Mushran wrote:
> I don't think this fixes the issue. As in, the fix for reco+mig race is a
> lot more involved. Is someone hitting this freq? As in, this should be
> a hard race to reproduce.
> 
> On 08/31/2010 08:41 AM, Wengang Wang wrote:
> >This patch tries to fix two problems:
> >
> >problem 1):
> >It's a case of recovery + migration. That is a recovery is happening when node I
> >is in progress of umount. Node I is the recovery master.
> >Say lockres A was mastered by the dead node and need to be recovered. Node I(the
> >reco master) and node II both have reference on lockres A.
> >So lockres A is being recovered from node II to node I, with RECOVERING flag set.
> >The umounting process is going on, it happened to be migrating lockres A to node
> >II. Since recovery not finished yet(RECOVERING still set), node II reponds with
> >-EFAULT to kill node I. Then node I killed its self(BUGON).
> >
> >There is a checking for recovery(on RECOVERING), but it droped res->spinlock and
> >dlm->spinlock. So the checking does not help much enough.
> >
> >Since we have to drop any spinlock when we are sending migrate lockres(
> >DLM_MIG_LOCKRES_MSG) message, we have to deal with above case.
> >
> >problem 2):
> >In the same context of problem 1), -ENOMEM from target node can trigger an
> >incorrect BUG() on the requester of "migrate lockres".
> >
> >The fix is when target node returns -EFAULT or -ENOMEM, we retry the migration(
> >for umount).
> >Though they are two separated problems, the fixes are in the same way. So I
> >combined them together.
> >