[Ocfs2-devel] [PATCH] ocfs2/dlm: correct the refmap on recovery master

Sunil Mushran sunil.mushran at oracle.com
Tue Jul 20 15:33:17 PDT 2010


On 07/19/2010 07:59 PM, Wengang Wang wrote:
>> Do you have the message sequencing that would lead to this situation?
>> If we migrate the lockres to the reco master, the reco master will send
>> an assert that will make us change the master.
>>      
> So first, the problem is not about the changing owner. It is that
> the bit(in refmap) on behalf of the node in question is not cleared on the new
> master(recovery master). So that the new master will fail at purging the lockres
> due to the incorrect bit in refmap.
>
> Second, I have no messages at hand for the situation. But I think it is simple
> enough.
>
> 1) node A has no interest on lockres A any longer, so it is purging it.
> 2) the owner of lockres A is node B, so node A is sending de-ref message
> to node B.
> 3) at this time, node B crashed. node C becomes the recovery master. it recovers
> lockres A(because the master is the dead node B).
> 4) node A migrated lockres A to node C with a refbit there.
> 5) node A failed to send de-ref message to node B because it crashed. The failure
> is ignored. no other action is done for lockres A any more.
>    

In dlm_do_local_recovery_cleanup(), we expicitly clear the flag...
when the owner is the dead_node. So this should not happen.

Your patch changes the logic to exclude such lockres' from the
recovery list. And that's a change, while possibly workable, needs
to be looked into more thoroughly.

In short, there is a disconnect between your description and your patch.
Or, my understanding.

> So node A means to drop the ref on the master. But in such a situation, node C
> keeps the ref on behalf of node A unexpectedly. Node C finally fails at purging
> lockres A and hang on umount.
>
>    
>> I think your problem is the one race we have concerning reco/migration.
>> If so, this fix is not enough.
>>      
> It's a problem of purging + recovery. no pure migration for umount here.
> So what's your concern?
>    

See above.



More information about the Ocfs2-devel mailing list