[Ocfs2-devel] avoid being purged when queued for assert_master

Wed Oct 12 18:47:30 PDT 2011

I meant master_request (not query). We set refmap _before_
asserting. So that should not happen.

On 10/12/2011 06:02 PM, Wengang Wang wrote:
> Hi Sunil,
>
> On 11-10-12 17:32, Sunil Mushran wrote:
>> So you are saying a lockres can get purged before the node is asserting
>> master to other nodes?
>>
>> The main place where we dispatch assert is during master_query.
>> There we set refmap before dispatching. Meaning refmap will protect
>> us from purging.
>>
>> But I think it could happen in master_requery, which only comes into
>> play if a node dies during migration.
>>
>> Is that the case here?
> I think this can mainly include the response for a master_request.
> in dlm_master_request_handler(), the master node quques assert_master.
> The node which requested a master_request knows the master by receving
> response values. It doesn't need to wait until the assert_master come.
> As you know, the asserting master work is done in a workqueue. And the
> work item in it can be heavily delayed. So in the duriation from the
> (old) master responding with "Yes, I am master" to it sending assert_master,
> Anything can heppan, the worse case is the lockres on the (old) master
> get purged and is remasted by another node. So in this case,
> apparently, the old master shouldn't send the assert_master any longer.
> To prevent that from happening, we should keep the lockres un-purged as
> long as it's queued for master_request.
>
> #the problem is what my flush_workqueue patch tries to fix.
>
> thanks,
> wengang.
>
>> On 10/12/2011 12:04 AM, Wengang Wang wrote:
>>> Hi Sunil/Joel/Mark and anyone who has interest,
>>>
>>> This is not a patch but a discuss.
>>>
>>> Currently we have a problem:
>>> When a lockres is still queued(in dlm->work_list) for sending an
>>> assert_master(or in processing of sending), the lockres can't be
>>> purged(removed from hash). there is no flag/state,on lockres its self,dinotes
>>> this situation.
>>>
>>> The badness is that if the lockres is purged(surely not the owner at the
>>> moment), and the assert_master is after the purge. it can confuse other
>>> nodes. On another node, the owner now can be any other nodes, thus on
>>> receiving the assert_master, it can trigger a BUG() because 'owner'
>>> doesn't match.
>>>
>>> So we'd better to prevent the lockres from be purged when it's queued
>>> for something(assert_master).
>>>
>>> Srini and I discussed some possible fixes:
>>> 1) adding a flag to lockres->state.
>>>     this does not work. A lockres can have multiple instances in the queue list.
>>>     A simple flag is not safe. And the instances are not nested, so even
>>>     saving a previous flags doesn't work. Neither can we merge the instances
>>>     because they can be for different purposes.
>>>
>>> 2) checking if the lockres if queued before purging it.
>>>    this works, but doesn't sounds good. it needs changes of current behaviour
>>>    on the queue list.   Also, we have no idea on the performance of the checking
>>>    (searching list).
>>>
>>> 3) making use of lockres->inflight_locks.
>>>    this works, but seems to be a mis-use of inflight_locks.
>>>
>>> 4) adding a new member to lockres counting the queued time.
>>>     this works and simple. but needs extra memory.
>>>
>>> I prefer to the 4).
>>>
>>> What's your idea?
>>>
>>> thanks,
>>> wengang.
>>>
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-devel