[Ocfs2-devel] Race condition between OCFS2 downconvert thread and ocfs2 cluster lock.

Sunil Mushran sunil.mushran at oracle.com
Tue Feb 21 16:45:58 PST 2012


Both AST and BAST can only be sent by the master. And we ensure the 
master sends the ASTs before BAST.

Do you have the full lockres dump?

On 02/21/2012 04:36 PM, Xiaowei.hu wrote:
> Hi Sunil,
>
> I mean it execute in this way:
>
> nodeA ocfs2_dlm_lock() and released the res spin lock,here A doesn't
> hold spin locks,
> then it start to execute the proxy ast handler , process bast request
> from nodeB,
> then dlmthread flushed the bast, after this node A start to queue its
> ast in ocfs2_dlm_lock() function.
>
> Thanks,
> Xiaowei
> On 02/22/2012 01:48 AM, Sunil Mushran wrote:
>> > bast queued and flushed,before the ast was queued
>>
>> Unlikely with o2dlm. dlmthread always sends ASTs before BASTs.
>>
>> Can you recreate the entire lockres? A full dump may yield more
>> information.
>>
>> Sunil
>>
>> On 02/20/2012 10:12 PM, xiaowei.hu at oracle.com wrote:
>>> I am trying to fix bug13611997,CT's machine run into BUG in ocfs2dc
>>> thread, BUG_ON(lockres->l_action != OCFS2_AST_CONVERT&&
>>> lockres->l_action != OCFS2_AST_DOWNCONVERT); I analysized the vmcore
>>> , the lockres->l_action = OCFS2_AST_ATTACH and l_flags=326(which
>>> means
>>> OCFS2_LOCK_BUSY|OCFS2_LOCK_BLOCKED|OCFS2_LOCK_INITIALIZED|OCFS2_LOCK_QUEUED),
>>> after compared with the code , this status could be only possible
>>> during ocfs2_cluster_lock,here is the race situation:
>>>
>>> NodeA NodeB
>>> ocfs2_cluster_lock on a new lockres M
>>> spin_lock_irqsave(&lockres->l_lock, flags);
>>> gen = lockres_set_pending(lockres);
>>> lockres->l_action = OCFS2_AST_ATTACH;
>>> lockres_or_flags(lockres, OCFS2_LOCK_BUSY);
>>> spin_unlock_irqrestore(&lockres->l_lock, flags);
>>>
>>> ocfs2_dlm_lock() finished and returned.
>>> **and lockres_clear_pending(lockres, gen, osb);
>>> request a lock on the same lockres M
>>> It's blocked by nodeA, and a ast proxy was send to A
>>>
>>> bast queued and flushed,before the ast was queued
>>> then the ocfs2dc was scheduled
>>> there is a chance to execute this code path:
>>> ocfs2_downconvert_thread()
>>> ocfs2_downconvert_thread_do_work()
>>> ocfs2_blocking_ast()
>>> ocfs2_process_blocked_lock()
>>> ocfs2_unblock_lock()
>>> spin_lock_irqsave(&lockres->l_lock, flags);
>>> if (lockres->l_flags& OCFS2_LOCK_BUSY)
>>> ret = ocfs2_prepare_cancel_convert(osb, lockres);
>>> BUG_ON(lockres->l_action != OCFS2_AST_CONVERT&&
>>> lockres->l_action != OCFS2_AST_DOWNCONVERT);
>>> here trigger the BUG()
>>>
>>> Solution:
>>> One possible solution for this is to remove the lockres_clear_pending
>>> marked by 2 stars, and left this clear work to the ast function.In
>>> this way could make sure the bast function wait for ast , let it
>>> clear OCFS2_LOCK_BUSY and set OCFS2_LOCK_ATTACHED first, before enter
>>> downconvert process.
>>>
>>>
>>
>



More information about the Ocfs2-devel mailing list