[Ocfs2-devel] dlm stress test hangs OCFS2

Sunil Mushran sunil.mushran at oracle.com
Tue Sep 15 17:49:15 PDT 2009


So originally my thinking was that the dc thread was not getting kicked.
That is not the case. The lock is getting downconverted. But it is getting
upconverted shortly thereafter. This just could be the case in which dlmglue
is slow to increment the holders to block the dc thread from downconverting
the lock. The snippet shows that BAST is received 16 usecs after the 
upconvert.

Coly, I have another patch. Pop out the older patch before applying this 
one.
http://oss.oracle.com/~smushran/0001-ocfs2-Patch-to-debug-hang-in-dlmglue-when-running-d.patch

BAST:
[368.807757] (2572,dlm_astd,0):ocfs2_blocking_ast:1025 BAST fired for 
lockres M0000000000000000085e0200000000, blocking 5, level 3 type Meta
[368.807767] (2571,ocfs2dc,0):ocfs2_process_blocked_lock:3839 lockres 
M0000000000000000085e0200000000 blocked.
[368.807774] (2571,ocfs2dc,0):ocfs2_prepare_downconvert:3232 lock 
M0000000000000000085e0200000000, new_level = 0, l_blocking = 5
[368.807779] (2571,ocfs2dc,0):ocfs2_downconvert_lock:3252 lock 
M0000000000000000085e0200000000, level 3 => 0
[368.807799] (2571,ocfs2dc,0):ocfs2_process_blocked_lock:3863 lockres 
M0000000000000000085e0200000000, requeue = no.

Downconvert AST:
[368.807806] (2572,dlm_astd,0):ocfs2_locking_ast:1069 lock 
M0000000000000000085e0200000000, action 3, unlock 0

Upconvert AST:
[369.007930] (2572,dlm_astd,0):ocfs2_locking_ast:1069 lock 
M0000000000000000085e0200000000, action 2, unlock 0

BAST:
[369.007946] (2572,dlm_astd,0):ocfs2_blocking_ast:1025 BAST fired for 
lockres M0000000000000000085e0200000000, blocking 5, level 3 type Meta
[369.007956] (2571,ocfs2dc,0):ocfs2_process_blocked_lock:3839 lockres 
M0000000000000000085e0200000000 blocked.
[369.007962] (2571,ocfs2dc,0):ocfs2_prepare_downconvert:3232 lock 
M0000000000000000085e0200000000, new_level = 0, l_blocking = 5
[369.007967] (2571,ocfs2dc,0):ocfs2_downconvert_lock:3252 lock 
M0000000000000000085e0200000000, level 3 => 0
[369.007987] (2571,ocfs2dc,0):ocfs2_process_blocked_lock:3863 lockres 
M0000000000000000085e0200000000, requeue = no.

Downconvert AST:
[369.007994] (2572,dlm_astd,0):ocfs2_locking_ast:1069 lock 
M0000000000000000085e0200000000, action 3, unlock 0

Upconvert AST:
[369.208048] (2572,dlm_astd,0):ocfs2_locking_ast:1069 lock 
M0000000000000000085e0200000000, action 2, unlock 0


Coly Li wrote:
>
> Sunil Mushran Wrote:
>> The full trace is available here.
>> http://oss.oracle.com/~smushran/calltrace_x1
>>
>> So one sees the following block repeated. It shows that the lock is
>> being downconverted from EX to NL but also upconverted presumably to EX.
>>
>>
>> Coli, Can you map the pids to the process names.
>>
>
> Hi Sunil,
>
> In the attached trace info, I add current->comm after the pid.
>
> Here is the steps I reproduce the blocking,
> - This time I only run 1 make_panic process on each node.
> - I ran make_panic on x1 firstly, then on x2.
> - On node x1, after creating 3-4 files, I start the make_panic script on node x2.
> - On node x2, make_panic blocked immediately, no file created. On node x1, after
> creating 13 files, make_panic blocked too.
> - After waiting for several minutes, still blocked. I stop gathering trace info.
>
> Please check the attachment. Thank.
>




More information about the Ocfs2-devel mailing list