[Ocfs2-devel] dlm stress test hangs OCFS2

Sunil Mushran sunil.mushran at oracle.com
Wed Sep 2 15:01:10 PDT 2009


So the thread is removing the lockres from the list (thus making
the count 0), and then calling ocfs2_process_blocked_lock() which
is adding that lockres back in the list (thus 1).

Trace ocfs2_process_blocked_lock() to see as to why it is putting it
back on the list.

Coly Li wrote:
>
> Sunil Mushran Wrote:
>> Read this thread for some background. There are others like this.
>> http://oss.oracle.com/pipermail/ocfs2-devel/2009-April/004313.html
>>
>> David had run into a similar issue with two nodes. The symptoms were the
>> same. In that case, we were failing to kick the downconvert thread under
>> one situation.
>>
>> Bottomline, the reason for the hang is that a node is not downconverting
>> its lock. It could be a race in dlmglue or something else.
>>
>> The node has a PR and an another nodes wants an EX. Unless the node
>> downconverts
>> to a NL, the master cannot upconvert the other node to EX. Hang. Also,
>> cancel
>> converts are in the mix.
>> [snip]
>> The downcnvt shows 1 lockres is queued. We have to assume it is this one.
>> If not, then we have a bigger problem. Maybe add a quick/dirty hack to dump
>> the lockres in this queue.
>>
>> Maybe we are forgetting to kick it like last time. I did scan the code
>> for that but came up empty handed.
>>
>> To solve this mystery, you have to find out as to why the dc thread is
>> not acting on the lockres. Forget stats. Just add printks in that thread.
>> Starting from say ocfs2_downconvert_thread_do_work().
>>     
>
> I simplified the original perl script to a simple bash script,
> ---------------------------------
> #!/bin/sh
>
> prefix=`hostname`
> i=1
> while [ 1 ];do
> 	f="$prefix"_"$i"
> 	echo $f
> 	touch $f
> 	i=`expr $i + 1`
> 	if [ $i -ge 1000 ];then
> 		i=1
> 		rm -f "$prefix"_*
> 	fi
> done
> ---------------------------------
>
> Run the above script on both nodes can also reproduce the blocking issue.
>
> When the blocking happens, ocfs2_downconvert_thread_do_work() still gets called
> again and again.
>
> I add a printk to display osb->blocked_lock_count before the while(1) loop
> inside ocfs2_downconvert_thread_do_work().
>
> Here is what I observed,
> 1) Before the blocking happens, the
> number sequence is,
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 2
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> (the count could be 1, 0, 2 and in an irregular sequence)
>
> 2) when the blocking happens, the number sequence of osb->blocked_lock_count is
> always like this,
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
> ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
> (all are 0-1-0-1-0-1-... in a regular sequence)
>
> Continue to track...
>
>   




More information about the Ocfs2-devel mailing list