[Ocfs2-devel] dlm stress test hangs OCFS2

Coly Li coly.li at suse.de
Wed Sep 2 10:11:53 PDT 2009



Sunil Mushran Wrote:
> Read this thread for some background. There are others like this.
> http://oss.oracle.com/pipermail/ocfs2-devel/2009-April/004313.html
> 
> David had run into a similar issue with two nodes. The symptoms were the
> same. In that case, we were failing to kick the downconvert thread under
> one situation.
> 
> Bottomline, the reason for the hang is that a node is not downconverting
> its lock. It could be a race in dlmglue or something else.
> 
> The node has a PR and an another nodes wants an EX. Unless the node
> downconverts
> to a NL, the master cannot upconvert the other node to EX. Hang. Also,
> cancel
> converts are in the mix.
> [snip]
> The downcnvt shows 1 lockres is queued. We have to assume it is this one.
> If not, then we have a bigger problem. Maybe add a quick/dirty hack to dump
> the lockres in this queue.
> 
> Maybe we are forgetting to kick it like last time. I did scan the code
> for that but came up empty handed.
> 
> To solve this mystery, you have to find out as to why the dc thread is
> not acting on the lockres. Forget stats. Just add printks in that thread.
> Starting from say ocfs2_downconvert_thread_do_work().

I simplified the original perl script to a simple bash script,
---------------------------------
#!/bin/sh

prefix=`hostname`
i=1
while [ 1 ];do
	f="$prefix"_"$i"
	echo $f
	touch $f
	i=`expr $i + 1`
	if [ $i -ge 1000 ];then
		i=1
		rm -f "$prefix"_*
	fi
done
---------------------------------

Run the above script on both nodes can also reproduce the blocking issue.

When the blocking happens, ocfs2_downconvert_thread_do_work() still gets called
again and again.

I add a printk to display osb->blocked_lock_count before the while(1) loop
inside ocfs2_downconvert_thread_do_work().

Here is what I observed,
1) Before the blocking happens, the
number sequence is,
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 2
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
(the count could be 1, 0, 2 and in an irregular sequence)

2) when the blocking happens, the number sequence of osb->blocked_lock_count is
always like this,
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
(all are 0-1-0-1-0-1-... in a regular sequence)

Continue to track...

-- 
Coly Li
SuSE Labs




More information about the Ocfs2-devel mailing list