[Ocfs2-devel] dlm stress test hangs OCFS2
Coly Li
coly.li at suse.de
Wed Sep 2 10:11:53 PDT 2009
Sunil Mushran Wrote:
> Read this thread for some background. There are others like this.
> http://oss.oracle.com/pipermail/ocfs2-devel/2009-April/004313.html
>
> David had run into a similar issue with two nodes. The symptoms were the
> same. In that case, we were failing to kick the downconvert thread under
> one situation.
>
> Bottomline, the reason for the hang is that a node is not downconverting
> its lock. It could be a race in dlmglue or something else.
>
> The node has a PR and an another nodes wants an EX. Unless the node
> downconverts
> to a NL, the master cannot upconvert the other node to EX. Hang. Also,
> cancel
> converts are in the mix.
> [snip]
> The downcnvt shows 1 lockres is queued. We have to assume it is this one.
> If not, then we have a bigger problem. Maybe add a quick/dirty hack to dump
> the lockres in this queue.
>
> Maybe we are forgetting to kick it like last time. I did scan the code
> for that but came up empty handed.
>
> To solve this mystery, you have to find out as to why the dc thread is
> not acting on the lockres. Forget stats. Just add printks in that thread.
> Starting from say ocfs2_downconvert_thread_do_work().
I simplified the original perl script to a simple bash script,
---------------------------------
#!/bin/sh
prefix=`hostname`
i=1
while [ 1 ];do
f="$prefix"_"$i"
echo $f
touch $f
i=`expr $i + 1`
if [ $i -ge 1000 ];then
i=1
rm -f "$prefix"_*
fi
done
---------------------------------
Run the above script on both nodes can also reproduce the blocking issue.
When the blocking happens, ocfs2_downconvert_thread_do_work() still gets called
again and again.
I add a printk to display osb->blocked_lock_count before the while(1) loop
inside ocfs2_downconvert_thread_do_work().
Here is what I observed,
1) Before the blocking happens, the
number sequence is,
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 2
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
(the count could be 1, 0, 2 and in an irregular sequence)
2) when the blocking happens, the number sequence of osb->blocked_lock_count is
always like this,
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 0
ocfs2_downconvert_thread_do_work:3725: osb->blocked_lock_count: 1
(all are 0-1-0-1-0-1-... in a regular sequence)
Continue to track...
--
Coly Li
SuSE Labs
More information about the Ocfs2-devel
mailing list