[Ocfs2-devel] [2.6.6 svn 1364]System hang randomly when writing to the same file from different processes of the same node

Mark Fasheh mark.fasheh at oracle.com
Tue Aug 24 13:41:44 CDT 2004


On Mon, Aug 23, 2004 at 10:51:30AM +0800, Chen, Yukun wrote:
> Hi Mark
> 
> I checked the version and found 1364 on both nodes. 
Ok. what messages do you see on "node B" when this happens on A? Is node B
doing anything in particular?

> Also, I attached the test cases for duplicating such bug.
I might've bitten off more than I can chew by asking for that test code :) I
wrote a simple program to write in one place on a file (and I run this
twice) and I couldn't reproduce it yet. Looking through your test scripts it
seems that's basically what's going on, but please fill me in on any steps
I've missed. I guess I'm looking for an easily reproducable test case. Does
this happen every time you run your test suite or is it intermittent?
	--Mark

> 
> The steps to run the test case:
> 
> 1. make sure you have setup the tvs environment
> 
> 2. make sure the two test machine can ssh each other as root without password 
> 
> 3.update the variable OCFSDEV in test.config to the device name of your ocfs2 partition
> 
> 4.update the variable REMOTE in setup.sh to the remote machine name
> 
> 5.make sure you have created dir /ocfs (I will updated it later to an arbitrary dir which the user can change in the latter version)
> 
> 6.run "test_filelock.sh"
> 
> Feel free let me if you have any problems. 
> Thanx.
> 
> Aaron
> 
> -----Original Message-----
> From: Mark Fasheh [mailto:mark.fasheh at oracle.com] 
> Sent: 2004??8??21?? 1:43
> To: Chen, Yukun
> Cc: ocfs2-devel at oss.oracle.com
> Subject: Re: [Ocfs2-devel] [2.6.6 svn 1364]System hang randomly when writing to the same file from different processes of the same node
> 
> 
> Are all your nodes updated to r1364 btw? That'd make a big difference as the
> voting flags got juggled around a bit (sorry!) Otherwise it looks like it's
> hung doing a TRUNCATE_PAGES message which would be very troubling indeed. If
> both nodes *are* in fact, running 1364, you mind posting your test code up
> so I can give it a try? Thanks,
> 	--Mark
> 
> On Fri, Aug 20, 2004 at 04:24:37PM +0800, Chen, Yukun wrote:
> > Hi all
> >
> > 
> > Steps to duplicate:
> > 
> > 1.Do some operation ,such as mkdir&touch , on node A and node B
> > 
> >  
> > 
> > 2.on node A process1  write to a file at a specific position(such as offset
> > 1000) ,100 times
> > 
> >  
> > 
> > 2.also on node A, at the same time , process2 write to the same file at the
> > 
> >  
> > 
> > same position, 100 times
> > 
> >  
> > 
> > Repeat step 1-2 several times, system will hang with the following message
> > found in node A:
> > 
> >  
> > 
> > state=1, lockid=22765568, flags = 0x1000, asked type = 5 master = 1, state =
> > 0x0, type = 5
> > 
> > (18397) ERROR at /tmp/trunk/src/dlm.c, 461: status = -110
> > 
> > (18397) ERROR at /tmp/trunk/src/vote.c, 910: inode 5558, vote_status=0,
> > vote_state=1, lockid=22765568, flags = 0x1000, asked type = 5 master = 1, state
> > = 0x0, type = 5
> > 
> > ...
> > 
> >  
> > 
> > on node B , error message with dmesg:
> > 
> > Call Trace:
> > 
> > recalc_task_prio
> > 
> > shedule
> > 
> > ocfs_comm_process_msg
> > 
> > ocfs_dlm_recv_msg
> > 
> > worker_thread
> > 
> > ocfs_dlm_recv_msg
> > 
> > default_wake_function
> > 
> > ....
> > 
> >  
> > 
> > Any ideas on it? thanx.
> > 
> >  
> > 
> > Aaron
> > 
> > Intel China Software Lab
> > 
> > Tel:  8621-52574545 Ext.1587
> > 
> > E_mail:yukun.chen at intel.com
> > 
> >  
> > 
> 
> > _______________________________________________
> > Ocfs2-devel mailing list
> > Ocfs2-devel at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> --
> Mark Fasheh
> Software Developer, Oracle Corp
> mark.fasheh at oracle.com


--
Mark Fasheh
Software Developer, Oracle Corp
mark.fasheh at oracle.com


More information about the Ocfs2-devel mailing list