[Ocfs2-users] Tracking down hangs

Andrew Robert Nicols andrew.nicols at luns.net.uk
Thu Jun 3 07:38:30 PDT 2010


On Thu, Jun 03, 2010 at 11:12:49AM +0100, Andrew Robert Nicols wrote:
> What's the best place to start looking for the cause of these hangs? I've
> attached the dmesg output which includes some call traces for hung threads. 
> I have stat_sysdir output though I suspect it's not so relevant. A
> scanlocks output doesn't reveal any busy locks that I can see (unless I'm
> not hitting it at the right time or misreading the output).

I've been attempting to trace the issue and have enabled tracing of DLM and
DLM_THREAD and have hit some interesting results.

On our readonly servers, we ordinarily see the following in the kernel log:

Jun  3 14:59:18 thumper4 kernel: [188257.478242] (6083,3):dlm_mle_release:413 calling mle_release for M00000000000000294ce11700000000, type 0
Jun  3 14:59:18 thumper4 kernel: [188257.478785] (6083,3):dlm_mle_release:413 calling mle_release for O00000000000000294ce11700000000, type 0
Jun  3 14:59:18 thumper4 kernel: [188257.479333] (6083,3):dlm_mle_release:413 calling mle_release for N000000001c8f8525, type 0
Jun  3 14:59:18 thumper4 kernel: [188258.115528] (6083,3):dlm_mle_release:413 calling mle_release for M000000000000002983bf3300000000, type 0
Jun  3 14:59:18 thumper4 kernel: [188258.116071] (6083,3):dlm_mle_release:413 calling mle_release for O000000000000002983bf3300000000, type 0
Jun  3 14:59:18 thumper4 kernel: [188258.118559] (6083,3):dlm_mle_release:413 calling mle_release for N000000001c8f8525, type 0
Jun  3 14:59:20 thumper4 kernel: [188259.606686] (6083,3):dlm_mle_release:413 calling mle_release for M0000000000000029b71e2400000000, type 0
Jun  3 14:59:20 thumper4 kernel: [188259.607000] (6083,3):dlm_mle_release:413 calling mle_release for O0000000000000029b71e2400000000, type 0
Jun  3 14:59:20 thumper4 kernel: [188259.607548] (6083,3):dlm_mle_release:413 calling mle_release for N000000001c8f8525, type 0
Jun  3 14:59:21 thumper4 kernel: [188261.055244] (6083,3):dlm_mle_release:413 calling mle_release for O000000000000001b5d229500000000, type 0
Jun  3 14:59:21 thumper4 kernel: [188261.055838] (6083,3):dlm_mle_release:413 calling mle_release for M000000000000001b5d229500000000, type 0
Jun  3 14:59:21 thumper4 kernel: [188261.055943] (6083,3):dlm_mle_release:413 calling mle_release for N000000001b7ddf5c, type 0

At times when we see a hang, we see the following:

Jun  3 14:59:22 thumper4 kernel: [188261.314168] (6083,3):dlm_proxy_ast_handler:306 lvb: none
Jun  3 14:59:22 thumper4 kernel: [188261.314173] (6083,3):dlm_proxy_ast_handler:308 type=1, blocked_type=5
Jun  3 14:59:22 thumper4 kernel: [188261.314177] (6083,3):dlm_proxy_ast_handler:336 lockres M0000000000000024cb281700000000
Jun  3 14:59:22 thumper4 kernel: [188261.314168] (6904,1):dlmconvert_remote:272 type=3, convert_type=-1, busy=0
Jun  3 14:59:22 thumper4 kernel: [188261.314590] (6083,3):dlm_proxy_ast_handler:306 lvb: none
Jun  3 14:59:22 thumper4 kernel: [188261.314594] (6083,3):dlm_proxy_ast_handler:308 type=0, blocked_type=0
Jun  3 14:59:22 thumper4 kernel: [188261.314597] (6083,3):dlm_proxy_ast_handler:336 lockres M0000000000000024cb281700000000
Jun  3 14:59:22 thumper4 kernel: [188261.314600] (6083,3):dlm_proxy_ast_handler:387 ast: adding to granted list... type=3, convert_type=0
Jun  3 14:59:22 thumper4 kernel: [188261.322638] (27918,0):dlmconvert_remote:272 type=0, convert_type=-1, busy=0
Jun  3 14:59:23 thumper4 kernel: [188263.044145] (6083,3):dlm_proxy_ast_handler:306 lvb: get lvb 
Jun  3 14:59:23 thumper4 kernel: [188263.044145] (6083,3):dlm_proxy_ast_handler:308 type=0, blocked_type=0
Jun  3 14:59:23 thumper4 kernel: [188263.044145] (6083,3):dlm_proxy_ast_handler:336 lockres M0000000000000024cb281700000000
Jun  3 14:59:23 thumper4 kernel: [188263.044145] (6083,3):dlm_proxy_ast_handler:387 ast: adding to granted list... type=0, convert_type=3
Jun  3 14:59:23 thumper4 kernel: [188263.044145] (6083,3):dlm_mle_release:413 calling mle_release for M00000000000000299c5c9e00000000, type 0
Jun  3 14:59:23 thumper4 kernel: [188263.044424] (6083,3):dlm_mle_release:413 calling mle_release for O00000000000000299c5c9e00000000, type 0
Jun  3 14:59:23 thumper4 kernel: [188263.046232] (6083,3):dlm_mle_release:413 calling mle_release for N000000000ae7c9ad, type 0
Jun  3 14:59:23 thumper4 kernel: [188263.046884] (6083,3):dlm_mle_release:413 calling mle_release for M0000000000000028df78fd00000000, type 0
Jun  3 14:59:23 thumper4 kernel: [188263.047426] (6083,3):dlm_mle_release:413 calling mle_release for O0000000000000028df78fd00000000, type 0
Jun  3 14:59:23 thumper4 kernel: [188263.048029] (6083,3):dlm_mle_release:413 calling mle_release for N000000001bb0685f, type 0

For this particular hang, catting our file took:
Thu Jun 03 14:59:23 BST 2010 real       0m1.648s

This corresponds pretty well to the entries at:
Jun  3 14:59:22 thumper4 kernel: [188261.314600] (6083,3):dlm_proxy_ast_handler:387 ast: adding to granted list... type=3, convert_type=0
Jun  3 14:59:22 thumper4 kernel: [188261.322638] (27918,0):dlmconvert_remote:272 type=0, convert_type=-1, busy=0
Jun  3 14:59:23 thumper4 kernel: [188263.044145] (6083,3):dlm_proxy_ast_handler:306 lvb: get lvb 
Jun  3 14:59:23 thumper4 kernel: [188263.044145] (6083,3):dlm_proxy_ast_handler:308 type=0, blocked_type=0

We see this every time there's a hang and the times correspond in the same kind
of way each time.

Is it worth enabling any other trace bits to see if there's something else
causing this?

Any help with this problem would be much appreciated,

Andrew

-- 
Systems Developer

e: andrew.nicols at luns.net.uk
im: a.nicols at jabber.lancs.ac.uk
t: +44 (0)1524 5 10147

Lancaster University Network Services is a limited company registered in
England and Wales. Registered number: 04311892. Registered office:
University House, Lancaster University, Lancaster, LA1 4YW
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url : http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100603/e5990c1e/attachment.bin 


More information about the Ocfs2-users mailing list