[Ocfs2-devel] hang with fsdlm

Sunil Mushran sunil.mushran at oracle.com
Wed Apr 15 19:07:19 PDT 2009


David,

So it's the same issue. bull-02 knows it is blocking an EX, has no
holders, but is not doing anything to downconvert.

Coly Li is working on this same issue on novellbz#492055.
https://bugzilla.novell.com/show_bug.cgi?id=492055

He is unable to reproduce with the sles11 kernel. The dlm bits should
be the same as mainline. I think. I've asked him to run with mainline.

Do you have a novell bz account?

Sunil

David Teigland wrote:
> Using fsdlm/ocfs2_controld.cman, I've rerun the test I've been having problems
> with on 2.6.30-rc1.  After running for several minutes in the same directory
> on three nodes, the test hangs, and I collect the following information:
>
>
> bull-01
> -------
>
>  3053 S<   [ocfs2dc]                   ocfs2_downconvert_thread
>  3054 S<   [dlm_astd]                  dlm_astd
>  3055 S<   [dlm_scand]                 dlm_scand
>  3056 S<   [dlm_recv/0]                worker_thread
>  3057 S<   [dlm_recv/1]                worker_thread
>  3058 S<   [dlm_recv/2]                worker_thread
>  3059 S<   [dlm_recv/3]                worker_thread
>  3060 S<   [dlm_send]                  worker_thread
>  3061 S<   [dlm_recoverd]              dlm_recoverd
>  3067 S<   [kjournald2]                kjournald2
>  3068 S<   [ocfs2cmt]                  ocfs2_commit_thread
>  3082 D+   /usr/bin/perl /root/make_pa ocfs2_cluster_lock
>
> make_panic    D 0000000000000002     0  3082   3021
>  ffff8801360f1a38 0000000000000046 ffffffffa0240b08 ffff88013703ea40
>  ffff88013eb877f8 0000000000004000 00000000001d0a80 000000000000cb10
>  0000000000000000 ffffffffa0240af0 ffff88013703ea40 ffff88007ffee340
> Call Trace:
>  [<ffffffffa022bdbf>] ? dlm_put_lockspace+0x18/0x2b [dlm]
>  [<ffffffff804f13e4>] schedule+0x9/0x1d
>  [<ffffffff804f171d>] schedule_timeout+0x24/0x15e
>  [<ffffffff804f1559>] ? wait_for_common+0x3c/0x123
>  [<ffffffff804f15ec>] wait_for_common+0xcf/0x123
>  [<ffffffff8023616f>] ? default_wake_function+0x0/0xf
>  [<ffffffff804f16ca>] wait_for_completion+0x18/0x1a
>  [<ffffffffa02d3755>] ocfs2_cluster_lock+0x91f/0x93e [ocfs2]
>  [<ffffffff804f387f>] ? _spin_unlock+0x26/0x2a
>  [<ffffffffa02e59fb>] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2]
>  [<ffffffffa02d750c>] ocfs2_inode_lock_full+0x180/0xccb [ocfs2]
>  [<ffffffff8025d97f>] ? print_lock_contention_bug+0x1e/0x110
>  [<ffffffffa02f0210>] ocfs2_unlink+0x1a3/0xa6e [ocfs2]
>  [<ffffffff802c3686>] ? vfs_unlink+0x95/0x110
>  [<ffffffff802c36af>] vfs_unlink+0xbe/0x110
>  [<ffffffff802c5d46>] do_unlinkat+0xcd/0x15d
>  [<ffffffff802be708>] ? sys_newlstat+0x31/0x3c
>  [<ffffffff804f34fc>] ? lockdep_sys_exit_thunk+0x35/0x67
>  [<ffffffff802c5de7>] sys_unlink+0x11/0x13
>  [<ffffffff8020baeb>] system_call_fastpath+0x16/0x1b
>
>     Device => Id: 253,2  Uuid: 8A77044F9D98416FAB240B44B1843EDA  Gen: 0x7622CA50
>   Label: 
>     Volume => State: 2  Flags: 0x0
>      Sizes => Block: 4096  Cluster: 4096
>   Features => Compat: 0x3  Incompat: 0x1D0  ROcompat: 0x1
>      Mount => Opts: 0x0  AtimeQuanta: 60
>    Cluster => Stack: cman  Name: 8A77044F9D98416FAB240B44B1843EDA  Version: 1.0
>   DownCnvt => Pid: 3053  Count: 0  WakeSeq: 55892  WorkSeq: 55892
>   Recovery => Pid: -1  Nodes: None
>     Commit => Pid: 3068  Interval: 0  Needs: 1
>    Journal => State: 1  NumTxns: 2  TxnId: 1136
>      Stats => GlobalAllocs: 3  LocalAllocs: 915  SubAllocs: 2938  LAWinMoves: 1 
>  SAExtends: 2
> LocalAlloc => State: 1  Descriptor: 2032128  Size: 2048 bits  Default: 2048 bits
>      Steal => Slot: -1  NumStolen: 0
>      Slots => Num     RecoGen
>             *   0           3
>                 1           3
>                 2           3
>                 3           3
>
> Lockres: M0000000000000000046e2a00000000  Mode: Protected Read
> Flags: Initialized Attached Busy
> RO Holders: 0  EX Holders: 0
> Pending Action: Convert  Pending Unlock Action: None
> Requested Mode: Exclusive  Blocking Mode: No Lock
> PR > Gets: 982  Fails: 0    Waits (usec) Total: 637775  Max: 79972
> EX > Gets: 69  Fails: 0    Waits (usec) Total: 723745  Max: 40985
> Disk Refreshes: 0
>
> Resource len 31  "M0000000000000000046e2a00000000"
> Master           
> LVB len 64 seq 123
> 05 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 
> 12 75 a1 17 e0 68 d0 10 12 79 39 dd b8 79 ee b4 
> 12 79 39 dd b8 79 ee b4 00 00 00 00 00 00 20 00 
> 41 ed 00 02 00 00 00 00 82 92 d2 7a 00 00 00 00
> Granted
> 01ce0003 PR      Remote:   4 03d80001          
> 014a0002 PR      Remote:   2 00d70001          
> Convert
> 038f0001 PR (EX)                               
>
>
> bull-02
> -------
>
>  2759 S<   [ocfs2dc]                   ocfs2_downconvert_thread
>  2760 S<   [dlm_astd]                  dlm_astd
>  2761 S<   [dlm_scand]                 dlm_scand
>  2762 S<   [dlm_recv/0]                worker_thread
>  2763 S<   [dlm_recv/1]                worker_thread
>  2764 S<   [dlm_recv/2]                worker_thread
>  2765 S<   [dlm_recv/3]                worker_thread
>  2766 S<   [dlm_send]                  worker_thread
>  2767 S<   [dlm_recoverd]              dlm_recoverd
>  2779 S<   [kjournald2]                kjournald2
>  2780 S<   [ocfs2cmt]                  ocfs2_commit_thread
>  2783 D+   /usr/bin/perl /root/make_pa ocfs2_cluster_lock
>
> make_panic    D 0000000000000002     0  2783   2727
>  ffff88013f497958 0000000000000046 ffffffffa024ab08 ffff88013ed6e540
>  ffff88007e9557f8 0000000000004000 00000000001d0a80 000000000000cb10
>  0000000000000000 ffffffffa024aaf0 ffff88013ed6e540 ffff88007fffe440
> Call Trace:
>  [<ffffffffa0235dbf>] ? dlm_put_lockspace+0x18/0x2b [dlm]
>  [<ffffffff804f13e4>] schedule+0x9/0x1d
>  [<ffffffff804f171d>] schedule_timeout+0x24/0x15e
>  [<ffffffff804f1559>] ? wait_for_common+0x3c/0x123
>  [<ffffffff804f15ec>] wait_for_common+0xcf/0x123
>  [<ffffffff8023616f>] ? default_wake_function+0x0/0xf
>  [<ffffffff804f16ca>] wait_for_completion+0x18/0x1a
>  [<ffffffffa02dd755>] ocfs2_cluster_lock+0x91f/0x93e [ocfs2]
>  [<ffffffff8025d97f>] ? print_lock_contention_bug+0x1e/0x110
>  [<ffffffff804f387f>] ? _spin_unlock+0x26/0x2a
>  [<ffffffffa02ef9fb>] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2]
>  [<ffffffffa02e150c>] ocfs2_inode_lock_full+0x180/0xccb [ocfs2]
>  [<ffffffffa02d0198>] ? ocfs2_dentry_revalidate+0x194/0x268 [ocfs2]
>  [<ffffffffa02e7b97>] ocfs2_permission+0x70/0x167 [ocfs2]
>  [<ffffffff802c285f>] inode_permission+0x69/0x97
>  [<ffffffff802c4752>] __link_path_walk+0x134/0xdea
>  [<ffffffff802c560e>] path_walk+0x69/0xd4
>  [<ffffffff802c58a7>] do_path_lookup+0x187/0x1df
>  [<ffffffff802bc094>] ? get_empty_filp+0xfc/0x188
>  [<ffffffff802c59bd>] path_lookup_open+0x5a/0x9b
>  [<ffffffff802c6588>] do_filp_open+0xb6/0x899
>  [<ffffffff802cf62e>] ? alloc_fd+0x3a/0x12e
>  [<ffffffff802cf711>] ? alloc_fd+0x11d/0x12e
>  [<ffffffff802b92aa>] do_sys_open+0x53/0xd3
>  [<ffffffff802b9353>] sys_open+0x1b/0x1d
>  [<ffffffff8020baeb>] system_call_fastpath+0x16/0x1b
>
>     Device => Id: 253,2  Uuid: 8A77044F9D98416FAB240B44B1843EDA  Gen: 0x7622CA50  Label: 
>     Volume => State: 2  Flags: 0x0
>      Sizes => Block: 4096  Cluster: 4096
>   Features => Compat: 0x3  Incompat: 0x1D0  ROcompat: 0x1
>      Mount => Opts: 0x0  AtimeQuanta: 60
>    Cluster => Stack: cman  Name: 8A77044F9D98416FAB240B44B1843EDA  Version: 1.0
>   DownCnvt => Pid: 2759  Count: 1  WakeSeq: 14838  WorkSeq: 14838
>   Recovery => Pid: -1  Nodes: None
>     Commit => Pid: 2780  Interval: 0  Needs: 1
>    Journal => State: 1  NumTxns: 8  TxnId: 928
>      Stats => GlobalAllocs: 3  LocalAllocs: 513  SubAllocs: 1556  LAWinMoves: 1  SAExtends: 2
> LocalAlloc => State: 1  Descriptor: 2064384  Size: 2048 bits  Default: 2048 bits
>      Steal => Slot: -1  NumStolen: 0
>      Slots => Num     RecoGen
>                 0           3
>             *   1           3
>                 2           3
>                 3           3
>
> Lockres: M0000000000000000046e2a00000000  Mode: Protected Read
> Flags: Initialized Attached Blocked Needs Refresh Queued
> RO Holders: 0  EX Holders: 0
> Pending Action: None  Pending Unlock Action: None
> Requested Mode: Protected Read  Blocking Mode: Exclusive
> PR > Gets: 463  Fails: 0    Waits (usec) Total: 1052625  Max: 41985
> EX > Gets: 37  Fails: 0    Waits (usec) Total: 990652  Max: 79971
> Disk Refreshes: 0
>
> Resource len 31  "M0000000000000000046e2a00000000"
> Local 1          
> Granted
> 00d70001 PR      Master:   1 014a0002          
>
> dlm: 8A77044F9D98416FAB240B44B1843EDA: addwait d70001 cur 2 overlap 4 count 2 f 100000
> dlm: 8A77044F9D98416FAB240B44B1843EDA: remwait d70001 convert_reply zap overlap_cancel
> dlm: 8A77044F9D98416FAB240B44B1843EDA: remwait d70001 cancel_reply wait_type 0
>
> messages mean a cancel operation occurs while a remote convert is in progress,
> then the convert reply comes back indicating that the convert was successful,
> so the cancel is a moot point and being ignored.
>
>
> bull-04
> -------
>
>  4114 S<   [ocfs2dc]                   ocfs2_downconvert_thread
>  4115 S<   [dlm_astd]                  dlm_astd
>  4116 S<   [dlm_scand]                 dlm_scand
>  4117 S<   [dlm_recv/0]                worker_thread
>  4118 S<   [dlm_recv/1]                worker_thread
>  4119 S<   [dlm_recv/2]                worker_thread
>  4120 S<   [dlm_recv/3]                worker_thread
>  4121 S<   [dlm_send]                  worker_thread
>  4122 S<   [dlm_recoverd]              dlm_recoverd
>  4136 S<   [kjournald2]                kjournald2
>  4137 S<   [ocfs2cmt]                  ocfs2_commit_thread
>  4142 D+   /usr/bin/perl /root/make_pa ocfs2_wait_for_mask
>
> make_panic    D 0000000000000002     0  4142   4086
>  ffff88013147b998 0000000000000046 ffff88013147b938 ffffffff8025aea3
>  0000000000000000 ffff88013142acc0 ffff88007ffee340 ffff88013142b078
>  0000000100078e31 0000000000000246 0000000000000292 ffffffffa02b102b
> Call Trace:
>  [<ffffffff8025aea3>] ? print_lock_contention_bug+0x1b/0xe1
>  [<ffffffffa02b102b>] ? dlm_put_lockspace+0x18/0x2b [dlm]
>  [<ffffffff804e6cfd>] schedule+0x9/0x1d
>  [<ffffffff804e702f>] schedule_timeout+0x28/0x176
>  [<ffffffff804e6dce>] ? wait_for_common+0x3d/0x11e
>  [<ffffffff804e6e57>] wait_for_common+0xc6/0x11e
>  [<ffffffff80232c89>] ? default_wake_function+0x0/0xf
>  [<ffffffffa03457be>] ? ocfs2_cluster_lock+0xf2/0x8e9 [ocfs2]
>  [<ffffffff804e6f39>] wait_for_completion+0x18/0x1a
>  [<ffffffffa0345563>] ocfs2_wait_for_mask+0x15/0x23 [ocfs2]
>  [<ffffffffa0345f90>] ocfs2_cluster_lock+0x8c4/0x8e9 [ocfs2]
>  [<ffffffff8025aea3>] ? print_lock_contention_bug+0x1b/0xe1
>  [<ffffffff804e9001>] ? _spin_unlock+0x26/0x2a
>  [<ffffffffa03592b8>] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2]
>  [<ffffffffa03479d0>] ocfs2_inode_lock_full+0x181/0xcbc [ocfs2]
>  [<ffffffff804e9001>] ? _spin_unlock+0x26/0x2a
>  [<ffffffffa033a3c7>] ? ocfs2_dentry_revalidate+0x194/0x26c [ocfs2]
>  [<ffffffffa03522a0>] ocfs2_permission+0x70/0x166 [ocfs2]
>  [<ffffffff802bdec4>] inode_permission+0x69/0x97
>  [<ffffffff802bfe9e>] __link_path_walk+0x13f/0xced
>  [<ffffffff8025aea3>] ? print_lock_contention_bug+0x1b/0xe1
>  [<ffffffff802c0a9a>] path_walk+0x4e/0x97
>  [<ffffffff802c0c84>] do_path_lookup+0x118/0x175
>  [<ffffffff802c1aff>] do_filp_open+0xde/0x819
>  [<ffffffff8025aea3>] ? print_lock_contention_bug+0x1b/0xe1
>  [<ffffffff802bfae7>] ? getname+0x26/0x1ac
>  [<ffffffff802ca4f2>] ? alloc_fd+0x33/0x11e
>  [<ffffffff804e9001>] ? _spin_unlock+0x26/0x2a
>  [<ffffffff802ca5ce>] ? alloc_fd+0x10f/0x11e
>  [<ffffffff802b4a30>] do_sys_open+0x53/0xd3
>  [<ffffffff802b4ad9>] sys_open+0x1b/0x1d
>  [<ffffffff8020baab>] system_call_fastpath+0x16/0x1b
>
>     Device => Id: 253,2  Uuid: 8A77044F9D98416FAB240B44B1843EDA  Gen: 0x7622CA50  Label: 
>     Volume => State: 2  Flags: 0x0
>      Sizes => Block: 4096  Cluster: 4096
>   Features => Compat: 0x3  Incompat: 0x1D0  ROcompat: 0x1
>      Mount => Opts: 0x0  AtimeQuanta: 60
>    Cluster => Stack: cman  Name: 8A77044F9D98416FAB240B44B1843EDA  Version: 1.0
>   DownCnvt => Pid: 4114  Count: 1  WakeSeq: 14213  WorkSeq: 14213
>   Recovery => Pid: -1  Nodes: None
>     Commit => Pid: 4137  Interval: 0  Needs: 1
>    Journal => State: 1  NumTxns: 4  TxnId: 851
>      Stats => GlobalAllocs: 2  LocalAllocs: 414  SubAllocs: 1396  LAWinMoves: 1  SAExtends: 1
> LocalAlloc => State: 1  Descriptor: 2096640  Size: 2048 bits  Default: 2048 bits
>      Steal => Slot: -1  NumStolen: 0
>      Slots => Num     RecoGen
>                 0           3
>                 1           3
>                 2           3
>             *   3           3
>
> no M0000000000000000046e2a00000000 exists in locking_state
>
> Resource len 31  "M0000000000000000046e2a00000000"
> Local 1          
> Granted
> 03d80001 PR      Master:   1 01ce0003          
>
> dlm: 8A77044F9D98416FAB240B44B1843EDA: validate_unlock_args -16 3d80001 0 10c 2 0 M0000000000000000046e2a00000000
>
> message means the lock is granted, i.e. the convert being canceled already
> completed, so the cancel operation is being ignored.
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
>   




More information about the Ocfs2-devel mailing list