[Ocfs2-devel] hang with fsdlm

David Teigland teigland at redhat.com
Wed Apr 15 09:57:17 PDT 2009


Using fsdlm/ocfs2_controld.cman, I've rerun the test I've been having problems
with on 2.6.30-rc1.  After running for several minutes in the same directory
on three nodes, the test hangs, and I collect the following information:


bull-01
-------

 3053 S<   [ocfs2dc]                   ocfs2_downconvert_thread
 3054 S<   [dlm_astd]                  dlm_astd
 3055 S<   [dlm_scand]                 dlm_scand
 3056 S<   [dlm_recv/0]                worker_thread
 3057 S<   [dlm_recv/1]                worker_thread
 3058 S<   [dlm_recv/2]                worker_thread
 3059 S<   [dlm_recv/3]                worker_thread
 3060 S<   [dlm_send]                  worker_thread
 3061 S<   [dlm_recoverd]              dlm_recoverd
 3067 S<   [kjournald2]                kjournald2
 3068 S<   [ocfs2cmt]                  ocfs2_commit_thread
 3082 D+   /usr/bin/perl /root/make_pa ocfs2_cluster_lock

make_panic    D 0000000000000002     0  3082   3021
 ffff8801360f1a38 0000000000000046 ffffffffa0240b08 ffff88013703ea40
 ffff88013eb877f8 0000000000004000 00000000001d0a80 000000000000cb10
 0000000000000000 ffffffffa0240af0 ffff88013703ea40 ffff88007ffee340
Call Trace:
 [<ffffffffa022bdbf>] ? dlm_put_lockspace+0x18/0x2b [dlm]
 [<ffffffff804f13e4>] schedule+0x9/0x1d
 [<ffffffff804f171d>] schedule_timeout+0x24/0x15e
 [<ffffffff804f1559>] ? wait_for_common+0x3c/0x123
 [<ffffffff804f15ec>] wait_for_common+0xcf/0x123
 [<ffffffff8023616f>] ? default_wake_function+0x0/0xf
 [<ffffffff804f16ca>] wait_for_completion+0x18/0x1a
 [<ffffffffa02d3755>] ocfs2_cluster_lock+0x91f/0x93e [ocfs2]
 [<ffffffff804f387f>] ? _spin_unlock+0x26/0x2a
 [<ffffffffa02e59fb>] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2]
 [<ffffffffa02d750c>] ocfs2_inode_lock_full+0x180/0xccb [ocfs2]
 [<ffffffff8025d97f>] ? print_lock_contention_bug+0x1e/0x110
 [<ffffffffa02f0210>] ocfs2_unlink+0x1a3/0xa6e [ocfs2]
 [<ffffffff802c3686>] ? vfs_unlink+0x95/0x110
 [<ffffffff802c36af>] vfs_unlink+0xbe/0x110
 [<ffffffff802c5d46>] do_unlinkat+0xcd/0x15d
 [<ffffffff802be708>] ? sys_newlstat+0x31/0x3c
 [<ffffffff804f34fc>] ? lockdep_sys_exit_thunk+0x35/0x67
 [<ffffffff802c5de7>] sys_unlink+0x11/0x13
 [<ffffffff8020baeb>] system_call_fastpath+0x16/0x1b

    Device => Id: 253,2  Uuid: 8A77044F9D98416FAB240B44B1843EDA  Gen: 0x7622CA50
  Label: 
    Volume => State: 2  Flags: 0x0
     Sizes => Block: 4096  Cluster: 4096
  Features => Compat: 0x3  Incompat: 0x1D0  ROcompat: 0x1
     Mount => Opts: 0x0  AtimeQuanta: 60
   Cluster => Stack: cman  Name: 8A77044F9D98416FAB240B44B1843EDA  Version: 1.0
  DownCnvt => Pid: 3053  Count: 0  WakeSeq: 55892  WorkSeq: 55892
  Recovery => Pid: -1  Nodes: None
    Commit => Pid: 3068  Interval: 0  Needs: 1
   Journal => State: 1  NumTxns: 2  TxnId: 1136
     Stats => GlobalAllocs: 3  LocalAllocs: 915  SubAllocs: 2938  LAWinMoves: 1 
 SAExtends: 2
LocalAlloc => State: 1  Descriptor: 2032128  Size: 2048 bits  Default: 2048 bits
     Steal => Slot: -1  NumStolen: 0
     Slots => Num     RecoGen
            *   0           3
                1           3
                2           3
                3           3

Lockres: M0000000000000000046e2a00000000  Mode: Protected Read
Flags: Initialized Attached Busy
RO Holders: 0  EX Holders: 0
Pending Action: Convert  Pending Unlock Action: None
Requested Mode: Exclusive  Blocking Mode: No Lock
PR > Gets: 982  Fails: 0    Waits (usec) Total: 637775  Max: 79972
EX > Gets: 69  Fails: 0    Waits (usec) Total: 723745  Max: 40985
Disk Refreshes: 0

Resource len 31  "M0000000000000000046e2a00000000"
Master           
LVB len 64 seq 123
05 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 
12 75 a1 17 e0 68 d0 10 12 79 39 dd b8 79 ee b4 
12 79 39 dd b8 79 ee b4 00 00 00 00 00 00 20 00 
41 ed 00 02 00 00 00 00 82 92 d2 7a 00 00 00 00
Granted
01ce0003 PR      Remote:   4 03d80001          
014a0002 PR      Remote:   2 00d70001          
Convert
038f0001 PR (EX)                               


bull-02
-------

 2759 S<   [ocfs2dc]                   ocfs2_downconvert_thread
 2760 S<   [dlm_astd]                  dlm_astd
 2761 S<   [dlm_scand]                 dlm_scand
 2762 S<   [dlm_recv/0]                worker_thread
 2763 S<   [dlm_recv/1]                worker_thread
 2764 S<   [dlm_recv/2]                worker_thread
 2765 S<   [dlm_recv/3]                worker_thread
 2766 S<   [dlm_send]                  worker_thread
 2767 S<   [dlm_recoverd]              dlm_recoverd
 2779 S<   [kjournald2]                kjournald2
 2780 S<   [ocfs2cmt]                  ocfs2_commit_thread
 2783 D+   /usr/bin/perl /root/make_pa ocfs2_cluster_lock

make_panic    D 0000000000000002     0  2783   2727
 ffff88013f497958 0000000000000046 ffffffffa024ab08 ffff88013ed6e540
 ffff88007e9557f8 0000000000004000 00000000001d0a80 000000000000cb10
 0000000000000000 ffffffffa024aaf0 ffff88013ed6e540 ffff88007fffe440
Call Trace:
 [<ffffffffa0235dbf>] ? dlm_put_lockspace+0x18/0x2b [dlm]
 [<ffffffff804f13e4>] schedule+0x9/0x1d
 [<ffffffff804f171d>] schedule_timeout+0x24/0x15e
 [<ffffffff804f1559>] ? wait_for_common+0x3c/0x123
 [<ffffffff804f15ec>] wait_for_common+0xcf/0x123
 [<ffffffff8023616f>] ? default_wake_function+0x0/0xf
 [<ffffffff804f16ca>] wait_for_completion+0x18/0x1a
 [<ffffffffa02dd755>] ocfs2_cluster_lock+0x91f/0x93e [ocfs2]
 [<ffffffff8025d97f>] ? print_lock_contention_bug+0x1e/0x110
 [<ffffffff804f387f>] ? _spin_unlock+0x26/0x2a
 [<ffffffffa02ef9fb>] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2]
 [<ffffffffa02e150c>] ocfs2_inode_lock_full+0x180/0xccb [ocfs2]
 [<ffffffffa02d0198>] ? ocfs2_dentry_revalidate+0x194/0x268 [ocfs2]
 [<ffffffffa02e7b97>] ocfs2_permission+0x70/0x167 [ocfs2]
 [<ffffffff802c285f>] inode_permission+0x69/0x97
 [<ffffffff802c4752>] __link_path_walk+0x134/0xdea
 [<ffffffff802c560e>] path_walk+0x69/0xd4
 [<ffffffff802c58a7>] do_path_lookup+0x187/0x1df
 [<ffffffff802bc094>] ? get_empty_filp+0xfc/0x188
 [<ffffffff802c59bd>] path_lookup_open+0x5a/0x9b
 [<ffffffff802c6588>] do_filp_open+0xb6/0x899
 [<ffffffff802cf62e>] ? alloc_fd+0x3a/0x12e
 [<ffffffff802cf711>] ? alloc_fd+0x11d/0x12e
 [<ffffffff802b92aa>] do_sys_open+0x53/0xd3
 [<ffffffff802b9353>] sys_open+0x1b/0x1d
 [<ffffffff8020baeb>] system_call_fastpath+0x16/0x1b

    Device => Id: 253,2  Uuid: 8A77044F9D98416FAB240B44B1843EDA  Gen: 0x7622CA50  Label: 
    Volume => State: 2  Flags: 0x0
     Sizes => Block: 4096  Cluster: 4096
  Features => Compat: 0x3  Incompat: 0x1D0  ROcompat: 0x1
     Mount => Opts: 0x0  AtimeQuanta: 60
   Cluster => Stack: cman  Name: 8A77044F9D98416FAB240B44B1843EDA  Version: 1.0
  DownCnvt => Pid: 2759  Count: 1  WakeSeq: 14838  WorkSeq: 14838
  Recovery => Pid: -1  Nodes: None
    Commit => Pid: 2780  Interval: 0  Needs: 1
   Journal => State: 1  NumTxns: 8  TxnId: 928
     Stats => GlobalAllocs: 3  LocalAllocs: 513  SubAllocs: 1556  LAWinMoves: 1  SAExtends: 2
LocalAlloc => State: 1  Descriptor: 2064384  Size: 2048 bits  Default: 2048 bits
     Steal => Slot: -1  NumStolen: 0
     Slots => Num     RecoGen
                0           3
            *   1           3
                2           3
                3           3

Lockres: M0000000000000000046e2a00000000  Mode: Protected Read
Flags: Initialized Attached Blocked Needs Refresh Queued
RO Holders: 0  EX Holders: 0
Pending Action: None  Pending Unlock Action: None
Requested Mode: Protected Read  Blocking Mode: Exclusive
PR > Gets: 463  Fails: 0    Waits (usec) Total: 1052625  Max: 41985
EX > Gets: 37  Fails: 0    Waits (usec) Total: 990652  Max: 79971
Disk Refreshes: 0

Resource len 31  "M0000000000000000046e2a00000000"
Local 1          
Granted
00d70001 PR      Master:   1 014a0002          

dlm: 8A77044F9D98416FAB240B44B1843EDA: addwait d70001 cur 2 overlap 4 count 2 f 100000
dlm: 8A77044F9D98416FAB240B44B1843EDA: remwait d70001 convert_reply zap overlap_cancel
dlm: 8A77044F9D98416FAB240B44B1843EDA: remwait d70001 cancel_reply wait_type 0

messages mean a cancel operation occurs while a remote convert is in progress,
then the convert reply comes back indicating that the convert was successful,
so the cancel is a moot point and being ignored.


bull-04
-------

 4114 S<   [ocfs2dc]                   ocfs2_downconvert_thread
 4115 S<   [dlm_astd]                  dlm_astd
 4116 S<   [dlm_scand]                 dlm_scand
 4117 S<   [dlm_recv/0]                worker_thread
 4118 S<   [dlm_recv/1]                worker_thread
 4119 S<   [dlm_recv/2]                worker_thread
 4120 S<   [dlm_recv/3]                worker_thread
 4121 S<   [dlm_send]                  worker_thread
 4122 S<   [dlm_recoverd]              dlm_recoverd
 4136 S<   [kjournald2]                kjournald2
 4137 S<   [ocfs2cmt]                  ocfs2_commit_thread
 4142 D+   /usr/bin/perl /root/make_pa ocfs2_wait_for_mask

make_panic    D 0000000000000002     0  4142   4086
 ffff88013147b998 0000000000000046 ffff88013147b938 ffffffff8025aea3
 0000000000000000 ffff88013142acc0 ffff88007ffee340 ffff88013142b078
 0000000100078e31 0000000000000246 0000000000000292 ffffffffa02b102b
Call Trace:
 [<ffffffff8025aea3>] ? print_lock_contention_bug+0x1b/0xe1
 [<ffffffffa02b102b>] ? dlm_put_lockspace+0x18/0x2b [dlm]
 [<ffffffff804e6cfd>] schedule+0x9/0x1d
 [<ffffffff804e702f>] schedule_timeout+0x28/0x176
 [<ffffffff804e6dce>] ? wait_for_common+0x3d/0x11e
 [<ffffffff804e6e57>] wait_for_common+0xc6/0x11e
 [<ffffffff80232c89>] ? default_wake_function+0x0/0xf
 [<ffffffffa03457be>] ? ocfs2_cluster_lock+0xf2/0x8e9 [ocfs2]
 [<ffffffff804e6f39>] wait_for_completion+0x18/0x1a
 [<ffffffffa0345563>] ocfs2_wait_for_mask+0x15/0x23 [ocfs2]
 [<ffffffffa0345f90>] ocfs2_cluster_lock+0x8c4/0x8e9 [ocfs2]
 [<ffffffff8025aea3>] ? print_lock_contention_bug+0x1b/0xe1
 [<ffffffff804e9001>] ? _spin_unlock+0x26/0x2a
 [<ffffffffa03592b8>] ? ocfs2_recovery_completed+0x34/0x3c [ocfs2]
 [<ffffffffa03479d0>] ocfs2_inode_lock_full+0x181/0xcbc [ocfs2]
 [<ffffffff804e9001>] ? _spin_unlock+0x26/0x2a
 [<ffffffffa033a3c7>] ? ocfs2_dentry_revalidate+0x194/0x26c [ocfs2]
 [<ffffffffa03522a0>] ocfs2_permission+0x70/0x166 [ocfs2]
 [<ffffffff802bdec4>] inode_permission+0x69/0x97
 [<ffffffff802bfe9e>] __link_path_walk+0x13f/0xced
 [<ffffffff8025aea3>] ? print_lock_contention_bug+0x1b/0xe1
 [<ffffffff802c0a9a>] path_walk+0x4e/0x97
 [<ffffffff802c0c84>] do_path_lookup+0x118/0x175
 [<ffffffff802c1aff>] do_filp_open+0xde/0x819
 [<ffffffff8025aea3>] ? print_lock_contention_bug+0x1b/0xe1
 [<ffffffff802bfae7>] ? getname+0x26/0x1ac
 [<ffffffff802ca4f2>] ? alloc_fd+0x33/0x11e
 [<ffffffff804e9001>] ? _spin_unlock+0x26/0x2a
 [<ffffffff802ca5ce>] ? alloc_fd+0x10f/0x11e
 [<ffffffff802b4a30>] do_sys_open+0x53/0xd3
 [<ffffffff802b4ad9>] sys_open+0x1b/0x1d
 [<ffffffff8020baab>] system_call_fastpath+0x16/0x1b

    Device => Id: 253,2  Uuid: 8A77044F9D98416FAB240B44B1843EDA  Gen: 0x7622CA50  Label: 
    Volume => State: 2  Flags: 0x0
     Sizes => Block: 4096  Cluster: 4096
  Features => Compat: 0x3  Incompat: 0x1D0  ROcompat: 0x1
     Mount => Opts: 0x0  AtimeQuanta: 60
   Cluster => Stack: cman  Name: 8A77044F9D98416FAB240B44B1843EDA  Version: 1.0
  DownCnvt => Pid: 4114  Count: 1  WakeSeq: 14213  WorkSeq: 14213
  Recovery => Pid: -1  Nodes: None
    Commit => Pid: 4137  Interval: 0  Needs: 1
   Journal => State: 1  NumTxns: 4  TxnId: 851
     Stats => GlobalAllocs: 2  LocalAllocs: 414  SubAllocs: 1396  LAWinMoves: 1  SAExtends: 1
LocalAlloc => State: 1  Descriptor: 2096640  Size: 2048 bits  Default: 2048 bits
     Steal => Slot: -1  NumStolen: 0
     Slots => Num     RecoGen
                0           3
                1           3
                2           3
            *   3           3

no M0000000000000000046e2a00000000 exists in locking_state

Resource len 31  "M0000000000000000046e2a00000000"
Local 1          
Granted
03d80001 PR      Master:   1 01ce0003          

dlm: 8A77044F9D98416FAB240B44B1843EDA: validate_unlock_args -16 3d80001 0 10c 2 0 M0000000000000000046e2a00000000

message means the lock is granted, i.e. the convert being canceled already
completed, so the cancel operation is being ignored.




More information about the Ocfs2-devel mailing list