[Ocfs2-devel] [PATCH] Revert iput deferring code in ocfs2_drop_dentry_lock

Thu Jan 16 06:02:44 PST 2014

On Thu 16-01-14 07:35:58, Goldwyn Rodrigues wrote:
> On 01/15/2014 08:47 PM, Jan Kara wrote:
> >On Wed 15-01-14 17:17:55, Goldwyn Rodrigues wrote:
> >>On 01/15/2014 09:53 AM, Jan Kara wrote:
> >>>On Wed 15-01-14 09:32:39, Goldwyn Rodrigues wrote:
> >>>>On 01/15/2014 07:33 AM, Jan Kara wrote:
> >>>>>On Wed 15-01-14 06:51:51, Goldwyn Rodrigues wrote:
> >>>>>>The following patches are reverted in this patch because these
> >>>>>>patches caused regression in the unlink() calls.
> >>>>>>
> >>>>>>ea455f8ab68338ba69f5d3362b342c115bea8e13 - ocfs2: Push out dropping
> >>>>>>of dentry lock to ocfs2_wq
> >>>>>>f7b1aa69be138ad9d7d3f31fa56f4c9407f56b6a - ocfs2: Fix deadlock on umount
> >>>>>>5fd131893793567c361ae64cbeb28a2a753bbe35 - ocfs2: Don't oops in
> >>>>>>ocfs2_kill_sb on a failed mount
> >>>>>>
> >>>>>>The regression is caused because these patches delay the iput() in case
> >>>>>>of dentry unlocks. This also delays the unlocking of the open lockres.
> >>>>>>The open lockresource is required to test if the inode can be wiped from
> >>>>>>disk on not. When the deleting node does not get the open lock, it marks
> >>>>>>it as orphan (even though it is not in use by another node/process)
> >>>>>>and causes a journal checkpoint. This delays operations following the
> >>>>>>inode eviction. This also moves the inode to the orphaned inode
> >>>>>>which further causes more I/O and a lot of unneccessary orphans.
> >>>>>>
> >>>>>>As for the fix, I tried reproducing the problem by using quotas and enabling
> >>>>>>lockdep, but the issue could not be reproduced.
> >>>>>   So I was thinking about this for a while trying to remember... What it
> >>>>>really boils down to is whether it is OK to call ocfs2_delete_inode() from
> >>>>>DLM downconvert thread. Bear in mind that ocfs2_delete_inode() takes all
> >>>>>kinds of interesting cluster locks meaning that the downconvert thread can
> >>>>>block waiting to acquire some cluster lock. That seems like asking for
> >>>>>trouble to me (i.e., what if the node holding the lock we need from
> >>>>>downconvert thread needs a lock from us first?) but maybe it is OK these
> >>>>>days.
> >>>>>
> >>>>
> >>>>The only lock it tries to take is the "inode lock" resource, which
> >>>>seems to be fine to take.
> >>>   Why is it obviously fine? Some other node might be holding the inode lock
> >>>and still write to the inode. If that needs a cluster lock for block
> >>>allocation and that allocation cluster lock is currently hold by our node,
> >>>we are screwed. Aren't we?
> >>>
> >>
> >>No. If another thread is using the allocation cluster lock implies
> >>we already have the inode lock. Cluster locks are also taken in a
> >>strict order in order to avoid ABBA locking.
> >   I agree with what you wrote above but it's not exactly what I'm talking
> >about. What seems to be possible is:
> >NODE 1					NODE2
> >holds dentry lock for 'foo'
> >holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
> >					dquot_initialize(bar)
> >					  ocfs2_dquot_acquire()
> >					    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
> >					    ...
> >downconvert thread (triggered from another
> >node or a different process from NODE2)
> >   ocfs2_dentry_post_unlock()
> >     ...
> >     iput(foo)
> >       ocfs2_evict_inode(foo)
> >         ocfs2_clear_inode(foo)
> >           dquot_drop(inode)
> >	    ...
> >             ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
> >              - blocks
> >					    finds we need more space in
> >					    quota file
> >					    ...
> >					    ocfs2_extend_no_holes()
> >					      ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
> >						- deadlocks waiting for
> >						  downconvert thread
> >
> 
> Thanks. This explains the situation much better. Whats stopping
> NODE1 from releasing the GLOBAL_BITMAP_SYSTEM_INODE? (to break from
> hold and wait condition)
  The fact that to release the lock, it has to be downconverted. And there
is only one downconvert thread (ocfs2dc) and that is busy waiting for
USER_QUOTA_SYSTEM_INODE lock...

> >However the positive part of this is that the race I was originally afraid
> >of - when ocfs2_delete_inode() would be triggered from
> >ocfs2_dentry_post_unlock() - isn't possible because ocfs2 seems to keep
> >invariant that node can cache dentry for 'foo' without holding inode lock
> >for 'foo' only if foo->i_nlink > 0. Thus iput() from downconvert thread
> >should never go into ocfs2_delete_inode() (however it would be nice to
> >assert this in ocfs2_dentry_post_unlock() for a peace of mind).
> 
> No, ocfs2_dentry_convert_worker sets OCFS2_INODE_MAYBE_ORPHANED. So,
> ocfs2_delete_inode() is called even if i_nlink > 0.
  Ah, too bad :(. Do you know why we do that? I have to admit inode
deletion locking was always one of the trickier part of ocfs2 locking and
I'm not sure I've fully appreciated all its details.

> >Since the deadlock seems to be quota specific, it should be possible
> >postpone just the quota processing for the workqueue. It isn't completely
> >trivial because we still have to cleanup inode quota pointers but it should
> >be doable. I'll try to have a look at this tomorrow.
> 
> Thanks! We need to work out a different way in order to fix this so
> that open locks are not delayed and does not hurt unlink
> performance.
  Sure, I'm all for fixing this deadlock in a better way.

								Honza
-- 
Jan Kara <jack at suse.cz>
SUSE Labs, CR