[Ocfs2-devel] [PATCH] Revert iput deferring code in ocfs2_drop_dentry_lock

Thu Jan 16 11:34:44 PST 2014

On Thu 16-01-14 11:19:47, Srinivas Eeda wrote:
> Hi Jan,
> thanks a lot for explaining the problem. Please see my comment below.
> 
> On 01/16/2014 06:02 AM, Jan Kara wrote:
> >On Thu 16-01-14 07:35:58, Goldwyn Rodrigues wrote:
> >>On 01/15/2014 08:47 PM, Jan Kara wrote:
> >>>On Wed 15-01-14 17:17:55, Goldwyn Rodrigues wrote:
> >>>>On 01/15/2014 09:53 AM, Jan Kara wrote:
> >>>>>On Wed 15-01-14 09:32:39, Goldwyn Rodrigues wrote:
> >>>>>>On 01/15/2014 07:33 AM, Jan Kara wrote:
> >>>>>>>On Wed 15-01-14 06:51:51, Goldwyn Rodrigues wrote:
> >>>>>>>>The following patches are reverted in this patch because these
> >>>>>>>>patches caused regression in the unlink() calls.
> >>>>>>>>
> >>>>>>>>ea455f8ab68338ba69f5d3362b342c115bea8e13 - ocfs2: Push out dropping
> >>>>>>>>of dentry lock to ocfs2_wq
> >>>>>>>>f7b1aa69be138ad9d7d3f31fa56f4c9407f56b6a - ocfs2: Fix deadlock on umount
> >>>>>>>>5fd131893793567c361ae64cbeb28a2a753bbe35 - ocfs2: Don't oops in
> >>>>>>>>ocfs2_kill_sb on a failed mount
> >>>>>>>>
> >>>>>>>>The regression is caused because these patches delay the iput() in case
> >>>>>>>>of dentry unlocks. This also delays the unlocking of the open lockres.
> >>>>>>>>The open lockresource is required to test if the inode can be wiped from
> >>>>>>>>disk on not. When the deleting node does not get the open lock, it marks
> >>>>>>>>it as orphan (even though it is not in use by another node/process)
> >>>>>>>>and causes a journal checkpoint. This delays operations following the
> >>>>>>>>inode eviction. This also moves the inode to the orphaned inode
> >>>>>>>>which further causes more I/O and a lot of unneccessary orphans.
> >>>>>>>>
> >>>>>>>>As for the fix, I tried reproducing the problem by using quotas and enabling
> >>>>>>>>lockdep, but the issue could not be reproduced.
> >>>>>>>   So I was thinking about this for a while trying to remember... What it
> >>>>>>>really boils down to is whether it is OK to call ocfs2_delete_inode() from
> >>>>>>>DLM downconvert thread. Bear in mind that ocfs2_delete_inode() takes all
> >>>>>>>kinds of interesting cluster locks meaning that the downconvert thread can
> >>>>>>>block waiting to acquire some cluster lock. That seems like asking for
> >>>>>>>trouble to me (i.e., what if the node holding the lock we need from
> >>>>>>>downconvert thread needs a lock from us first?) but maybe it is OK these
> >>>>>>>days.
> >>>>>>>
> >>>>>>The only lock it tries to take is the "inode lock" resource, which
> >>>>>>seems to be fine to take.
> >>>>>   Why is it obviously fine? Some other node might be holding the inode lock
> >>>>>and still write to the inode. If that needs a cluster lock for block
> >>>>>allocation and that allocation cluster lock is currently hold by our node,
> >>>>>we are screwed. Aren't we?
> >>>>>
> >>>>No. If another thread is using the allocation cluster lock implies
> >>>>we already have the inode lock. Cluster locks are also taken in a
> >>>>strict order in order to avoid ABBA locking.
> >>>   I agree with what you wrote above but it's not exactly what I'm talking
> >>>about. What seems to be possible is:
> >>>NODE 1					NODE2
> >>>holds dentry lock for 'foo'
> >>>holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
> >>>					dquot_initialize(bar)
> >>>					  ocfs2_dquot_acquire()
> >>>					    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
> >>>					    ...
> >>>downconvert thread (triggered from another
> >>>node or a different process from NODE2)
> >>>   ocfs2_dentry_post_unlock()
> >>>     ...
> >>>     iput(foo)
> >>>       ocfs2_evict_inode(foo)
> >>>         ocfs2_clear_inode(foo)
> >>>           dquot_drop(inode)
> >>>	    ...
> >>>             ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
> >>>              - blocks
> >>>					    finds we need more space in
> >>>					    quota file
> >>>					    ...
> >>>					    ocfs2_extend_no_holes()
> >>>					      ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
> >>>						- deadlocks waiting for
> >>>						  downconvert thread
> >>>
> >>Thanks. This explains the situation much better. Whats stopping
> >>NODE1 from releasing the GLOBAL_BITMAP_SYSTEM_INODE? (to break from
> >>hold and wait condition)
> >   The fact that to release the lock, it has to be downconverted. And there
> >is only one downconvert thread (ocfs2dc) and that is busy waiting for
> >USER_QUOTA_SYSTEM_INODE lock...
> Yes that indeed seems to be a problem, thanks a lot for explaining
> :). I do not know much about quota's code but wondering if we can
> fix this problem by enforcing the lock ordering ? Can NODE2 check
> first if it needs space before getting lock on
> USER_QUOTA_SYSTEM_INODE. If it does then it should acquire
> GLOBAL_BITMAP_SYSTEM_INODE before acquiring lock on
> USER_QUOTA_SYSTEM_INODE ? (Basically we enforce
> GLOBAL_BITMAP_SYSTEM_INODE cannot be taken while node has
> USER_QUOTA_SYSTEM_INODE lock)
  Well, that is impractical to say the least - we would have to rewrite the
allocation path to handle specifically quota code which would get the
GLOBAL_BITMAP_SYSTEM_INODE lock at a different place than any other code.
I think it is better to postpone dropping quota references - that leaves
the pain localized into the code handling inode eviction.

								Honza

-- 
Jan Kara <jack at suse.cz>
SUSE Labs, CR