[Ocfs2-devel] [PATCH] Revert iput deferring code in ocfs2_drop_dentry_lock

Thu Jan 16 07:49:46 PST 2014

On 01/16/2014 08:02 AM, Jan Kara wrote:
> On Thu 16-01-14 07:35:58, Goldwyn Rodrigues wrote:
>> On 01/15/2014 08:47 PM, Jan Kara wrote:
>>> On Wed 15-01-14 17:17:55, Goldwyn Rodrigues wrote:
>>>> On 01/15/2014 09:53 AM, Jan Kara wrote:
>>>>> On Wed 15-01-14 09:32:39, Goldwyn Rodrigues wrote:
>>>>>> On 01/15/2014 07:33 AM, Jan Kara wrote:
>>>>>>> On Wed 15-01-14 06:51:51, Goldwyn Rodrigues wrote:
>>>>>>>> The following patches are reverted in this patch because these
>>>>>>>> patches caused regression in the unlink() calls.
>>>>>>>>
>>>>>>>> ea455f8ab68338ba69f5d3362b342c115bea8e13 - ocfs2: Push out dropping
>>>>>>>> of dentry lock to ocfs2_wq
>>>>>>>> f7b1aa69be138ad9d7d3f31fa56f4c9407f56b6a - ocfs2: Fix deadlock on umount
>>>>>>>> 5fd131893793567c361ae64cbeb28a2a753bbe35 - ocfs2: Don't oops in
>>>>>>>> ocfs2_kill_sb on a failed mount
>>>>>>>>
>>>>>>>> The regression is caused because these patches delay the iput() in case
>>>>>>>> of dentry unlocks. This also delays the unlocking of the open lockres.
>>>>>>>> The open lockresource is required to test if the inode can be wiped from
>>>>>>>> disk on not. When the deleting node does not get the open lock, it marks
>>>>>>>> it as orphan (even though it is not in use by another node/process)
>>>>>>>> and causes a journal checkpoint. This delays operations following the
>>>>>>>> inode eviction. This also moves the inode to the orphaned inode
>>>>>>>> which further causes more I/O and a lot of unneccessary orphans.
>>>>>>>>
>>>>>>>> As for the fix, I tried reproducing the problem by using quotas and enabling
>>>>>>>> lockdep, but the issue could not be reproduced.
>>>>>>>    So I was thinking about this for a while trying to remember... What it
>>>>>>> really boils down to is whether it is OK to call ocfs2_delete_inode() from
>>>>>>> DLM downconvert thread. Bear in mind that ocfs2_delete_inode() takes all
>>>>>>> kinds of interesting cluster locks meaning that the downconvert thread can
>>>>>>> block waiting to acquire some cluster lock. That seems like asking for
>>>>>>> trouble to me (i.e., what if the node holding the lock we need from
>>>>>>> downconvert thread needs a lock from us first?) but maybe it is OK these
>>>>>>> days.
>>>>>>>
>>>>>>
>>>>>> The only lock it tries to take is the "inode lock" resource, which
>>>>>> seems to be fine to take.
>>>>>    Why is it obviously fine? Some other node might be holding the inode lock
>>>>> and still write to the inode. If that needs a cluster lock for block
>>>>> allocation and that allocation cluster lock is currently hold by our node,
>>>>> we are screwed. Aren't we?
>>>>>
>>>>
>>>> No. If another thread is using the allocation cluster lock implies
>>>> we already have the inode lock. Cluster locks are also taken in a
>>>> strict order in order to avoid ABBA locking.
>>>    I agree with what you wrote above but it's not exactly what I'm talking
>>> about. What seems to be possible is:
>>> NODE 1					NODE2
>>> holds dentry lock for 'foo'
>>> holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE
>>> 					dquot_initialize(bar)
>>> 					  ocfs2_dquot_acquire()
>>> 					    ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
>>> 					    ...
>>> downconvert thread (triggered from another
>>> node or a different process from NODE2)
>>>    ocfs2_dentry_post_unlock()
>>>      ...
>>>      iput(foo)
>>>        ocfs2_evict_inode(foo)
>>>          ocfs2_clear_inode(foo)
>>>            dquot_drop(inode)
>>> 	    ...
>>>              ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE)
>>>               - blocks
>>> 					    finds we need more space in
>>> 					    quota file
>>> 					    ...
>>> 					    ocfs2_extend_no_holes()
>>> 					      ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE)
>>> 						- deadlocks waiting for
>>> 						  downconvert thread
>>>
>>
>> Thanks. This explains the situation much better. Whats stopping
>> NODE1 from releasing the GLOBAL_BITMAP_SYSTEM_INODE? (to break from
>> hold and wait condition)
>    The fact that to release the lock, it has to be downconverted. And there
> is only one downconvert thread (ocfs2dc) and that is busy waiting for
> USER_QUOTA_SYSTEM_INODE lock...
>
>>> However the positive part of this is that the race I was originally afraid
>>> of - when ocfs2_delete_inode() would be triggered from
>>> ocfs2_dentry_post_unlock() - isn't possible because ocfs2 seems to keep
>>> invariant that node can cache dentry for 'foo' without holding inode lock
>>> for 'foo' only if foo->i_nlink > 0. Thus iput() from downconvert thread
>>> should never go into ocfs2_delete_inode() (however it would be nice to
>>> assert this in ocfs2_dentry_post_unlock() for a peace of mind).
>>
>> No, ocfs2_dentry_convert_worker sets OCFS2_INODE_MAYBE_ORPHANED. So,
>> ocfs2_delete_inode() is called even if i_nlink > 0.
>    Ah, too bad :(. Do you know why we do that? I have to admit inode
> deletion locking was always one of the trickier part of ocfs2 locking and
> I'm not sure I've fully appreciated all its details.
>

Well, The file could be unlinked on another node while it is open on 
this node. inode->i_nlink is a node-local value (and is refreshed only 
later).

>>> Since the deadlock seems to be quota specific, it should be possible
>>> postpone just the quota processing for the workqueue. It isn't completely
>>> trivial because we still have to cleanup inode quota pointers but it should
>>> be doable. I'll try to have a look at this tomorrow.
>>
>> Thanks! We need to work out a different way in order to fix this so
>> that open locks are not delayed and does not hurt unlink
>> performance.
>    Sure, I'm all for fixing this deadlock in a better way.

-- 
Goldwyn