[Ocfs2-devel] [PATCH] unlink performance: wait for open lock in case of dirs

Tue Jul 2 00:11:08 PDT 2013

On 07/02/2013 12:44 AM, Goldwyn Rodrigues wrote:

> On Sun, Jun 30, 2013 at 1:44 AM, Jeff Liu <jeff.liu at oracle.com> wrote:
>> On 06/29/2013 07:12 AM, Andrew Morton wrote:
>>
>>> On Fri, 14 Jun 2013 21:05:10 -0500 Goldwyn Rodrigues <rgoldwyn at gmail.com> wrote:
>>>
>>>> This patch is to improve the unlink performance. Here is the scenario:
>>>>
>>>> On node A, create multiple directories say d1-d8, and each have 3
>>>> files under it f1, f2 and f3.
>>>> On node B, delete all directories using rm -Rf d*
>>>>
>>>> The FS first unlinks f1, f2 and f3. However, when it performs
>>>> ocfs2_evict_inode() -> ocfs2_delete_inode() ->
>>>> ocfs2_query_inode_wipe() -> ocfs2_try_open_lock() on d1, it fails
>>>> with -EAGAIN. The open lock fails because on the remote node
>>>> a PR->EX convert takes longer than a simple EX grant.
>>>>
>>>> This starts a checkpoint because OCFS2_INODE_DELETED flag is not set
>>>> on the directory inode. Now, a checkpoint interferes with the journaling
>>>> of the inodes deleted in the following unlinks, in our case,
>>>> directories d2-d8 and the files contained in it.
>>>>
>>>> With this patch, We wait on a directory EX lock only if we already
>>>> have an open_lock in PR mode. This way we will avoid the ABBA locking.
>>>>
>>>> By waiting for the open_lock on the directory, I am getting a unlink
>>>> performance improvement of a rm -Rf of 50-60% in the usual case.
>>>>
>>>> Also, folded ocfs2_open_lock and ocfs2_try_open_lock into one.
>>>>
>>>> Let me know if you would like to see the test case.
>>>
>>> We need some more review and test of this patch, please.
>>>
>>> The patch doesn't apply to current kernels - please redo, retest and
>>> resend it.
> 
> Yes, I sent it against an older kernel. I will update it once I
> incorporate the other review comments. Thanks for this comment.
> 
>>>
>>> In particular, the kernel you're patching doesn't have the
>>> ocfs2_is_hard_readonly() tests in ocfs2_open_lock() and
>>> ocfs2_try_open_lock().
>>>
>>> And looking at those tests, I wonder about ocfs2_open_lock().
>>>
>>> : int ocfs2_open_lock(struct inode *inode)
>>> : {
>>> :     int status = 0;
>>> :     struct ocfs2_lock_res *lockres;
>>> :     struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>>> :
>>> :     BUG_ON(!inode);
>>> :
>>> :     mlog(0, "inode %llu take PRMODE open lock\n",
>>> :          (unsigned long long)OCFS2_I(inode)->ip_blkno);
>>> :
>>> :     if (ocfs2_is_hard_readonly(osb) || ocfs2_mount_local(osb))
>>> :             goto out;
>>> :
>>> :     lockres = &OCFS2_I(inode)->ip_open_lockres;
>>> :
>>> :     status = ocfs2_cluster_lock(OCFS2_SB(inode->i_sb), lockres,
>>> :                                 DLM_LOCK_PR, 0, 0);
>>> :     if (status < 0)
>>> :             mlog_errno(status);
>>> :
>>> : out:
>>> :     return status;
>>> : }
>>>
>>> It will return zero if ocfs2_is_hard_readonly().  Is that correct?
>>
>>> Should it return -EROFS for writes?
> 
> Yes, it should.
> 
> 
>>
>> Hi Andrew,
>>
>> That is correct for the current design, because ocfs2_open_lock() is
>> implemented to grant a lock in protect read mode.
>>
>> The ocfs2_is_hard_readonly() has been introduced to ocfs2_open_lock()
>> by commit: 03efed8a2a1b8e00164eb4720a82a7dd5e368a8e
>>
>>     ocfs2: Bugfix for hard readonly mount
>>         ocfs2 cannot currently mount a device that is readonly at the media
>>     ("hard readonly").  Fix the broken places.
>>     see detail: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1322
>>
>>> After your patch, this code does
>>> know whether the attempt was for a write.
>>
>>
>>
>> Yes, with this change, it need a fix indeed.
>>
>>>
>>> Please check all that.
>>
>>
>>
>>
>> Hi Goldwyn,
>>
>> Sorry for the too late response as I have mentioned that I'll take a look at it,
>> but some internal tasks really racked my brains in the past couple of weeks.
> 
> No problem.
> 
>>
>> Just as Andrew's comments, this patch can not be applied properly against the
>> upstream development tree, could you please re-base it.
> 
> Yes, will do. I have other comments from Mark as well.
> 
>>
>> If we merge ocfs2_open_lock()/ocfs2_open_try_lock() into one, looks the semantics of
>> ocfs2_open_lock() is broke with the 'write' option is introduced, but this is not a
>> big issue as we can tweak up the comments as well as those functions which would be
>> affected, but we need to run ocfs2-test(at least cover all the code path which are
>> involved in this change) to avoid any regression, it can be download from:
>> https://oss.oracle.com/projects/ocfs2-test/source.html
>> https://oss.oracle.com/osswiki/OCFS2/Ocfs2TestList.html
>>
>> Another thing come to mind since this patch would affect ocfs2_evict_inode() with
>> ocfs2_try_open_lock(), i.e, the following call chains:
>>
>> ocfs2_evict_inode()
>> |
>> |-------ocfs2_delete_inode()
>>         |
>>         |-------ocfs2_query_inode_wipe()
>>         |       |
>>         |       |-------ocfs2_try_open_lock()
>>         |       |
>>         |-------ocfs2_clear_inode()
>>                 |
>>                 |---------ocfs2_open_unlock()
>>
>> OCFS2 has a DEAD LOCK issue when creating/removing tons of files in parallel if the
>> disk quota is enabled, please refer to:
>> https://oss.oracle.com/bugzilla/show_bug.cgi?id=1339
>>
>> As per the old discussions, Jan Kara pointed out that there is a race in
>> ocfs2_evict_inode() if consider the following lock sequences:
>>
>> ocfs2_evict_inode()
>> |
>> |-------ocfs2_inode_lock()
>>         |
>>         |-------ocfs2_try_open_lock() [ try to get open lock in EX mode ]
>>                 |
>>                 |-------ocfs2_inode_unlock()
>>                         |
>>                         |-------ocfs2_open_unlock() [ drops shared open lock that is hold ]
>>
>> Quotas from Jan:
>> """
>> Now if two nodes happen to execute ocfs2_evict_inode() in parallel and
>> ocfs2_try_open_lock() happens on both nodes before ocfs2_open_unlock() is
>> called on any of them, ocfs2_try_open_lock() fails for both nodes...
>>
>> I would think that the code should be reorganized so that shared open
>> lock is dropped before we drop inode lock. Then the race could not happen.
>> But I'm not sure if something else would not break.
>> """
> 
> I could not find the reference of this in the bug. Is it discussed on
> the ML. Do you have a link so I can look at the history?
> 
>>
>> This is deserve to give a try IMHO, especially there would be kind of dependencies
>> with your fix, and that would be wonderful if we can have your improvement as well
>> as lock sequences adjustments that might fix the quota problems.
>>
> 
> Yes, I understand. Please provide the references if possible so I get
> the bigger picture.

https://oss.oracle.com/pipermail/ocfs2-devel/2012-August/008709.html

Thanks,
-Jeff