[Ocfs2-devel] [PATCH] unlink performance: wait for open lock in case of dirs

Mon Jul 1 09:44:26 PDT 2013

On Sun, Jun 30, 2013 at 1:44 AM, Jeff Liu <jeff.liu at oracle.com> wrote:
> On 06/29/2013 07:12 AM, Andrew Morton wrote:
>
>> On Fri, 14 Jun 2013 21:05:10 -0500 Goldwyn Rodrigues <rgoldwyn at gmail.com> wrote:
>>
>>> This patch is to improve the unlink performance. Here is the scenario:
>>>
>>> On node A, create multiple directories say d1-d8, and each have 3
>>> files under it f1, f2 and f3.
>>> On node B, delete all directories using rm -Rf d*
>>>
>>> The FS first unlinks f1, f2 and f3. However, when it performs
>>> ocfs2_evict_inode() -> ocfs2_delete_inode() ->
>>> ocfs2_query_inode_wipe() -> ocfs2_try_open_lock() on d1, it fails
>>> with -EAGAIN. The open lock fails because on the remote node
>>> a PR->EX convert takes longer than a simple EX grant.
>>>
>>> This starts a checkpoint because OCFS2_INODE_DELETED flag is not set
>>> on the directory inode. Now, a checkpoint interferes with the journaling
>>> of the inodes deleted in the following unlinks, in our case,
>>> directories d2-d8 and the files contained in it.
>>>
>>> With this patch, We wait on a directory EX lock only if we already
>>> have an open_lock in PR mode. This way we will avoid the ABBA locking.
>>>
>>> By waiting for the open_lock on the directory, I am getting a unlink
>>> performance improvement of a rm -Rf of 50-60% in the usual case.
>>>
>>> Also, folded ocfs2_open_lock and ocfs2_try_open_lock into one.
>>>
>>> Let me know if you would like to see the test case.
>>
>> We need some more review and test of this patch, please.
>>
>> The patch doesn't apply to current kernels - please redo, retest and
>> resend it.

Yes, I sent it against an older kernel. I will update it once I
incorporate the other review comments. Thanks for this comment.

>>
>> In particular, the kernel you're patching doesn't have the
>> ocfs2_is_hard_readonly() tests in ocfs2_open_lock() and
>> ocfs2_try_open_lock().
>>
>> And looking at those tests, I wonder about ocfs2_open_lock().
>>
>> : int ocfs2_open_lock(struct inode *inode)
>> : {
>> :     int status = 0;
>> :     struct ocfs2_lock_res *lockres;
>> :     struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
>> :
>> :     BUG_ON(!inode);
>> :
>> :     mlog(0, "inode %llu take PRMODE open lock\n",
>> :          (unsigned long long)OCFS2_I(inode)->ip_blkno);
>> :
>> :     if (ocfs2_is_hard_readonly(osb) || ocfs2_mount_local(osb))
>> :             goto out;
>> :
>> :     lockres = &OCFS2_I(inode)->ip_open_lockres;
>> :
>> :     status = ocfs2_cluster_lock(OCFS2_SB(inode->i_sb), lockres,
>> :                                 DLM_LOCK_PR, 0, 0);
>> :     if (status < 0)
>> :             mlog_errno(status);
>> :
>> : out:
>> :     return status;
>> : }
>>
>> It will return zero if ocfs2_is_hard_readonly().  Is that correct?
>
>> Should it return -EROFS for writes?

Yes, it should.

>
> Hi Andrew,
>
> That is correct for the current design, because ocfs2_open_lock() is
> implemented to grant a lock in protect read mode.
>
> The ocfs2_is_hard_readonly() has been introduced to ocfs2_open_lock()
> by commit: 03efed8a2a1b8e00164eb4720a82a7dd5e368a8e
>
>     ocfs2: Bugfix for hard readonly mount
>         ocfs2 cannot currently mount a device that is readonly at the media
>     ("hard readonly").  Fix the broken places.
>     see detail: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1322
>
>> After your patch, this code does
>> know whether the attempt was for a write.
>
>
>
> Yes, with this change, it need a fix indeed.
>
>>
>> Please check all that.
>
>
>
>
> Hi Goldwyn,
>
> Sorry for the too late response as I have mentioned that I'll take a look at it,
> but some internal tasks really racked my brains in the past couple of weeks.

No problem.

>
> Just as Andrew's comments, this patch can not be applied properly against the
> upstream development tree, could you please re-base it.

Yes, will do. I have other comments from Mark as well.

>
> If we merge ocfs2_open_lock()/ocfs2_open_try_lock() into one, looks the semantics of
> ocfs2_open_lock() is broke with the 'write' option is introduced, but this is not a
> big issue as we can tweak up the comments as well as those functions which would be
> affected, but we need to run ocfs2-test(at least cover all the code path which are
> involved in this change) to avoid any regression, it can be download from:
> https://oss.oracle.com/projects/ocfs2-test/source.html
> https://oss.oracle.com/osswiki/OCFS2/Ocfs2TestList.html
>
> Another thing come to mind since this patch would affect ocfs2_evict_inode() with
> ocfs2_try_open_lock(), i.e, the following call chains:
>
> ocfs2_evict_inode()
> |
> |-------ocfs2_delete_inode()
>         |
>         |-------ocfs2_query_inode_wipe()
>         |       |
>         |       |-------ocfs2_try_open_lock()
>         |       |
>         |-------ocfs2_clear_inode()
>                 |
>                 |---------ocfs2_open_unlock()
>
> OCFS2 has a DEAD LOCK issue when creating/removing tons of files in parallel if the
> disk quota is enabled, please refer to:
> https://oss.oracle.com/bugzilla/show_bug.cgi?id=1339
>
> As per the old discussions, Jan Kara pointed out that there is a race in
> ocfs2_evict_inode() if consider the following lock sequences:
>
> ocfs2_evict_inode()
> |
> |-------ocfs2_inode_lock()
>         |
>         |-------ocfs2_try_open_lock() [ try to get open lock in EX mode ]
>                 |
>                 |-------ocfs2_inode_unlock()
>                         |
>                         |-------ocfs2_open_unlock() [ drops shared open lock that is hold ]
>
> Quotas from Jan:
> """
> Now if two nodes happen to execute ocfs2_evict_inode() in parallel and
> ocfs2_try_open_lock() happens on both nodes before ocfs2_open_unlock() is
> called on any of them, ocfs2_try_open_lock() fails for both nodes...
>
> I would think that the code should be reorganized so that shared open
> lock is dropped before we drop inode lock. Then the race could not happen.
> But I'm not sure if something else would not break.
> """

I could not find the reference of this in the bug. Is it discussed on
the ML. Do you have a link so I can look at the history?

>
> This is deserve to give a try IMHO, especially there would be kind of dependencies
> with your fix, and that would be wonderful if we can have your improvement as well
> as lock sequences adjustments that might fix the quota problems.
>

Yes, I understand. Please provide the references if possible so I get
the bigger picture.

--
Goldwyn