[Ocfs2-devel] ocfs2: fix sparse file & data ordering issue in direct io. review
Eric Ren
zren at suse.com
Fri Jun 2 02:14:40 PDT 2017
Hi Guanghui,
Please sort out your mail more orderly. It looks really messy! So, rework your mail by asking
question yourself like:
- What is the problem you are facing?
Looks like a BUG_ON() is triggered. but which BUG_ON()? the backtrace? How this can be
reroduced?
- What help or answer do you hope for?
You didn't ask any question below!
On 05/26/2017 11:46 AM, Zhangguanghui wrote:
> This patch replace that function ocfs2_direct_IO_get_blocks with
Which patch? Don't analyze code before telling the problem.
>
> this function ocfs2_get_blocks in ocfs2_direct_IO, and remove the ip_alloc_sem.
>
> but i think ip_alloc_sem is still needed because protect allocation changes is very correct.
"still needed" - so, which commit dropped it?
>
> Now, BUG_ON have been tiggered in the process of testing direct-io.
>
> Comments and questions are, as always, welcome. Thanks
Comments on what?
>
>
> As wangww631 described
A mail thread link is useful for people to know the discussion and background.
>
> In ocfs2, ip_alloc_sem is used to protect allocation changes on the node.
> In direct IO, we add ip_alloc_sem to protect date consistent between
> direct-io and ocfs2_truncate_file race (buffer io use ip_alloc_sem
> already). Although inode->i_mutex lock is used to avoid concurrency of
> above situation, i think ip_alloc_sem is still needed because protect
> allocation changes is significant.
>
> Other filesystem like ext4 also uses rw_semaphore to protect data
> consistent between get_block-vs-truncate race by other means, So
> ip_alloc_sem in ocfs2 direct io is needed.
>
>
> Date: Fri, 11 Sep 2015 16:19:18 +0800
> From: Ryan Ding <ryan.ding at oracle.com>
> Subject: [Ocfs2-devel] [PATCH 7/8] ocfs2: fix sparse file & data
> ordering issue in direct io.
You email subject is almost the same as this patch, which brings confusion...
> To: ocfs2-devel at oss.oracle.com
> Cc: mfasheh at suse.de
> Message-ID: <1441959559-29947-8-git-send-email-ryan.ding at oracle.com>
Don't copy & paste patch content, only making you mail too long to scare reader away.
Eric
>
> There are mainly 3 issue in the direct io code path after commit 24c40b329e03 ("ocfs2: implement ocfs2_direct_IO_write"):
> * Do not support sparse file.
> * Do not support data ordering. eg: when write to a file hole, it will alloc
> extent first. If system crashed before io finished, data will corrupt.
> * Potential risk when doing aio+dio. The -EIOCBQUEUED return value is likely
> to be ignored by ocfs2_direct_IO_write().
>
> To resolve above problems, re-design direct io code with following ideas:
> * Use buffer io to fill in holes. And this will make better performance also.
> * Clear unwritten after direct write finished. So we can make sure meta data
> changes after data write to disk. (Unwritten extent is invisible to user,
> from user's view, meta data is not changed when allocate an unwritten
> extent.)
> * Clear ocfs2_direct_IO_write(). Do all ending work in end_io.
>
> This patch has passed fs,dio,ltp-aiodio.part1,ltp-aiodio.part2,ltp-aiodio.part4
> test cases of ltp.
>
> Signed-off-by: Ryan Ding <ryan.ding at oracle.com>
> Reviewed-by: Junxiao Bi <junxiao.bi at oracle.com>
> cc: Joseph Qi <joseph.qi at huawei.com>
> ---
> fs/ocfs2/aops.c | 851 ++++++++++++++++++++++---------------------------------
> 1 files changed, 342 insertions(+), 509 deletions(-)
>
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index b4ec600..4bb9921 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -499,152 +499,6 @@ bail:
> return status;
> }
>
> -/*
> - * TODO: Make this into a generic get_blocks function.
> - *
> - * From do_direct_io in direct-io.c:
> - * "So what we do is to permit the ->get_blocks function to populate
> - * bh.b_size with the size of IO which is permitted at this offset and
> - * this i_blkbits."
> - *
> - * This function is called directly from get_more_blocks in direct-io.c.
> - *
> - * called like this: dio->get_blocks(dio->inode, fs_startblk,
> - * fs_count, map_bh, dio->rw == WRITE);
> - */
> -static int ocfs2_direct_IO_get_blocks(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh_result, int create)
> -{
> - int ret;
> - u32 cpos = 0;
> - int alloc_locked = 0;
> - u64 p_blkno, inode_blocks, contig_blocks;
> - unsigned int ext_flags;
> - unsigned char blocksize_bits = inode->i_sb->s_blocksize_bits;
> - unsigned long max_blocks = bh_result->b_size >> inode->i_blkbits;
> - unsigned long len = bh_result->b_size;
> - unsigned int clusters_to_alloc = 0, contig_clusters = 0;
> -
> - cpos = ocfs2_blocks_to_clusters(inode->i_sb, iblock);
> -
> - /* This function won't even be called if the request isn't all
> - * nicely aligned and of the right size, so there's no need
> - * for us to check any of that. */
> -
> - inode_blocks = ocfs2_blocks_for_bytes(inode->i_sb, i_size_read(inode));
> -
> - down_read(&OCFS2_I(inode)->ip_alloc_sem);
> -
> - /* This figures out the size of the next contiguous block, and
> - * our logical offset */
> - ret = ocfs2_extent_map_get_blocks(inode, iblock, &p_blkno,
> - &contig_blocks, &ext_flags);
> - up_read(&OCFS2_I(inode)->ip_alloc_sem);
> -
> - if (ret) {
> - mlog(ML_ERROR, "get_blocks() failed iblock=%llu\n",
> - (unsigned long long)iblock);
> - ret = -EIO;
> - goto bail;
> - }
> -
> - /* We should already CoW the refcounted extent in case of create. */
> - BUG_ON(create && (ext_flags & OCFS2_EXT_REFCOUNTED));
> -
> - /* allocate blocks if no p_blkno is found, and create == 1 */
> - if (!p_blkno && create) {
> - ret = ocfs2_inode_lock(inode, NULL, 1);
> - if (ret < 0) {
> - mlog_errno(ret);
> - goto bail;
> - }
> -
> - alloc_locked = 1;
> -
> - down_write(&OCFS2_I(inode)->ip_alloc_sem);
> -
> - /* fill hole, allocate blocks can't be larger than the size
> - * of the hole */
> - clusters_to_alloc = ocfs2_clusters_for_bytes(inode->i_sb, len);
> - contig_clusters = ocfs2_clusters_for_blocks(inode->i_sb,
> - contig_blocks);
> - if (clusters_to_alloc > contig_clusters)
> - clusters_to_alloc = contig_clusters;
> -
> - /* allocate extent and insert them into the extent tree */
> - ret = ocfs2_extend_allocation(inode, cpos,
> - clusters_to_alloc, 0);
> - if (ret < 0) {
> - up_write(&OCFS2_I(inode)->ip_alloc_sem);
> - mlog_errno(ret);
> - goto bail;
> - }
> -
> - ret = ocfs2_extent_map_get_blocks(inode, iblock, &p_blkno,
> - &contig_blocks, &ext_flags);
> - if (ret < 0) {
> - up_write(&OCFS2_I(inode)->ip_alloc_sem);
> - mlog(ML_ERROR, "get_blocks() failed iblock=%llu\n",
> - (unsigned long long)iblock);
> - ret = -EIO;
> - goto bail;
> - }
> - up_write(&OCFS2_I(inode)->ip_alloc_sem);
> - }
> -
> - /*
> - * get_more_blocks() expects us to describe a hole by clearing
> - * the mapped bit on bh_result().
> - *
> - * Consider an unwritten extent as a hole.
> - */
> - if (p_blkno && !(ext_flags & OCFS2_EXT_UNWRITTEN))
> - map_bh(bh_result, inode->i_sb, p_blkno);
> - else
> - clear_buffer_mapped(bh_result);
> -
> - /* make sure we don't map more than max_blocks blocks here as
> - that's all the kernel will handle at this point. */
> - if (max_blocks < contig_blocks)
> - contig_blocks = max_blocks;
> - bh_result->b_size = contig_blocks << blocksize_bits;
> -bail:
> - if (alloc_locked)
> - ocfs2_inode_unlock(inode, 1);
> - return ret;
> -}
>
> ......
>
> +static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
> + loff_t offset)
> +{
> + struct file *file = iocb->ki_filp;
> + struct inode *inode = file_inode(file)->i_mapping->host;
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> + loff_t end = offset + iter->count;
> + get_block_t *get_block;
> +
> + /*
> + * Fallback to buffered I/O if we see an inode without
> + * extents.
> + */
> + if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL)
> + return 0;
> +
> + /* Fallback to buffered I/O if we do not support append dio. */
> + if (end > i_size_read(inode) && !ocfs2_supports_append_dio(osb))
> + return 0;
> +
> + if (iov_iter_rw(iter) == READ)
> + get_block = ocfs2_get_block;
> + else
> + get_block = ocfs2_dio_get_block;
> +
> + return __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
> + iter, offset, get_block,
> + ocfs2_dio_end_io, NULL, 0);
> +}
> +
> const struct address_space_operations ocfs2_aops = {
> .readpage = ocfs2_readpage,
> .readpages = ocfs2_readpages,
> ________________________________
> All the best wishes for you.
> zhangguanghui
> -------------------------------------------------------------------------------------------------------------------------------------
> 本邮件及其附件含有新华三技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from New H3C, which is
> intended only for the person or entity whose address is listed above. Any use of the
> information contained herein in any way (including, but not limited to, total or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
More information about the Ocfs2-devel
mailing list