[Ocfs2-devel] [PATCH 19/41] ocfs2: Integrate CoW in file write.

Tao Ma tao.ma at oracle.com
Fri Aug 21 16:17:55 PDT 2009



Joel Becker wrote:
> On Tue, Aug 18, 2009 at 02:19:20PM +0800, Tao Ma wrote:
>> +	if (ret == -ETXTBSY) {
>> +		BUG_ON(refcounted_cpos == UINT_MAX);
>> +		cow_len = wc->w_clen - (refcounted_cpos - wc->w_cpos);
>> +
>> +		ret = ocfs2_refcount_cow(inode, di_bh,
>> +					 refcounted_cpos, cow_len);
>> +		if (ret) {
>> +			mlog_errno(ret);
>> +			goto out;
>> +		}
> 
> 	I've just realized two more problems.  Well, one is a bug;
> the other is merely inefficient.
> 	First, the inefficiency.  We've cooked up an
> ocfs2_refcount_cow() that can handle any cpos+write_len.  But we call it
> from ocfs2_write_begin_nolock(), which only goes a page at a time.  So
> even for a 1GB write, we're going to CoW 1MB at a time.  For the first
> page of the I/O, we'll call ocfs2_refcount_cow().  This will try to CoW
> just the page.  We'll pad that out to 1MB in cal_cow_clusters().  For
> the next few pages up to 1MB of I/O it will see the now-CoWed clusters.
> But then it gets to the first page of the second MB.  It will CoW the
> second MB, and so on.  We've just split the 1GB range into 1MB hunks on
> disk.
yes, that is anticipated. We CoW 1MB at most at a time.
> 	Now, we have to check REFCOUNTED in write_begin() (well,
> populate_write_desc()) because that's how we trap mmap().  So we leave
> it here.  But for a regular write, we know the entire length up in
> ocfs2_file_aio_write().  So in ocfs2_prepare_inode_for_write(), right
> before the direct_io checks, why don't we just CoW the entire write
> there?  Create a check_for_refcount just like check_for_holes, except
> instead of filling holes you CoW.  The function can easily skip out if
> there's no refcount tree on the inode.  This gives us large CoW regions.
> We're going to have to do the CoW anyway.  When a regular write gets
> into populate_write_desc(), it won't find any refcounted records, so
> there's no more work at that level.
yes, we can put a check there, but we can't resolve the 1MB issue you 
mentioned above either. Maybe we can make ocfs2_refcount_cow more 
intelligent? But I would say let us leave it as-is and this can be a 
future improvement.
> 	Even better, this fixes the bug.  What's the bug?  The current
> code doesn't CoW O_DIRECT writes!  We only check in prepare_write_desc,
> which we don't use for O_DIRECT!  And ocfs2_direct_IO_get_blocks()
> doesn't trigger buffered fallback either!  Well, we don't want buffered
> fallback.  We want CoW followed by real O_DIRECT.  ANd if we do the CoW
> up in prepare_inode_for_write(), we get it.  Plus, we can put a
> BUG_ON(ext_flags & REFCOUNTED) in direct_IO_get_blocks().
oh, yes, this is really a bug. I don't think of O_DIRECT when I created 
this patch set. So I may really need to add a check in 
ocfs2_prepare_inode_for_write(I guess just need to call 
ocfs2_refcount_cow and make write_len<=1MB).

This also make me think that we can cal ocfs2_refcount_cow right before 
we populate_write_desc, so that we don't need to call it twice and we 
can directly BUG_ON(ext_flags & REFCOUNTED) in it.

Regards,
Tao



More information about the Ocfs2-devel mailing list