[Ocfs2-devel] [PATCH 17/41] ocfs2: Add CoW support.

Tao Ma tao.ma at oracle.com
Thu Aug 20 19:04:18 PDT 2009



Joel Becker wrote:
> On Tue, Aug 18, 2009 at 02:19:18PM +0800, Tao Ma wrote:
>> +		if (*cow_len + leaf_clusters >= max_clusters) {
>> +			if (*cow_len == 0) {
>> +				/*
>> +				 * cpos is in a very large extent record.
>> +				 * So just split max_clusters from the
>> +				 * extent record.
>> +				 */
>> +				if ((rec_end - cpos) <= max_clusters) {
>> +					/*
>> +					 * We can take max_clusters off
>> +					 * the end and cover all of our
>> +					 * write.
>> +					 */
>> +					*cow_start = rec_end - max_clusters;
> 
> 	I just had a thought.  What if leaf_clusters is more than
> max_clusters, but our cpos is near the end of the extent?  This only
> works for the first pass (*cow_len == 0).  Imagine we have an extent of
> 1.5MB, and our cpos is at 1MB, and we want to write 1MB.  Clustersize of
> 256K.
> 
>   |<-- rec->e_cpo ==0                   leaf_clusters == 6 -->|
>   |--<256>--|--<256>--|--<256>--|--<256>--|--<256>--|--<256>--|
>                                           `cpos
> 
> When we get here, *cow_len is 0, *cow_start is 0 (rec->e_cpos), rec_end
> is 6 (leaf_clusters), cpos is 4 because the user wants to write at 1MB
> into the file, max_clusters is 4 (1MB) and write_len is 4 (1MB write).
> 
> 1. *cow_len + leaf_clusters >= max_clusters
>    0        + 6             >= 4  -> TRUE
> 
> 2. (rec_end - cpos) <= max_clusters
>    6        - 4     <= 4  -> TRUE
> 
> 3. *cow_start = rec_end - max_clusters
>               = 6       - 4  -> 2
> 
> 4. *cow_len = max_clusters
>             = 4
> 
> 	This leaves us with a *cow_start of 2 and a *cow_len of 4.  This
> *correctly gets us the last MB of the extent.  That's good.
> However, our write was 1MB from cpos 4.  We're returning to the caller
> 1MB from cpos 2.  We're going to do a short write.
oh, yes, this is really a bug. I guess the reason why tristan's test 
case can't find it is that we are now called from ocfs2_write_begin. 
Normally the write_len will at most be a PAGE_SIZE and the start is 
aligned to PAGE_SIZE also. With our x86_64 boxes, PAGE_SIZE=4096, we 
always CoW the right pos. But with PAGE_SIZE=64k, it will be exposed.
So how about change the first check to:
			if (((rec_end - cpos) <= max_clusters) &&
			    (cpos + write_len <= rec_end)) {

> 
>> +				} else if ((*cow_start + max_clusters) >
>> +					   (cpos + write_len)) {
> 
> 	Should this be >=?  I think it should be, and I think it's my
> fault.  But check to make sure.
yeah, ">=" will be more accurate. Actually, with ">", we can survive in 
a less efficient way since the "else" will cover this case and CoW a 
different part(start from another pos).

Regards,
Tao
> 
>> +					/*
>> +					 * We can take max_clusters off
>> +					 * the front and cover all of
>> +					 * our write.
>> +					 */
>> +					/* NOOP, *cow_start is already set */
>> +				} else {
>> +					/*
>> +					 * We're CoWing more data than
>> +					 * write_len for contiguousness,
>> +					 * but it doesn't fit at the
>> +					 * front or end of this extent.
>> +					 * Let's try to slice the extent
>> +					 * up nicely.  Optimally, our
>> +					 * CoW region starts at a
>> +					 * multiple of max_clusters.  If
>> +					 * that doesn't fit, we give up
>> +					 * and just CoW at cpos.
>> +					 */
>> +					*cow_start +=
>> +						(cpos - *cow_start) &
>> +							~(max_clusters - 1);
>> +					if ((*cow_start + max_clusters) <
>> +					    (cpos + write_len))
>> +						*cow_start = cpos;
>> +				}
>> +			}
>> +			*cow_len = max_clusters;
>> +			break;
>> +		} else
>> +			*cow_len += leaf_clusters;
>> +
>> +		/*
>> +		 * If we reach the end of the extent block and don't get enough
>> +		 * clusters, continue with the next extent block if possible.
>> +		 */
>> +		if (i + 1 == le16_to_cpu(el->l_next_free_rec) &&
>> +		    eb && eb->h_next_leaf_blk) {
>> +			brelse(eb_bh);
>> +			eb_bh = NULL;
>> +
>> +			ret = ocfs2_read_extent_block(INODE_CACHE(inode),
>> +					       le64_to_cpu(eb->h_next_leaf_blk),
>> +					       &eb_bh);
>> +			if (ret) {
>> +				mlog_errno(ret);
>> +				goto out;
>> +			}
>> +
>> +			eb = (struct ocfs2_extent_block *) eb_bh->b_data;
>> +			el = &eb->h_list;
>> +			i = -1;
>> +		}
>> +	}
>> +
>> +out:
>> +	brelse(eb_bh);
>> +	return ret;
>> +}
> 
> Joel
> 



More information about the Ocfs2-devel mailing list