[Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries

Wed Feb 23 06:43:44 PST 2011

Hi Joel,

On Wed, Feb 23, 2011 at 3:39 AM, Joel Becker <jlbec at evilplan.org> wrote:
> On Tue, Feb 22, 2011 at 06:05:20PM -0600, Goldwyn Rodrigues wrote:
>> On Tue, Feb 22, 2011 at 5:34 PM, Joel Becker <jlbec at evilplan.org> wrote:
>> >        I may have been confused because ocfs2_zero_extend() only used
>> > to matter for nonsparse, where this behavior was different.  Which of
>> > the above did you mean?
>> >
>
> Key:
>  EOF -> End of File, or i_size.
>  EOB -> End of Block, the end of the block containing i_size.  So if
>        i_size is 8000 and blocksize is 4096, EOB is 8192.
>  EOC -> End of Cluster, the end of the cluster containing i_size.  So if
>        i_size is 8000, blocksize is 4096, and clustersize is 16K, EOC
>        is 16384, two blocks past i_size.
>
> Hey Goldwyn,
>        I want to apologize for not quite getting the point of your
> original patch.  The legacy ocfs2_zero_extend() name put me in the mind
> of our old nonsparse allocation code, and I saw EOF->EOC and assumed it
> was at allocation time, not i_size change time.  I knew I was still
> reloading the state in my brain and could be missing something, but that
> still must have been frustrating to you.
>

I understand. However, it opened more avenues to look for trouble-making code.

>> Not now, but it may in the future and that is the problem.
>>
>> So consider your case of usual parameters, but i_size=4096.
>> Now you append 32 bytes of data to it, so pos = i_size = 4096
>> and len=32
>> write(fd, 32 bytes of data);
>>
>> In this case, zero_start and zero_to_size in zero_extend_file are both
>> 4096 and it will not zero this page in ocfs2_zero_extend.
>
>        Correct.
>
>> After the write, i_size=4128.
>
>        And the block is zeroed out to 8191 (or at least it should be)
> by.
>

by...?

>> Now lseek 512 bytes -
>> lseek(fd, 512, SEEKCUR);
>>
>>
>> ..and write 4096 bytes again.
>> write(fd, 4096bytes of data);
>>
>> pos=4640 len=4096 bytes, and after the transaction i_size=8736.
>>
>> i_size is beyond 8192 now
>>
>> Hole between 4128-4640 was never zeroed and carries junk. Does that
>> make sense or am I doing the math wrong here?
>
>        That may be the problem, but that is not what should have
> happened.  write_begin() should have zeroed 4128-8192 while processing
> the 32-byte write.  Linux always assumes that you zero a block when you
> bring it into service.  Now, this is not the responsibility of
> ocfs2_zero_tail(), because i_size == pos.  ocfs2_zero_tail() is supposed
> to work when i_size < pos.  It's when you prepare the pages that this
> matters.

So, who should zero the page? I assume map_pages

>        Think about your first patch.  It zeros from i_size to pos.  But

Actually, it zeroed from i_size to EOC. But again I think that might
be incorrect because of the commits you mentioned.

> we're supposed to be guaranteed that zeros exist from i_size to EOB.
> Thus I read it as zeroing during allocation, not when i_size moves to
> cover already allocated clusters.  The "correct" approach (as Linux sees
> it) would zero from EOB to pos (or, actually, EOB to the start of the
> block containing pos).  We're already supposed to be doing that; that's
> part of what I recently fixed.  Later in in our discussion I thought you
> might have found a bug in my fix.  That's why I looked at the
> ocfs2_should_read_block() code.  Now I see that is not the case (though
> my patch is probably correct for ocfs2_should_read_block(), it is
> unrelated to the problem you are seeing).

Yes, I agree it is correct, and does partially resolve the issue.

>        Given your example above, at write(fd, 32 bytes of data),
> ocfs2_prepare_pages_for_write() should be zeroing 4128-8192 before the
> write code is allowed to copy the 32 bytes into 4096-4128.  Everyone
> expects it to work this way (xfs, extN, etc).  That's exactly what
> write_begin() is for.  It gives back a page array where all the pages
> are properly allocated and zeroed.  Ours is complex because we have to
> handle cluster allocation, where clusters, blocks, and pages can be all
> be various multiples of each other.

I agree, but bouncing off "len" over functions might be an overkill,
so I thought zero the page (4096-8192) before writing to it. (refer:
"set new" patch)

>        Your last patch, where you force 'new', isn't quite right.  The
> cluster boundaries passed to map_pages aren't right, and the later
> zeroing could be wrong (zeroing earlier parts of the file that contained
> data).  At least, that's how I read it.  But I think you're in the right
> place to start.

I still don't understand why? I am not considering clusters at all.
The flag is set to new if -
i_size <= page_offset <= pos

There can be no data beyond i_size, so I don't think it will zero any
previously written data. Can you explain in terms of values of i_size,
cluster size, block size, page size=4k, pos, len etc? I think I have
asked this before as well.

-- 
Goldwyn