[Ocfs2-devel] [PATCH] Treat writes as new when holes span across page boundaries

Goldwyn Rodrigues rgoldwyn at gmail.com
Wed Feb 23 12:54:19 PST 2011


Hi Joel,

On Wed, Feb 23, 2011 at 1:13 PM, Joel Becker <jlbec at evilplan.org> wrote:
>> >        Your last patch, where you force 'new', isn't quite right.  The
>> > cluster boundaries passed to map_pages aren't right, and the later
>> > zeroing could be wrong (zeroing earlier parts of the file that contained
>> > data).  At least, that's how I read it.  But I think you're in the right
>> > place to start.
>>
>> I still don't understand why? I am not considering clusters at all.
>> The flag is set to new if -
>> i_size <= page_offset <= pos
>
>        Thinking about it, I think you're right about this part of its
> safety.  My big concern is that 'new' means "I have a whole empty
> cluster" and that might not be true.
>        It's starting to look like your code is even closer to correct,
> but might not be located exactly where it needs to be.  Let's look at
> what should_zero means.
>        In ocfs2_write_cluster(), it means the calling functions have an
> entire new or unwritten cluster.  That's why we take v_blkno as cpos
> when should_zero is true.  We know that the caller has set us up with an
> entire cluster and we have to zero the front or all of it.  That's great
> when we call down to ocfs2_prepare_page_for_write(), as we are able to
> trust we can zero anything we see with impunity.  That's why we zero
> whole ranges (cluster_start..cluster_end).  Remember here that a write
> context can have multiple pages (to cover the front of a large cluster
> that's just been allocated) or multiple clusters per page (16K/64K page
> size).  should_zero (from c_needs_zero) tells us that the entire cluster
> we're looking at is newly active.
>        Down in ocfs2_map_page_blocks(), we must not venture outside
> from/to, as other code (not the write path) assumes it will not pull in
> blocks that it doesn't need.  So that's not a place to change.
>        What about the loop at the bottom of ocfs2_write_cluster()?  The
> one that calls ocfs2_prepare_page_for_write()?  It can walk multiple
> pages, some of which may be the front of a newly allocated cluster
> (Imagine this write is allocating a 1MB cluster at the end of the file,
> but pos is 512K inside the new cluster.  This code handles zeroing all
> the pages between the start of the cluster and pos).  I suppose we could
> check here, but ocfs2_prepare_page_for_write() has logic based on target
> vs non-target page, and we don't want to override it.
>        So we do want to make a change inside
> ocfs2_prepare_page_for_write().  Let's look at your change.  You trigger
> 'new' even if we're not the target page.  This means the BUG_ON() in the
> else{} can never fail for the extend I described in the last paragraph.
> The calling code should have made sure we are correctly passing
> should_zero to all non-target pages.  Having new overridden for them is
> wrong.  So maybe we move your check inside the target_page if.  It only
> really matters for the target page.
>        But wait!  Check out block_write_full_page_endio() in
> fs/buffer.c!  Especially this part:
>
>        zero_user_segment(page, offset, PAGE_CACHE_SIZE);
>
> It is zeroing the tail of a page when it straddles i_size to make sure
> it goes out correctly.  But, of course, blocks past i_size never go to
> disk.  And we don't care; readpage should handle them correctly, right?
> So if writepage is going to zero for us, and readpage is going to read
> correctly, how are you getting bad data?
>        I was about to think we were done, but now I'm not so sure.  I
> have a question: with my change to ocfs2_should_read_block(), but
> without your patch, what do you see for corruption?  Your pattern
> written in the first pass, or complete random garbage?
>

Joel, you are getting confused in the list of parameters to reproduce
the bug. I have listed it in the past, but not in a single location so
it is difficult to pick up the bits. I will put it together for you.
Conditions are -

1. blocksize < clustersize
2.  the previous write ended in a page boundary, but not a cluster
boundary. eg, clustersize=16K but pos is at the page boundary = 4k
3. you wrote something at the page boundary
4. there is a hole created in the page from 4k-8k in *between* say 4128-4640
5. i_size has moved past the second page, or past 8192

What I am seeing is pure junk(not 0xbaadfeed), and not what I had
written earlier.
The first 4k of the cluster is always zeroed, it is the consecutive
ones which are not.

In retrospect, since we have covered the rest of the bases, we can set
new on the condition:

i_size == page_offset == pos

I interpreted new as new page or new data vs new cluster, and perhaps
that is the reason I placed the condition there.

-- 
Goldwyn



More information about the Ocfs2-devel mailing list