[Ocfs2-devel] [PATCH 4/6] ocfs2: budget for extent tree splits when adding refcount flag

Darrick J. Wong darrick.wong at oracle.com
Thu Nov 10 09:11:59 PST 2016


On Thu, Nov 10, 2016 at 05:20:27PM +0800, Darwin wrote:
> Hello,
> 
> On Thu, Nov 10, 2016 at 6:51 AM, Darrick J. Wong
> <darrick.wong at oracle.com> wrote:
> > When we're adding the refcount flag to an extent, we have to budget
> > enough space to handle a full extent btree split in addition to
> 
> May I ask some questions (possibly stupid):
> 1) why and when would extend btree split? From my understanding, if I do
> 
> $cp --reflink a b
> 
> a is not inline-data. refcount tree will be allocated, every extent
> record of "a" will refer to
> refcount record respectively and be marked with refcounted, operations
> like this also for "b".
> So, I think splitting only happens when writing on them, CMIIW;-)

The VFS reflink interface (FICLONERANGE) and dedupe interface (FIDEDUPERANGE)
allows callers to specify the range of bytes on which to operate, which means
that we can share arbitrary parts of two files.  For example, let's say you
have one regular extent in a file's block map:

RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR (regular extent)

Now ask reflink to share the middle of that extent with some other file.
ocfs2's extent mapping record can store a flag indicating that the
extent could be shared, which means that the one record now splits into
three:

RRRRRRRRRRRssssssRRRRRRRRRRRRR (regular, shared, regular)

This scenario happens if you run duperemove against an ocfs2 filesystem
to deduplicate file data, since dedupe wants the filesystem to be able
to share arbitrary blocks of files.

You're correct that cp --reflink only ever deals with entire extents
because it calls FICLONERANGE with both file offsets zero and the length
set to the length of the source file.  The important thing to remember
is that we are not limited to sharing entire files, even if the
userspace utilities don't take advantage of it.

(FWIW the duperemove program does take advantage of it.)

> 2) what do you mean by "*full* extent btree"?

(Slight correction to that -- I should have said "full extent tree split".)

Record insertion operations on a tree structure all follow the same
basic strategy -- search from the root towards the records in the leaf
blocks until we find the place where the record would be, memmove() all
the records following that spot up by one index, copy the record data
into the newly vacated slot, and (if necessary) walk back up the tree to
update the interior node pointers.

If the desired leaf is already full, however, there is a problem -- we
have to split the leaf into two half-full leaf blocks before we can
insert the record.  We must also add a pointer to the new leaf into the
next level up in the tree.  If that interior node is also full, we split
the interior node prior to adding the new leaf block pointer.  We must
then add a pointer to the new interior node into the next level up in
the tree, and so on until we reach the root.

In short, if we want to add a record to a tree whose blocks are
completely full, we end up splitting blocks all the way up the tree.

> > whatever modifications have to be made to the refcount btree.  We
> > don't currently do this, with the result that generic/186 crashes
> > when we need an extent split but not a refcount split because meta_ac
> > never gets allocated.
> 
> 3) in what situation, will this happen? - "we need an extent split but
> not a refcount split".
> Could you please explain more by example?

An extreme example would be a program like this:

- write to block zero
- for i in 1 to 524288,
  - reflink block zero to block $i

When this program terminates, the refcount tree will contain a single
refcount record ($phys_blk, len=1, refcount=524288).  The extent map for
this file, however, will have 524,288 extent records:

($phys_blk, len=1, offset=0, flags=shared)
($phys_blk, len=1, offset=1, flags=shared)
...
($phys_blk, len=1, offset=524288, flags=shared)

There are 524288 records.  A 4k leaf can fit 252 records, so there will
be 2081 leaf blocks.  A 4k node also can fit 252 records, so there will
be 9 node blocks pointing to leaves, and one root block to point to the
first level nodes.  Clearly, the extent tree has split many times across
all the reflink operations.  However, the refcount tree never splits
because there's only one record.

--D

> 
> Thanks,
> Darwin
> 
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong at oracle.com>
> > ---
> >  fs/ocfs2/refcounttree.c |    3 +++
> >  1 file changed, 3 insertions(+)
> >
> >
> > diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
> > index 59be8f4..d92b6c6 100644
> > --- a/fs/ocfs2/refcounttree.c
> > +++ b/fs/ocfs2/refcounttree.c
> > @@ -3698,6 +3698,9 @@ int ocfs2_add_refcount_flag(struct inode *inode,
> >         struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> >         struct ocfs2_alloc_context *meta_ac = NULL;
> >
> > +       /* We need to be able to handle at least an extent tree split. */
> > +       ref_blocks = ocfs2_extend_meta_needed(data_et->et_root_el);
> > +
> >         ret = ocfs2_calc_refcount_meta_credits(inode->i_sb,
> >                                                ref_ci, ref_root_bh,
> >                                                p_cluster, num_clusters,
> >
> >
> > _______________________________________________
> > Ocfs2-devel mailing list
> > Ocfs2-devel at oss.oracle.com
> > https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> 
> 
> -- 
> Thanks,
> Darwin



More information about the Ocfs2-devel mailing list