OCFS2/DesignDocs/OnlineDefrag

Online Defragmentation

November 25, 2010

Introduction

It will always be a myth that filesystem on Linux don't need to be defragmented, though strategies like 'allocation reservation' to some extent helps a lot, preventing the fragmentation quite a bit, a long-term-lived contemporary filesystem still has the risk of being severely fragmented over time, making the online-defragmentation a reality is always on the roadmap of OCFS2.

Defragmentation Types

For all kinds of filesystem on Linux, users/developers always need to think about following three kinds of defragmentation, and we decide to start with the first one at the very beginning.

1. Single File Defragmentation.

It's kind of a best-effort strategy to coalesce tiny extents in a file.

2. Relevant Files Defragmentation.

) Guarantee a better locality for all inodes under the same directory. Tao's-Inode-Allocation-Strategy should relieve the issues a lot.
) Guarantee a better locality for all inodes' data under the same directory.

3. Free space defragmentation.

Get a global knowledge of all free-space chunks distributed in global_bitmap, and move them around to make bigger free-space chunks by killing/merging smaller ones.wengang's-defragmenation-strategy also talks about it.

Above three kinds of defragmentatinos will be performed via three ioctls.

The OCFS2_IOC_DEFRAG_* Ioctls

Like other contemporary filesystems, we're also planning to end online-defrag up with introducing new ioctls.

#define OCFS2_IOC_DEFRAG_FILE             _IOW('o', 6, struct ocfs2_defrag_extent)
#define OCFS2_IOC_DEFRAG_DIR              _IOW('o', 7, struct ocfs2_defrag_dir)       /* TO-DO */
#define OCFS2_IOC_DEFRAG_FREE_SPACE       _IOW('o', 9, struct ocfs2_defrag_freespace) /* TO-DO */

/* Flags */
#define OCFS2_DEFRAG_FL_XATTR               0x00000001   /* TO_DO */
#define OCFS2_DEFFAG_FL_SKIP_REFCOUNT       0x00000002   /* TO_DO */
#define OCFS2_DEFFAG_FL_SKIP_UNWRITTEN      0x00000004   /* TO_DO */

struct ocfs2_defrag_extent {
        /* start of the defrag operation */
        __u64 start;

        /* number of bytes to defrag, use (u64)-1 to say all */
        __u64 len;

        /*
         * in/out flags for the operation, using this to make xattr be defraged.
         * and return the status of defragmentation.
         */             
        __u32 flags;
                
        /*      
         * any extent bigger than this will be considered
         * already defragged.
         * Using 0 to let kernel figure out a practical 
         * threshold by scanning current global_bitmap layout.
         *
         */     
        __u32 extent_threshold;
                
        /* reserved for later */ 
        __u32 reserved[2];

}

I'm always wary of schemes that do the work in the kernel. It is a lot of complex code. Most other schemes do the work in userspace. As an example, ext4 allocates a shadow file in userspace, then has the kernel move the data. -- Joel

Now we're running into 'OCFS2_IOC_DEFRAG vs OCFS2_IOC_MOVE_EXT', using OCFS2_IOC_MOVE_EXT could end up becoming more attractive when we want a generic interface some day for userspace defragmenter, against all kinds of filesystem, however, userspace would need to gain a better knowledge on the layout/bitmap of fs's allocated/un-allocated blocks, from where the best allocation could fit. -- Tristan

Following is a original thought to be workable for both simple-strategy-of-defragmentation-in-kernel and extents-movements-in-kernel. -- Tristan

#define OCFS2_MOVE_EXT_FL_AUTO_DEFRAG   (0x00000001)    /* Kernel manages to
                                                           claim new clusters
                                                           as the goal place
                                                           for extents moving */
#define OCFS2_MOVE_EXT_FL_COMPLETE      (0x00000002)    /* Move or defragmenation
                                                           completely gets done.
                                                         */
struct ocfs2_move_extents {
/* All values are in bytes */
        /* in */
        __u64 me_start;         /* Virtual start in the file to move */
        __u64 me_len;           /* Length of the extents to be moved */
        __u64 me_goal;          /* Physical offset of the goal */
        __u64 me_thresh;        /* Maximum distance from goal or threshold
                                   for auto defragmentation */
        __u64 me_flags;         /* flags for the operation:
                                 * - auto defragmentation.
                                 * - refcount,xattr cases.
                                 */

        /* out */
        __u64 me_moved_len;     /* moved length */
        __u64 me_new_offset;    /* Resulting physical location */
        __u32 me_reserved[3];   /* reserved for futhure */
};

#define OCFS2_IOC_MOVE_EXT      _IOW('o', 6, struct ocfs2_move_extents)

Copy-On-Write

Online defragmentation for a single file to coalesce extents is much like that Copy-On-Write operations, as in, we claim new contiguous clusters, then read the pages from extents one by one, update/merge the extent records, then make_dirty_and_remap the pages to let writeback write the data into new allocated clusters.

Given that CoW codes on refcounting is well tested, all we need to do is to generalize these codes a bit to make it proper for defragmentation usage.

Basic Logic

1. Scanning global_bitmap to get a general idea of how freespace fragmented, by help of OCFS2_IOC_INFO_FREEFRAG? to get a practical threshold size of extent to be defragmented.

2. Preallocate/Claim contiguous clusters of threshold-size for extents to be coalesced.

Where are these stored? If the global bitmap sees them as allocated, we need to have a way to re-free them if the node crashes. fsck is not an acceptable answer. -- Joel

Well, how about we finish all defragmentation operations for a range of extent in one jounrnal transaction? which seems to prevent us from bothering adding additional bits/flags on-disk to accomplish the recovery. -- Tristan

3. Scan the extents from defrag-range's start+len to start(to guarantee the minimum rotation of extent rbtree), to figure out which extent needs to be defraged, read the pages one by one from this extents until the total size reach threshold, one round of defragmentation maybe immediately ends up without collecting enough clusters when encountering a hole or unwritten extent.

3. mark these pages dirty and remap them to new allocated clusters, let writeback flush the data.

4. Merge defraged extents and update corresponding extent record, and commit the journal. for refcounted extent, we also need to update refcount tree, things get a little bit complicated here. (Currently, for the sake of simplicity we're not going to take merging of extents in different extent block into account at the very beginning.)

5. Goto step 3 for a loop.

Things to think about:

1. Failure/error tolerance.

2. Not enough freespace left for new allocation of threshold size, we need a more flexible strategy here.

Locking

No extra locks need to be introduced other than inode lock?

I don't think you need new locks, but remember you'll be locking and unlocking allocators as you do the allocation and freeing of extents you move. -- Joel

Feature Bits

No need to add feature bit since we get nothing changed physically on-disk?

I wonder where you are going to store the record of the bits you've allocated while the copy is happening. Like I stated above, you need some way to recover them if the system crashes or has an error in the middle of defragmentation. -- Joel

Like what I said above, let one journal transaction guarantee this, adding bits/flags on-disk is not that reasonable since we've not changed/added new structures on-disk. -- Tristan

Debugging.

Do wee need to have a EXT_FLAG on coalesced extent to give a better chance for debugging?

Userspace

libocfs2 and defrag.ocfs2 will need changes to handle online defragmentation.

How To Test

1. Prepare a tool to frag fs/files in a relatively short time, e.g, multi-nodes growing-writes on the same file, or persistent updates on refcount-shared-extent in different offset to cause CoW frequently, which ends up generating a plenty of trivial separated extents.

2. Data integrity should be guaranteed after an online-defrag, it'll be better to take holes/unwritten area/reflink/xattr into account, data show up before/after the operation should be exactly identical.

3. Data consistency should be guaranteed when crash happens during online-defrag operation.

4. Performance comparison/benchmark test, by comparing extent numbers, or disk seeking time on write/read and the layout of o2info of course;)

5. Boundary test covering all corners will be extremely important to us, e.g. cross the extent block cases, within a hole etc.

6. Multi-nodes test.

7. Combination test with reflink/xattr/unwritten-extents/holes will be a PLUS.