OCFS2/DesignDocs/DiscontiguousBlockGroups

Discontiguous Block Groups

Joel Becker

March 9, 2010

Introduction

A major problem facing ocfs2 is growing our metadata suballoctors in the face of filesystem fragmentation. The fragmentation comes from a variety of places, all of which are worth attacking. This design attacks the worst case, when every other improvement has done its best.

Givens

All of the other attacks (corrected localalloc window sliding, dirty localallocs, reservations, metadata magazine, etc.) are worthwhile approaches to packing the filesystem well. They should be implemented. They can even be implemented in parallel.
Pre-allocation is a band-aid. When a filesystem gets full, it will use up the preallocation. No amount of preallocation can handle the fragmentation comes from freeing things on an already full filesystem.
No matter how well you pack a filesystem as it grows, a long-lived filesystem will eventually use all space and then free it sporadically. This will leave fragmented regions that are not affected by the other methods. A long-lived filesystem still needs to be able to grow its allocators.

Proposal

The solution is discontigous block groups. A discontiguous block group is a block group that is not carved from one contiguous range of clusters; rather, it is made up of a few extents of clusters. There is still one group descriptor. The group descriptor not only contains the block group's bitmap, it also contains the list of extents making up the block group.

Structure Review

/*
 * On disk extent record for OCFS2
 * It describes a range of clusters on disk.
 *
 * Length fields are divided into interior and leaf node versions.
 * This leaves room for a flags field (OCFS2_EXT_*) in the leaf nodes.
 */
struct ocfs2_extent_rec {
/*00*/  __le32 e_cpos;          /* Offset into the file, in clusters */
        union {
                __le32 e_int_clusters; /* Clusters covered by all children */
                struct {
                        __le16 e_leaf_clusters; /* Clusters covered by this
                                                   extent */
                        __u8 e_reserved1;
                        __u8 e_flags; /* Extent flags */
                };
        };
        __le64 e_blkno;         /* Physical disk offset, in blocks */
/*10*/
};

/*
 * On disk allocator group structure for OCFS2
 */
struct ocfs2_group_desc
{
/*00*/  __u8    bg_signature[8];        /* Signature for validation */
        __le16   bg_size;                /* Size of included bitmap in
                                           bytes. */
        __le16   bg_bits;                /* Bits represented by this
                                           group. */
        __le16  bg_free_bits_count;     /* Free bits count */
        __le16   bg_chain;               /* What chain I am in. */
/*10*/  __le32   bg_generation;
        __le32  bg_reserved1;
        __le64   bg_next_group;          /* Next group in my list, in
                                           blocks */
/*20*/  __le64   bg_parent_dinode;       /* dinode which owns me, in
                                           blocks */
        __le64   bg_blkno;               /* Offset on disk, in blocks */
/*30*/  struct ocfs2_block_check bg_check;      /* Error checking */
        __le64   bg_reserved2;
/*40*/  __u8    bg_bitmap[0];
};

struct ocfs2_dinode {
/*00*/  __u8 i_signature[8];            /* Signature for validation */
        __le32 i_generation;            /* Generation number */
        __le16 i_suballoc_slot;         /* Slot suballocator this inode
                                           belongs to */
        __le16 i_suballoc_bit;          /* Bit offset in suballocator
                                           block group */
....
}

/*
 * Sizing of block groups from mkfs.c
 */
static int 
ocfs2_clusters_per_group(int block_size, int cluster_size_bits)
{
        int megabytes;

        switch (block_size) {
        case 4096:
        case 2048:
                megabytes = 4;
                break;
        case 1024:
                megabytes = 2;
                break;
        case 512:
        default:
                megabytes = 1;
                break;
        }

        return (megabytes << ONE_MB_SHIFT) >> cluster_size_bits;
}

Note that a block group consists of 2048 bits for the 512B, 1K, and 2K blocksizes. It is only 1024 bits for the 4K blocksize. We'll consider the 2048 bit case, because it uses up 256 bytes of bitmap in the group descriptor.

The ocfs2_group_desc header only uses 64 bytes. Add the 256 bytes of the bitmap, and you have 320 bytes used. A 512B block still has 192 bytes free. That's 12 extent records. A 1K block can store 44 extent records, a 2K block can store 108 records, and a 4K group descriptor can store 244 extent records. Clearly we have plenty of space to spare.

How It Works

When trying to allocate a block group, if a cpg-sized contiguous region can't be found, the code tries to get N contiguous regions that add up to cpg. As long as N is less than or equal to the number of extent records that fit inside the group descriptor, the N regions are stored in the extent list and the discontigous block group is linked into the chain. Obviously, the first extent in the list is the region containing the group descriptor.

What about allocating metadata from the group? In a contiguous block group, the suballoc bit is converted to a block number by adding it to the group descriptor's block number. A discontiguous block group must look up the bit in its extent list to convert it to a block number.

Freeing a block of metadata takes another change. Currently, the metadata block can subtract its suballoc_bit from its own block number to arrive at the group descriptor's location. This doesn't work when the group descriptor lives in another region. The suballoc_bit math comes from the cluster allocator; data clusters have no headers. Our metadata blocks do have headers. If a metadata block was allocated from a discontigous group, it uses le64 of reserved space to store their group descriptor's location. This is easy for all group types except ocfs2_extent_block, where we are using the last reserved field. With this field set, the metadata block can get back to its group descriptor just as easily as before.

Note that our operations, while on discontiguous regions, actually involve the same amount of math. I don't expect this to impact our metadata allocator performance at all.

Tada, discontiguous block groups.

Compatibility

Filesystems supporting discontiguous block groups will have to set the OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG. Older filesystems and tools won't know how to free these blocks, so they are unable to run recovery.

Group descriptors that describe discontiguous block groups will have to have a flag set to show they are discontiguous. This will allow contiguous block groups to be brought in from older filesystems with no change.

Metadata allocated from a discontiguous block group will have a flag set indicating that their group descriptor location is valid. This way, we don't have to walk existing filesystems and set the group descriptor location when enabling this feature. Contiguous block groups and the blocks they contain look identical to block groups from older versions.

This gives us a couple of nice properties. Turning on the feature is just setting the INCOMPAT bit. If no discontiguous block groups exist, disabling the feature just requires clearing the INCOMPAT bit. Sunil suggests we reject disabling the feature if any discontiguous groups exist rather than working hard to relocate them. I think that's a good first pass.

Most of our metadata types already have a flags field. We can either use that for a HAS_GROUP_DESCRIPTOR_LOC flag, or we can simply define (group_desc_loc == 0) to mean that suballoc_bit is absolute. I think we zero our unused fields religiously. I'm interested in opinions on that one.

Open questions

What if we run out of extents?

The current proposal assumes that we still keep each group, contiguous or not, at cpg in size. What happens, though, if your 512B filesystem can't allocate a 1MB block group in 12 extents? The current proposal says that the operation fails and no block group is created. This is the simplest answer. The worst case is 4K clusters, which requires 256 clusters for 1MB. A 512B group descriptor has 12 slots, so you'd need 12 extents of 22 clusters each. If we're that badly fragmented, it's time to give up. Larger block sizes have even more extents and can handle higher degrees of fragmentation.

But what if we allowed it? Let's say you're only able to get 12 extents of 20 clusters each. This makes 240 clusters, or 960K - not quite the 1MB specified for a 512B blocksize filesystem. How does our code work out? Pretty well, actually. The suballocator code honors bg_bits, not cpg. As long as bg_bits is set correctly, the allocation and freeing of metadata blocks would work correctly. The only changes you'd have to make are a few places we validate a bit against cpg when it could just as easily be validated against bg_bits.

Conclusion

Here's the thing: any long-lived mostly-full filesystem will get here, no matter how well we pack it as it grows. We're going to have to handle this kind of fragmentation. Let's do it now.