OCFS2/DesignDocs/OfflineResize

OCFS2 FILESYSTEM EXTEND SUPPORT

Sunil Mushran, Aug 21 2006

GOALS

The immediate goal is to allow users to extend an umounted OCFS2 volume (offline growth). The next step would be to allow users to extend an mounted OCFS2 volume (online growth). While this document only discusses offline growth, the scheme used will be such that it will be compatible with our future online growth plans. (Online growth will be addressed in the next major filesystem release.)

USER INTERACTION

tunefs.ocfs2 will be the front-end tool via which the user will perform the filesystem extend operation. The "-S" argument will indicate resize. In absence of the [blocks-count], the tool will grow the volume to the current size of the partition.

# tunefs.ocfs2
tunefs.ocfs2 1.2.2
usage: tunefs.ocfs2 [-N number-of-node-slots] [-L volume-label]
        [-J journal-options] [-S] [-qvV] device [blocks-count]

The use of blocks-count is compatible with its current use in mkfs.ocfs2.

STRUCTURES

ocfs2_chain_list is part of ocfs2_dinode with OCFS2_BITMAP_FL set.

struct ocfs2_chain_list {
/*00*/  __le16 cl_cpg;                  /* Clusters per Block Group */
        __le16 cl_bpc;                  /* Bits per cluster */
        __le16 cl_count;                /* Total chains in this list */
        __le16 cl_next_free_rec;        /* Next unused chain slot */
        __le64 cl_reserved1;
/*10*/  struct ocfs2_chain_rec cl_recs[0];      /* Chain records */
};

struct ocfs2_chain_rec {
        __le32 c_free;  /* Number of free bits in this chain. */
        __le32 c_total; /* Number of total bits in this chain */
        __le64 c_blkno; /* Physical disk offset (blocks) of 1st group */
};

struct ocfs2_group_desc
{
/*00*/  __u8    bg_signature[8];        /* Signature for validation */
        __le16  bg_size;                /* Size of included bitmap in bytes. */
        __le16  bg_bits;                /* Bits represented by this group. */
        __le16  bg_free_bits_count;     /* Free bits count */
        __le16  bg_chain;               /* What chain I am in. */
/*10*/  __le32  bg_generation;
        __le32  bg_reserved1;
        __le64  bg_next_group;          /* Next group in my list, in blocks */
/*20*/  __le64  bg_parent_dinode;       /* dinode which owns me, in blocks */
        __le64  bg_blkno;               /* Offset on disk, in blocks */
/*30*/  __le64  bg_reserved2[2];
/*40*/  __u8    bg_bitmap[0];
};

BACKGROUND

The global bitmap in OCFS2 splits the entire volume in groups of clusters aka Block Groups or BGs for short. The global bitmap, via ocfs2_chain_list, is equipped to directly point to cl_count BGs. Once the total number of BGs exceeds that, the new BGs are linked to the existing BGs creating chains. Hence the term, chained block groups. Using this scheme, OCFS2 is able to handle unlimited number of block groups.

The first block of each BG, other than the first BG, contains the descriptor (ocfs2_group_desc). This descriptor not only points to the next BG (bg_next_group) in the chain but also contains the bitmap (bg_bitmap) for that group. The block# of the descriptor for the first BG is contained in the super block (s_first_cluster_group).

All BGs, other than the last BG, contain an equal number of clusters (cl_cpg). The number of clusters per group depends on the block size. The larger the blocksize, the larger the bitmap, the more the clusters it can hold. For e.g., a 4K block can handle 32256 clusters, 2K 15872 clusters, while 1K only 7680 clusters. The last BG contains the remainder clusters.

SCHEME

The scheme followed uses write ordering instead of journalling. Ordered writes is preferred as that scheme could later be extended to support online resize.

The scheme, in short, is to first extend the last BG before creating new ones and adding them to ocfs2_chain_list in cyclical order starting from one after the last BG.

The scheme in details is as follows:

Validate the global bitmap to ensure the global bitmap inode and the chained block groups are all consistent. If this fails, abort.
Ensure the user is growing the volume by at least 1 cluster. If not, abort.
For simplicity, disallow volume resize alongwith add slots or journal resize.
Write incompat flag OCFS2_FEATURE_INCOMPAT_RESIZE_INPROG in superblock.
Compute the number of new clusters, num_new_clusters.
If the volume has only one BG, ocfs2_chain_list->cl_cpg may not be the max possible. This is due to a quirk in mkfs. Set cl_cpg to max before continuing. This is required as fsck assumes that when computing locations of block group descriptors.

        if (cl->cl_next_free_rec == 1) {
                if (cl->cl_cpg < 8 * gd->bg_size)
                        cl->cl_cpg = 8 * gd->bg_size;
        }

Read the descriptor of the last BG and top off gd->bg_bits.

        gd->bg_bits += cl->cl_bpc *
                        MIN(num_new_clusters,
                                (cl->cl_cpg - (gd->bg_bits/cl->cl_bpc))).

Initialize the first cluster of each new block group to zeros before populating the descriptor. This is required as we might use the "unused" space one day. Link the new BG to the next cl->cl_rec in cyclical order. If cl->cl_rec is already in use, make it point to the new BG and make the new BG point to the other block. While this scheme will put the new BGs at the head of the list, that is not an issue as OCFS2 continuously reorders the chain lists such that ones with more free space are at the top of the list.
Write the blocks in the following order:
1. Descriptor(s) of all the new BGs (multiple block writes)
2. Descriptor of the last BG (one block write)
3. Global Bitmap Inode (one block write)
4. Superblock (one block write)
Incompat flag OCFS2_FEATURE_INCOMPAT_RESIZE_INPROG is cleared during the Superblock write.

FAILURE SCENARIOS

The incompat flag will ensure that the fs will not be mounted and will force the user to fsck the volume.

When fsck detects this incompat flag, it will first clear the resize related inconsistency before doing its regular checking.

The resize can fail in 3 locations during the write phase.

Segfault during or after writing the new descriptors but before writing to the last BG.

This is not an issue as the new descriptors do not come into play till the global bitmap inode is not flushed to disk. fsck will just clear the incompat flag.

Segfault after writing the last BG but before global bitmap inode.

In this case, the global bitmap will be out-of-sync with the last BG. However, the new BGs are still not visible. fsck will fix the global bitmap inode to be consistent with the last BG and then update num_clusters in the superblock.

Segfault after writing global bitmap but before the superblock.

fsck will remove all the new BGs that are beyond the end-of-volume as determined by the superblock->num_clusters.