[OracleOSS] [TitleIndex] [WordIndex]

OCFS2/DesignDocs/OfflineResize

OCFS2 FILESYSTEM EXTEND SUPPORT

Sunil Mushran, Aug 21 2006

GOALS

The immediate goal is to allow users to extend an umounted OCFS2 volume (offline growth). The next step would be to allow users to extend an mounted OCFS2 volume (online growth). While this document only discusses offline growth, the scheme used will be such that it will be compatible with our future online growth plans. (Online growth will be addressed in the next major filesystem release.)

USER INTERACTION

tunefs.ocfs2 will be the front-end tool via which the user will perform the filesystem extend operation. The "-S" argument will indicate resize. In absence of the [blocks-count], the tool will grow the volume to the current size of the partition.

# tunefs.ocfs2
tunefs.ocfs2 1.2.2
usage: tunefs.ocfs2 [-N number-of-node-slots] [-L volume-label]
        [-J journal-options] [-S] [-qvV] device [blocks-count]

The use of blocks-count is compatible with its current use in mkfs.ocfs2.

STRUCTURES

ocfs2_chain_list is part of ocfs2_dinode with OCFS2_BITMAP_FL set.

struct ocfs2_chain_list {
/*00*/  __le16 cl_cpg;                  /* Clusters per Block Group */
        __le16 cl_bpc;                  /* Bits per cluster */
        __le16 cl_count;                /* Total chains in this list */
        __le16 cl_next_free_rec;        /* Next unused chain slot */
        __le64 cl_reserved1;
/*10*/  struct ocfs2_chain_rec cl_recs[0];      /* Chain records */
};

struct ocfs2_chain_rec {
        __le32 c_free;  /* Number of free bits in this chain. */
        __le32 c_total; /* Number of total bits in this chain */
        __le64 c_blkno; /* Physical disk offset (blocks) of 1st group */
};

struct ocfs2_group_desc
{
/*00*/  __u8    bg_signature[8];        /* Signature for validation */
        __le16  bg_size;                /* Size of included bitmap in bytes. */
        __le16  bg_bits;                /* Bits represented by this group. */
        __le16  bg_free_bits_count;     /* Free bits count */
        __le16  bg_chain;               /* What chain I am in. */
/*10*/  __le32  bg_generation;
        __le32  bg_reserved1;
        __le64  bg_next_group;          /* Next group in my list, in blocks */
/*20*/  __le64  bg_parent_dinode;       /* dinode which owns me, in blocks */
        __le64  bg_blkno;               /* Offset on disk, in blocks */
/*30*/  __le64  bg_reserved2[2];
/*40*/  __u8    bg_bitmap[0];
};

BACKGROUND

The global bitmap in OCFS2 splits the entire volume in groups of clusters aka Block Groups or BGs for short. The global bitmap, via ocfs2_chain_list, is equipped to directly point to cl_count BGs. Once the total number of BGs exceeds that, the new BGs are linked to the existing BGs creating chains. Hence the term, chained block groups. Using this scheme, OCFS2 is able to handle unlimited number of block groups.

The first block of each BG, other than the first BG, contains the descriptor (ocfs2_group_desc). This descriptor not only points to the next BG (bg_next_group) in the chain but also contains the bitmap (bg_bitmap) for that group. The block# of the descriptor for the first BG is contained in the super block (s_first_cluster_group).

All BGs, other than the last BG, contain an equal number of clusters (cl_cpg). The number of clusters per group depends on the block size. The larger the blocksize, the larger the bitmap, the more the clusters it can hold. For e.g., a 4K block can handle 32256 clusters, 2K 15872 clusters, while 1K only 7680 clusters. The last BG contains the remainder clusters.

SCHEME

The scheme followed uses write ordering instead of journalling. Ordered writes is preferred as that scheme could later be extended to support online resize.

The scheme, in short, is to first extend the last BG before creating new ones and adding them to ocfs2_chain_list in cyclical order starting from one after the last BG.

The scheme in details is as follows:

        if (cl->cl_next_free_rec == 1) {
                if (cl->cl_cpg < 8 * gd->bg_size)
                        cl->cl_cpg = 8 * gd->bg_size;
        }

        gd->bg_bits += cl->cl_bpc *
                        MIN(num_new_clusters,
                                (cl->cl_cpg - (gd->bg_bits/cl->cl_bpc))).

FAILURE SCENARIOS

The incompat flag will ensure that the fs will not be mounted and will force the user to fsck the volume.

When fsck detects this incompat flag, it will first clear the resize related inconsistency before doing its regular checking.

The resize can fail in 3 locations during the write phase.

This is not an issue as the new descriptors do not come into play till the global bitmap inode is not flushed to disk. fsck will just clear the incompat flag.

In this case, the global bitmap will be out-of-sync with the last BG. However, the new BGs are still not visible. fsck will fix the global bitmap inode to be consistent with the last BG and then update num_clusters in the superblock.

fsck will remove all the new BGs that are beyond the end-of-volume as determined by the superblock->num_clusters.


2011-12-23 01:01