OCFS2/DesignDocs/OnlineResize

OCFS2 FILESYSTEM EXTEND SUPPORT

Tao Ma, September 17 2007

GOALS

The immediate goal is to allow users to extend an OCFS2 volume while some nodes have already mounted it(online growth). We have also described how to do an offline resize in this article http://oss.oracle.com/osswiki/OCFS2(2f)DesignDocs(2f)OfflineResize.html I am very glad to see that the scheme used during offline resize will also be compatible with our online growth, so some of the contents in this articles is copied from the design documents of offline resize.

USER INTERACTION

It is the same as the offline resize and we will handle the difference internally. tunefs.ocfs2 will be the front-end tool via which the user will perform the filesystem extend operation. The "-S" argument will indicate resize. In absence of the [blocks-count], the tool will grow the volume to the current size of the partition.

# tunefs.ocfs2
tunefs.ocfs2 1.4
usage: tunefs.ocfs2 [-N number-of-node-slots] [-L volume-label]
        [-J journal-options] [-S] [-qvV] device [blocks-count]

The use of blocks-count is compatible with its current use in mkfs.ocfs2.

STRUCTURES

ocfs2_chain_list is part of ocfs2_dinode with OCFS2_BITMAP_FL set.

struct ocfs2_chain_list {
/*00*/  __le16 cl_cpg;                  /* Clusters per Block Group */
        __le16 cl_bpc;                  /* Bits per cluster */
        __le16 cl_count;                /* Total chains in this list */
        __le16 cl_next_free_rec;        /* Next unused chain slot */
        __le64 cl_reserved1;
/*10*/  struct ocfs2_chain_rec cl_recs[0];      /* Chain records */
};

struct ocfs2_chain_rec {
        __le32 c_free;  /* Number of free bits in this chain. */
        __le32 c_total; /* Number of total bits in this chain */
        __le64 c_blkno; /* Physical disk offset (blocks) of 1st group */
};

struct ocfs2_group_desc
{
/*00*/  __u8    bg_signature[8];        /* Signature for validation */
        __le16  bg_size;                /* Size of included bitmap in bytes. */
        __le16  bg_bits;                /* Bits represented by this group. */
        __le16  bg_free_bits_count;     /* Free bits count */
        __le16  bg_chain;               /* What chain I am in. */
/*10*/  __le32  bg_generation;
        __le32  bg_reserved1;
        __le64  bg_next_group;          /* Next group in my list, in blocks */
/*20*/  __le64  bg_parent_dinode;       /* dinode which owns me, in blocks */
        __le64  bg_blkno;               /* Offset on disk, in blocks */
/*30*/  __le64  bg_reserved2[2];
/*40*/  __u8    bg_bitmap[0];
};

BACKGROUND

The global bitmap in OCFS2 splits the entire volume in groups of clusters aka Block Groups or BGs for short. The global bitmap, via ocfs2_chain_list, is equipped to directly point to cl_count BGs. Once the total number of BGs exceeds that, the new BGs are linked to the existing BGs creating chains. Hence the term, chained block groups. Using this scheme, OCFS2 is able to handle unlimited number of block groups.

The first block of each BG, other than the first BG, contains the descriptor (ocfs2_group_desc). This descriptor not only points to the next BG (bg_next_group) in the chain but also contains the bitmap (bg_bitmap) for that group. The block# of the descriptor for the first BG is contained in the super block (s_first_cluster_group).

All BGs, other than the last BG, contain an equal number of clusters (cl_cpg). The number of clusters per group depends on the block size. The larger the blocksize, the larger the bitmap, the more the clusters it can hold. For e.g., a 4K block can handle 32256 clusters, 2K 15872 clusters, while 1K only 7680 clusters. The last BG contains the remainder clusters.

SCHEME

In order to make the code more generic, some operations which are useful for both offline and online resize are abstracted, so this scheme contains both the offline and online operations.

The scheme in details is as follows:

For simplicity, disallow volume resize along with add slots or journal resize.
Ensure the user is growing the volume by at least 1 cluster. If not, abort. We can get the maximum cluster numbers by reading the inode of the global bitmap.
check whether the OCFS2 volume is mounted or not.
Validate the global bitmap.
1. For offline resize, validate the global bitmap to ensure the global bitmap inode and the chained block groups are all consistent. If this fails, abort.
2. For online resize, mount the volume at /tmp/<timestamp> and if the mount process is OK, we take it for granted that the global bitmap is OK. The mounted point will be created according to the timestamp. If it fails with EEXIST, try again after a second. We only try 3 times and return error if all fail.
Calculate whether the increased the block will occupy the new group descriptors.
For online resize, there is some possibility that another guy may operate online resize after we begin our operation and may damage what we have already done.Lock tunefs-specific lock to block out other nodes from doing concurrent resizes.
Write the new group descriptors, if needed. If failed, abort.This is done by tunefs.ocfs2.
For offline resize, Write incompat flag OCFS2_FEATURE_INCOMPAT_RESIZE_INPROG in superblock, calculate, write the global bitmap and super block. This is the normal work of offline resize. Please refer the original design doc of offline resize for more details.
For online resize, it is a little complicated, we will add an ioctl for the mount point and let the kernel do the work. So the kernel will do when it gets the command:
1. lock i_mutex, metadata of global bitmap inode with EX lock.
2. start the transaction.
3. Modify the last group descriptor and make it dirty(the main mechanism is the same as offline resize).
4. Modify the global bitmap inode and make it dirty(the main mechanism is the same as offline resize).
5. commit the transaction.
6. Unlock the data and metadata lock of global bitmap.
7. Write out the super block.
8. Write out the backup superblock. This is done by tunefs.ocfs2.
9. Unlock super tunefs-specific lock._

SPECIAL CONSIDERATION

Since we may add many new group descriptors and we want to spread them to all chain records in the whole global bitmap, so how they are written and linked to the file needs us more consideration. So in general, the whole process is like this:

In user space

all the new group descriptors are initialized, the extra blocks in the same cluster will be emptied.
they will be linked as the first group descriptor at the specified chain record, that is the first group descriptor in current chain record of the global bitmap will be linked to the bottom of the new group descriptor. If there are more than one group descriptor for a chain record, they are linked together and the group descriptor which has the max block offset will be as the head.

In ocfs2 kernel

the last group descriptor will be updated.
the chain record will be updated. The new group descriptor will be the head and the record information will be updated.
the global information for the inode will be updated.
the last group descriptor and the inode will be pushed into journal to update.

NEW ONLINE LOCK RESOURCE

A new lock resource named tunefs opertion lock will be added in libo2dlm in which case only tunefs.ocfs2 will use this lock to prevent other tunefs.ocfs2 operations in other nodes. This lock may be useful for our new features such as adding slots online, so it is tunefs specific, and only used to synchronize tunefs operations.

The whole life cycle of it is:

When a node want to do online resize, an EX lock will be got. If we fail to get the lock, it means that some nodes are trying to do online resize, abort the process.
If we get the lock successfully, do the real resize work in the kernel.
free the lock after the online resize.

FAILURE SCENARIOS

The resize can fail in 4 locations during the write phase.

For online resize, the program panic after we lock the tunefs-specific lock. No problem. If the program abort, dlmfs will free the lock when close(2) is invoked.
The work in user space when we initialize the new group descriptors. No problem, since we don't touch the ocfs2 volume, nothing is needed.
JBD will handle the work in ocfs2 kernel corruption.
If the tunefs.ocfs2 crashes after we have successfully update the global bitmap, which means the kernel has done its work successfully. No problem, fsck.ocfs2 will find the error "PR_SUPERBLOCK_CLUSTERS" and fix the problem.
If we update the super block and fail to update the backup super block. No problem, There are only difference in i_clusters of the super block inode. So even if we recover the super block with the backup, fsck.ocfs2 will find the error "PR_SUPERBLOCK_CLUSTERS" and fix the problem.