[OracleOSS] [TitleIndex] [WordIndex]

OCFS2/DesignDocs/OnlineResize

OCFS2 FILESYSTEM EXTEND SUPPORT

Tao Ma, September 17 2007

GOALS

The immediate goal is to allow users to extend an OCFS2 volume while some nodes have already mounted it(online growth). We have also described how to do an offline resize in this article http://oss.oracle.com/osswiki/OCFS2(2f)DesignDocs(2f)OfflineResize.html I am very glad to see that the scheme used during offline resize will also be compatible with our online growth, so some of the contents in this articles is copied from the design documents of offline resize.

USER INTERACTION

It is the same as the offline resize and we will handle the difference internally. tunefs.ocfs2 will be the front-end tool via which the user will perform the filesystem extend operation. The "-S" argument will indicate resize. In absence of the [blocks-count], the tool will grow the volume to the current size of the partition.

# tunefs.ocfs2
tunefs.ocfs2 1.4
usage: tunefs.ocfs2 [-N number-of-node-slots] [-L volume-label]
        [-J journal-options] [-S] [-qvV] device [blocks-count]

The use of blocks-count is compatible with its current use in mkfs.ocfs2.

STRUCTURES

ocfs2_chain_list is part of ocfs2_dinode with OCFS2_BITMAP_FL set.

struct ocfs2_chain_list {
/*00*/  __le16 cl_cpg;                  /* Clusters per Block Group */
        __le16 cl_bpc;                  /* Bits per cluster */
        __le16 cl_count;                /* Total chains in this list */
        __le16 cl_next_free_rec;        /* Next unused chain slot */
        __le64 cl_reserved1;
/*10*/  struct ocfs2_chain_rec cl_recs[0];      /* Chain records */
};

struct ocfs2_chain_rec {
        __le32 c_free;  /* Number of free bits in this chain. */
        __le32 c_total; /* Number of total bits in this chain */
        __le64 c_blkno; /* Physical disk offset (blocks) of 1st group */
};

struct ocfs2_group_desc
{
/*00*/  __u8    bg_signature[8];        /* Signature for validation */
        __le16  bg_size;                /* Size of included bitmap in bytes. */
        __le16  bg_bits;                /* Bits represented by this group. */
        __le16  bg_free_bits_count;     /* Free bits count */
        __le16  bg_chain;               /* What chain I am in. */
/*10*/  __le32  bg_generation;
        __le32  bg_reserved1;
        __le64  bg_next_group;          /* Next group in my list, in blocks */
/*20*/  __le64  bg_parent_dinode;       /* dinode which owns me, in blocks */
        __le64  bg_blkno;               /* Offset on disk, in blocks */
/*30*/  __le64  bg_reserved2[2];
/*40*/  __u8    bg_bitmap[0];
};

BACKGROUND

The global bitmap in OCFS2 splits the entire volume in groups of clusters aka Block Groups or BGs for short. The global bitmap, via ocfs2_chain_list, is equipped to directly point to cl_count BGs. Once the total number of BGs exceeds that, the new BGs are linked to the existing BGs creating chains. Hence the term, chained block groups. Using this scheme, OCFS2 is able to handle unlimited number of block groups.

The first block of each BG, other than the first BG, contains the descriptor (ocfs2_group_desc). This descriptor not only points to the next BG (bg_next_group) in the chain but also contains the bitmap (bg_bitmap) for that group. The block# of the descriptor for the first BG is contained in the super block (s_first_cluster_group).

All BGs, other than the last BG, contain an equal number of clusters (cl_cpg). The number of clusters per group depends on the block size. The larger the blocksize, the larger the bitmap, the more the clusters it can hold. For e.g., a 4K block can handle 32256 clusters, 2K 15872 clusters, while 1K only 7680 clusters. The last BG contains the remainder clusters.

SCHEME

In order to make the code more generic, some operations which are useful for both offline and online resize are abstracted, so this scheme contains both the offline and online operations.

The scheme in details is as follows:

SPECIAL CONSIDERATION

Since we may add many new group descriptors and we want to spread them to all chain records in the whole global bitmap, so how they are written and linked to the file needs us more consideration. So in general, the whole process is like this:

In user space

In ocfs2 kernel

NEW ONLINE LOCK RESOURCE

A new lock resource named tunefs opertion lock will be added in libo2dlm in which case only tunefs.ocfs2 will use this lock to prevent other tunefs.ocfs2 operations in other nodes. This lock may be useful for our new features such as adding slots online, so it is tunefs specific, and only used to synchronize tunefs operations.

The whole life cycle of it is:

FAILURE SCENARIOS

The resize can fail in 4 locations during the write phase.


2011-12-23 01:01