OCFS2 UNWRITTEN EXTENTS SUPPORT

Mark Fasheh <mark dot fasheh at oracle dot com>

June 12, 2007

New Terms

Unwritten Extent: An extent allocated to an inode but whose disk space has not yet been initialized. Reads from an unwritten extent will return zeros. Because they have already been allocated, writes to an unwritten extent will not fail because of -ENOSPC. A write also initializes the region with data, so it will no longer be considered unwritten.
Punching Holes: The removal of allocation from an arbitrary inode region. This is different from truncate in that the region removed can be anywhere within the inode space (not just at the tail). Inode size is not necessarily changed.

Why Is This Important

With unwritten extent support an application has the ability to pre-allocate a region of disk without zeroing the actual file data. This can benefit applications which want a large amount of contiguousness within their file data. With unwritten extent support an application can literally ask for many gigabytes of allocation all in one system call. The file system allocator doesn't have to make any guesses as to what the application will do - it just allocates the region as contiguously as possible.

Performance of writes to unwritten regions will cost about the same as filling a hole. Since they do not require allocation during write, contention on bitmaps will be lower. In the worst case writing to an unwritten region will require two tree operations, instead of one for standard hole-filling. The second operation iterates over the same tree region however, so block caching will minimize the time spent reading disk.

The ability to punch holes provides applications the ability to remove unused regions of a file without having to resort to zeroing the space for initialization. Video editing and HSM are two commonly used examples of applications which want this ability.

Structure Review

The ocfs2_extent_record is the sole structure used to describe extents on disk in Ocfs2 today. As part of the sparse file support, the ocfs2_extent_record e_clusters field was turned into a union of two fields.

The first, e_int_clusters is used for interior tree nodes. It describes the entire possible allocation range of the subtree below it.

Leaf nodes use the e_leaf_clusters field in order to make room for an additional field, e_flags. The e_flags field is used to store extent record flags. Today the only existing flag is OCFS2_EXT_UNWRITTEN which marks the extent as unwritten.

   1 /*
   2  * On disk extent record for OCFS2
   3  * It describes a range of clusters on disk.
   4  *
   5  * Length fields are divided into interior and leaf node versions.
   6  * This leaves room for a flags field (OCFS2_EXT_*) in the leaf nodes.
   7  */
   8 struct ocfs2_extent_rec {
   9 /*00*/  __le32 e_cpos;          /* Offset into the file, in clusters */
  10         union {
  11                 __le32 e_int_clusters; /* Clusters covered by all children */
  12                 struct {
  13                         __le16 e_leaf_clusters; /* Clusters covered by this
  14                                                    extent */
  15                         __u8 e_reserved1;
  16                         __u8 e_flags; /* Extent flags */
  17                 };
  18         };
  19         __le64 e_blkno;         /* Physical disk offset, in blocks */
  20 /*10*/
  21 };
  22 
  23 /*
  24  * Extent record flags (e_node.leaf.flags)
  25  */
  26 #define OCFS2_EXT_UNWRITTEN     (0x01)  /* Extent is allocated but
  27                                          * unwritten */
  28

The file system gets a new RO_COMPAT bit which governs the ability to create and modify unwritten extents. The flag depends on OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC. If we don't want to introduce that dependency then the flag must be turned into an INCOMPAT.

/*
 * Unwritten extents support.
 */
#define OCFS2_FEATURE_RO_COMPAT_UNWRITTEN       0x0001

An ioctl(2) based interface is provided to userspace. It has been designed to be compatible with the existing XFS interface.

   1 /*
   2  * Space reservation / allocation / free ioctls and argument structure
   3  * are designed to be compatible with XFS.
   4  */
   5 struct ocfs2_space_resv {
   6         __s16           l_type;
   7         __s16           l_whence;
   8         __s64           l_start;
   9         __s64           l_len;          /* len == 0 means until end of file */
  10         __s32           l_sysid;
  11         __u32           l_pid;
  12         __s32           l_pad[4];       /* reserve area                     */
  13 };
  14 
  15 #define OCFS2_IOC_ALLOCSP               _IOW ('X', 10, struct ocfs2_space_resv)
  16 #define OCFS2_IOC_FREESP                _IOW ('X', 11, struct ocfs2_space_resv)
  17 #define OCFS2_IOC_RESVSP                _IOW ('X', 40, struct ocfs2_space_resv)
  18 #define OCFS2_IOC_UNRESVSP      _IOW ('X', 41, struct ocfs2_space_resv)
  19 #define OCFS2_IOC_ALLOCSP64     _IOW ('X', 36, struct ocfs2_space_resv)
  20 #define OCFS2_IOC_FREESP64      _IOW ('X', 37, struct ocfs2_space_resv)
  21 #define OCFS2_IOC_RESVSP64      _IOW ('X', 42, struct ocfs2_space_resv)
  22 #define OCFS2_IOC_UNRESVSP64    _IOW ('X', 43, struct ocfs2_space_resv)

posix_fallocate(2) is the preferred method of interface to user. The patches implementing it have not yet made their way to the mainline kernel. If they make it before ocfs2 support is complete, the ioctl(2) interface will be removed. This code was taken from 2.6.22-rc4-mm2

   1 /*
   2  * sys_fallocate - preallocate blocks or free preallocated blocks
   3  * @fd: the file descriptor
   4  * @mode: mode specifies if fallocate should preallocate blocks OR free
   5  *        (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
   6  *        FA_DEALLOCATE modes are supported.
   7  * @offset: The offset within file, from where (un)allocation is being
   8  *          requested. It should not have a negative value.
   9  * @len: The amount (in bytes) of space to be (un)allocated, from the offset.
  10  *
  11  * This system call, depending on the mode, preallocates or unallocates blocks
  12  * for a file. The range of blocks depends on the value of offset and len
  13  * arguments provided by the user/application. For FA_ALLOCATE mode, if this
  14  * system call succeeds, subsequent writes to the file in the given range
  15  * (specified by offset & len) should not fail - even if the file system
  16  * later becomes full. Hence the preallocation done is persistent (valid
  17  * even after reopen of the file and remount/reboot).
  18  *
  19  * It is expected that the ->fallocate() inode operation implemented by the
  20  * individual file systems will update the file size and/or ctime/mtime
  21  * depending on the mode and also on the success of the operation.
  22  *
  23  * Note: Incase the file system does not support preallocation,
  24  * posix_fallocate() should fall back to the library implementation (i.e.
  25  * allocating zero-filled new blocks to the file).
  26  *
  27  * Return Values
  28  *      0       : On SUCCESS a value of zero is returned.
  29  *      error   : On Failure, an error code will be returned.
  30  * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate()
  31  * fall back on library implementation of fallocate.
  32  *
  33  * <TBD> Generic fallocate to be added for file systems that do not
  34  *       support fallocate it.
  35  */
  36 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)

What Works Today

Today an ocfs2 file system encountering an unwritten extent will return zeros upon reading it. That functionality was built into the sparse allocation changes. This gives us a degree of backwards compatibility with the flag - it can safely be an RO_COMPAT as read-only mounts will return valid data to the user. Since read support for unwritten extents was introduced with sparse file support, the unwritten extents flag should never appear without sparse allocation turned on.

Syscall Interface Details

mode: Either FA_ALLOCATE for allocation of unwritten extents, or FA_DEALLOCATE to punch a hole.
offset: Start of region, from beginning of file.
len: length of region

Neither offset nor len are allowed to be less than zero. Mtime and ctime are always updated. If the requested change is FA_ALLOCATE and the new region extends past the existing i_size, then a new i_size will be set.

Ioctl Interface Details

The ioctl fields are treated very similarly to the arguments to lseek(2):

l_whence: SEEK_SET, SEEK_CUR, or SEEK_END
l_start: Where to start, relative to l_whence.
l_len: Length in bytes to allocate / deallocate. A len of zero 0 means until end of file.

All other ioctl fields are currently ignored.

Commands can be one of two types:

OCFS2_IOC_RESVSP, OCFS2_IOC_UNRESVSP: Change the file allocation (reserve, unreserve) without changing i_size.
OCFS2_IOC_ALLOCSP, OCFS2_IOC_FREESP: Change the file allocation, updating i_size if the original i_size is contained within the region passed. OCFS2_IOC_ALLOCSP doesn't use unwritten extents, but instead writes actual zeros to disk (much like ocfs2 ftruncate() on non-sparse file systems). Support for this pair of ioctls can (and should) safely be dropped as they are not really used by applications any more.

Ctime and mtime are always updated.

Internal Interface

ocfs2_allocate_unwritten_extents is the lowest level interface to allocation of unwritten extents, called from the higher level ioctl/syscall handlers. It takes a byte range in the start, and len arguments and iterates over the inode tree, allocating unwritten extents to fill all holes found. The current extent allocation routines are used with little to no modification. Any existing extents are skipped, regardless of whether they are marked unwritten or not. Inode meta data not directly related to extent allocation, such as i_size, i_mtime, etc should be modified by the caller.

   1 static int ocfs2_allocate_unwritten_extents(struct inode *inode,
   2                                             u64 start, u64 len);

ocfs2_mark_extent_written is called by the Ocfs2 write path (via buffered file writes, writeable mmap, etc) to remove the OCFS2_EXT_UNWRITTEN flag from an existing portion of an unwritten extent. The write code is responsible for zeroing those parts of the extent clusters which the user data will not be written to (in a manner similar to hole filling).

The extent passed in must be entirely contained within an existing ocfs2_extent_rec - it is not allowed to straddle two records. The extent passed in is allowed to cover less than all of the existing ocfs2_extent_rec.

If the extent passed in covers the entire existing extent and the new ocfs2_extent_rec is contiguous with the ocfs2_extent_rec to it's left or right (or both), it will be merged with them. This will result in the deletion of one or two ocfs2_extent_rec, depending on how many sides of the new extent are contiguous.

If the extent passed in lies on the left or right edge of the existing extent and is contiguous with the ocfs2_extent_rec to that side of the existing extent, then that portion of the existing extent will be merged with the ocfs2_extent_rec to it's left or right.

merge operations are handled by ocfs2_try_to_merge_extent as described below.

If no parts of the region passed in are contiguous with an adjacent ocfs2_extent_rec, then the code will split the existing ocfs2_extent_rec. A single split occurs if the region passed in lies on one edge of the existing region. A double split happens if the region passed in does not lie on either edge, i.e., it is in the middle of the existing ocfs2_extent_rec. Internally, a double split is treated as two single split operations. Splitting extents requires the allocation of at least one more ocfs2_extent_rec, two in the case of a double split.

split operations are handled by ocfs2_split_and_insert as described below.

   1 int ocfs2_mark_extent_written(struct inode *inode, struct buffer_head *di_bh,
   2                               handle_t *handle, u32 cpos, u32 len, u32 phys,
   3                               struct ocfs2_alloc_context *meta_ac);
   4

TODO: High level description of what is done to merge extents

   1 static int ocfs2_try_to_merge_extent(struct inode *inode,
   2                                      handle_t *handle,
   3                                      struct ocfs2_path *left_path,
   4                                      int split_index,
   5                                      struct ocfs2_extent_rec *split_rec,
   6                                      struct ocfs2_alloc_context *meta_ac,
   7                                      struct ocfs2_merge_ctxt *ctxt);

ocfs2_split_and_insert() is responsible for splitting a region from an existing ocfs2_extent_rec such that the region can have the OCFS2_EXT_UNWRITTEN flag removed.

ocfs2_split_and_insert passes an ocfs2_insert_type to ocfs2_do_insert_extent for the majority of its work. An enum is added to the ocfs2_insert_type structure which takes one of three values SPLIT_NONE, SPLIT_LEFT or SPLIT_RIGHT. This is used to complete right rotations at the correct time and during the final insert where a different path is taken than a traditional new record insert.

SPLIT_NONE: No split to be done. This is used unconditionally by the traditional insert code.
SPLIT_LEFT: The record passed exists on the left edge of an existing record.
SPLIT_RIGHT: The record passed exists on the right edge of an existing record.

ocfs2_split_and_insert will grow the allocation tree if there is no more room for extent records. Once that is done, the type of split to be done is determined. If the record to be extracted is not on an edge, then the code takes two passes. The first extracts the portion of the existing record between the right edge of the split record and the right edge of the existing record. This is then passed through to ocfs2_split_and_insert as a right split. After the operation the split record will be on the right edge, causing another right split.

   1 static int ocfs2_split_and_insert(struct inode *inode,
   2                                   handle_t *handle,
   3                                   struct ocfs2_path *path,
   4                                   struct buffer_head *di_bh,
   5                                   struct buffer_head **last_eb_bh,
   6                                   int split_index,
   7                                   struct ocfs2_extent_rec *orig_split_rec,
   8                                   struct ocfs2_alloc_context *meta_ac,
   9                                   struct ocfs2_merge_ctxt *ctxt);

Left Rotations

TODO: Explain ocfs2_rotate_tree_left(), ocfs2_compress_tree()

Punching Holes

TODO: Describe this

Future Optimizations

Merge extent records between tree leaves
Turn a tree with only one branch and fewer records than fit into the dinode back into a level zero extents-in-inode block

OCFS2/DesignDocs/UnwrittenExtents