NEW OCFS2 FEATURES

What is this?

The following is a list of interesting features which have been added to Ocfs2 since it was accepted into the mainline Linux kernel. The list isn't meant to be comprehensive - commit logs have all the gory details. This is more intended to provide a higher level view of the more important or interesting changes.

What is this NOT?

To be clear, this is not a release announcement and should not be construed as one. The final list of features which will be available to customers via backported modules or distribution vendor distributions can vary, though obviously we hope to include as much of this as possible. Please keep in mind that backporting file system features 6 or more kernel versions is a very large task and that some core kernel code required for some file system features may not exist in some enterprise kernel releases.

How do I get these features?

The best way to get access to all these features is to run the latest mainline Linux kernel. Additionally, the Ocfs2 team at Oracle makes regular releases of the Ocfs2 1.4 module, which includes many of these features. The 1.4 modules works with many existing enterprise distributions. SLES users can look to SLES10 SP2, or the next major version, SLES11 for most of the features listed below.

If you're running a recent mainline kernel, you can use the 1.4.x series of ocfs2-tools to create and convert file systems to the new feature set. Most features however don't require any disk changes. If a feature requires disk changes, the appropriate mkfs.ocfs2 flag is noted below. Additionally, tunefs.ocfs2 can convert a file system to most features, and back.

In 1.4.x releases, mkfs.ocfs2 defaults to a set of file system feature flags which are intended to work on most recent mainline kernel releases. Typically, we define recent as within the last two versions. That is, if the current kernel version is 2.6.23, the mkfs.ocfs2 defaults would turn on any disk features which were included in 2.6.23 and 2.6.22. File system feature sets can always be fine tuned however, so if you're running an older kernel it's very easy to build file systems it will understand.

To build an Ocfs2 file system with disk features understood by most recent kernels:

$ mkfs.ocfs2 --fs-feature-level=default device

To build an Ocfs2 file system with disk features understood on all kernels and all versions of Ocfs2 ever released:

$ mkfs.ocfs2 --fs-feature-level=max-compat device

To build an Ocfs2 file system with disk features which will work on only the latest kernel releases:

$ mkfs.ocfs2 --fs-feature-level=max-features device

mkfs.ocfs2 can also fine-tune feature flags via the --fs-features= option. Please consult the mkfs.ocfs2 man page for more details.

Can I add new disk features to an existing file system?

Disk features can be turned on and off via tunefs.ocfs2. Typically, most features can be turned on and off via the --fs-features= switch. Consult the tunefs.ocfs2 documentation for details. For example, to turn on sparse files and unwritten extents support:

$ tunefs.ocfs2 --fs-features=sparse,unwritten device

Turning off features only require that you prepend 'no' to the feature name. For example, to turn off unwritten extents support:

$ tunefs.ocfs2 --fs-features=nounwritten device

Great, how about some specifics?

The following list is ordered by kernel version. As new features get added, the Ocfs2 team will update this document.

Ordered Mode Journaling

Required Kernel: Linux 2.6.16
Required Tools: Any

When mounted with data=ordered (the new default journaling mode), Ocfs2 will flush file data to disk before committing it's metadata. This flushing ensures that data written to newly allocated regions will be there after a file system crash at the expense of some performance. Users can go back to data=writeback mode if they'd rather make have the performance at the expensve of some small amount of data integrity. Meta data integrity is always preserved in all journaling modes.

File Attribute Support

Required Kernel: Linux 2.6.19
Required Tools: Any

This allows a user to use the chattr command to set and clear Ext2 style file attributes such as the immutable bit. lsattr can be used to view the current set of attributes.

Directory Readahead

Required Kernel: Linux 2.6.19
Required Tools: Any

Enhances the performance of many directory operations by asynchronously reading blocks which may get accessed in the future.

Performance Enhancement - stat(2)

Required Kernel: Various, starting with Linux 2.6.19
Required Tools: Any

We managed to increase our cold cache stat(2) times by cutting the required amount of disk I/O by 50%.

Performance Enhancement - unlink(2)

Required Kernel: Linux 2.6.19
Required Tools: Any

Ocfs2 1.2 requires a broadcast message in order to unlink(2) or rename(2) a file. We replaced the broadcast messaging with a DLM lock which covers directory entries. This allows for faster unlink(2) times due to lower messaging overhead.

Splice Support

Required Kernel: Linux 2.6.20
Required Tools: Any

Provide support for the splice(2) system call. Splice allows for efficient copies between file descriptors by moving the data in kernel. For example, typically a program copying data between two files would have to read blocks from the source file into a local buffer, then write that buffer to the target file. Splice speeds up that process by moving the data between two file descriptors in one system call.

Local Mounts

Required Kernel: Linux 2.6.20
Required Tools: Ocfs2 Tools 1.2.5
mkfs.ocfs2 Command: -M local

This allows the user to mark an Ocfs2 file system as local. Local mounts skip all cluster code and act like a single node file system (for example, ext3). Tunefs.ocfs2 can be used to switch a file system from local to clustered mode. This allows customers to use Ocfs2 for non-clustered use, but with the option of clustering the file system at a later time.

Atime/Mtime Updates

Required Kernel: Various, Linux 2.6.20 has the most complete set of time-related features and fixes.
Required Tools: Any

This feature was requested often.

Mtime is always updated on buffered writes now. The only exception is O_DIRECT where avoiding a meta data update allows us to allow multiple streaming O_DIRECT writers.

Atime updates now happen consistently and are propagated throughout the cluster. Since atime can have a negative performance impact, Ocfs2 is flexible in how it handles atime updates. How atime is updated can be tuned via mount option:

atime_quantum=NRSECS: Defaults to 60 seconds. OCFS2 will not update atime unless this number of seconds has passed since the last update. Set to zero to always update atime.
noatime: This standard mount option turns off atime updates completely.
relatime: Another standard mount option (added in Linux v2.6.20) which Ocfs2 supports. Relative atime only updates the atime if the previous atime is older than the mtime or ctime. This is useful for applications that only need to know that a file has been read since it was last modified.

Additionally, all time updates in Ocfs2 have nanosecond resolution.

Sparse File Support

Required Kernel: Linux 2.6.22
Required Tools: Ocfs2 Tools 1.3.9
mkfs.ocfs2 Command: --fs-features=sparse

Ocfs2 1.2 didn't support the ability to have holes in files. This meant that simple ftruncate(2) operations had to allocate data and fill that space by writing zero's to disk. If the sparse files bit is set on the Ocfs2 super block, none of this becomes necessary. Since sparse file support required us to code more flexible btree operations, it paved the way for other, more advanced file system features such as Unwritten Extents which are described in the next section.

Flexible Allocation API

Required Kernel: Linux 2.6.23
Required Tools: Ocfs2 Tools 1.3.9
mkfs.ocfs2 Command: --fs-features=unwritten

Aside from sparse files, Ocfs2 now supports some more advanced features which are intended to allow users more control over inode btree allocation. Software can access these features via an ioctl(2), or fallocate(2) on later kernels.

Unwritten Extents: If the unwritten extents bit is set on the Ocfs2 super block, an application can request that a range of clusters be pre-allocated within a file. Ocfs2 will mark those extents with a special flag so that expensive data zeroing doesn't have to be performed. Reads and writes to a pre-allocated region act as reads and writes to a hole, except a write will not fail due to lack of data allocation. Pre-allocation via unwritten extents also has the advantage that the file system is given the entire useful btree range up front, instead of piecemeal as would happen when filling holes with multiple write(2) calls. This means that Ocfs2 can do a much better job at optimizing data layout. Also, later writes are much less likely to incur cluster locking overhead against data allocators.
Punching Holes: If the sparse files bit is set on the Ocfs2 super block, an application can request that allocation be removed from arbitrary regions within an inode btree. Essentially, this creates holes. If part of a file is no longer useful, this could be a more efficient method of removing it than having the application manually zero the data.

Shared Writeable MMAP

Required Kernel: Linux 2.6.23
Required Tools: Any

Another feature that was very often requested. Shared writeable memory mappings are fully supported now on Ocfs2.

Data in Inode

Required Kernel: Linux 2.6.24
Required Tools: Ocfs2 Tools 1.4.2
mkfs.ocfs2 Command: --fs-features=inline-data

This saves on space by storing file and directory data directly inside an inode block. Data is transparently moved out to an extent when it no longer fits inside the inode block. In some cases, this can also make a positive impact on cold-cache directory and file operations.

Online Resize

Required Kernel: Linux 2.6.25
Required Tools: Ocfs2 Tools 1.3.9-0.1
tunefs.ocfs2 Command: -S [blocks-count]

Tunefs.ocfs2 can now instruct a running file system to resize itself. Only online volume expansion is supported at this time. The new volume size is reflected in the file system meta data where other nodes will pick it up.

Cluster aware flock(2)

Required Kernel: Linux 2.6.26
Required Tools: Any

The flock(2) system call is now cluster aware. File locks taken on one node from userspace will interact with those taken on other nodes. All flock(2) options are supported, including the kernels ability to cancel a lock request when an appropriate kill signal is recieved by the user. Unfortunately, POSIX file locks, also known as lockf(3) or fcntl(2) locks are not yet supported in a cluster manner. We hope to have that ready in an upcoming version of Ocfs2.

Userspace Clustering

Required Kernel: Linux 2.6.26
Required Tools: Ocfs2 Tools 1.4.2 + cluster stack tools

These changes allow Ocfs2 to fully integrate with many of the available userspace cluster stacks, making Ocfs2 the only open source cluster file system with such a wide choice of underlying stacks. Today the choices include Pacemaker (bundled with recent versions of SUSE Linux) and CMAN (Bundled with EL).

More Flexible Inode Allocation

Required Kernel: Linux 2.6.26
Required Tools: Any

Previous versions of Ocfs2 would return error if the node local inode allocator file required extension and the disk was out of space. These changes allowed the node to look to other nodes inode allocators for free space before giving up.

Cluster aware POSIX file locks (fcntl(), lockf())

Required Kernel: Linux 2.6.28
Required Tools: Any - userspace cluster stack required

POSIX locks are now cluster aware. Locks taken on one node will interact with those taken on another node. Due to the group communication required to make these locks coherent, a userspace cluster is required.

Extended Attributes

Required Kernel: Linux 2.6.28
Required Tools: Ocfs2 Tools 1.4.4

Ocfs2 now has some of the most flexible support for extended attributes in Linux file systems today. Small attributes can be stored directly in the inode block, which provides a large performance increase. If no more attributes can fit in the inode block, new ones are stored externally in a name-indexed btree. Small attribute values are stored inline, near their meta data, while large attribute data grows out to a btree. In theory the btrees have similar limits to inode data. In practice though, the VFS limits EA sizes to 64k.

Very Large Block Devices Support

Required Kernel: Linux 2.6.28
Required Tools: Ocfs2 Tools 1.4.2

Ocfs2 can now use JBD2. Amongst other benefits, this allows us to support large block devices with more than 32 bits worth of block numbers. As a part of these patches, and 'inode64' mount option is added which toggles creation of inodes whose inode number requires more than 32 bits to be adequately described.

User/Group Quotas

Required Kernel: Linux 2.6.29
Required Tools: Ocfs2 Tools 1.4.4

Ocfs2 now has full support for user and group quotas. The changes work with the existing set of quota tools though, you'll need support in ocfs2-tools to turn the file system feature on.

POSIX ACLS, Security (Selinux) Attributes

Required Kernel: Linux 2.6.29
Required Tools: Ocfs2 Tools 1.4.4

These are built on top of extended attributes. If EA's are turned on, the file system will now automatically support POSIX ACLS and selinux.

Meta Data Checksums and Transparent Correction

Required Kernel: Linux 2.6.29
Required Tools: Ocfs2 Tools 1.6.0 (not yet released)

Not only can block corruptions now automatically be detected and reported via checksum, but by storing an ECC value, Ocfs2 can transparently correct small corruptions. We feel this feature is especially critical to a cluster file system, where our customers have repeatedly stated to us a desire to minimize the amount of total cluster downtime, even if it's the disk which corrupts.

Indexed Directories

Required Kernel: Linux 2.6.30
Required Tools: TBD - Ocfs2-tools support is currently under development

Indexed directories allows for fast lookups in directories that hold hundreds of thousands of files. Millions, even. Without this feature, the lookups are sequential which on modern hardware does not perform well over a few thousand files. With this feature, the file names are stored in a btree allowing the fs to quickly traverse to the block the could hold that file name.

Reflink (unlimited inode-based writeable snapshots)

Required Kernel: Linux 2.6.32
Required Tools: Ocfs2 Tools 1.6.0 (not yet released)

Reflink provides the ability to snapshot a regular file. These snapshots look like a hard link. In fact, the two share similar restrictions. Just like a hard link, one can only reflink regular files and the link has to be on the same file system as the source. But unlike a hard link, a reflinked file has the same data as the source only at the time of creation as all writes to either file results in a Copy-on-Write on that file. Like a hard link (and unlike symlinks), a reflinked file is indistinguishable from its source. There is no parent-child relationship. Users can reflink a (reflinked) file any number of times. Well, okay. Upto 4 billion times.

OCFS2/NewFeaturesList