[OracleOSS] [TitleIndex] [WordIndex]



What is this?

The following is a list of interesting features which have been added to Ocfs2 since it was accepted into the mainline Linux kernel. The list isn't meant to be comprehensive - commit logs have all the gory details. This is more intended to provide a higher level view of the more important or interesting changes.

What is this NOT?

To be clear, this is not a release announcement and should not be construed as one. The final list of features which will be available to customers via backported modules or distribution vendor distributions can vary, though obviously we hope to include as much of this as possible. Please keep in mind that backporting file system features 6 or more kernel versions is a very large task and that some core kernel code required for some file system features may not exist in some enterprise kernel releases.

How do I get these features?

The best way to get access to all these features is to run the latest mainline Linux kernel. Additionally, the Ocfs2 team at Oracle makes regular releases of the Ocfs2 1.4 module, which includes many of these features. The 1.4 modules works with many existing enterprise distributions. SLES users can look to SLES10 SP2, or the next major version, SLES11 for most of the features listed below.

If you're running a recent mainline kernel, you can use the 1.4.x series of ocfs2-tools to create and convert file systems to the new feature set. Most features however don't require any disk changes. If a feature requires disk changes, the appropriate mkfs.ocfs2 flag is noted below. Additionally, tunefs.ocfs2 can convert a file system to most features, and back.

In 1.4.x releases, mkfs.ocfs2 defaults to a set of file system feature flags which are intended to work on most recent mainline kernel releases. Typically, we define recent as within the last two versions. That is, if the current kernel version is 2.6.23, the mkfs.ocfs2 defaults would turn on any disk features which were included in 2.6.23 and 2.6.22. File system feature sets can always be fine tuned however, so if you're running an older kernel it's very easy to build file systems it will understand.

To build an Ocfs2 file system with disk features understood by most recent kernels:

To build an Ocfs2 file system with disk features understood on all kernels and all versions of Ocfs2 ever released:

To build an Ocfs2 file system with disk features which will work on only the latest kernel releases:

mkfs.ocfs2 can also fine-tune feature flags via the --fs-features= option. Please consult the mkfs.ocfs2 man page for more details.

Can I add new disk features to an existing file system?

Disk features can be turned on and off via tunefs.ocfs2. Typically, most features can be turned on and off via the --fs-features= switch. Consult the tunefs.ocfs2 documentation for details. For example, to turn on sparse files and unwritten extents support:

Turning off features only require that you prepend 'no' to the feature name. For example, to turn off unwritten extents support:

Great, how about some specifics?

The following list is ordered by kernel version. As new features get added, the Ocfs2 team will update this document.

Ordered Mode Journaling

When mounted with data=ordered (the new default journaling mode), Ocfs2 will flush file data to disk before committing it's metadata. This flushing ensures that data written to newly allocated regions will be there after a file system crash at the expense of some performance. Users can go back to data=writeback mode if they'd rather make have the performance at the expensve of some small amount of data integrity. Meta data integrity is always preserved in all journaling modes.

File Attribute Support

This allows a user to use the chattr command to set and clear Ext2 style file attributes such as the immutable bit. lsattr can be used to view the current set of attributes.

Directory Readahead

Enhances the performance of many directory operations by asynchronously reading blocks which may get accessed in the future.

Performance Enhancement - stat(2)

We managed to increase our cold cache stat(2) times by cutting the required amount of disk I/O by 50%.

Performance Enhancement - unlink(2)

Ocfs2 1.2 requires a broadcast message in order to unlink(2) or rename(2) a file. We replaced the broadcast messaging with a DLM lock which covers directory entries. This allows for faster unlink(2) times due to lower messaging overhead.

Splice Support

Provide support for the splice(2) system call. Splice allows for efficient copies between file descriptors by moving the data in kernel. For example, typically a program copying data between two files would have to read blocks from the source file into a local buffer, then write that buffer to the target file. Splice speeds up that process by moving the data between two file descriptors in one system call.

Local Mounts

This allows the user to mark an Ocfs2 file system as local. Local mounts skip all cluster code and act like a single node file system (for example, ext3). Tunefs.ocfs2 can be used to switch a file system from local to clustered mode. This allows customers to use Ocfs2 for non-clustered use, but with the option of clustering the file system at a later time.

Atime/Mtime Updates

This feature was requested often.

Mtime is always updated on buffered writes now. The only exception is O_DIRECT where avoiding a meta data update allows us to allow multiple streaming O_DIRECT writers.

Atime updates now happen consistently and are propagated throughout the cluster. Since atime can have a negative performance impact, Ocfs2 is flexible in how it handles atime updates. How atime is updated can be tuned via mount option:

Additionally, all time updates in Ocfs2 have nanosecond resolution.

Sparse File Support

Ocfs2 1.2 didn't support the ability to have holes in files. This meant that simple ftruncate(2) operations had to allocate data and fill that space by writing zero's to disk. If the sparse files bit is set on the Ocfs2 super block, none of this becomes necessary. Since sparse file support required us to code more flexible btree operations, it paved the way for other, more advanced file system features such as Unwritten Extents which are described in the next section.

Flexible Allocation API

Aside from sparse files, Ocfs2 now supports some more advanced features which are intended to allow users more control over inode btree allocation. Software can access these features via an ioctl(2), or fallocate(2) on later kernels.

Shared Writeable MMAP

Another feature that was very often requested. Shared writeable memory mappings are fully supported now on Ocfs2.

Data in Inode

This saves on space by storing file and directory data directly inside an inode block. Data is transparently moved out to an extent when it no longer fits inside the inode block. In some cases, this can also make a positive impact on cold-cache directory and file operations.

Online Resize

Tunefs.ocfs2 can now instruct a running file system to resize itself. Only online volume expansion is supported at this time. The new volume size is reflected in the file system meta data where other nodes will pick it up.

Cluster aware flock(2)

The flock(2) system call is now cluster aware. File locks taken on one node from userspace will interact with those taken on other nodes. All flock(2) options are supported, including the kernels ability to cancel a lock request when an appropriate kill signal is recieved by the user. Unfortunately, POSIX file locks, also known as lockf(3) or fcntl(2) locks are not yet supported in a cluster manner. We hope to have that ready in an upcoming version of Ocfs2.

Userspace Clustering

These changes allow Ocfs2 to fully integrate with many of the available userspace cluster stacks, making Ocfs2 the only open source cluster file system with such a wide choice of underlying stacks. Today the choices include Pacemaker (bundled with recent versions of SUSE Linux) and CMAN (Bundled with EL).

More Flexible Inode Allocation

Previous versions of Ocfs2 would return error if the node local inode allocator file required extension and the disk was out of space. These changes allowed the node to look to other nodes inode allocators for free space before giving up.

Cluster aware POSIX file locks (fcntl(), lockf())

POSIX locks are now cluster aware. Locks taken on one node will interact with those taken on another node. Due to the group communication required to make these locks coherent, a userspace cluster is required.

Extended Attributes

Ocfs2 now has some of the most flexible support for extended attributes in Linux file systems today. Small attributes can be stored directly in the inode block, which provides a large performance increase. If no more attributes can fit in the inode block, new ones are stored externally in a name-indexed btree. Small attribute values are stored inline, near their meta data, while large attribute data grows out to a btree. In theory the btrees have similar limits to inode data. In practice though, the VFS limits EA sizes to 64k.

Very Large Block Devices Support

Ocfs2 can now use JBD2. Amongst other benefits, this allows us to support large block devices with more than 32 bits worth of block numbers. As a part of these patches, and 'inode64' mount option is added which toggles creation of inodes whose inode number requires more than 32 bits to be adequately described.

User/Group Quotas

Ocfs2 now has full support for user and group quotas. The changes work with the existing set of quota tools though, you'll need support in ocfs2-tools to turn the file system feature on.

POSIX ACLS, Security (Selinux) Attributes

These are built on top of extended attributes. If EA's are turned on, the file system will now automatically support POSIX ACLS and selinux.

Meta Data Checksums and Transparent Correction

Not only can block corruptions now automatically be detected and reported via checksum, but by storing an ECC value, Ocfs2 can transparently correct small corruptions. We feel this feature is especially critical to a cluster file system, where our customers have repeatedly stated to us a desire to minimize the amount of total cluster downtime, even if it's the disk which corrupts.

Indexed Directories

Indexed directories allows for fast lookups in directories that hold hundreds of thousands of files. Millions, even. Without this feature, the lookups are sequential which on modern hardware does not perform well over a few thousand files. With this feature, the file names are stored in a btree allowing the fs to quickly traverse to the block the could hold that file name.

Reflink (unlimited inode-based writeable snapshots)

Reflink provides the ability to snapshot a regular file. These snapshots look like a hard link. In fact, the two share similar restrictions. Just like a hard link, one can only reflink regular files and the link has to be on the same file system as the source. But unlike a hard link, a reflinked file has the same data as the source only at the time of creation as all writes to either file results in a Copy-on-Write on that file. Like a hard link (and unlike symlinks), a reflinked file is indistinguishable from its source. There is no parent-child relationship. Users can reflink a (reflinked) file any number of times. Well, okay. Upto 4 billion times.

2012-11-08 13:01