This shouldn't be very hard - it requires updating ocfs2-tools however.
Could be easily implemented on top of the local alloc bitmap, thus helping reduce fragmentation if we have multiple threads doing extending writes. See the Ext3 code for this.
This is important - we need the ability to fix a certain subset of file system problems without taking the entire cluster down.
Generally, this would involve userspace software walking the file system, copying files and 'moving' extents from file to file via an ioctl file data changes would be detected via timestamp or other update. One could even imagine software which automatically activates on a nightly basis to optimize file system layout.
Support for more than 32 bits worth of cluster
We need to complete JBD2 support before even bothering with this. This feature would allow us to support very large devices without growing clustersize, which is our method of supporting them today. In particular what we'd do is grow i_clusters on the global bitmap by an additional 32 bits. The best way to do this would be to add an i_clusters_hi field to the inode and to create a new cluster bitmap type, whose chain record bitcount fields are larger. Local alloc needs a very minor update to use an additional 32 bits of reserved fields for recording start offset. The truncate log would also require a minor update to grow the log record size too.
F_GETLEASE / F_SETLEASE
Needs to be cluster aware. NFSv4 uses these for Directory Delegations, so support for this will help with NFSv4 directory caching.
We have no choice about ever getting this in kernel, so it behooves us to get it out of the way, as opposed to maintaining an external patch (which means we have to push it to vendors on every release, test seperate from mainline, etc). So the software needs to be done entirely in userspace, which means using bind mounts. The good news is that we can trivially implement this as a seperate project, and teach mount.ocfs2 how to run things on a new mount. Some level of backwards compatibility can surely be maintained. This will likely require a patch to the umount binary.
SharedRoot project has a pure user-space solution for the same.
This makes a lot of sense for a cluster file system, especially Ocfs2 since all our data is represented in extents. Doing this right looks much more possible these days - we can do the allocation safely in ->writepages(), so ->writepage() is no longer a problem. The total number of "reserved for delalloc" clusters needs to be tracked somehow cluster-wide so that we don't -enospc after the apps write calls come back. Also we have to be able to do writeout for an inode on demand when transferring locks.
External journal support
Support for external journal devices.
Much like ext3 handles it, but this needs to be cluster aware which complicates things.
Owner: Jan Kara
We should eventually support this, though I don't know that it's really a particularly useful feature these days.
Inode Allocation Improvements
Improve our inode allocation strategy by allowing the allocator to make better decisions with regard to inode placement.
The ext3 htree designs major feature is that it can be 'online' backwards compatible with old ext3 directories. Managing that across a cluster of nodes is much more difficult. We must also look at the way htree is implemented. In order to get to your hash, you have to look up the allocation of at least two blocks, and *then* read them. This results in a much slower average performance than if directory inodes were directly hashed. We should take the opportunity to develop some in-tree indexing scheme which can avoid the problems htree has. The feature can be added such that both directory schemes can exist on a given device (via a directory dinode "indexed" flag), making an upgrade to the new scheme possible.
Allows node membership, including external heartbeat and node fencing, to be controlled from user space. Cluster membership is a solved problem in the high availability arena, and inputting node membership from a user space high availability system enables OCFS2 to take advantage of those algorithms. It also allows true external device fencing, such as terminating a node's access to a SAN device by removing permission in the SAN fabric, giving OCFS2 better options than panicking when it loses connectivity. See OCFS2/CManUnderO2CB.
Online Resize (Completed as of 2.6.25-rc1)
Target Release: 1.4
Userspace will be responsible for taking steps to ensure that other nodes don't do a simultaneous resize. It will also pre-format all new cluster groups and simply pass their block numbers into the kernel so the the fs module can do the bitmap locking and group linking steps. We implement the kernel part because this is a high priority feature and there is precedence for doing online resize at least partly in kernel.
Data in Inode Blocks (Completed as of 2.6.24-rc1)
Owner: Mark Fasheh
Target Release: 1.4
Should speed up local node data operations with small files significantly. By reducing the number of blocks requiring modification to 1.
Unwritten extents (Completed as of 2.6.23-rc1)
Would allow us to support posix_fallocate() which asks the file system to allocate a region of a file, without actually writing any file data. Reads from those regions return zero's until data is written to them.
Shared Writeable MMAP (Completed as of 2.6.23-rc1)
Target Release: 1.4
Needs one more bugfix and then this is ready to go. MMAP patch and tests
Remove ocfs2_delete_inode() Vote (Completed as of 2.6.22-rc1)
Target Release: 1.4
Speeds up ocfs2_delete_inode(), another critical path for our unlink code. The best scheme will follow directly the VFS reference counting, thus only destroying the inode on last reference. What we'll want to do is take a shared lock on an "inode open" resource in ocfs2_read_locked_inode(). ->ocfs2_delete_inode() can do a trylock operation to get an exclusive lock on it. If the trylock fails, we know the inode still exists on another node. Serialize the trylock in ->ocfs2_delete_inode() by holding an exclusive inode meta data lock.
Sparse file support (Completed as of 2.6.22-rc1)
Target Release: 1.4
This gains us performance, more efficient usage of space, and has no impact on direct io. It also has the effect of simplifying many critical paths (file write, ftruncate, etc), which will make it easier to implement other allocation changes in the future. This should be done before "Data in inode blocks" and "inline extended attributes" as those two become much easier afterwards.
Remove Dentry Votes (Completed as of 2.6.19-rc1)
Removes rename and unlink votes by putting a lock on the dentry structure. This will speed up most cases of those operations by only messaging those nodes with an actual interest in the particular files.
Directory Readahead (Completed as of 2.6.19-rc1)
Mostly done anyway, just needs some multi-node testing. Readahead patch
Extended Attributes (Completed as of 2.6.29)
Owner: Tao and Tiger
Gets us a number of features which require extended attributes, such as POSIX ACLs and SELinux capability. This includes "inline" extended attributes, which place a small number of them in the inode block (until the space is needed by the extent maps). Inline xattrs have a major performance benefit in benchmarks. Current proposal is a hybrid on the ext3 scheme, potentially falling out to a single level, hash indexed tree if the number of attributes grows larger than one block.
Userspace file locking support (Completed as of 2.6.28)
This is the last "posix" feature we need. We don't have to support mandatory file locks, nor does any solution have to be particularly performant - applications which want locking performance tend to implement their own dlm - the POSIX api doesn't lend itself very easily to high performance locking. flock(2) (non POSIX) type locks could be done very easily by just creating a new lock type in dlmglue. fcntl/lockf however is much trickier and would require some new cluster messages.
JBD2 Support (Completed as of 2.6.27)
Plug Ocfs2 into the jbd2 api. This should be a compile time option for the kernel module. We can use JBD2 in backwards compatibility mode for existing file systems, and mkfs/tunefs could be modified to create jbd2 formatted journals. This would remove the 32 bit blocks limit on Ocfs2. Though simply plugging in can be tested anywhere, we'd need to test on very large file systems once we have tools code to turn on the new jbd features.