OCFS2 PROJECTS

December 2009

This page lists the potential new projects for OCFS2. This list will help us prioritize projects for the next 12 months.

Transparent Compression of Small Files

The idea here is to expand the scope of inline-data. Currently, we can store inline files of around 3K. Joel did some tests with compression that showed that that size could be increased to 12K. That'll be a huge win in mail-store like workloads.

Support for more than 32 bits worth of cluster

We need to complete JBD2 support before even bothering with this. This feature would allow us to support very large devices without growing clustersize, which is our method of supporting them today. In particular what we'd do is grow i_clusters on the global bitmap by an additional 32 bits. The best way to do this would be to add an i_clusters_hi field to the inode and to create a new cluster bitmap type, whose chain record bitcount fields are larger. Local alloc needs a very minor update to use an additional 32 bits of reserved fields for recording start offset. The truncate log would also require a minor update to grow the log record size too. Adding support for the same in the Refcount trees will be more than trivial.

With this change, we will be able to support volume sizes ranging from 64 Zettabytes (4K clustersize, 64 bits + 12 bits) to 16 Yottabytes (1M clustersize, 64 bits + 20 bits). JBD2 will still limit us to 64 bits of 4K blocks.

This change will not affect the max file size. It will still be dependent on the clustersize. To support that, or in other words, make the fs 64-bit clean, we will need to break the format, a very unappealing proposition at this stage. We will look into that in a few years time.

Testing Volumes > 16TB

Before we start to worry about supporting > 32 bits worth of clusters, we should look into testing the large volumes with >4KB clustersize. A quick test of the same revealed some issues. We are overflowing some number(s) somewhere. This task requires a thorough checkup of the fs. Kernel and userspace. One should be able to test the same by mounting a loopback device backed by a sparse file. As things stand today, we should be able to support upto 4PB (32 bits + 20 bits for clustersize)

Online Add Slots

Tao wrote up a design doc for the same in 2007. But he had to shift focus to xattr and later reflink. It's time we added this feature. One known drawback of this feature is that the journals created will not be as contiguous as they are during mkfs time due to free space fragmentation. One suggestion is to not allow creation of a slot if the journal has > some number of extents.

Deferred Unlink

Delay the 2nd step of an unlink (cleanup of the orphaned inode) to a worker thread, thus improving the latency of unlink by about 25%. Lustre actually implements a similar scheme, so there's some precedent for doing this. Patches for this exist (contact Mark Fasheh) but were never pushed upstream because they needed more testing and tuning, especially to be sure that we don't overwhelm the nodes available memory by pinning too many inodes. The memory overhead became an issue during NFS testing.

Async locking of new inodes

This could help improve create performance. The idea is that when a new inode is created, we associate it in memory with it's parent directory. The lock request can then becomes asyncronous so long as we never drop the parent directory lock before all the child locks have been created at the proper level. The tricky thing with this feature is that locking errors need to be handled well.

Freeze and Thaw

Tiger worked on the same some time ago. The patches have been saved in bz722.

Online Defragmentation

We could leverage reflink to do online defragmentation to coalesce file extents into larger contiguous chunks. Reducing free space fragmentation should also be a goal in this project.

Tracing/Logging

The mlog infrastructure is old. The kernel now has ftrace/utrace/younameit mechanisms that allow us to trace and filter the logs using pid, etc.

Wengang has started the process for the same with the following patch to trace alloc.c. The problem here is that the code change is significant that it will be hard to us to backport patches from mainline to the 1.4 and 1.6 code trees. We could always make the code the same by shoe-horning mlog into the tracepoint macros. But that would be a lot of work.

Our feeling is that we should wait till later 2010 when the next release of EL and SLES are available. If they are 2.6.30 or later, they should support ftrace. At that time, it will be worthwhile for us to make the change.

FS Instrumentation

The goal here is to start adding some counters (creations, deletes, renames, lookups, etc.) that are exposed via debugfs for an iostat-style tool to read and show the change per sec. This will help us understand the interaction an app has with the fs. The second step would be provide information about latencies in the fs.

On a parallel track, Sunil has patches that instrument the o2net/o2dlm layer. Hopefully, those patches will see the light of day sometime soon.

OCFS2/OCFS2 Planning