A soft reflink is to reflink what symlink is to link. A soft reflink creates a new inode with no extents. An extended attribute stores the path to the source. All reads are redirected to the source inode. Writes are supplemented from the src inode and saved locally. For e.g., if the user were to issue a write of 10 bytes at offset 100, the fs would need to read the grab the cache page for the first cluster of the source inode, overwrite the new data and write to the new inode.
Pros: Cross file system support. More efficient as the write operation would be in terms of clusters and not 1M chunks.
Cons: Any changes to the source file could potentially logically corrupt the data in the soft reflinks.
Note: One should be able to create reflinks of soft reflinks. Soft reflinks of soft reflinks should also be possible with some max nesting count. But do we care?
P.S. This idea came to me when I (Sunil) was discussing storage issues with Kurt. I am not yet convinced that it is worthwhile implementing it.
Owner: Tiger Yang
A static file is a regular file with a frozen metadata. It does not allow any updates to its metadata. As in, no new links, unlinks, touches, extends, truncates, reflinks, etc. All these operations will fail with an EACCES (or EPERM). For simplicity, we require the file to be fully allocated and initialized.
The only operations the file will allow are direct reads and writes.
What this means is that the file operations do not require either journaling or cluster locks. This means we should be able to "safely" io to such a file during recovery (dlm and fs).
For now, static file enabling/disabling will be an offline activity using tunefs.ocfs2. While we could come up with a safe scheme to enable it online, we see no way for us to safely disable it. Remember that we don't want to involve any cluster locking. Direct or indirect. And if we cannot disable it online, we should not allow it being enabled online as it could easily be used for DoS.
Our initial understanding was that this feature would not be of any interest to local file systems. But there appears to be a use case that is appropriate for local file systems too. Specifically, an attribute that denotes fixed file mapping which can be used by the boot loaders to store the block pointers of files that cannot be relocated by EXT4_IOC_MOVE_EXT. Currently the boot loaders expect the immutable attribute to prevent the relocation and it does not. One current suggestion is to add a new attribute Fixed Metadata that denotes a file whose metadata cannot be changed. The file itself can be written to but writes that require metadata change will be denied. This email thread can be followed here.
Enlarge the scope of the intr mount option
Currently ocfs2 allows nointr/intr mount options. nointr means disable signal checking before the fs enters dlm to obtain a lock. intr, default, means do that signal checking and exit, if needed, with ERESTARTSYS.
While nointr is very useful, intr is almost useless as it has a very narrow scope.
One problem we have with cluster locking is that if another node is not downconverting its lock, processes on other nodes will hang. That is to be expected. What is not expected is when multiple processes hang on the same lock and are all non-killable. Say a user issues a df that hangs. It is very likely the user will issue multiple dfs only to see them all hang. All non-killable.
In ocfs2_cluster_lock(), when a thread comes to make an upconvert it first sees whether an upconversion is in process (BUSY). If it is, it skips the dlm call and waits directly on the completion event. This wait in non-interruptible when it does not need to be. The thread that has made the dlm call needs to be non-interruptible as it is waiting for the ast callback. The second thread onwards are just waiting on the completion event. There is no reason why they should not be killable.
One suggestion is to allow such processes to be interrupted if the user has mounted intr.
Track PIDs of Lock Holders
Currently the fs keeps the counts of ro and ex lock holders. While this works well, it would be more useful in debugging if we track the PIDs too. Right now we can track hangs due to cluster locking from the node that has the problem to the resource master and onto the node that is not relinquishing the lock. We can see the fs cannot relinquish the lock because it has one or more holders. But we cannot go any further. In mainline, we can get stacks per pid/tid. "cat /proc/PID/task/TID/stack". If we knew the PID, we'd be able to get the problematic stack. In almost all cases, if we encounter this problem, it is not an ocfs2 issue, but a problem in some other component in the kernel. Knowing the stack will allow us to narrow down the scope of the problem quickly.
Parallel Journal Replay
Currently during fs recovery, we replay the journals one slot at a time. ocfs2_recover_node() is called per node/slot to replay the journal and recover the slot. Here most of the time is spent during journal replay as it has to read the journal and then write the blocks all over the volume. This becomes slow when it has to recover multiple slots which is likely to become more common as cluster sizes grow. We should look into doing fs recovery in parallel.
Handle Incorrectly Linked Chains
We need to handle incorrect links in our chain allocator. We had a case recently in which the global bitmap chain became circular leading to a loop in the fs allocator and a similar loop in fsck. In short, neither the fs nor the tools can handle this corruption.
What should be happening is that the fs should detect an issue and go read-only. Meanwhile, the fsck should be able to detect and fix this. (debugfs also goes in a loop.)
In ocfs2_search_chain(), we scan the group descriptors and rely on a NULL bg->bg_next_group to terminate. Instead we should keep a running count of the descriptors read and flag an error if the count exceeds the max_chain_len. max_chain_len could be calculated as follows.
num_gds = (i_clusters + cl_cpg)/cl_cpg;
max_chain_len = (num_gds + l_count)/l_count;
This should work for both global and the sub allocators.
In fsck, we appear to be handling this partially. However, while reading the bitmap, a circular chain throws the fsck in a loop. We should add a corrupt code in fswreck to duplicate this problem and ensure fsck can handle this.
debugfs, when printing the allocators, should be able to detect and handle this gracefully.
tunefs, scans the global bitmap before making some changes. It too should be tested to ensure it can handle this corruption. It should detect and ask the user to run fsck to fix the volume.
Extent Block Pre-allocation and Stealing (Completed as of 2.6.34)
In OCFS2, the space required for inodes and extent blocks is dynamically allocated. This space is tracked by slot specific inode_alloc and extent_alloc files. As these are bitmaps, they are designed to grow in contiguous chunks of a fixed size. For a typical fs, this size is 4MB.
The problem is that it is hard to get a contiguous 4MB chunks in an aging file system with a fragmented free space. We encountered this issue earlier with inodes when users were able to create files on one node and not on another. The problem then was that while the file system as a whole had enough space for inodes, it did not have any free in one node's inode allocator. The solution to fix that problem was to introduce inode stealing that allowed nodes to use the inode space allocated to another slot/node.
We are now encountering the same issue with extent blocks. The difference here is that most slots do not have any free space in the extent allocator. The reason is that our inodes have a high fan out and thus rarely require the extent blocks to track large files. The extent blocks come into serious play only when the free space gets fragmented, just the time when we are unable to grow the extent allocator. Wonderful!
I just had a concall with a user who appears to be running into this issue. He has 10 slots. The extent allocators sizes for the slots are 0M, 4M, 4M, 4M, 0M, 37M, 16M, 8M, 8M, 12M. Total 93MB. Volume size is 700GB. -sm
bz1197 logged for the same. 20G volume with 4 slots. The sizes of extent allocators are 4M, 4M, 4M, 4M. The sizes of the inode allocators are 176M, 8M, 4M, 4M. -sm
Another ct hit the same issue. Two volumes. First is a 50G volume with 16 slots. Only 3 in-use. Inode allocators are 3G, 2.5G and 3G. Extent allocators are 20M, 8M, 12M. The other volume is a 100G volume with 16 slots. Again only 3 in-use. Inode allocators are 2.8G, 500M and 800M. Extent allocators are 8M, 8M and 8M. -sm
So what this means is that mere stealing of space for the extent allocator will not help. We will have to pre-allocate space to this allocator during mkfs. We have resisted the same for inodes because we don't want to waste space. But with extents, the space required is fairly small. These bugzillas have the system directory output. I see something to the order of 20M consumed by extent_alloc for volumes ranging from 100GB to 2TB. Owner needs to validate my claim. ;)
This task requires adding extent_allocator stealing similar to that of inodes. Also, modifying mkfs to make it preallocate some space. Say 60MB for the volume, equally divided by the number of slots. Give or take.
It would be useful to cache ACLs in the inode like other filesystems do. It is, of course, harder in the case of cluster locking, but it would be nice in the case of read-mostly, single-node, or local filesystems.
Maybe a bit somewhere in the LVB could trigger the revalidation, but I'm not sure I'd waste it.
More file limits checks
Check s_maxbytes in more places in the Ocfs2 code (read/write/truncate in particular). The worry is that the 64 bit machines in a mixed 32/64 bit cluster might set things like i_size past what the 32 bit machines can handle. They should be able to cleanly error out when encountering such situations.
I believe this task has been completed. Need to verify.
Delay the 2nd step of an unlink (cleanup of the orphaned inode) to a worker thread, thus improving the latency of unlink by about 25%. Lustre actually implements a similar scheme, so there's some precedent for doing this. Patches for this exist (contact Mark Fasheh) but were never pushed upstream because they needed more testing and tuning, especially to be sure that we don't overwhelm the nodes available memory by pinning too many inodes.
Async locking of new inodes
This could help improve create performance. The idea is that when a new inode is created, we associate it in memory with it's parent directory. The lock request can then becomes asyncronous so long as we never drop the parent directory lock before all the child locks have been created at the proper level. The tricky thing with this feature is that locking errors need to be handled well.
Modify truncate to use hole punching code
Truncate is just a special case of punching holes - it only works against the edge of the tree. Modify ocfs2_commit_truncate() to just make calls to ocfs2_remove_extent(). This will reduce the complexity of our truncate code and kill quite a few redundant lines in alloc.c
Track negative dentries
Currently, we're doing a pretty good job of avoiding a heavy revalidate for valid names via our dentry locking strategy. This doesn't track negative dentries however, which forces a revalidate when the file system sees one. An easy way to fix this is to attach a locking generation number to directory inodes which gets incremented each time it's cluster lock gets dropped. We could record the generation on a negative dentry the 1st time we see it and compare to that on the directory. If the generations are different, a revalidate is forced and we record the current locking generation on the dentry. This will help performance on workloads which have a high miss rate in name lookups. Examples of such workloads are shell $PATH lookups, and program compilation (which often looks for header files in multiple places).
Freeze and Thaw
There has been an interest shown by partners in OCFS2 supporting freeze and thaw capabilities inorder to freeze ios to the device across the cluster. The vendors can use this during cloning.
Status: Awaiting feedback http://oss.oracle.com/bugzilla/show_bug.cgi?id=722
Reduce the size of memory structures
From the file system, the two that we'd most like to see go on a diet would be struct ocfs2_inode_info and struct ocfs2_lock_res
On the DLM side, the lvb member of struct dlm_lockstatus could be made into a pointer which is only optionally allocated. This would save us about 64 bytes per struct ocfs2_lock_res, except for the meta data lock which is the only user of the LVB.
Enhance mlog (maybe) to throttle message spew.
Dynamic local alloc (Completed as of 2.6.28)
Local alloc sizes are fixed for a given combination of blocksize and clustersize. We can improve this by tracking allocations and growing the local alloc file as the number and size of allocations grows. The code should be able to also shrink the local alloc back on it's next window move when allocation patterns calm down. This would help on workloads which are doing large allocations (such as installing software, creating large data files, etc).
Owner: Srini (Completed)
We need a e2image like tool that allows one to dump the OCFS2 metadata in a sparse file allowing us to look at it. It will be useful to analyze of how the inodes get spread out over time affecting stat performance.
Support for ->readpages
Patch is in ocfs2.git, will be pushed upstream for 2.6.25
Verify and improve unwritten extent merging (Completed as of 2.6.26-rc1)
The extent merging code down underneath "ocfs2_mark_extent_written()" could get some more optimizations. For example, it won't merge extents between leaf blocks.
This can easily be done on a single node. Create a very large file with unwritten extents, start writing to regions within the file and inspect the layout with debugfs.ocfs2. Verify that extents are being merged as they should be, and add code to make our merging even smarter. The goal is that you would be able to reserve a large unwritten extent and split it up with a bunch of writes, but still get the same exact extent layout back once all unwritten regions have been written to. ocfs2-test has some programs (fill_verify_holes, and reserve_space) which can help with this.
Merge meta and data locks (Completed 2.6.25)
The rational for this is that when modifying data, we have to always take the meta data lock anyway. Due to ordered write mode a journal flush writes inode data anyway, so there's little gain from seperating data writeout under a different lock. There's also general lock resource reasons to do this. The extra lock results in:
- One more round of lock mastery, per inode.
- Additional memory usage.
Create locks at initially requested level (completed as of 2.6.24-rc3)
Today, if we have not yet created a cluster lock, ocfs2_cluster_lock() will first create it at NLMODE, and then convert the lock to either PRMODE or EXMODE (whichever is requested). A nice performance enhancement would be to change this so that dlmglue creates the lock at the requested level. Mostly one just needs to read through the flow of ocfs2_cluster_lock() and the generic ast functions. Once that is understood, changing things shouldn't be too hard.
Merge meta and generic lock unblock methods (Completed as of 2.6.19-rc1)
ocfs2_do_unblock_meta() has almost the exact same flow and corner cases of ocfs2_generic_downconvert_lock(), except that it also involves using an LVB. We should merge the two by adding lvb callbacks to ocfs2_generic_downconvert_lock() and having the meta data downconvert code use it.
Add sys_splice() support (Completed as of 2.6.20-rc1)
The 2.6.17 kernel added a new system call "splice" which can quickly move data between file descriptors in kernel. This requires a small amount of file system support, so the proper callbacks could be implemented in OCFS2. One can look at the code in fs/ocfs2/aio.c and fs/ocfs2/file.c related to file read/write to get an idea of how the locking can be handled.
Remove ocfs2_handle_add_inode() (Completed as of 2.6.20-rc1)
A small number of paths use this to simplify i_mutex locking of system files by handing it off to the journaling code to take and then drop when ocfs2_commit_trans() is called. It's actually mostly a relic of a very old design and really isn't required any more. Basically those sites can be replaced with the actual locking calls. This could easily be broken up into multiple patches, each one removing a specific call site and replacing it with the direct locking calls.
Remove ocfs2_handle_add_lock() (Completed as of 2.6.20-rc1)
Very similar to the ocfs2_handle_add_inode() but for cluster locks. This runs a bit deeper actually in that the handle gets passed to ocfs2_meta_lock() so that the inode can automatically be added to the handle on sucessful lock acquisition. Again, this only touches a few paths, and patches could be broken up if need be.
Remove ocfs2_journal_handle abstraction (Completed as of 2.6.20-rc1)
Once "Remove ocfs2_handle_add_inode()" and "Remove ocfs2_handle_add_lock()" are done, we can get rid of the ocfs2_journal_handle structure and pass around the native JBD type instead. This will aid readability and also give us a small speedup due to not having to alloc/free our own journal handle structure.
Update i_atime (Completed as of 2.6.20-rc1)
There are a couple of potential designs for this: always update atime, updates based on some sort of timer (gfs2), update only when we're going to write out the inode anyway (xfs). The file system should have full support for the "noatime" mount option.
So an approach needs to be decided upon, and the proper set of patches developed.
Local FS Mount Option (Completed as of 2.6.20-rc1; Tools r1271)
Owner: Sunil Mushran
Allow the user to supply a mount option which tells the file system not to try accessing the cluster stack. The idea is that someone who wants to use OCFS2 locally wouldn't have to prepare an /etc/ocfs2/cluster.conf file. This will require some modifications to mount.ocfs2, so it would be a good way to learn a little bit about the userspace side of OCFS2 mounting. For the kernel paths, we could just extend the existing "hard readonly" code to allow loading a journal and making file system changes.
Clean up endianness annotations (Completed as of 2.6.22-rc1)
Owner: Mark Fasheh
OCFS2 is fully endian neutral, but there are a few debug prints which are missing some le*_to_cpu() magic. This is harmless, but considered bad style. Also, a user on big endian architectures would get strange output from some of those prints. So we have multiple good reasons to clean this up. Download and run the sparse tool against the OCFS2 source.
For help running sparse: http://lwn.net/Articles/205624/;
I/O priorities for the heartbeat thread
Owner: Zhen Wei
This will allow the ocfs2 heartbeat thread to prioritize I/O which may help cut down on spurious fencing. Most of this will be in the tools - we can have a pid configfs attribute and let userspace (ocfs2_hb_ctl) exec the "ionice" program after starting heartbeat.
Configurable Network Timeouts/Delays
Currently, a number of network timeouts and delays are hard coded to default values. Adding configfs fields for idle timeout, keepalive delay, and reconnect delay allows the administrator some flexibility in the network behavior of o2net. (Patch)