OCFS2 Inode Allocation Strategy
Mark Fasheh <mfasheh at suse dot com>
Initial revision based on a conversation with Jan Kara <jack at suse dot cz>
April 16, 2008
This document describes the result of conversations various Ocfs2 developers and testers have had regarding inode allocation.
Inode Alloc Problem Areas
During a recent conversation, Jan Kara and I identified two interesting topics related to inode allocation.
New inodes being spread out
The core problem here is that the inode block search looks for the "emptiest" inode group to allocate from. So if an inode alloc file has many equally (or almost equally) empty groups, new inodes will tend to get spread out amongst them, which in turn can put them all over the disk. This is undesirable because directory operations on conceptually "nearby" inodes force a large number of seeks.
The easiset way to see this is to start with a fresh file system and create a large number of inodes in it - enough to force the allocation of a few inode groups. At this point, things are actually reasonably distributed because the allocator started in an empty state. If the inodes are removed and then recreated however, they'll get spread out amongst the existing inode groups as the allocator cycles through them since each group keeps geting slightly less "empty" than the next.
A fix is to change inode group selection criteria from "emptiest" to something that makes more sense for inode allocation.
Most file systems try to group inodes with their parent directory. As described by Jan, ext3 takes the following steps:
<jack> ext3 code does: <jack> start looking at parent's group, take the first group which satisfies these conditions: <jack> it has at least 'min_blocks' free, it has at least 'min_inodes' free, it doesn't have too many directories. <jack> where 'too many directories' mean - number_of_directories_on_fs / number_of_groups + inodes_per_group / 16 <jack> And then there's fallback if no group satisfies above requirements. <jack> In the group, we just take the first free inode.
Unfortunately, things are a bit trickier for Ocfs2 as it's access to allocators is transient in nature. So building up global inode / directory counts would require cluster messaging.
Also, Ocfs2 allocators grow on demand so there are times when the allocator will have very little free space for optimization. We could, at a future date though change the group add heuristics to pre-emptively allocate extra groups to allow for better inode placement.
That said, Ocfs2 can try a simplified version of the standard scheme which should produce some favorable results:
In core directory inodes are given some extra fields, which records the last used allocation group and allocation slot.
When allocating an inode:
- If last_used_group is set and we use the same allocation slot, search that group first. If no free block is found, fall back to less optimal search, and set last_used_group and last_used_slot to the result. 
- If last_used_group is not set and the parent directory is in our allocator, start looking for free inodes in the chunk of the parent directory. Again, fall back to less optimal search if no free inodes are found. Always set last_used_group to the result.
One thing to watch out for is that we'll lose information when the fs is unmounted, or when the inode is pruned from memory. If this is proven to be a real problem, we could always adding a disk field to record this information.
Another consideration is that we might want to try locking the parent directories allocator even if it's in another slot. This has the downside of incurring cluster locking overhead, but the upside is that placement won't be thwarted simply by the directory having been created on another node.I don't think this is good for us. With the above "last_used_group" solution, only the 1st inode may be discontiguous with the parent, and from then on, we can use the saved group and all the following inodes will be contiguous and it is OK enough for us.
A last thought - if the parent group doesn't work for any reason and we don't have a last_used_group set, we could also try allocating from the same group as an inode whose directory data is adjacent to the one being inserted.This idea may be good. But I think we could defer it when index dir is committed to the kernel.
Inode Group Allocation
This topic requires much more exploration before we do anything concrete.
The idea here is that we might want to consider more explicit placement of inode groups. Right now they're allocated from the local alloc file, which may or may not be a bad thing.
For instance, keeping inode groups close to each other might help readdir+stat, which we're currently weak on. The downside of course is that data for those inodes might be on a different region of the disk [question - how well do we do in this regard today?]. An offset to that downside could be inline-data which wouldn't require seeks anyway for small files.
To allocate inode groups so that they are closer to each other:
- Inode group allocs use the global bitmap directly
- We try to allocate in same cluster group as last time, by remembering which one was last allocated from
- To pick initial cluster group, try for the emptiest one that the local alloc is not using
Of course, we might decide that it's better to go the other way - explicitly spread out cluster groups. Other algorithms could be designed around that concept.
