OCFS2/DesignDocs/InodeStealing

INODE STEALING

Introduction

In OCFS2, we allocate the inodes from slot specific inode_alloc to avoid inode creation congestion. The local alloc file grows in a large contiguous chunk. As for a 4K bs, it grows 4M every time. So 1024 inodes will be allocated from global_bitmap at a time.

Over time, if the fs gets fragmented enough(e.g, the user has created many small files and also delete some of them), we can end up in a situation, whereby we cannot extend the inode_alloc as we don't have a large chunk free in the global_bitmap even if df shows few gigs free. More annoying is that this situation will invariably mean that while one cannot create inodes on one node but can from another node. Still more annoying is that an unused slot may have space for plenty of inodes but is unusable as the user may not be mounting as many nodes anymore.

One solution is an offline defrag(http://oss.oracle.com/osswiki/OCFS2(2f)DesignDocs(2f)defragmentation.html). While that is workable, it is not reasonable to expect users will be happy to umount the volume on all nodes, run defrag, and then mount, etc. Our fix needs to be online, and preferably, transparent.

Solution

One solution is to steal inodes from another slot. If we are in this condition, we are close to ENOSPC. So a slower alloc is better than no alloc. In this case, we begin from the last node which is normally the most least frequently used and try to allocate from it. If that fails, we go to its previous slot and have a try. Eventually we will reach node 0. If there is still no space available, we return ENOSPC. So The normal inode alloc process looks like this:

Allocate from its own inode_alloc:000X(the mechanism we currently use)
- If we can reserve, OK.
- If fails, try to allocate a large chunk from global_bitmap and reserve once again.
If fails, try to allocate from the last node's inode_alloc.
- Just try to reserve, we don't go for global_bitmap if this inode also can't allocate the inode.
If fails, try the node before it until we reach inode_alloc:0000. In the process, we will skip its own inode_alloc.
If fails, try to allocate from its own inode_alloc:000X once again.(Here is a chance that the global_bitmap may has a large enough chunk now during the inode iteration).

Implementation

The source code change is very limited. Just some modification in ocfs2_reserve_new_inode and ocfs2_reserve_suballoc_bits.

ocfs2_reserve_new_inode:
- When ocfs2_reserve_suballoc_bits fails with -ENOSPC, iterate all the other inodes to reserve a node. The iteration algorithm has been discussed above.
- Try ocfs2_reserve_suballoc_bits once again in the end.
ocfs2_reserve_suballoc_bits:
- Add a parameter to indicate whether we need to go for global_bitmap. For other nodes, we don't do that it.
ocfs2_mknod_locked
- Now we can allocate from other slots, so ocfs2_dino.i_suballoc_slot has to be changed also.