[OracleOSS] [TitleIndex] [WordIndex]

OCFS2/DesignDocs/InodeStealing

INODE STEALING

Owner: TaoMa

Introduction

In OCFS2, we allocate the inodes from slot specific inode_alloc to avoid inode creation congestion. The local alloc file grows in a large contiguous chunk. As for a 4K bs, it grows 4M every time. So 1024 inodes will be allocated from global_bitmap at a time.

Over time, if the fs gets fragmented enough(e.g, the user has created many small files and also delete some of them), we can end up in a situation, whereby we cannot extend the inode_alloc as we don't have a large chunk free in the global_bitmap even if df shows few gigs free. More annoying is that this situation will invariably mean that while one cannot create inodes on one node but can from another node. Still more annoying is that an unused slot may have space for plenty of inodes but is unusable as the user may not be mounting as many nodes anymore.

One solution is an offline defrag(http://oss.oracle.com/osswiki/OCFS2(2f)DesignDocs(2f)defragmentation.html). While that is workable, it is not reasonable to expect users will be happy to umount the volume on all nodes, run defrag, and then mount, etc. Our fix needs to be online, and preferably, transparent.

Solution

One solution is to steal inodes from another slot. If we are in this condition, we are close to ENOSPC. So a slower alloc is better than no alloc. In this case, we begin from the last node which is normally the most least frequently used and try to allocate from it. If that fails, we go to its previous slot and have a try. Eventually we will reach node 0. If there is still no space available, we return ENOSPC. So The normal inode alloc process looks like this:

Implementation

The source code change is very limited. Just some modification in ocfs2_reserve_new_inode and ocfs2_reserve_suballoc_bits.


2011-12-23 01:01