OCFS2/DesignDocs/DataInInode

OCFS2 Support for Data in Inode Blocks

Mark Fasheh

September 6, 2007

Goals / Requirements

We want to take advantage of Ocfs2's very large inodes by storing file and directory data inside of the same block as other inode meta data. This should improve performance of small files and directories with a small number of names in them. It will also reduce the disk usage of small files.

The implementation must also be flexible enough to allow for future sharing of unused inode space, most likely in the form of inline extended attributes.

Disk Structures

The superblock gets a new incompat bit:

/* Support for data packed into inode blocks */
#define OCFS2_FEATURE_INCOMPAT_INLINE_DATA      0x0040

The ocfs2_dinode gets a dynamic features field and a flag for data-in-inode:

/*
 * Flags on ocfs2_dinode.i_dyn_features
 *
 * These can change much more often than i_flags. When adding flags,
 * keep in mind that i_dyn_features is only 16 bits wide.
 */
#define OCFS2_INLINE_DATA_FL    (0x0001)        /* Data stored in inode block */
#define OCFS2_HAS_XATTR_FL      (0x0002)
#define OCFS2_INLINE_XATTR_FL   (0x0004)
#define OCFS2_INDEXED_DIR_FL    (0x0008)
struct ocfs2_dinode {
        ...
        __le16 i_dyn_features;
        ...
};

And the meta data lvb gets a mirror of that field:

struct ocfs2_meta_lvb {
        __u8         lvb_version;
        __u8         lvb_reserved0;
        __be16       lvb_idynfeatures;
        ...
};

A new structure, ocfs2_inline_data is added for storing inline data:

/*
 * Data-in-inode header. This is only used if i_flags has
 * OCFS2_INLINE_DATA_FL set.
 */
struct ocfs2_inline_data
{
/*00*/  __le16  id_count;       /* Number of bytes that can be used
                                 * for data, starting at id_data */
        __le16  id_reserved0;
        __le32  id_reserved1;
        __u8    id_data[0];     /* Start of user data */
};

And the ocfs2_dinode gets struct ocfs2_inline_data embedded in the id2 union:

struct ocfs2_dinode {
        ...
/*C0*/  union {
                struct ocfs2_super_block        i_super;
                struct ocfs2_local_alloc        i_lab;
                struct ocfs2_chain_list         i_chain;
                struct ocfs2_extent_list        i_list;
                struct ocfs2_truncate_log       i_dealloc;
                struct ocfs2_inline_data        i_data;
                __u8                            i_symlink[0];
        } id2;
/* Actual on-disk size is one block */
};

Managing Dynamic Inode Features

A flag on ocfs2_dinode is needed so that the code can determine when to use the inline data structure, as opposed to the extent list. In the future, it is anticipated that similar flags will have to be added for things like extended attributes, inline extended attributes, directory indexing and so on. The LVB structure will have to be capable of passing these flags around and they will need to be set and cleared automatically as meta data locks are taken.

ocfs2_dinode has only one flags field today, i_flags. It is typically used for flags that are rarely change. Most in fact, are only ever set by programs like mkfs.ocfs2 or tunefs.ocfs2. Examples of such flags are OCFS2_SYSTEM_FL, OCFS2_BITMAP_FL, and so on. The only ones that are manipulated by the file system, OCFS2_VALID_FL and OCFS2_ORPHANED_FL are only done at very specific, well defined moments in an inodes lifetime.

i_flags is never set or cleared from any lvb code or any generic update code (ocfs2_mark_inode_dirty()). In order to support a data-in-inode flag, we'd have to carefully mask out existing i_flags flags. Additionally, the LVB would be required to hold an additional 32 bits of information.

Instead of using i_flags, we create a new field, i_dyn_features. This way the code for manipulating the flags will be cleaner, and less likely to unintentionally corrupt a critical inode field. Since it would only be used for dynamic features, we can just use a 16 bit field. In the future, i_dyn_features can hold information relating to extended attributes.

The ocfs2_inline_data Structure

struct ocfs2_inline_data is embedded in the disk inode and is only used if OCFS2_INLINE_DATA_FL is set. Likewise, putting data inside an inode block requires that OCFS2_INLINE_DATA_FL gets set and struct ocfs2_inline_data be initialized.

Today, there are two fields in struct ocfs2_inline_data.

`id_count`	Describes the number of bytes which can be held inside of the inode.
`id_data`	Marks the beginning of the inode data. It is exactly `id_count` bytes in size.

In the future, id_count may be manipulated as extended attributes are stored/removed from the inode.

i_size is used to determine the end of the users data, starting at the beginning of id_data. All bytes in id_data beyond i_size but before id_count must remain zero'd.

i_blocks (memory only) and i_clusters for an inode with inline data are always zero.

High Level Strategy

Essentially what we have to do in order to make this work seamlessly is fool applications (and in some cases, the kernel) into thinking that inline data is just like any other. Data is "pushed" back out into an extent when the file system gets a request which is difficult or impossible to service with inline data. Directory and file data have slightly different requirements, and so are described in separate subsections.

File Inodes

For file inodes, we mirror info to/from the disk into unmapped pages via special inline-data handlers which can be called from Ocfs2 VFS callbacks such as ->readpage() or ->aio_write().

When we finally need to turn the inode into an extent list based one, the page is mapped to disk and adjacent pages within the cluster need to be zero'd and similarly mapped. data=ordered mode is respected during this process.

Strategy for specific file system operations is outlined in the table below.

File System Operation	Strategy for Existing Inline-data	New Strategy for Extents if applicable
Buffered Read (`->readpage()`, etc)	Copy from disk into page	NA
Buffered Write	If the write will still fit within `id_count`, mirror from the page onto disk. Otherwise, push out to extents.	If `i_clusters` and `i_size` are zero, and the resulting write would fit in the inode, we'll turn it into an inline-data inode.
`O_DIRECT` Read/Write	Fall back to buffered I/O	NA
MMap Read	Same as buffered read - we get this via `->readpage()`	NA
MMap Write	Push out to an extent list on all mmap writes	NA
Extend (`ftruncate()`, etc)	If the new size is inside of id_count, just change i_size.	NA
Truncate/Punching Holes (`ftruncate()`, `OCFS2_IOC_UNRESVSP*`, etc)	Zero the area requested to be removed. Update `i_size` if the call is from `ftruncate()`	NA
Space Reservation (`OCFS2_IOC_RESVSP*`)	Push out to extents if the request is past id_count. Otherwise, nothing to do	NA

Directory Inodes

Directories are one area where I expect to see a significant speedup with inline data - a 4k file system can hold many directory entries in one block.

On a high level, directory inodes are simpler than file inodes - there's no mirroring that needs to happen inside of a page cache page so pushing out to extents is trivial. Locking is straightforward - just about all operations are protected via i_mutex.

Looking closer at the code however, it seems that many low level dirops are open coded, with some high level functions (ocfs2_rename() for example) understanding too much about the directory format. And all places assume that the dirents start at block offset zero and continue for blocksize bytes. The answer is to abstract things further out. Any work done now in that area will help us in the future when we begin looking at indexed directories.

Since directories always start with data, they will always start with OCFS2_INLINE_DATA_FL. This winds up saving us a bitmap lock since no data allocation is required for directory creation.

The code for expanding a directory from inline-data is structured so that the initial expansion will also guarantee room for the dirent to be inserted. This is because the dirent to be inserted might be larger than the space freed up from just expanding to one block. In that case, we'll want to expand to two blocks. Doing both blocks in one pass means that we can rely on our allocators ability to find adjacent clusters.

A table of top level functions which access directories will help to keep things in perspective.

Function Name	Access Type
`ocfs2_readdir`	Readdir
`ocfs2_queue_orphans`	Readdir
`_ocfs2_get_system_file_inode`	Lookup
`ocfs2_lookup`	Lookup
`ocfs2_get_parent`	Lookup
`ocfs2_unlink`	Lookup, Remove, Orphan Insert
`ocfs2_rename`	Lookup, Remove, Insert, Orphan Insert
`ocfs2_mknod`	Lookup, Insert, Create
`ocfs2_symlink`	Lookup, Insert
`ocfs2_link`	Lookup, Insert
`ocfs2_delete_inode`	Orphan Remove
`ocfs2_orphan_add`	Insert
`ocfs2_orphan_del`	Remove

Open Questions

Do we create new files with `OCFS2_INLINE_DATA_FL`?

Note that I literally mean "files" here - see above for why directories always start with inline data.

There's a couple of ways we can handle this. One thing I realized early is that we don't actually have 100% control over state transitions - that is, file write will always want to be able to turn an empty extent list into inline data. We can get into that situations from many paths. Tunefs.ocfs2 could set OCFS2_FEATURE_INCOMPAT_INLINE_DATA on a file system which has empty files. Also, on any file system, the user could truncate an inode to zero (the most common form of truncate) and we'd certainly want to turn that into inline-data on an appropriate write.

Right now, new inodes are still created with an empty extent list. Write will do the right thing when it see's them, and the performance cost to re-format that part of the inode is small. This has the advantage that the code to turn an inode into inline-data from write gets tested more often. Also, it's not uncommon that the 1st write to a file be larger than can fit inline.

Can we "defragment" an inline directory on the fly?

This would ensure that we always have optimal packing of dirents, thus preventing premature extent list expansion. The actual defrag code code be done trivially. The problem is that dirents would be moving around, which might mean that a concurrent readdir() can get duplicate entries. Maybe a scheme where we only defrag when there are no concurrent directory readers would work.

Locking

As a small refresher on how locks nest in Ocfs2:

i_mutex -> ocfs2_meta_lock() -> ip_alloc_sem -> ocfs2_data_lock()

Generally, locking stays the same. Initially, I thought we could avoid the data lock, but we still want to use it in for forcing page invalidation on other nodes. The locking is only really interesting when we're worried about pushing out to extents (or turning an empty extent list into inline data).

As usual, mmap makes things tricky. It can't take i_mutex, so most real work has to be done holding ip_alloc_sem.

Here are some rules which mostly apply to files, as directory locking is much less complicated.

i_mutex is sufficient for spot-checking inline-data state. The inode will never have inline-data added if you hold i_mutex, however it might be pushed out to extents if ocfs2_page_mkwrite() beats you to ip_alloc_sem.
To create an inline data section, or go from inline data to extents the process must hold a write lock on ip_alloc_sem. This is the only node-local protection that mmap can count on.
Reading inline data requires only the cluster locks and a read on ip_alloc_sem

ChangeLog

Sept. 20, 2007: Completed description, based on completed patches - Mark Fasheh
Sept 6, 2007: Begin complete re-write based on my prototype - Mark Fasheh
Apr 12, 2007: Sprinkled some comments - Sunil Mushran
Dec 4, 2006: Andrew Beekhof also do the same work, so I quit
Dec 3, 2006: Add mount option, dlm LVB structure
Nov 30, 2006: Initial edition