OCFS2/DesignDocs/IndexedDirectories

Ocfs2 Indexed Directories Design Doc

DRAFT

Mark Fasheh <mfasheh at suse dot com>

August 7, 2008

Introduction

This document describes the type of changes required so that Ocfs2 can support directory indexing. We want this so that the file system can support fast lookup of directory entries. At the same time, we want to:

Maintain POSIX compliance of directory operations. Typically this means that a readdir(2) operation can't see the same entry twice (so inserts/unlinks aren't allowed to move entries around). telldir(2) is also a problem in that it's cookie is very small.
Avoid some of the performance problems that htree has. The problem with htree is that dirents are read (via readdir(2)) in hash order. For readdir+stat` workloads, this winds up doing a large amount of random I/O.

Secondary goals are that we increase performance of create and unlink. We don't want it to take O(N) time to insert a record, when lookup of that record is already O(1). Rename is nasty, but to the disk format, it's essentially a compound operation made up of unlinks and creates so it needs not be heavily considered here.

Another primary goal is that of backwards compatibility. We want a running file system to understand both directory formats. This goal is very easy however, since we already have per-inode features fields (i_dyn_features) which we can use to indicate whether a directory is indexed or not. It is not necessary that the file system be able to convert a directory from one format to another.

Structure Review

/*
 * On-disk directory entry structure for OCFS2
 *
 * Packed as this structure could be accessed unaligned on 64-bit platforms
 */
struct ocfs2_dir_entry {
/*00*/  __le64   inode;                  /* Inode number */
        __le16   rec_len;                /* Directory entry length */
        __u8    name_len;               /* Name length */
        __u8    file_type;
/*0C*/  char    name[OCFS2_MAX_FILENAME_LEN];   /* File name */
/* Actual on-disk length specified by rec_len */
} __attribute__ ((packed));

Some notes about the existing directory format:

It is fully POSIX compliant - dirents never move, even to create free space. Also, a directory is never truncated, unless deleted.
readdir+stat will generally be reading inodes in insert order. For Ocfs2, this is typically close to block number order.
'.' and '..' entries have a fixed offset. I think that some software might expect that. It should be trivial for us to fake this if we need to.

Design

Two Btrees

I propose that we keep two btrees per directory, an unindexed tree and a name-indexed tree.

Unindexed Tree

The unindexed tree would be nearly identical to the existing Ocfs2 directory allocation btree, where the leaves are blocks of sequential ocfs2_dir_entry structures. We would use this tree for readdir. That way, we preserve our insert-order stats and POSIX compliance. The tree would be stored in exactly the same place in the inode as it is today. This would also be the only place where we store full dirents, so rebuilding a corrupted directory (from fsck) would involve a walk of this tree and re-insert of found records into the indexed tree (discussed later).

The only addition to our standard directory tree would be a small record at the beginning of each block which would facilitate our ability to find free space in the tree. Additionally, we take advantage of the opportunity to provide a per-block checksum. We can actually hide this in a fake directory entry record if need be. Changes to the existing code would be minimal.

/*
 * Per-block record for the unindexed directory btree. This is carefully
 * crafted so that the rec_len and name_len records of an ocfs2_dir_entry are
 * mirrored. That way, the directory manipulation code needs a minimal amount
 * of update.
 *
 * NOTE: Keep this structure aligned to a multiple of 4 bytes.
 */
struct ocfs2_dir_block_hdr {
/*00*/  __u8    db_signature[8];
        __le16  db_compat_rec_len;      /* Backwards compatible with
                                           ocfs2_dir_entry. */
        __u8    db_compat_name_len;     /* Always zero. Was name_len */
        __u8    db_free_list;           /* What free space list is this dir
                                         * block on. Set to
                                         * OCFS2_IDIR_FREE_SPACE_HEADS when the
                                         * block is completely full */
        __le32  db_reserved1;
        __le64  db_free_prev;           /* Physical block pointers comprising a
        __le64  db_free_next;            * doubly linked free space list */
        __le64  db_csum;
};

Name Indexed Tree

We can keep a name-indexed btree rooted in a separate block from the directory inode. Lookup would exclusively use this tree to find dirents.

Each record in the indexed tree provides just enough information so that we can find the full dirent in the unindexed tree. This allows us to keep a much more compact tree, with fixed-length records as opposed to the unindexed tree whose records can be large and of multiple sizes. The distinction helps - for most block sizes, collisions should not be a worry. Fixed-length records also make manipulation of indexed dir blocks more straight forward. The expense of course is a couple of additinoal reads, however even with that read, we stay well within our limit of O(1). In order to reduce the number of leaves to search when multiple records with the same hash are found, we also store the entries name_len field.

Since the size of indexed dirents is so small, I don't think we need to use the same bucket scheme as DesignDocs/IndexedEATrees. It should be sufficient to allow that the hashed name determines the logical offset of which indexed tree block to search. We'd have room for over 200 collisions per block on a 4k blocksize file system, over 100 collisions on 2k blocksize, and so on.

Aside from lookup, unlink performance is also improved - since we know the name to unlink, we can hash it, look it up in the index and know exactly where the corresponding dirent is.

The only operation which does not get a performance increase from this is insert of directory entries. While the indexed tree insert is O(1), an insert into the unindexed tree still requires an O(N) search of it's leaf blocks for free space. To solve this problem, I propose that we link unindexed leaf blocks together into lists according to their largest contiguous amount of free space. The next section describes this in more detail.

#define OCFS2_IDIR_FREE_SPACE_HEADS     16

/*
 * A directory indexing block. Each indexed directory has one of these,
 * pointed to by ocfs2_dinode.
 *
 * This block stores an indexed btree root, and a set of free space
 * start-of-list pointers.
 *
 * NOTE: Directory indexing is not supported on 512 byte block size file
 * systems 
 */
struct ocfs2_idir_block {
       __u8     idx_signature[8];
       __le64   idx_csum;
       __le64   idx_fs_generation;
       __le64   idx_blkno;
       __le64   idx_reserved1[2];
       __le32   idx_reserved2;
       __le16   idx_suballoc_slot;
       __le16   idx_suballoc_bit;
       __le64   idx_free_space[OCFS2_IDIR_FREE_SPACE_HEADS];
       struct ocfs2_extent_list idx_list;
};

/*
 * A directory entry in the indexed tree. We don't store the full name here,
 * but instead provide a pointer to the full dirent in the unindexed tree.
 *
 * We also store name_len here so as to reduce the number of leaf blocks we
 * need to search in case of collisions.
 */
struct ocfs2_idir_dirent {
       __le32   hash;           /* Result of name hash */
       __u8     name_len;       /* Length of entry name */
       __u8     reserved1;
       __le16   reserved2;
       __le64   dirent_off;     /* Logical offset of dirent in unindexed tree */
};

/*
 * The header of a leaf block in the indexed tree.
 */
struct ocfs2_idir_block_hdr {
       __u8     ih_signature[8];
       __le64   ih_csum;
       __le64   ih_reserved1;
       __le32   ih_reserved2;
       __le16   ih_count;       /* Maximum number of entries possible in ih_entries */
       __le16   ih_num_used;    /* Current number of ih_entries entries */
       struct ocfs2_idir_dirent ih_entries[0];  /* Indexed dir entries in a packed array of length ih_num_used */
};

Free Space List

The free space lists help us find which unindexed dir block to insert a given dirent into. Each dir block can be part of a single list, or none at all if it is full. Which list a dir block is stored in depends on the size of the contiguous free space found in that block.

Unlink, insert and rename may have to move a leaf block between lists. To make this easier, the entries are doubly linked.

So that we can find the 1st entries in each list, an array of free space list headers are stored in the block holding the indexed tree root. Each header is associated with a certain size. It's list entries are comprised of blocks which have at least that large a hole for dirents, but still less than the size of the next header.

TODO: The list sizes breakdown I have right now is only to illustrate the point. We can do better if we take average dirent sizes into account and distribute our list sizes accordingly.

To look for a leaf block with a certain amount of free space, F, we select the highest ordered list whose entries are equal to or less than F. If we search the entire list and don't find anything, we can move up to the next sized list and be guaranteed to find enough space in it's first entry. If there are no larger list entries, then we know a new leaf block has to be added.

While this won't give us better than O(N) in the worst case, it should greatly improve our chances of finding free space in the average case. In particular, directories whose inner leaves are almost or completely full (this is common) will not require a full scan.

TODO: The list sizes breakdown I have right now is only to illustrate the point. We can do better if we take average dirent size into account.

#define OCFS2_DIR_MIN_REC_LEN           16  /*  OCFS2_DIR_REC_LEN(1) */
/* Maximum rec_len can be almost block-size as dirents cover the entire block. OCFS2_DIR_REC_LEN(OCFS2_MAX_FILENAME_LEN) however, is 268 */

static inline int ocfs2_dir_block_which_list(struct ocfs2_dir_block_hdr *hdr)
{
        if (hdr->db_free_max == 0)
           return -1;

        return ocfs2_idir_free_size_to_idx(le16_to_cpu(hdr->db_free_max));
}

/*
 * Given an index to our free space lists, return the minimum allocation size
 * which all blocks in this list are guaranteed to support.
 */
static inline unsigned int ocfs2_idir_free_idx_to_min_size(unsigned int idx)
{
        BUG_ON(idx >= OCFS2_IDIR_FREE_SPACE_HEADS);
        return (idx + 1) * 16;
}

/*
 * Given a rec len, return the index of which free space list to search.
 */
static inline unsigned int ocfs2_idir_free_size_to_idx(unsigned int rec_len)
{
        unsigned int idx = 0;
        unsigned int bytes;

        for (idx = (OCFS2_IDIR_FREE_SPACE_HEADS - 1); idx >= 0; idx--) {
            bytes = ocfs2_idir_free_idx_to_min_size(idx);
            if (rec_len >= bytes)
               break;
        }

        return idx;
}

Open Questions

How is POSIX readdir affected if a directory goes from inline to indexed?

The insertion of a block header for that 1st block technically changes the logical location of our dirents. In theory then, a readdir might see a dirent twice, if it races at exactly the right moment. What do we do about this?

Nothing, and just call it a rare race.
Waste space in the inline directory (undesirable) and keep a placeholder block header there.
Move the block header to the back of the block. Can a full inline dir make this impossible though? Also, we can't easily change the size of the header if we do this.
Fake the offsets used in readdir / telldir for inline dirs by adding sizeof(struct ocfs2_dir_block_hdr) if it's inline.

I like the last solution best, but it requires us guaranteeing that an inline directory will never grow to anything other than an indexed one. I'm not sure why a user wouldn't want their directory to be indexed, but stranger things have happened. Of course, if we separate out the block header into it's own feature then we can always insert the block header into non-indexed directories too.

Do we use logical or physical block pointers?

logical is cleaner
physical provides better performance

I think actually, that the free space list and list heads should be physical block pointers since we could be traversing them a lot. The dirent pointers in the indexed items can be logical though.

Absolutely physical structure should be physical block pointers. That's the free space list and list heads. For the dirent pointers... I don't like having extra reads, but somehow logical feels better. I like having an absolute linkage between index entry and dirent (byte offset is directly to the dirent). That said, a physical block offset and scanning the dirblock for the entry works just fine and is fewer reads. Is that indecisive enough? Basically, I'm trying to come up with a reason why either method would be better/worse when doing some future change. -- Joel

Well, we can directly address the dirent, regardless of whether we use logical or physical addresses. Since existing directory allocation doesn't change in the same way that file allocation could, it might make the most sense to use physical addresses in the indexed dirent too, based only on the performance gain. --Mark

Can we optimize lookup of "medium" size directories

I think the lookup code can use i_size to determine the maximum number of blocks it would search, had the directory not been indexed. If that number is less than or equal to the minimum number of blocks required to check the index tree, we can ignore the index altogether.

For example, say we have an index btree with only one level. In that case, the number of reads (after the inode has been read) is 2 - one for the root block, and one for the leaf. If the inodes size is less than or equal to 2 blocks, it might make sense to just search the unindexed tree. If we take block contiguousness and readahead into account, we might be able to optimize further. If clustersize is 16K, blocksize is 4K, readahead size is 16K, and i_size is one cluster or less, we might as well just pull in the entire leaf cluster in one I/O.

Another thing we can consider, is storing the index inline in the root index block until it gets large enough that a btree is necessary. This is similar to how we store EA blocks.

What size hash to use

We're using 32 bits as described above. There's a couple of considerations. Firstly, is 32 bits a large enough hash? It probably is, considering that we can have a fairly large number of collisions within a block. However, it still might be a good idea to increase the size of our hash so that the high 32 bits addresses the exact cluster in our btree, while the lower bits help us located the exact block within that cluster. That way, we can use the entire address space available to us for cluster sizes that are greater than block size. Since largest clustersize is 1meg (20 bits) and smallest blocksize is 512 bytes (9 bits), we'd need to use another 11 bits (probably just rounded to 16) for a 48 bit hash. Obviously, which parts of that are used depends on the clustersize / blocksize combination.

How do we fill up indexed tree leaves before rebalancing?

Obviously, we don't want to allocate a new leaf block for every insert. To get around this, we'll store entries in the empty leaf whose cpos is closest to but still less than the entries hash value. When a leaf fills up, we allocate a new one and rebalance by moving half the entries to the new leaf. Entries whose hash equals a the cpos of the leaf never move. When we've filled up a leaf with identical hash values, we -ENOSPC.

File systems where cluster size is different from block size pose a problem though. Which block do we chose (within the cluster) to store an entry? We'd like to spread them out within the cluster, so that we don't overload a given block, forcing early rebalance. This is actually related to the previous question about hash size. If we can assume a hash size large enough to also include block offset within the cluster, then the block offset from beginning of cluster to store in is: largehash & ~(clustersize_bits - blocksize_bits). So, expanding this:

logical block of hash = ((upper 32 bits of hash) << (clustersize_bits - blocksize_bits) + (hash & ~(clustersize_bits - blocksize_bits)

Of course, what we actually do is use the (upper 32 bits of hash) value to find the closest matching logical cluster (cpos), and then use the (hash & ~(clustersize_bits - blocksize_bits) result to find a block within that cluster.

Do hard links require any special considerations

I don't believe so. The unindexed tree can hold multiple hard links just fine. The indexed tree is name-indexed, so the likelyhood of collisions is low.

Should tunefs.ocfs2 convert old directories when turning on this feature?

Probably not as it's expensive. Obviously turning off the feature requires conversion of indexed directories, but that should be as easy as deleting the indexed tree and clearing the 1st dirent in each block. We could always provide a separate tunefs command (or standalone tool) to indexed an existing directory.

I think this is a good scheme. -- Joel

You know, we could even do most of a dir conversion in userspace by just hard linking files from the old dir into a new one. We could us mtime/ctime to detect changes that might have happened during the linking. --Mark

TODO

pseudo code for insert
pseudo code for unlink
pseudo code for rename?
tools changes needed
- tunefs.ocfs2
- mkfs.ocfs2
- libocfs2
- fsck.ocfs2