OCFS2 Indexed Btrees for Extended Attributes
Mark Fasheh <mark dot fasheh at oracle dot com>
December 7, 2007
Required Reading
This document references the extended attributes design document heavily. In particular, familiarity with struct ocfs2_xattr_header and struct ocfs2_xattr_entry is required.
Introduction
The extended attributes design document describes an excellent method for storing small (in-inode) to medium (EA Block) numbers of extended attributes. Support for large numbers of attributes is touched upon by referencing a name indexed tree, but no specific description for how such a tree would work is given. With this document, I attempt to flesh out a more detailed design of a Btree indexed by EA name hashes.
To be clear, this isn't about support for large individual EA's - that's handled already in the orignal EA document by allowing for a value btree. This document is specifically about handlings large numbers of EA's. In many ways, this is more important. The number of applications using extended attributes is growing steadily and it's not out of the question to think that we would need to store more than one blocks worth of attributes today.
Primary Goals / Requirements
- Support for a large number of extended attributes
- Performant lookup of attribute names.
One comment on the number of attributes we should be targeting. On one hand, I believe we shouldn't limit the total number of attributes. On the other hand, I'm much more concerned with the performance of hundreds (maybe thousands?) of EA's, than I am with "tens of thousands".
Overall with regards to performance, we want the same as any Ocfs2 features we've designed before - good performance with excellent core design that our customers will appreciate. It doesn't have to be the absolute fastest on the block, especially if that's at the expense of readability or mintainability. It should be "up there" though, and I don't think that's particularly beyond us.
Secondary Goals
Avoid creating a completely new disk structure (High Priority).
At first glance, this seems to mean that we don't want to put a full blown new set of structures on disk, and while that's certainly part of it, there's much more. Specifically, I feel that we should avoid a completely new set of algorithms and design approaches. For example - at the worst case an indexed btree might require some new disk structures and perhaps even a seperate btree implementation, it's still a btree with tradeoffs and space/performance considerations that we're used to.
Make re-use of existing btree code (Medium-High Priority)
If this just seems impossible, then we can fall back to re-implementing the parts of the code which differ too greatly. At any rate, if we have to re-implement some of the insert code, perhaps we could make it generic enough to enable the next goal. Hopefully things like hole-punching (for deleting a tree) could be mostly re-used though.
Design with a close eye towards indexed directories (Medium Priority)
This really shouldn't be too hard as I believe indexed directories are actually simpler (particularly in leaf format) than an EA tree, but if it starts to get tough, we can just worry about indexed directories when we get to them.
Existing Ocfs2 Btree Structures
The ocfs2 btree manipulation code is very specific about what a tree should look like.
- One extents logical offset plus length never overlaps with the logical offset of the extent next to it.
- Interior nodes are never sparse, each extent covers theoretical range which the lower nodes have (cpos always is what the lowest cpos below is)
struct ocfs2_extent_rec { /*00*/ __le32 e_cpos; /* Offset into the file, in clusters */ union { __le32 e_int_clusters; /* Clusters covered by all children */ struct { __le16 e_leaf_clusters; /* Clusters covered by this extent */ __u8 e_reserved1; __u8 e_flags; /* Extent flags */ }; }; __le64 e_blkno; /* Physical disk offset, in blocks */ /*10*/ };
Extents are allocated as clusters, which can range from 4k to 1M. Ocfs2 block sizes range from 512 bytes to 4k.
Design
We use e_cpos to store the smallest hash value within an extent. The range of hashes in an extent can then be calculated by comparing with the next extent record. e_leaf_clusters, then still only refers to the actual on-disk allocation.
Extents within a btree are broken up into equal sized 'buckets'. Hash values are distributed in sort order throughout the non-empty buckets in the extent. This should make a binary search (starting at the 1st bucket) simple. While the hash range of any given bucket can change as the extent is inserted into, we do not allow for overlapping ranges between buckets. The maximum number of collisions we will allow for then is one bucket's worth. If a set of EA's collide so much that they fill up the bucket, any subsequent attempts to insert another EA with the same hash will fail with ENOSPC.
Maximum size of an extent will be limited as time-to-search will degrade as the extent grows.
Open Question: As it stands, bucket range isn't actually written to a field anywhere and it's implicitly determined by the first and last values within the bucket. Should we bother writing range to disk? If so, how and where?
If bucket size is chosen so that it's never larger than cluster size AND we don't allow more than one buckets worth of collisons, then we can never break the "no overlapping extents" rule of the btree code.
My hope is that breaking the extents up into buckets will reduce the scope of most operations (inserts, removals) down to a manageable size. Searching should be only minimally worse than if we had a whole-extent size ordered list.
A superblock value can record the bucket size. I propose however, that for now we choose 4k as the bucket size for all file systems, regardless of other fs parameters. 4k is the maximum size which will automatically work with every cluster size and we're favoring 4k clusters very strongly so if something doesn't work with that sized buckets, we'll be in major trouble.
Bucket Format
Internally, each bucket is an ocfs2_xattr_header. The list, insert, and remove operations within a bucket are identical to those in a single external EA block except the data might span multiple fs blocks.
There's no need to worry about the le16 xh_count field in ocfs2_xattr_header. Doing the math with a minimum xattr_entry size of 16 bytes: 65535 * 16 = 1048576 which means we're safe until we want larger than 1M buckets.
Finding an EA
An EA is found by hashing it's name and searching an ocfs2 btree using e_cpos as the index to an extent which may contain the EA. If the exact cpos isn't found, then the next smallest value is used. For the purposes of the search, length fields are ignored. They only refer to the on-disk size of the extent, not the hash range. The leftmost extent records e_cpos value is always zero.
Once an extent is found, a binary search across the set of non-empty buckets is performed. The search can skip around the buckets until it either finds the value it's looking for, a set of equal hash value records whose names don't match, or a pair of adjacent records with values less than and greater than the key.
Inserts / Splits
An new extent is formatted as a series of empty buckets. Buckets are filled from left to right, starting with the first bucket in the extent. Each time a bucket is filled, the set of nonempty buckets is extended and the range of hash values are redistributed evenly through all buckets in the extent (the buckets are rebalanced). We continue this until all buckets within an extent are full or almost full.
We can store the current number of contiguous, non-empty buckets by storing it in the 16 bit xh_reserved1 field.
Once all buckets in an extent are full, we need to add another extent. Things are a bit different if the new extent is contiguous or not. If it's contiguous, we just continue the process of rebalancing the extent buckets. If the extent isn't contiguous, we pick a new cpos for it and move all EA's which now fit into the new one. The new cpos is chosen so as to be in the middle of the range that the existing extent covers: new_cpos = cpos + (highest_hash - lowest_hash) / 2
Open Question: We need some (hopefully simple) heuristics on when we consider an extent "full" and in need of being split.
Conflicts
One of the biggest questions I have about this is "how many collisions are reasonable?".
We can calculate several things here. Lets assume a bucket size of 4096 bytes. Removing 8 bytes of overhead for the top of ocfs2_xattr_header gives us 4088 bytes for storing EA's.
Let's assume that EA values are all ocfs2_xattr_value_root structures and we allow for two extents in each one. That gives us 16 bytes of overhead, 16 bytes for the ocfs2_extent_list, and 16 bytes per ocfs2_extent_rec, so a total of 64 bytes.
Structure |
Bytes Required |
ocfs2_xattr_header overhead |
8 Bytes |
ocfs2_xattr_entry |
16 Bytes |
ocfs2_xattr_value_root Overhead |
16 Bytes |
ocfs2_extent_list with two extents |
48 Bytes (16 + 2x16) |
Of course, the name is the most variable part of this.
Open Question: What is the average length of an EA name?
#define XATTR_NAME_MAX 255
Name Length |
Per EA Overhead |
# of EA's that fit in 4088 bytes |
255 |
335 |
12 |
128 |
208 |
19 |
96 |
176 |
23 |
64 |
144 |
28 |
32 |
112 |
36 |
16 |
96 |
42 |
8 |
88 |
46 |
Item Removal
When the last item in a bucket is removed, we re-balance all the buckets in that extent.
Space Optimization Questions
The maximum inline value should be limited to 80 bytes, the size of a 2 extent struct ocfs2_xattr_tree_root.
Should we drop the whole ocfs2_xattr_value_root thing in the indexed tree and just use a single block pointer to a tree root? The block pointer would really suck for "medium" sized EA's.
Hashing Algorithm
- need a random seed in super block which is never printed on console
- TEA by default
- Should allow for different hash types via superblock parameter