Extended Attributes Preliminary Design Document
Jeff Mahoney, SUSE Labs, Novell
Mark Fasheh, Oracle
Original Revision: July 26, 2006 (jeffm)
Many Updates: November/December 2007 (mfasheh)
Introduction
Extended attributes are used for storing POSIX ACLs, SELinux labels, and user accessible metadata. They are essential for deploying file systems exported for workgroup use via samba. The following document outlines a design for implementing extended attributes on the OCFS2 file system.
The design should be flexible enough to support many large extended attributes, but also quick enough to provide good performance when only a few small extended attributes are associated with the inode. It should also consider that extended attributes are generally accessed less frequently than the data they protect/describe, and in-inode data should take performance precedence over the extended attributes.
In order to meet these goals, the design describes layers of indirection to meet the demands of larger attributes while preserving the performance behavior of smaller ones. Large numbers of attributes should also be handled gracefully.
When space in the inode allows, the xattr header will be kept at the end of the inode block, with xattr entries preceding it on disk. When the inode is being used for in-inode data, or otherwise does not have enough space to contain the xattr header, the header is placed its own block with as many entries as will fit before allocating additional blocks to store entries.
In order to maximize performance, xattr values will be kept with their description entries whenever possible. This applies to both the inode block and external xattr blocks.
Locking
The conventions on local file systems are such that write operations take the inode mutex as well as a per-inode xattr rwsem. Read operations only take the xattr rwsem.
For the initial implementation, the cluster inode meta lock will be used to protect the attribute space. Eventually, it may be desirable to implement a cluster xattr lock that can handle locking/caching/refreshing of high profile metadata like ACLs.
Data Structures
Existing Data Structures
struct ocfs2_dinode
. . .
__le16
i_dyn_features
. . .
__le64
i_xattr_loc
__le64
i_reserved1[8]
. . .
struct ocfs2_xattr_header header
New Flags: OCFS2_HAS_XATTR_FL = 0x0002, OCFS2_INLINE_XATTR_FL = 0x0004
OCFS2_HAS_XATTR_FL is set when the inode has extended attributes. The i_xattr_loc member contains the location of the extended attribute header. If i_dyn_flags contains the OCFS2_INLINE_XATTR_FL flag, the i_xattr_loc member contains the offset from the beginning of the inode block where the ocfs2_xattr_header record is located. If the flag is unset, then the value contains a block number where the first ocfs2_xattr_block can be found.
struct ocfs2_extent_rec (16 bytes)
__le32
e_cpos
__le32
e_clusters
__le64
e_blkno
The structure itself is unmodified, but we use e_cpos to store the hash of the name of the first attribute entry in the block. This modification is used for hashed directories as well and may be able to share the code used for manipulating them. This requires that Sparse Tree Updates be integrated before Extended Attributes can be used.
New Data Structures
struct ocfs2_xattr_entry (16 bytes)
__le32
xe_name_hash
__le16
xe_name_offset
__u8
xe_name_len
__u8
xe_local (0)
xe_type (1:7)
__le64
xe_value_size
Every extended attribute has an ocfs2_xattr_entry associated with it. It indicates what type of extended attribute it describes as well as where to find it, how large it is, etc.
The names and values will always be block local and will be placed in reverse order from the end of the block with the value immediately following the name, subject to alignment rules.
When the xe_local bit is set, the attribute is stored in the local block. When the xe_local bit is unset, the value stored in the local block will be an ocfs2_xattr_value_root record, rooting an extent tree where the attribute data is actually stored. xe_value_size contains the size of the attribute. The full 64 bits for size isn't likely to be used any time soon, but it doesn't cost us much for future-proofing. Most sizes in ocfs2 are 64 bits anyway.
struct ocfs2_xattr_value_root (Greater than 32 bytes)
__le32
xr_clusters
__le32
xr_reserved0
__le64
xr_last_eb_blk
struct ocfs2_extent_list
xr_list
Most attributes will likely be placed in the block with the entry. When they grow too large, the value will be replaced with an ocfs2_xattr_value_root record and the xe_local bit will be cleared. In this case, the ocfs2_extent_list within the ocfs2_xattr_value_root will have a depth of 0, and all block pointers will be local (clusters will be allocated for extent data). A default number of extent list records has yet to be determined. In the event that the attribute is of such a size that it won't fit in the default number of extents, the ocfs2_extent_list will root a standard ocfs2 btree.
In order to avoid wasting storage on names, the name prefix will be mapped to a 7 bit value and removed from the name itself. The name's suffix will be stored in the block.
enum ocfs2_xattr_type { OCFS2_XATTR_INDEX_USER = 0, OCFS2_XATTR_INDEX_POSIX_ACL_ACCESS, OCFS2_XATTR_INDEX_POSIX_ACL_DEFAULT, OCFS2_XATTR_INDEX_TRUSTED, OCFS2_XATTR_INDEX_LUSTRE, OCFS2_XATTR_INDEX_SECURITY, OCFS2_XATTR_MAX };
Each entry has a 32-bit hash value associated with it. The hash value is calculated using the full (prefix.suffix) name of the xattr to avoid hash collisions when the same suffix is used in multiple attribute namespaces. It it used to identify when a name is not going to match before doing a string comparison to verify that the name is a match. The entries themselves are stored on disk sorted by xe_name_hash.
Although lookups within a block are a linear operation, the xattr blocks are stored in a b-tree of depth 1. The search space is automatically limited to blocks where there is a likely match by using the e_cpos value in struct ocfs2_extent_rec.
Entries may optionally contain a 32 bit hash value to perform data integrity checks against. When the hash value is 0, it is considered unused.
Names and values will be padded to align on 64-bit boundaries.
struct ocfs2_xattr_header (24 bytes)
__le16
xh_count
__le16
xh_reserved1
__le32
xh_csum
struct ocfs2_xattr_entry xh_entries[0]
The ocfs2_xattr_header describes how many ocfs2_xattr_entry records are in the block.
The xh_count member contains the count of how many records are in the local block. The entries themselves start immediately after the ocfs2_extent_list, which is variable size.
struct ocfs2_xattr_block (48 bytes)
__u8
xb_signature[8]
__le16
xb_suballoc_slot
__le16
xb_suballoc_bit
__le32
xb_fs_generation
__le32
xb_csum
__le16
xb_flags
__le16
xb_reserved0
__le64
xb_blkno
__le64
xb_reserved1[2]
union xb_attrs
struct ocfs2_xattr_header xb_header
union xb_attrs
struct ocfs2_xattr_tree_root xb_root
#define OCFS2_XATTR_INDEXED 0x1
The ocfs2_xattr_block is where extended attribute entries are located when they are outside of the local inode block. It has the signature "XATTR01".
xb_flags determines how attributes are to be found. By default, the xb_header field in the xb_attrs union is used to find in-block extended attributes. Once the number of extended attributes gets larger than will fit, then we set OCFS2_XATTR_INDEXED move them into a btree.
If OCFS2_XATTR_INDEXED is set then the xb_root field in the xb_attrs union roots a name-indexed btree. The extent records will be sorted as usual by e_cpos, but will contain the hash value of the first entry in the block.
struct ocfs2_xattr_tree_root (Greater than 32 bytes)
__le32
xt_clusters
__le32
xt_reserved0
__le64
xt_last_eb_blk
struct ocfs2_extent_list
xt_list
TODO: this is identical in size and layout to ocfs2_xattr_value_root. We should probably combine them somehow.
The Indexed BTrees for extended attributes designed doc discusses changes related to EA name indexing extensively.
VISUAL LAYOUT (WARNING: THIS IS OUT OF DATE)
A crude visual overview of how the blocks are laid out:
When the xattr header is local:
+-------------------------------------------+ | OCFS2 INODE BLOCK | +-------------------------------------------+ | . . . | | OCFS2 CORE INODE | | . . . |------+ | __le32 i_dyn_features OCFS2_INLINE_XATTR_FL --+ | | . . . | | | struct ocfs2_xattr_header <------------------+ | | struct ocfs2_extent_list xh_extents | --------------+ | struct ocfs2_xattr_entry entry0 -----+ | | | struct ocfs2_xattr_entry entry1 ---+ | | | | struct ocfs2_xattr_entry entry2 | | | | | struct ocfs2_xattr_entry entry3 | | | | | struct ocfs2_xattr_entry entry4 | | | | | entry4 name | | | | | entry4 value/extent list | | | | | entry3 name | | | | | entry3 value/extent list | | | | | entry2 name | | | | | entry2 value/extent list | | | | | entry1 name <-------------------+ | | | | entry1 value/extent list | | | | entry0 name <---------------------+ | | | entry0 value/extent list | --------------+-----------+ +-------------------------------------------+ | | | |
When the xattr header is it its own block:
| | +-------------------------------------------+ | | | OCFS2 EXTENDED ATTRIBUTE BLOCK | | | +-------------------------------------------+ | | | struct ocfs2_xattr_block header "XATTR01" | | | | struct ocfs2_extent_list xh_extents | -----+ | | | struct ocfs2_xattr_entry entry0 -----+ | | | | | struct ocfs2_xattr_entry entry1 ---+ | | | | | | struct ocfs2_xattr_entry entry2 | | | | | | | struct ocfs2_xattr_entry entry3 | | | | | | | struct ocfs2_xattr_entry entry4 | | | | | | | . . . | | | | | | | entry4 name | | | | | | | entry4 value/ocfs2_extent_list | | | -----+--------+---------+ | | entry3 name | | | | | | | | entry3 value/ocfs2_extent_list | | | | | | | | entry2 name | | | | | | | | entry2 value/ocfs2_extent_list | | | | | | | | entry1 name <-------------------+ | | | | | | | entry1 value/ocfs2_extent_list | | | | | | | entry0 name <---------------------+ | | | | | | entry0 value/ocfs2_extent_list | | | | | +-------------------------------------------+ | | | | | | | | +-------------------------------------------+ | | | | | OCFS2 EXTENDED ATTRIBUTE BLOCK | <----+--(or)--+ | | +-------------------------------------------+ | | | struct ocfs2_xattr_block header "XATTR02" | | | | struct ocfs2_extent_list xh_extents | | | | struct ocfs2_xattr_entry entry0 -----+ | | | | struct ocfs2_xattr_entry entry1 ---+ | | | | | struct ocfs2_xattr_entry entry2 | | | | | | struct ocfs2_xattr_entry entry3 | | | | | | struct ocfs2_xattr_entry entry4 | | | | | | . . . | | | | | | entry4 name | | | | | | entry4 value/ocfs2_extent_list | | | | | | entry3 name | | | | | | entry3 value/ocfs2_extent_list | | | | | | entry2 name | | | | | | entry2 value/ocfs2_extent_list | | | --------+ | | | entry1 name <-------------------+ | | | | | | entry1 value/ocfs2_extent_list | | --+-+ | | | | entry0 name <---------------------+ | | | | | | | entry0 value/ocfs2_extent_list | | | | | | +-------------------------------------------+ | | | | | | | | | | +-------------------------------------------+ | | | | | | OCFS2 EXTENDED ATTRIBUTE VALUE BLOCK | <-+ + | | | +-------------------------------------------+ | | | | | | | | | | | (contents) | | | | | | | | | | | +-------------------------------------------+ | | | | | | | | +-------------------------------------------+ | | | | | OCFS2 EXTENDED ATTRIBUTE VALUE BLOCK | <---+ | | | +-------------------------------------------+ | | | | | | | | | (contents) | | | | | | | | | +-------------------------------------------+ | | | | | | +-------------------------------------------+ | | | | OCFS2 EXTENT BLOCK | <-------+ | | +-------------------------------------------+ | | | (points to attribute blocks or another | | | | extent block) | | | +-------------------------------------------+ | | | | +-------------------------------------------+ | | | OCFS2 EXTENDED ATTRIBUTE VALUE BLOCK | <-----------------------+ | +-------------------------------------------| | | | | | (contents) | | | | | +-------------------------------------------+ | | +-------------------------------------------+ | | OCFS2 EXTENDED ATTRIBUTE VALUE BLOCK | <-------------------------+ +-------------------------------------------| | | | (contents) | | | +-------------------------------------------+
CHANGES
Mon Jul 24 19:20:05 EDT 2006 jeffm
- Initial version
Tue Jul 25 17:37:43 EDT 2006 jeffm
Renamed ocfs2_xattr_block to ocfs2_xattr_block_rec
- Added allocation, fs generation, and OCFS2-style block signature information to xattr block header.
Split ocfs2_xattr_header into ocfs2_xattr_block and ocfs2_xattr_inode_header to avoid duplication of data between inode and header when occupying the same block.
Replaced ocfs2_xattr_entry_header with new ocfs2_xattr_block.
- Cluster inode xattr lock changed to optional performance enhancement.
Changed indirect entry values from block pointers to ocfs2_extent_list to allow large xattrs with minimal effort.
- Added 7 bit type information to entry to allow removal of name prefix, saves 5+ bytes per name.
Added 1 bit local field to allow removal of 16 bit xe_block_count from entry since the count is contained in the ocfs2_extent_list
Changed xattr size to 32 bit value to accomodate values > 64k
- Added additional visual cases to describe new extent cases
Tue Jul 25 18:22:46 EDT 2006 jeffm
Moved name before value, since xe_value_size describes total value of xattr. This is fine when the value is local, but when the value is an ocfs2_extent_list, the value's size will be larger than the actual size of the ocfs2_extent_list. The ocfs2_extent_list is dynamically sized and self describing, so placing it immediately after the name will allow us to locate it easily using name offset + name size.
- Aligned structures to 64 bit boundaries
Wed Jul 26 12:02:25 EDT 2006 jeffm
Dropped ocfs2_xattr_block_rec in favor of using ocfs2_extent_rec since padding ocfs2_xattr_block_rec ended up making them the same size and hashed directories will use the e_cpos as a hash value too.
Wed Jul 26 23:30:27 EDT 2006 jeffm
Restructured ocfs2_xattr_header so that it could function both as the in-inode header as well as the header used in ocfs2_xattr_block.
ocfs2_xattr_block now contains a prologue containing identification information but otherwise uses the same header as the inode.
Added i_flag usage to mark inline xattrs; With the flag set, i_xattr_loc now contains an offset into the local inode where the header is located. It is no longer located at the end of the block, and can be placed anywhere.
- Defined some limitations of the extent b-tree.
- Reformatted document for Wiki inclusion.