OCFS2 SPARSE TREE UPDATES
Terms
- Node
For our purposes this refers to an ocfs2_extent_list structure, either within the ocfs2_dinode or within an ocfs2_extent_block.
- Empty extent record
Also known as a record whose e_clusters value is 0. Typically this means that the entire record is zero'd. ocfs2_add_branch() however will leave an empty extent with e_cpos filled in. Therefore we only key on e_clusters. Empty records do not signify the existence of a hole. They are simply a result of temporary tree changes, and should be ignored by everything except the insert code.
- Splitting record
- The extent record # within a leaf node which should be shifted right or left, leaving an empty extent record in it's place.
- Root node
- For a rotation which involves two leaf nodes, the "root node" is the lowest level tree node which contains a path to both leafs. This basically forms a subtree with that node at the root.
Structure Review
/* * On disk extent record for OCFS2 * It describes a range of clusters on disk. */ struct ocfs2_extent_rec { /*00*/ __le32 e_cpos; /* Offset into the file, in clusters */ __le32 e_clusters; /* Clusters covered by this extent */ __le64 e_blkno; /* Physical disk offset, in blocks */ /*10*/ }; /* * On disk extent list for OCFS2 (node in the tree). Note that this * is contained inside ocfs2_dinode or ocfs2_extent_block, so the * offsets are relative to ocfs2_dinode.id2.i_list or * ocfs2_extent_block.h_list, respectively. */ struct ocfs2_extent_list { /*00*/ __le16 l_tree_depth; /* Extent tree depth from this point. 0 means data extents hang directly off this header (a leaf) */ __le16 l_count; /* Number of extent records */ __le16 l_next_free_rec; /* Next unused extent slot */ __le16 l_reserved1; __le64 l_reserved2; /* Pad to sizeof(ocfs2_extent_rec) */ /*10*/ struct ocfs2_extent_rec l_recs[0]; /* Extent records */ }; /* * On disk extent block (indirect block) for OCFS2 */ struct ocfs2_extent_block { /*00*/ __u8 h_signature[8]; /* Signature for verification */ __le64 h_reserved1; /*10*/ __le16 h_suballoc_slot; /* Slot suballocator this extent_header belongs to */ __le16 h_suballoc_bit; /* Bit offset in suballocator block group */ __le32 h_fs_generation; /* Must match super block */ __le64 h_blkno; /* Offset on disk, in blocks */ /*20*/ __le64 h_reserved3; __le64 h_next_leaf_blk; /* Offset on disk, in blocks, of next leaf header pointing to data */ /*30*/ struct ocfs2_extent_list h_list; /* Extent record list */ /* Actual on-disk size is one block */ };
Goals
- Use existing structures as much as possible. If fields have to be added, they don't alter the extent_block and extent_list structures in a way which isn't forward compatible.
- Very well understood tree updates - put a premium on clean and understandable code. This is especially important here as the complexity of tree updates increases.
- Failure cases are very simple and well understood. Tree updates should be broken up into "atomic" (cannot fail) updates, with state in between updates well understood and documented.
- Make full re-use of the existing tree code. This should not be hard as the operations done already follow the other rules, and are still quite important for sparse files. Essentially we're only "adding" new types of tree updates.
A short overview of existing atomic tree updates
Tree depth shift
This is done when the allocation tree has reached its maximum size.
The act of adding a whole level to the tree. A new ocfs2_extent_block is allocated and fully prepared by filling in values. It is then inserted between the ocfs2_dinode and the rest of the tree.
See function: ocfs2_shift_tree_depth()
Adding a branch
This is done when all existing bottom most nodes are full but one or more of the upper tree nodes have space.
The new branch is 'empty' in the sense that every block will contain a single record with e_clusters == 0
In the case of failure after this update, truncate and extent_map code completely understand the concept of empty extent blocks.
See functions: ocfs2_add_branch(), ocfs2_find_branch_target()
Insert an extent record
This is also done with new extents which are contiguous to an existing record as all but the last step of the operation is the same.
Expects the tree to already have room in the rightmost leaf for the extent. Travels down the tree to the new record location, updating one record in each extent block on the way.
The algorithm needs minimal changes to do the same thing for non tail extents - the hard work should already be done beforehand.
See function: ocfs2_do_insert_extent()
New Atomic Tree Updates
None of these operations are performed for extents which can be simply appended to a leaf node. So the appending allocation case only makes use of the existing tree code.
Both the rotation functions don't actually insert the new extent - they simply create an empty extent record in the place where the new one will be inserted.
Rotate branches right
We only change up to two adjacent leaf nodes at a time. To do this, the function is given the path from a root node to the two leaf nodes. The root node is the lowest level tree node which contains a path to both leafs. You can consider this to be the root of the subtree which we will rotate nodes in. Nothing above the root node needs to change, and thus is not passed into this function.
The mechanism to determine which part of the tree needs update is actually very similar to how ocfs2_find_branch_target() and ocfs2_add_branch() function today.
Also passed to the function is the "splitting record" - the extent record # within the left leaf which will be shifted right, leaving an empty extent in it's place.
In the best-case scenario that the empty extent is to be created in a rightmost leaf block, the root node and left leaf values can be empty and we simply rotate within the right leaf using the "splitting record" value.
The right leaf node must have room for one additional record, or have an empty first record.
At right leaf node: If the leftmost record is not empty, shift records to the right and create an empty leftmost record.
At left leaf node: Call R the rightmost left leaf record. Copy R into the leftmost right leaf record position (aka, position 0). Shift the remaing left leaf records to the right, leaving an empty record at the splitting point.
Now we travel backwards, up the tree updating each pair of nodes between the leaves and the root node.
For the left node: Call A the rightmost record, then: A->e_clusters -= R->e_clusters;
For the right node: Call B the leftmost record, then: B->e_cpos -= R->e_clusters; B->e_clusters += R->e_clusters;
At the root node, we find the two adjacent records which begin our path to the leaves. If the record on the left is A, and the record on the right is B, then: A->e_clusters -= R->e_clusters; B->e_cpos -= R->e_clusters; B->e_clusters += R->e_clusters;
See function: ocfs2_rotate_subtree_right()
Rotate branches left
This operates on a maximum of two branches simultaneously.
This is done when inserting a extent which is contiguous with two ajacent leaf node extent records (one on the left, one on the right). Ultimately, that particular record insert will result in the two records being "joined" into one, though it doesn't happen in the rotate code.
I think that for now this might be optional. It's interesting to note which case this helps though, which is why the operation is noted here.
High level `ocfs2_rotate_tree_right()` walkthrough
The purpose of this function is to prepare a series of subtrees, from right to left, to pass to ocfs2_rotate_subtree_right().
This section will also describe a series of support functions which ocfs2_rotate_tree_right() might call.
ocfs2_is_empty_extent(): Determine whether a given ocfs2_extent_rec can be considered empty.
ocfs2_find_split_rec(): Given a record to insert, and an extent list, return the array index where the next empty extent record should be created.
This function will simply skip past existing empty records, so any insert code that wishes to optimize by using those should check the records immediately before the returned index.
ocfs2_create_empty_extent(): Given an extent list and a position, create an empty extent at position in the list. Will also increment l_next_free_rec.
ocfs2_find_lowest_cpos(): Find the e_cpos value of the 1st non empty extent within the given extent list.
ocfs2_find_path(): Given an empty buffer_head array and a cpos value, this function will traverse the inode tree in search of the first record greater than or equal to cpos. The path traveled is recorded in the array.
ocfs2_find_subtree_root(): This function is passed the dinode buffer and two full paths from the dinode down to a pair of adjacent leaves. It's task is to figure out which buffer is the subtree root - this can be the dinode itself in a worst-case rotation.
To help the caller optimize which buffers to pass to the rotation function, we pass back the path array index which is the beginning of the unique path portions.
ocfs2_rotate_tree_right(): We know for sure when we get here that there's no way we're doing a simple record append.
If we have tree depth (a simple extent list), we handle this case up front by simply calling ocfs2_find_split_rec() to determine an index to directly pass to ocfs2_create_empty_extent().
Allocate two arrays of buffer head pointers, sized tree_depth - 1. Call these "right_path" and "left_path"
Set a root_bh variable to NULL.
Fill in the right_path array by calling ocfs2_find_path() with a search of i_clusters. The intent here is to get the rightmost tree path.
Now loop, until we have no more subtree roots:
- Call "cpos" the return of find_lowest_cpos() on the right path's leaf block.
If "cpos" is greater than our insert record->e_cpos:
- Fill in the left_path array via ocfs2_find_path of cpos - 1 Use ocfs2_find_subtree root to fill in the root_bh variable.
High level ocfs2_insert_extent() walkthrough
Modify the function prototype to take a cpos value:
int ocfs2_insert_extent(struct ocfs2_super *osb, struct ocfs2_journal_handle *handle, struct inode *inode, struct buffer_head *fe_bh, u32 cpos, u64 start_blk, u32 new_clusters, struct ocfs2_alloc_context *meta_ac);
if we have tree depth, el = last_eb_blk extent list. else el = dinode extent list. i = el->l_next_free_rec - 1
Determine whether the extent list has room for another extent record:
free_recs = 0; if ((el->l_next_free_rec == 0) || (el->l_next_free_rec < el->l_count) || (el->l_recs[i].e_clusters == 0)) free_recs = 1;
Find insert point. Determine the kind of insertion we will be doing:
appending_insert = true if we are only adding to the end of the tree. insertion_type = one of the following: NEW_RECORD LEFT_CONTIG RIGHT_CONTIG LEFT_RIGHT_CONTIG //not supported yet
/* * If any of the following is true, we don't need any additional tree updates: * 1) We're contiguous with an existing extent. * 2) There is room in the rightmost leaf block. */ if (insertion_type == LEFT_CONTIG || insertion_type == RIGHT_CONTIG) goto out_add; if (free_last_eb_recs) goto out_add; add_to_tree:
This code doesn't change. In a nutshell, we figure out if we need a tree shift and do it. In all cases, we'll have to add a branch so we do that too. We need to skip the tree rotate in some cases though, so add a goto for that.
///if (appending_insert) /// goto out_add;
///rotate_tree:
///ocfs2_rotate_tree_right()
out_add:
ocfs2_do_insert_extent();
done!
TODO: Lets change things so that do_insert_extent will handle the rotation if need be.