[Ocfs2-tools-devel] questions for dx_dirs

Wed Sep 30 23:02:38 PDT 2009

On Thu, Oct 01, 2009 at 08:59:15AM +0800, Coly Li wrote:
> > Almost - the only thing you got wrong is that we have a struct
> > ocfs2_dir_block_trailer at the end of every *block* in the unindexed tree
> > (you said "extent"). Otherwise, as you noted it looks exactly like the
> > standard dirent tree that we share with ext*.
> > 
> > So instead of per-cluster or per-extent, this is a per-block structure and
> > instead of the beginning of the block, we placed it at the end (which made
> > more sense because then dirents like '.' and '..' can keep their well-known
> > locations).
> > 
> 
> So ocfs2_dir_block_hdr is also a per-block structure (not per-cluster, not
> per-extent) ?
> 
> And in every block of the unindexed tree, the first record is
> ocfs2_dir_block_hdr, and the last record is ocfs2_dir_block_trailer, the rested
> records between these 2 records are ocfs2_dir_entry ?

ocfs2_dir_block_hdr does not exist - it's only an 'ocfs2_dir_block_trailer'
at the very end of each unindexed block which we store. Everything before
the trailer is a dirent.

> > Just to recap regarding the actual disk structures:
> > 
> > struct ocfs2_dx_root_block -> dr_free_blk
> > Pointer to the start of the list of free unindexed blocks
> > 
> > struct ocfs2_dir_block_trailer -> db_free_next
> > Pointer to next record on list. 0 is end of list.
> > 
> > 
> 
> My understanding was,
> 1) ocfs2_dx_root_block->dr_free_blk maintains free blocks (all ocfs2_dir_entry
> records are free)
> 2) ocfs2_dir_block_trailer->db_free_next maintains the free space inside an
> unindexed tree block (the free space comes after an previous dentry gets removed).
> 
> After checking the kernel code, I don't find anything support 2). Do I
> mis-understand something ?

Think of it as a singly linked list where each list item is a directory
block. The next item in the list is pointed to by
ocfs2_dir_block_trailer->db_free_next. The head of the list can be found via
ocfs2_dx_root_block->dr_free_blk.

If we want to know what size dirent will fit in a given block, we check
ocfs2_dir_block_trailer->db_free_rec_len. This value is kept up to date
during insertion/removal, sometimes by scanning the entire block.

In poor ASCII art:

[ ocfs2_dinode.i_dx_root ] ->
[ ocfs2_dx_root_block.dr_free_blk] ->
[ ocfs2_dirent,...,ocfs2_dir_block_trailer.db_free_next ] ->
[ ocfs2_dirent,...,ocfs2_dir_block_trailer.db_free_next ] ->
and so on, until db_free_next == 0.

> >> 3) in unindexed tree, use {db_free_prev, db_free_next} to maintain free slot
> >> within an extent block; in indexed tree, use ocfs2_dir_block_trailer to maintain
> >> the free slots within an ocfs2_idir_dirent ?
> > 
> > In the unindexed tree, we use the db_free_next field in the per-block
> > structure of ocfs2_dir_block_trailer to maintain a list of free blocks.
> > 
> 
> What I am confused is here :-)
> 
> When a dentry gets removed from the directory file, a record is freed within an
> unindex tree block. There should be something to maintain this kind of free
> space, so next time when allocating space to new dentry, this free space might
> be re-used.

FYI, A good function to look at for this is ocfs2_delete_entry_dx() in
fs/ocfs2/dir.c.

So, for a dirent of size 'reclen', here's what we'll do:

A)
Remove the entry

B) We must first always update db_free_rec_len:

	max_rec_len = ocfs2_find_max_rec_len(dir->i_sb, leaf_bh);
	trailer->db_free_rec_len = cpu_to_le16(max_rec_len);

C) If the previous value of ocfs2_dir_block_trailer->db_free_rec_len == 0
then we must add this unindexed block to the list:

	trailer->db_free_next = dx_root->dr_free_blk;
	dx_root->dr_free_blk = cpu_to_le64(leaf_bh->b_blocknr);

> > The indexed tree does not require or have any list of free blocks - each
> > traversal of the tree is a fast search (since it's indexed) and if the
> > target block has no free records, it is simply split into two blocks.
> 
> Right now, I am checking the code to find out
> 1) how the indexed tree is stored on disk.

The indexed tree is stored on disk as a standard ocfs2 btree rooted at a
struct ocfs2_dx_root_block->dr_list.

Each leaf cluster is broken up on block boundaries so we're only ever
searching one target block at a time. The structure contained within these
blocks is 'struct ocfs2_dx_leaf'. As you can see from ocfs2_fs.h,
ocfs2_dx_leaf contains a ocfs2_dx_entry_list which holds an array of
ocfs2_dx_entry. Most of the block is actually comprised of many of
struct ocfs2_dx_entry:

struct ocfs2_dx_entry {
	__le32		dx_major_hash;	/* Used to find logical
					 * cluster in index */
	__le32		dx_minor_hash;	/* Lower bits used to find
					 * block in cluster */
	__le64		dx_dirent_blk;	/* Physical block in unindexed
					 * tree holding this dirent. */
};

You'll note that in addition to storing hash values, this structure also
stores a pointer to a leaf block in the unindexed tree. This way we don't
have to duplicate dirent information. All dirents are stored in the
unindexed tree only.  

> 2) how to do the traversal in the indexed tree.

To look up an item in the indexed tree, we first generate a 64-bit hash from
the name+len (see ocfs2_dx_dir_name_hash()).

This hash is broken up into two 32 bit components - the major and the minor.

Finding the item in the tree takes two steps. First, we use the major hash
to give us a logical cluster offset which we look for in the tree. If that
cluster is found, we then determine which block within the cluster to search by the
value of the minor hash. Look at __ocfs2_dx_dir_hash_idx() to see how we
turn the minor hash into a block offset (from start of cluster).

Once the block has been read in, we do a linear search of the indexed block
and only return records where the major AND minor hash values match.

To get the full dirent, we then read the unindexed block at dx_dirent_blk
and search that for the match.

Probably the best functions to read for this would be ocfs2_dx_dir_lookup()
and ocfs2_dx_dir_search().

> 3) how to insert a new record into the indexed tree.

For this, you'll want to look at ocfs2_prepare_dx_dir_for_insert(). The gory
details are in ocfs2_find_dir_space_dx() which is called from there.

Essentially though, we repeat the search to find which block should be the
target for the new entry. If there is a block at that location and it has
space, we're done. If there isn't space, but the location exists, we split
it into two semi-equal sized blocks. There's more than I'm explaining here
of course, but my hands need a short typing break :)
	--Mark

--
Mark Fasheh