OCFS2 Dentry Removal Strategy
Mark Fasheh <mark dot fasheh at oracle dot com>
August 30, 2006
Goals / Requirements
To remove the current dentry vote mechanism. A replacement should:
- Perform better in the common case
Keep a very fast ->d_revalidate() callback
- Maintain a high level of consistency with respect to dentry updates in the file system
- Not require any direct file system messaging
The last requirement in particular, will put OCFS2 one step closer to not needing any node information, which will in turn allow us to switch cluster stacks more easily.
Problem Description
The problem is one of keeping the dentry cache between nodes consistent. In a world without rename or unlink, this is pretty easy because we can just let nodes build up names via ->lookup() and just take cluster locks to protect the directory during read. Once a node wants to remove or rename a dentry however, things get much more difficult, and some sort of node synchronization is required.
Votes
Currently the OCFS2 file system employs a vote mechanism to maintain dcache consistency during rename or unlink. In short, a vote is a network message which the file system transmits to all nodes that are concurrently mounted. The vote contains three values of importance:
Parent Directory Inode # |
Inode # |
Name |
Dentry votes against a given name within a directory are serialized by the exclusive lock on the parent directory. This way, multiple concurrent unlink requests within a directory won't stomp on each other. Similarly, the map of currently mounted nodes is kept static by holding a read level lock on the super block lock during the entire vote operation.
Upon recieving the vote, each node does an ilookup() on the inode, and walks the i_dentry list, looking for a dentry with a d_parent and d_name that match those within the vote. If one is found, it's passed to d_delete() which will unhash it from the dcache, thus making it impossible for a process to find later.
ocfs2_unlink() sends a vote for the name being removed. ocfs2_rename() will send up to two votes - one for the name being moved, and an optional one for the name being replaced, if the rename also results in an unlink. It might be possible to get fancy with rename, but it's really not worth complicating an already very complicated process.
There are a number of issues with this scheme:
- It always has worst-case performance
- It requires a messaging layer within the file system which has a number of it's own problems
- Messaging requires that userspace push node information (node #, ip address, etc) up into the file system
The existence of any votes requires detailed mount maps to be stored, which in turn actually requires an additional set of mount/unmount votes.
Inode lifetiming becomes more complicated because we now have an additional thread which might take and drop inode references without haveing gone through the normal steps (i.e., ->lookup, which holds parent directory locks)
Dentry Locks
In short, my proposed solution is that we take advantage of the notification and node tracking features in the DLM to ensure that communication about dentry events only happens to those nodes which actually have a reference on a given name.
To do this, we replace the broadcast mechanism with a cluster lock which covers a set of dentries.
A node which has done lookup on a name retains a protected read lock on the dentry, until final dput. If the user requests and unlink or rename, the protected read is upgraded to an exclusive lock. Other nodes who have seen the dentry will then be informed that they need to downgrade their lock, which will involve d_delete on the dentry.
Lock Naming
The first problem we run into in is naming the new lock resource. A lock name needs to be universally agreed upon by all nodes, without network communication. Also, we typically, have one lock cover one resource (in this case, one lock per dentry).
The most obvious solution of putting dentry name in the resource name won't work because DLM lock names can only be 32 bytes long. Increasing this value is not feasible.
Storing the directory offset of a name will work until OCFS2 supports indexed directories (offsets can change due to inserts).
Instead, we choose to use a triple of, lock type id, parent directory inode # and inode #. This has the disadvantage that the lock will cover all hard links to an inode within a given directory. We are not particularly concerned with this constraint as hard links are a rare case, and handling multiple dentries per lock callback only adds a minor amount of overhead.
At the time of this writing, all OCFS2 lock resource names were stored in plain text - integer values are printed as hexadecimal strings. This has been a huge win in terms of debugability. Debug and error prints can simply pass the resource name directly to the console, which means we can pinpoint problems very quickly.
Unfortunately, the lock resource name size doesn't have enough room for two pretty printed 64 bit inode numbers. We then make the compromise that the inode block # will be stored as a binary value. We can preceed it with a null character, so that pretty prints are still possible (just less useful). The debugfs lock state interface can manually extract the binary value and pretty print the lower 32 bits, which will aid debugging in all but the largest file system sizes.
The dentry lock name looks like this:
Field Description |
Size |
A lock type identifier, 'N' |
1 byte |
Parent directory inode #, as a hexadecimal string |
16 bytes |
A single NULL character |
1 byte |
Binary representation of the inode # |
8 bytes |
As a side note, the OCFS2 developers tend to feel that the time might be right for us to move away from string representations of lock names. This would require some userspace updates however, and it was decided that this should not be part of the dentry vote removal plan.
Tracking Structure
struct ocfs2_dentry_lock { atomic_t dl_count; atomic_t dl_aliases; u64 dl_parent_blkno; /* * This struct inode reference is piggy-backed on those of all * the dentries involved. */ struct inode *dl_inode; struct ocfs2_lock_res dl_lockres; };
There are two counts on struct ocfs2_dentry_lock.
- dl_count
Governs the lifetime of the object. When this drops to zero, the dl_lockres structure is torn down, and the whole things is freed.
- dl_aliases
Tracks the total number of dentries which refer to this object. This way the downconvert code can know when to stop looking for aliases (the struct inode reference may not be valid after the last alias is torn down).
An attached dentry, has a reference on both counts.
The remaining fields are fairly straight foward:
- dl_parent_blkno
Used by the downconvert thread to find appropriate aliases to d_delete()
- dl_inode
Every inode has a list of aliases on i_dentry. This is the list we walk to find hard links.
- dl_lockres
- The actual lock resource which covers the attached aliases.
Object Creation
When a new dentry is linked into the system, we first pass it to ocfs2_dentry_attach_lock(), which will search the inode alias list for an existing (and matching) struct ocfs2_dentry_lock. If one exists already, references are taken, and a pointer to it is stored in the d_fsdata dentry field. Otherwise a new one is created and similarly stored on the dentry. This function also takes the PRMODE lock to hook into any other nodes which might care about the dentry.
Callers of the function must hold parent directory semaphores and cluster locks:
- The parent directory semaphore will protect us from having to worry about concurrent processes on our node trying to attach a new lock at the same time.
- The parent directory cluster lock (held at either PR or EX mode) protects us from unlink and rename on other nodes.
A dput() can happen asynchronously due to pruning, so we cover attaching and detaching the dentry lock with a dentry_attach_lock spinlock.
The following functions call ocfs2_dentry_attach_lock(): ocfs2_lookup(), ocfs2_mknod(), ocfs2_symlink()
Object Removal
We use the ->d_iput() callback to manage object removal. ocfs2_dentry_iput() will be called by the dcache when a dentry is being destroyed. This can happen due to pruning (reclaiming memory), during unmount, or after a dentry has been passed to d_delete(). The dentry is unhashed and removed from the inode alias list before the ->d_iput() callback.
ocfs2_dentry_iput() drops references, and will drop the cluster lock from the system (and kfree()) on last ref.
A race may occur when one process is doing lock shutdown and another process wants to create a new dentry lock. Right now we let them race, which means that for a very short while, a node might have two locks on a lock resource. This should not be a problem though because one of them is in the process of being thrown out.
Removal of a dentry might be forced via d_delete() in the downconvert thread via ocfs2_dentry_convert_worker(). The thread takes an elevated reference on the struct ocfs2_dentry_lock while walking the dcache, so that it won't try to destroy the lock resource it is downconverting. The reference is dropped later, after the down convert is complete and a drop of the lock won't deadlock the thread or cause it to reference free'd memory.
Unlink
Once it has cluster locks on the parent directory and the inode being unlinked, ocfs2_unlink() will take an exclusive lock on the dentry lock. This ensures that all other nodes run ocfs2_dentry_convert_worker()
Rename
As far as the dentry locking is concerned, rename is still treated as two unlinks.
After the dentries have been unhashed on the other nodes, rename potentially needs to build up a new lock for the name that was moved (it might have been moved to a different directory). ocfs2_rename() calls ocfs2_dentry_move() which actually handles this case by manually de-referencing the existing dentry lock structure, and creating a new one via ocfs2_dentry_attach_lock()
Normally, the VFS calls d_move() after a succesful rename in order to perform the "rename" of the dentry. There is a very small race with rename and d_move() however. The d_move() for a rename is done after the ->rename() callback, outside the parent directory cluster locks which normally protect the name creation/removal mechanism. The race is that after ocfs2_rename(), but before the d_move() a different node can discover the new name (created during the rename) and unlink it, forcing a d_delete(). d_move() it seems, unconditionally rehashes the (renamed) dentry and so if it's done after a d_delete()` we could get some inconsitent state amongst the nodes.
OCFS2 fixes this race by performing the d_move() internally (via ocfs2_dentry_move()), while protected by cluster locks. There is a small patch to the VFS required which makes this possible. At the time of this writing, the patch is in Andrew Morton's -mm tree and is expected to make it's way upstream.
DCACHE_DISCONNECTED
Due to it's stateless nature, NFS can occasionally leave dentries in "DCACHE_DISCONNECTED" state. What this flag means is that the dentry in question is the root of a tree which has not built up a connection to the rest of the fs tree yet. OCFS2 has the opportunity to do any setup on these in the ocfs2_get_dentry() and ocfs2_get_parent() NFS callbacks. Today, we do not build up any lock structures, but set the operations callbacks so that ocfs2_dentry_revalidate() can see the dentry.
ocfs2_lookup() calls d_splice_alias() which will d_move() a DCACHE_DISCONNECTED dentry if it finds one. The code in ocfs2_lookup() has to detect this (via return value from d_splice_alias()), and only attach a dentry lock to the correct dentry.
Disadvantages
Hopefully there aren't too many actual disadvtages to this scheme. The two most obvious ones that spring to mind are:
- This uses more memory than the votes do
- Adding lock mastery (especially via our existing dlm) will add overhead to our find() / stat() type operations
- We can regain this overhead (and more) by eliminating the double disk read in stat, and by moving to a new dlm with faster lock mastery