Joel Becker, Oracle
November 19, 2008
The refcount tree design document describes a way for more than one ocfs2 inode to share a physical data extent. It provides Copy-on-Write (CoW) functionality for modifications to this shared extent, preserving the original data extent for any other inode referencing it. It does not describe any way to create these shared trees. This is intentional, as the refcount tree is intended to be a generic data structure that operates with no special cases.
This document describes the REFLINK operation, the only operation needed to create and manage these shared structures.
Reference Counted Links
The REFLINK operation creates "reference counted links", or reflinks. The operation is analogous to the link(2) operation, except that writes do not modify the shared data, they use CoW.
That is, a reflink starts as a new inode whose extent tree points to the data clusters of the source inode. A refcount tree is used to refcount the linkage. Over time, writes to any of the reflinked inodes cause CoW to create unshared data clusters.
The REFLINK Operation
The REFLINK operation will be an ioctl(2) against an ocfs2 filesystem, but it is easiest to imagine as a new system call reflink(2):
int reflink(const char *oldpath, const char *newpath);
If REFLINK returns success, newpath is created. The inodes for oldpath and newpath share a refcount tree, and their extent trees point to the indentical set of data clusters.
If newpath exists, the REFLINK operation MUST fail.
REFLINK, like link(2), is disallowed on directories. If oldpath is a directory, -EPERM is returned as in link(2). In fact, all of the error returns from link(2) map to REFLINK.
Technically, refcount trees and the REFLINK operation would work fine on directories. We disallow directories because of the expectation that REFLINK on a directory would include reflinking all of the directory's children, and that isn't what happens. Without that, there is no point in reflinking a directory.
While the description pretends the existence of a system call, in reality this will be OCFS2_IOC_REFLINK, and the argument will be a structure with pointers to the two paths.
Pretending it is a Syscall
Why pretend this way? Because REFLINK creates a new path entry, we want to emulate the VFS operations building up to the call. So, the underlying ocfs2_reflink() function will look just like ocfs2_link():
static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry);
The REFLINK ioctl(2) will look like sys_link(7). It will do the path_walk(7) and lookup_create(7) calls to get the inode and dentry bits needed, then call ocfs2_reflink().
This keeps ocfs2_reflink() clean. And if anyone else can implement REFLINK, we can make it a system call without much trouble.
Reflinks of Reflinks
The reference count fields of a refcount tree are 32bit values. Thus, four billion inodes can share the same refcount tree and any combination of physical extents.
A simple example (blocksize and clustersize of 4K):
# dd if=/dev/random of=orig bs=16K count=1 # reflink orig copy1 # reflink copy1 copy2 # dd if=/dev/random of=orig bs=4K count=1 seek=1 # reflink orig copy3
Assume, for the sake of argument, that the dd(1) created four 4K extents on orig. After the first REFLINK, orig and copy1 share a refcount tree. Each extent in both files is flagged with REFCOUNTED, and the reference count of each extent is two.
Next, we REFLINK copy2 to copy1. This creates a new inode. Its extent tree points to the very same clusters, and the extents are flagged with REFCOUNTED, and the reference count of each physical extent is incremented to three. The refcount tree is now shared by all three inodes.
We now write 4K at an offset of 4K. This is the second logical extent of file orig. The refcount tree code does a CoW operation, allocating a new 4K cluster for orig. This replaces the second extent, clearing the REFCOUNTED flag on that extent. The refcount for the replaced cluster goes down to two, as copy1 and copy2 still refer to it.
Finally, we REFLINK copy3 against orig. It will share the first, third, and fourth extents with all of the other files. Those references go up to four. Meanwhile, the second extent of orig is added to the refcount tree; orig and copy3 point to it, and it has a refcount of two.
This can go on and on. The simple set of operations against the refcount tree ensures a consistent behavior.
The Guts of REFLINK
The actual REFLINK operation is simple to define for ocfs2 as well. Other filesystems will implement it different. There are two stages. First, the source inode is prepared, and then the target inode is constructed.
The preparation of the source inode could not be simpler.
- If the source inode does not have a refcount tree, one is added. This is a single, isomorphic transaction.
- For each extent
- If the extent is already REFCOUNTED, nothing is done
- If the extent is not REFCOUNTED, it is added to the refcount tree and marked as REFCOUNTED. This is a single, isomorphic transaction.
We commit at each step, because after each step, the inode is in a consistent state. The transactions are small, simple and isomorphic. The filesystem doesn't need to clean up on error or on a crash (other than journal recovery). When the preparation is done, every extent on the inode is in the refcount tree. If this is the first time the inode is reflinked, they'll have a refcount of one. An inode that is already partially or fully reflinked will have some extents already in the refcount tree with higher refcounts.
Now we can build the target inode.
- We know, from navigating the source inode, exactly how much metadata is needed for this transaction.
- The new inode is allocated.
- The new inode is added to the orphan directory so that it can be erased if we crash before completing the operation. This ensures we don't create half-reflinked inodes.
- For each extent in the source inode, an entry is added to the new inode's extent tree. The refount is incremented in the refcount tree.
- The new inode is linked into place at newpath.
Well, almost. The process must be repeated for all data extents attached to an inode, including extended attribute buckets and any external extended attribute values. Basically, all data extents attached to an inode must be linked out. You can't have the new inode modifying extended attributes on the original inode. But the operation follows the same steps, with the same guarantees.
The goal was to create a generic operation, REFLINK, built atop a generic shared extent structure, the refcount tree. With the REFLINK operation (and the reflink(1) command wrapping it), we can now build useful tools.