Reference Counted Links in ocfs2
Joel Becker, Oracle
November 20, 2008
Introduction
This series of design documents describes the generic REFLINK operation and the ocfs2-specific implementation thereof. The REFLINK operation creates a new inode that shares the data extents of a source inode in a Copy-on-Write (CoW) fashion.
The design is the outcome of an exploration into inode snapshots started in early October 2008. In the end, the REFLINK operation is not limited to snapshots and enables a number of use cases.
Snapshots on ocfs2
ocfs2 is a general purpose extent-based shared-disk cluster filesystem. Some filesystems, like ZFS, btrfs, and WAFL, have a single tree that describes the entire filesystem. This makes snapshotting the volume, or even a subtree, pretty easy. Because ocfs2 uses block-based addressing, it does not have a single starting point to describe the entire filesystem. Implementing a snapshot system for the entire volume or a directory subtree is impractical in the ocfs2 code.
This isn't so bad, though, because ocfs2 does just fine with storage assisted snapshots. This is where the underlying storage can snapshot the LUN underneath the filesystem. High-end storage already can do this, and it works just fine with ocfs2. On the low end, LVM2 provides a snapshot capability. Once ocfs2 support for clvmd goes production, this will be usable as well.
Single File Snapshots
LUN-based snapshots require snapping the entire LUN (obviously). This is impractical when one wants to save a single file or a small group of files. For a filesystem, this means snapping inodes.
The REFLINK Operation
The design of inode snapshots turned into the generic REFLINK operation. These design documents describe the generic operation, the ocfs2 structures needed to support it, and some use cases.
Design Documents
There are three parts to this design.
The specifics of the refcount tree data structure, which allows ocfs2 inodes to share data extents, are documented in OCFS2/DesignDocs/RefcountTrees.
The REFLINK operation, which creates a reference counted link, is described in OCFS2/DesignDocs/ReflinkOperation.
The REFLINK operation is used to create snapshots and clones in OCFS2/DesignDocs/ReflinkUses.
Conclusion
This is hopefully a generic way to support these features in ocfs2. Any comments and criticisms are welcome. Our goal is a good design.