Using REFLINK to Create Tools
Joel Becker, Oracle
November 19, 2008
The design document for the REFLINK operation describes a generic method for creating inodes that share their data extents. This is nice and all, but what are real-world use cases?
The original impetus for this work was inode snapshots. While snapshots of whole volumes are nice, sometimes you just want a single file. Virtual machine hosts use single files as the backing store for a guest's disks. Snapshotting that single file is the same as snapshotting the guest's whole disk.
A simple snapshot tool can be created to create readonly snapshots.
#!/bin/bash # # snap.ocfs2 - create immutable snapshots # umask 333 && reflink "$1" "$2" exit $?
A smarter version of the tool would have options for the 'immutable' attribute of chattr(1). It might also make the destination target optional; it can provide a default snapshot location such as .filename.snap or perhaps a .snapshot directory at the filesystem root.
The reverse case is to create shallow clones of existing files.
Going back to the virtual machine example, a common problem is to bring up many virtual machines with the same operating system. While an administrator can certainly run N installs for N virtual machines, there are easier ways. An administrator might install a single virtual machine with a particular OS. Once the OS is configured, the administrator shuts down the VM. The administrator then copies that disk image N times, one for each VM required. The original disk image, the base image, is left alone for future use.
This is slow (copying multi-gigabyte files N times) and wasteful of space. The QEmu package has a solution for this in its disk images. A new QEmu disk image can be based on another base image. When the new image is created, it is actually empty. Only writes are stored in the new image. Unchanged blocks are read from the base image. Unsurprisingly, this sounds just like Copy-on-Write (CoW).
QEmu disk images are only usable by systems based on QEmu, including KVM. REFLINK, however, can be used by any hypervisor. The administrator takes the base image and creates a reflink copy for each VM required. The VMs open their image in read-write mode, and any blocks they write are subject to the CoW semantics of the refcount tree. Blocks that are never modified are shared with the original image. This makes creating a new image file very fast, and saves a ton of space for data that is unchanged.
# reflink base-fedora-9.img host1.img # create-vm --name host1 --image host1.img # reflink base-fedora-9.img host2.img # create-vm --name host2 --image host2.img ...
Restoring a Snapshot
Let's say there is a file foo. The file is snapshotted nightly with the command "snap.ocfs2 foo foo.$(date)". Today is 2008.11.18. Yesterday the file was corrupted, and so you want to go back to the version from two days ago. It's simple.
# unlink foo # reflink foo.2008.11.18 foo
Now foo is an exact copy of the snapshot, and it takes no extra space to boot. The CoW properties of the refcount tree will take hold when you start modifying foo.
Note that this can be wrapped with a command like "snap.ocfs2 --restore" if you like. The underlying operation is reflink(1).
Breaking a Reflink
You have a file foo and its various snapshots. For whatever reason, you want foo to no longer share any extents with another inode. Ready?
# cp foo bar # mv bar foo
Throughout the creation of these design documents, I've thought of and forgotten other uses cases. They'll come back to me, I swear. Perhaps you'll think of them. By defining the refcount tree in a generic fashion, and the REFLINK operation in the same fashion, we have the building blocks for a variety of tools and uses.