[OracleOSS] [TitleIndex] [WordIndex]


Refcount Trees

Joel Becker, Oracle

November 19, 2008


A basic method for sharing data elements in filesystems is Copy-on-Write (CoW). In the CoW scheme, two entities share the same physical data element. When one of the entities wants to change the data element, it creates a new copy before updating it. Thus, the original data element is not changed, and the other entity is not affected by the change.

In the ocfs2 filesystem, the entities are inodes (the physical backing for files) and the data elements are clusters (the core unit of allocation in ocfs2). The current ocfs2 format does not allow more than one inode to refer to a single cluster. That is, inodes:clusters is a 1:N relationship. For CoW to work, more than one inode must be able to share a data cluster - a M:N relationship. The cluster should only be considered unused when no inodes refer to it.

We will keep track of this via a reference count on that cluster. The reference count is connected to the inodes that share it. The data structure that keeps track of the reference count is a refcount tree.

The Allocation Tree

ocfs2 inodes store their data allocation via a B-Tree. The tree is indexed by the virtual offset in the file, in terms of clusters (the "cluster position", or cpos). Only leaf nodes point to actual data storage. The data storage is in terms of extents, defined by the extent record structure.

struct ocfs2_extent_rec {
/*00*/  __le32 e_cpos;          /* Offset into the file, in clusters */
        union {
                __le32 e_int_clusters; /* Clusters covered by all children */
                struct {
                        __le16 e_leaf_clusters; /* Clusters covered by this
                                                   extent */
                        __u8 e_reserved1;
                        __u8 e_flags; /* Extent flags */
        __le64 e_blkno;         /* Physical disk offset, in blocks */

e_cpos is the virtual offset into the file (cpos). e_leaf_clusters is the length of this extent in clusters. e_blkno is the physical location of the actual data storage.

The physical location is stored as a direct address. There is no place to store a reference count, and no obvious way to share this record with another inode. Imagine that another inode has an identical extent record. When one of the two inodes removes this extent (ftruncate(2), unlink(2), etc), it sends the physical cluster back to the global allocator (freeing it). The second inode still thinks it is referring to valid storage, but it is not. Clearly we must prevent the storage from being freed until all inodes are done with it.

One possible solution would be to have e_blkno point to another data structure that contains the real location and the reference count. But this comes with a question of how much space such structures would take, how many inodes can share them, and how they fragment over time.

Given that inodes will have fewer and fewer shared records as CoW progresses, it makes no sense to add a layer of indirection for the reference count. By leaving the current structure in place, read operations take no penalty and behave as currently defined in ocfs2.

We also want write operations (all modifying operations, including write(2), ftruncate(2), final unlink(2), and storing to an mmap(2) area) to have no penalty if the storage involved is not shared. CoW will create situations where an inode has some shared clusters and some clusters it owns exclusively. While writing to the shared clusters will take a hit for CoW, the exclusive clusters should proceed as fast as if no clusters were shared.


When we get to the extent record for a given piece of data, we want to proceed with no penalty when that storage is not shared. But the only information we have is the ocfs2_extent_rec. Thus, we introduce the OCFS2_EXT_REFCOUNTED extent flag. This is stored on the e_flags field. The flag is ignored for read operations. For write operations, if the flag is not set, the extent is not shared with any other inode. The write operation can proceed with no penalty. If the flag is set, the operation must access the refcount tree and behave appropriately.

The Refcount Record

Reference counts for shared extents are stored in refcount records. These are stored on disk in the refcount tree, another ocfs2 B-Tree. The ocfs2 B-Tree functions are generic enough that it is easy to create, store, modify, and search. The data leaves of a refcount tree contain refcount blocks, which store the refcount records. These records describe the reference count for a physical extent. A refcount record has the following format:

struct ocfs2_refcount_rec {
/*00*/  __le64 r_cpos;          /* Physical offset, in clusters */
        __le32 r_clusters;      /* Clusters covered by this extent */
        __le32 r_refcount;      /* Reference count of this extent */

In a refcount record, r_cpos is a physical cpos. It's the absolute offset, in clusters, from the start of the disk. It is 64 bits for two reasons: it pads the structure nicely, and it is ready for a filesystem with more than 2^32 total clusters (see the sidebar about 64-bit cpos later in this document).

The r_refcount explicitly means the number of extent records in the filesystem that point to this physical extent. Each of those extent records MUST have the OCFS2_EXT_REFCOUNTED flag.

Refcount records do not map 1:1 with extent records. A large extent may be split by a CoW operation. To unchanged inodes, they have one extent record covering the entire extent. The changed inode will have an extent record for the unchanged portion and a new extent record for the changed portion. The refcount tree will have similarly split the single refcount record into two. The changed portion will have decremented the reference count by one, as the changed inode is no longer using that physical extent.

More than one refcount tree may exist in a filesystem. Inodes only share a refcount tree when they share physical clusters. Inodes with no shared clusters do not have to have a refcount tree. Two inodes that do not share any clusters together may not have the same refcount tree. This way, there is no contention between inodes that don't share anything.

Why not a single system file that has refcount entries for every cluster in the filesystem? Because ocfs2 is a cluster filesystem, and this would be a painful point of contention; every write to a shared cluster would require the global refcount tree lock. With a refcount tree shared only among inodes that share clusters, the contention is localized.

Linking the Refcount Tree to Inodes

An inode with shared clusters will have the OCFS2_HAS_REFCOUNT_FL flag set in i_dyn_features. When this flag is set, the i_refcount_loc field will be the address of the refcount tree's root refcount block. Each inode sharing the tree will point to the same refcount tree root.

Simple Transitions

An inode can add a refcount tree root as a single transaction. The inode merely allocates and initializes the root refcount block, sets its OCFS2_HAS_REFCOUNT_FL flag, and points i_refcount_loc to the root block. An empty refcount tree is isomorphic to having no refcount tree.

It is expressly legal for a refcount record to have an r_refcount of one when a single inode is using this physical extent. Thus, the following two states are equivalent:

rec->e_cpos = 0;
rec->e_len = 2;
rec->e_flags = 0;
rec->e_blkno = XXXX;

phys_cpos = ocfs2_blocks_to_clusters(XXXX);
refcount_lookup(phys_cpos) -> -ENOENT;

rec->e_cpos = 0;
rec->e_clusters = 2;
rec->e_flags = OCFS2_EXT_REFCOUNTED;
rec->e_blkno = XXXX;

phys_cpos = ocfs2_blocks_to_clusters(XXXX);
refcount_rec = refcount_lookup(phys_cpos);
refcount_rec->r_cpos = phys_cpos;
refcount_rec->r_clusters = 2;
refcount_rec->r_refcount = 1;

In other words, an extent can be transitioned from unrefcounted to refcounted in a single, small transaction. The states are isomorphic. By logical extension, an inode with no refcount tree can add a refcount tree and then, piece by piece, add each of its extents to that refcount tree. The result is isomorphic with the original, refcount-tree-less inode.

When an inode wants to remove a REFCOUNTED extent from its tree, it decrements the reference count rather than adding the extent to the truncate log. Obviously, the refcount code adds the extent to the truncate log when the refcount reaches zero, but that is transparent to the caller.


When an inode wants to modify a REFCOUNTED extent, it must remove its reference from the refcount tree. If it is modifying only a portion of a REFCOUNTED extent, it may split the physical extent in the refcount tree, removing its reference from the portion it modified.

If r_refcount for that extent is one, there is an optimization. The physical extent can merely be removed from the refcount tree (its reference is going to zero), and the REFCOUNTED flag is cleared from the inode's extent record. No actual copying needs to take place, as no other inode is sharing this physical extent. This is isomorphic.

If r_refcount is more than one, CoW must happen. The copy is also an isomorphic transaction:

  1. Allocate the new extent
  2. Copy the data from the original extent to the new extent.
  3. Store the new physical extent location in the extent record (e_blkno).

  4. Clear the REFCOUNTED flag on the extent record.
  5. Decrement r_refcount for the old, shared physical location.

With the REFCOUNTED flag cleared and the operation committed, the inode now has its own copy of the data, and the filesystem is in a consistent state. The write operation can now modify the new, unshared data extent.

An optimization would be to combine the new data with the copy of the original data so as to not write twice. The procedure above is documented to show the isomorphism.

CoW Fragmentation

One concern with CoW is fragmentation. If an extent is large (say, 1GB), we can't copy the entire thing in one CoW operation. The copy would take too long. However, if we only copy the portion that is changing (say, 4K), we must break the single extent into two: one new 4K extent and one shared 1GB-4K extent. If this repeats over and over again, we end up with 262144 extents, each 4K in size. This is the worst case, of course, but it is horrible fragmentation.

The refcount tree code will CoW in hunks of 1MB. That is, writing to an extent of less than 1MB will CoW the entire extent. Writing to an extent of more than 1MB will CoW only 1MB. This preserves contiguousness of 1MB hunks, allowing for some good sequential reads, but does not overburden the copy. We know from experience that the kernel work required to allocate and use 1MB of pages is doable.

For a large sequential write, the 1MB hunks of CoW are likely to be contiguously allocated out of the localalloc file.

The Refcount Block

A refcount block is a block describing refcount records.

#define OCFS2_REFCOUNT_LEAF_FL          (0x00000001)
#define OCFS2_REFCOUNT_TREE_FL          (0x00000002)

struct ocfs2_refcount_list {
/*00*/  __le16  rl_count;       /* Maximum number of entries possible
                                   in rl_records */
        __le16  rl_used;        /* Current number of used records */
        __le32  rl_reserved2;
        __le64  rl_reserved1;   /* Pad to sizeof(ocfs2_refcount_record) */
/*10*/  struct ocfs2_refcount_rec rl_recs[0];   /* Refcount records */

struct ocfs2_refcount_block {
/*00*/  __u8    rf_signature[8];        /* Signature for verification */
        __le16  rf_suballoc_slot;       /* Slot suballocator this block
                                           belongs to */
        __le16  rf_suballoc_bit;        /* Bit offset in suballocator
                                           block group */
        __le32  rf_fs_generation;       /* Must match superblock */
/*10*/  __le64  rf_blkno;               /* Offset on disk, in blocks */
        __le64  rf_parent;              /* Parent block, only valid if
                                           OCFS2_REFCOUNT_LEAF_FL is set in
                                           rf_flags */
/*20*/  struct ocfs2_block_check rf_check;  /* Error checking */
        __le32  rf_count;               /* Number of inodes sharing this
                                           refcount tree */
        __le32  rf_flags;               /* See the flags above */
/*30*/  __le64  rf_reserved[10];
/*80*/  union {
            struct ocfs2_refcount_list rf_records;  /* List of refcount
                                                       records */
            struct ocfs2_extent_list rf_list;       /* Extent record list,
                                                       only valid if
                                                       is set in rf_flags */
/* Actual on-disk size is one block */

The refcount tree root is a refcount block pointed to by i_refcount_loc. This block MUST NOT have the OCFS2_REFCOUNT_LEAF_FL flag set, as it is not a leaf. When it starts, it will NOT have the OCFS2_REFCOUNT_TREE_FL set. The refcount records are stored inline in rf_records. When rf_count goes to zero, the tree MUST have no entries.

If the refcount tree root fills up, a new refcount leaf is allocated. A refcount leaf is a refcount block with OCFS2_REFCOUNT_LEAF_FL set. The refcount tree root is converted to a tree using rf_list, and it is pointed to the new leaf block. The leaf block sets its rf_parent to the root block.

The refcount tree is keyed by the 32-bit physical cpos of the extent in question. The length is a range of keys, just like dx dirs or xattrs (MarkFasheh was instrumental in mapping the dx dirs lookup scheme to refcount trees. Thanks MarkFasheh!). This allows refcount records for different physical extents to be packed into one leaf. If the leaf fills up, the leaf is split. With the current 32-bit limitation on filesystem cpos, it is not possible for there to be collisions on the physical cpos.

A filesystem 32-bit physical cpos MUST NOT have a refcount block where both OCFS2_REFCOUNT_LEAF_FL and OCFS2_REFCOUNT_TREE_FL are set. A refcount block is either a root or a leaf.

Sidebar: 64-bit cpos

High-level Operations

The following high-level operations (from an inode's point of view) are defined.

Add a refcount tree to an inode

Allocate a new refcount tree root and attach it to this inode. This will start rf_count at one. Only valid if the inode has no refcount tree already.

Set this inode's refcount tree

Given an existing refcount tree root, point this inode to it. Increments rf_count. Only valid if the refcount tree exists and the inode has no existing refcount tree.

Remove this inode's refcount tree

Decrements rf_count and frees the refcount tree root if rf_count goes to zero. The inode MUST NOT have any extent records with the REFCOUNTED flag.

Increment refcount
Given a physical cpos and length, increment the refcount in the tree. If the extent has not been seen before, a refcount record is created for it. Refcount records may be merged or split by this operation.
Decrement refcount for delete
Given a physical cpos and length, decrement the refcount in the tree. If the refcount for any portion of the extent goes to zero, that portion is queued for freeing (eventually goes in the truncate log, probably via the cached delete context).
Decrement refcount without delete
This is the CoW optimization seen above. The refcount MUST have started at one. It will go to zero, and the extent will be removed from the refcount tree.


A refcount tree is an entity shared by more than one inode. It can't be protected by any particular inode's cluster lock. Thus, a new lock type will be created for refcount trees.

Note that all the simple operations above act on one inode and the refcount tree. Other inodes using this refcount tree are not affected.

Feature Bits

The refcount tree is the actual physical disk change. This capability will be toggled by the OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE feature. While the filesystem is safe to read in read-only mode, the recovery code will have to handle the multi-linked clusters. Thus, it has to be an INCOMPAT feature instead of an RO_COMPAT feature.


libocfs2 and fsck.ocfs2 will need changes to handle multi-linked clusters.

How Refcount Trees are Created

This is all well and good, but how does one create one of these shared trees? The answer is the REFLINK operation.

2012-11-08 13:01