This document describes the planned metadata to manage multiple physical devices underneath a Btrfs filesystem. These concepts are not finalized, but implementation is expected to start shortly.
The Btrfs requirements for multi-device support:
If Btrfs were to rely on device mapper or MD for mirroring, it would not be able to resolve checksum failures by checking the mirrored copy. The lower layers don't know the checksum or granularity of the filesystem blocks, and so they are not able to verify the data they return.
If Btrfs were to rely on device mapper for aggregating all of the physical devices into a single big address space, it would not have sufficient information to allocate mirrored copies on different devices. Keeping this information in sync between Btrfs and the device mapper would be difficult and error prone.
Because different Btrfs subvolumes may have different allocation policies, the management of multiple devices needs to be very flexible. Metadata mirroring may be on by default even in single spindle configurations, and so it is not sufficient to mirror between two devices of identical size.
Chunks are sections of storage with a logical address. All extent pointers will reference chunks instead of physical blocks on a drive. The super block will have a bootstrapping section that maps the chunks used by the chunk tree. This allows all parts of the filesystem, including the chunking code, to use the chunking code for extent address lookup.
Each chunk will have space allocated to it from one or more devices, and may be set to mirror or stripe IO to those physical devices. Normal sizes for a chunk would be at least 256MB and average 1% of the size of the device.
Each chunk will be owned by a single extent allocation tree, and will have a back reference to that tree.
Each device added to the filesystem will be given a 64 bit device id. Information about each chunk will be stored in a btree identified by a 64 bit device id. Each btree root in the filesystem will be tied to one and only one chunk tree for block translation lookups. The ID of this lookup tree will be duplicated in each btree so that lookups can be safely done during repair.
Each device added to the filesystem has an allocation tree recording which portions of the device are currently allocated to a chunk. Back references in this tree record which chunk is allocating which part of the device. Because there are relatively few device extents per device, this allocation tree can be shared between many devices.
Chunks are assigned to an extent allocation tree, and used to satisfy extent allocations for metadata and data. As space usage grows, more chunks are added to the allocation tree dynamically. The extent allocation tree will do sub-chunk sized allocations, track free space inside the chunk, and track back references to the extents allocated out of the chunk.
Logical addressing allows efficient chunk relocation. The extent allocation tree that owns the chunk can be consulted to see which parts of the chunk are in use, and it is then copied in bulk from source to destination.
Devices in the filesystem can be removed or balanced by relocating chunks. If new devices are added, chunks can either be individually restriped or the new device can be placed into the pool for new allocations only.
Rebuilding mirrors is done by consulting the allocation tree for the dead drive and following back references to the chunks allocated to it. Each chunk is repaired individually, optionally following back references to limit the repair to areas of the chunk that were actually in use.
IDs of phyiscal devices are stored in the device super block. These are scanned from userland to construct a list of all the devices contributing to a given Btrfs uuid. The chunk trees also carry information about each of the devices, so things can be verified at mount time.
This graphic gives a high level view of the components involved with recording space usage, and details which trees reference each other: