Btrfs now supports a range of blocksizes for btree metadata, including blocks larger than the page size. This is done by extending the extent mapping page cache code with a simple API for reading and writing buffers of arbitrary size. Unless the file is packed into the btree, file data is written into an separate extent, which is allocated in 4K units.
The buffers have a simple API for reading and writing into them and they are backed by the traditional page cache. Every access of filesystem metadata needs to go through the API, which did require large changes to Btrfs. The current patches allow separate sizes for btree leaves and nodes, but so far I haven't benchmarked all the possible variations.
The Btrfs changes are not yet stable and they are large disk format change, but they are available in a separate HG branch:
Kernel module HG tree
Progs HG tree
The first round of benchmarking shows that larger block sizes do consume more CPU, especially in metadata intensive workloads, but overall read speeds are much better. For starters, I've done compilebench --makej tests across 4k, 8k, 16k and 32k blocksizes.
Compilebench simulates the files read and written by creating, compiling, reading and deleting kernel trees. The compile phase just creates files with the same sizes and names as the real .o files, but it does so in random order (simulating files created by make -j).
If a given file fits in the Btrfs leaf and is less than 8K, the current patches pack it into the btree instead of allocating data blocks for it on disk. Btrfs has 148 bytes of header in the leaf, and so the largest packed file is the leafsize minus 148. The 16K limit was an arbitrary choice.
| Operation | 4096 | 8192 | 16384 | 65536 | Ext3 | Graph |
| Create Trees | 34.26 MB/s | 34.36 MB/s | 31.69 MB/s | 31.49 MB/s | 21.52 MB/s | IO | Compiling Trees | 28.73 MB/s | 29.70 MB/s | 32.09 MB/s | 30.90 MB/s | 20.44 MB/s | IO |
| Read Trees | 10.30 MB/s | 12.01 MB/s | 12.74 MB/s | 11.95 MB/s | 17.22 MB/s | IO |
| Delete Trees | 14.87s | 13.94s | 12.26s | 12.06s | 23.86s | IO |
After some tuning for CPU usage, block sizes larger than 4K win each phase of this benchmark. Ext3 still wins the read phase however.
Very few of the .o files during the compile phase are packed into the tree. The larger leaves are faster there because the IO is more efficient in general, and because reading metadata back in to create the .o files is faster.
The read phase and delete phase are generally faster as the leaf size increases. However, the operations on large directories clearly show some CPU usage problem during deletes on large leaves. Zach Brown has some code to delay shifting leaf data during deletes, hopefully this will help once it is integrated.
Operations on one million 512 byte files in one directory
| Blocksize | Create | Read | Delete |
| 4096 | 153s (123s sys) | 3m59s (1m40s sys) | 5m7s (2m18s sys) |
| 8192 | 145s (118s sys) | 3m1s (1m30s sys) | 4m7s (2m16s sys) |
| 16384 | 156s (131s sys) | 2m21s (1m8s sys) | 4m5s (2m38s sys) |
| 32768 | 189s (166s sys) | 2m19s (1m1s sys) | 6m2s (3m43s sys) |
| 65536 | 265s (241s sys) | 2m5s (1m4s sys) | 8m1s (5m54s sys) |
| Graph | IO | IO | IO |
Operations on one million 16K files in one directory
| Blocksize | Create | Read | Delete |
| 4096 | 480s (244s sys) | 17m14s (3m11s sys) | 4m31s (2m15s sys) |
| 8192 | 459s (238s sys) | 15m20s (3m8s sys) | 4m28s (2m29s sys) |
| 16384 | 470s (240s sys) | 14m47s (3m8s sys) | 5m2s (3m9s sys) |
| 32768 | 521s (270s sys) | 14m39s (3m16s sys) | 7m7s (4m41s sys) |
| 65536 | 663s (362s sys) | 14m31s (3m27s sys) | 11m22s (7m48s sys) |
| Graph | IO | IO | IO |
You can find compilebench here. Seekwatcher was used to create the graphs.