Blocksize benchmarks

Btrfs now supports a range of blocksizes for btree metadata, including blocks larger than the page size. This is done by extending the extent mapping page cache code with a simple API for reading and writing buffers of arbitrary size. Unless the file is packed into the btree, file data is written into an separate extent, which is allocated in 4K units.

The buffers have a simple API for reading and writing into them and they are backed by the traditional page cache. Every access of filesystem metadata needs to go through the API, which did require large changes to Btrfs. The current patches allow separate sizes for btree leaves and nodes, but so far I haven't benchmarked all the possible variations.

The Btrfs changes are not yet stable and they are large disk format change, but they are available in a separate HG branch:

Kernel module HG tree
Progs HG tree

The first round of benchmarking shows that larger block sizes do consume more CPU, especially in metadata intensive workloads, but overall read speeds are much better. For starters, I've done compilebench --makej tests across 4k, 8k, 16k and 32k blocksizes.

Compilebench simulates the files read and written by creating, compiling, reading and deleting kernel trees. The compile phase just creates files with the same sizes and names as the real .o files, but it does so in random order (simulating files created by make -j).

If a given file fits in the Btrfs leaf and is less than 8K, the current patches pack it into the btree instead of allocating data blocks for it on disk. Btrfs has 148 bytes of header in the leaf, and so the largest packed file is the leafsize minus 148. The 16K limit was an arbitrary choice.

Operation 4096 8192 16384 65536 Ext3 Graph
Create Trees 34.26 MB/s 34.36 MB/s 31.69 MB/s 31.49 MB/s 21.52 MB/s IO
Compiling Trees 28.73 MB/s 29.70 MB/s 32.09 MB/s 30.90 MB/s 20.44 MB/s IO
Read Trees 10.30 MB/s 12.01 MB/s 12.74 MB/s 11.95 MB/s 17.22 MB/s IO
Delete Trees 14.87s 13.94s 12.26s 12.06s 23.86s IO

After some tuning for CPU usage, block sizes larger than 4K win each phase of this benchmark. Ext3 still wins the read phase however.

Very few of the .o files during the compile phase are packed into the tree. The larger leaves are faster there because the IO is more efficient in general, and because reading metadata back in to create the .o files is faster.

The read phase and delete phase are generally faster as the leaf size increases. However, the operations on large directories clearly show some CPU usage problem during deletes on large leaves. Zach Brown has some code to delay shifting leaf data during deletes, hopefully this will help once it is integrated.

Operations on one million 512 byte files in one directory

Blocksize Create Read Delete
4096 153s (123s sys) 3m59s (1m40s sys) 5m7s (2m18s sys)
8192 145s (118s sys) 3m1s (1m30s sys) 4m7s (2m16s sys)
16384 156s (131s sys) 2m21s (1m8s sys) 4m5s (2m38s sys)
32768 189s (166s sys) 2m19s (1m1s sys) 6m2s (3m43s sys)
65536 265s (241s sys) 2m5s (1m4s sys) 8m1s (5m54s sys)
Graph IO IO IO

Operations on one million 16K files in one directory

Blocksize Create Read Delete
4096 480s (244s sys) 17m14s (3m11s sys) 4m31s (2m15s sys)
8192 459s (238s sys) 15m20s (3m8s sys) 4m28s (2m29s sys)
16384 470s (240s sys) 14m47s (3m8s sys) 5m2s (3m9s sys)
32768 521s (270s sys) 14m39s (3m16s sys) 7m7s (4m41s sys)
65536 663s (362s sys) 14m31s (3m27s sys) 11m22s (7m48s sys)
Graph IO IO IO

You can find compilebench here. Seekwatcher was used to create the graphs.