Btrfs Benchmarks

So far, benchmarking has focused on workloads that target a specific aspect of the allocator algorithms or disk formats. Over the long term, this page will be updated to reflect all the benchmarking data that has been gathered. Database workloads and multi-process benchmarking have not been tested yet, both will perform very poorly until a few more items come off the TODO list.

ext3, xfs and btrfs are compared below. The benchmarking machine is a Dell desktop machine (2.4ghz, dual core) on top of a single SATA drive running v2.6.21. Each FS was formatted on the same 40GB LVM volume. All three filesystems are using IO barriers to force cache flushes.

ext3 is mounted -o data=writeback,barrier=1 and created with htree indexing.

xfs was formatted with mkfs.xfs -d agcount=1 -l size=128m,version=2. This creates one allocation group on the disk, which gave the best XFS results on this machine (hint from SGI's Dave Chinner). xfs was mounted -o logbsize=256k


Creating Kernel trees

This is a fairly boring benchmark, just creating enough kernel trees to fill up ram a few times over. However, the results were not what I expected, so there is some discussion below about why the numbers come out this way. compilebench -i 20 -r 0 is used to create files with the same names and sizes as the 2.6.20 kernel tree, and the speed of the 20 creation runs are averaged together for the result:

FS Average throughput
(20 runs)
Graph
Btrfs 28.77MB/s IO graph
XFS 16.95 MB/s IO graph
Ext3 22.92 MB/s IO graph

It should be noted that throughput in the table above is calculated from the total file data written divided by the time the test required. Throughput in the IO graphs is derived from blktrace, and so it includes everything the FS actually writes to the drive. It may be higher than the speeds observed by the application.

The Ext3 results look faster than XFS, but the graph shows that toward the end of the run ext3 performance starts to degrade. Even though ext3 is placing the metadata and data close together on disk, the two are not being written at the same time. Right at 250 seconds on the graph, you can see someone come in and flush out the block device inode in a series of seeky (although increasing) writes. As the number of compilebench runs increases, ext3 average performance goes down.

XFS has the same problem of metadata vs data seeking, but the graph shows it is writing back the metadata much more frequently. The XFS numbers stay consistent as more compilebench runs are done.

Btrfs does not intermix metadata and data on the drive, and so it is able to run in this workload with much less seeking. The Btrfs numbers also stay consistent as more runs are done (Btrfs pays for this with read speed).


Compilebench

Copy on write filesystems are more likely to fragment as the FS ages. compilebench was developed to try and measure how an FS performs during a long run of file creates, deletions and modifications. It does this by simulating a kernel compile, creating and deleting files with the same names and sizes as they appear in the kernel (thanks to Matt Mackall for this idea). Times to read, stat, create, delete, patch and clean trees are also collected. Please check the compilebench homepage for more details.

This test used compilebench -i 90 -r 150, which creates 90 initial trees and then runs 150 random operations on them.

Operation Btrfs XFS Ext3
Initial Create 29.17 MB/s 15.92 MB/s 15.11 MB/s
Create tree 13.05 MB/s 9.22 MB/s 10.63 MB/s
Patch tree 4.73 MB/s 3.70 MB/s 5.14 MB/s
Compile Tree 20.48 MB/s 23.51 MB/s 17.24 MB/s
Clean tree 96.50 MB/s 141.11 MB/s 47.27 MB/s
Read Tree 6.91 MB/s 9.17 MB/s 9.04 MB/s
Read Compiled Tree 12.51 MB/s 15.46 MB/s 15.48 MB/s
Delete Tree 17.12 seconds 17.10 seconds 19.26 seconds
Delete Compiled Tree 24.67 seconds 21.68 seconds 31.28 seconds
Stat Tree 15.29 seconds 6.33 seconds 12.59 seconds
Stat Compiled Tree 15.29 seconds 7.93 seconds 14.04 seconds
Fsck time after run 1m44s 3m32s 11m0s

As expected, the numbers show that Btrfs scores higher on the write phases than on the read phases. Overall, XFS wins most of the phases.


Big directories

These tests stress the directory indexing, inode allocation and metadata writeback routines. One million files are created in a single directory on an empty FS, and then read and deleted. Unmounts are done between each operation.

Tar is used to demonstrate performance from reading files in the order that readdir returns them. acp also reads directories, but it sorts the files by inode number as it finds them and does large batches of open(2). acp then does readahead(2) to optimize data reads, and finally reads the data. acp is available in two versions, acp (syslets) and acp (readahead). Since there is only one directory in this case, the two give about the same results.

Times for one million empty files in a single dir

Operation Btrfs XFS Ext3
Create 1m49.83s 12m26s 1m54s
find . 8.77s 5.95s 48s
Read (tar) 3m56s 2m23s 9m19s
Read (acp) 3m45s 2m15s 1m57s

Times for one million files in a single dir, 512 bytes each

Operation Btrfs XFS Ext3
Create 3m17s 16m6s 3m45s
find . 21.60s 10.85s 50.22s
Read (tar) 5m49s 8m7s 204m22s
Read (acp) 5m30s 4m2s 3m27s
Delete 9m25s 7m28s 18m27s

Times for one million files in a single dir, 16k each

Operation Btrfs XFS Ext3
Create 6m52s 16m6s 7m21s
find . 16.19s 23.69s N/A
Read (tar) 18m56s 12m17s N/A
Read (acp) 12m11s 10m37s 8m19s
Delete 11m16s 9m48s N/A

The numbers show that read and delete performance in these tests is dominated by how closely readdir order matches the order of inodes on disk. Presumably, the XFS readahead code for directories is the major reason it does so well. Btrfs hits a middle ground, but it is clear there is work to be done in avoiding seeks while fetching the inodes.

Btrfs is able to win the read with tar run on 512 byte files because those small files are packed into the same btree block that stores the inode.

The ext3 numbers show that htree needs some help. While it is possible to create programs like acp that sort everything by inode number, this should not be required to avoid horrible performance during backups.