So far, benchmarking has focused on workloads that target a specific aspect of the allocator algorithms or disk formats. Over the long term, this page will be updated to reflect all the benchmarking data that has been gathered. Database workloads and multi-process benchmarking have not been tested yet, both will perform very poorly until a few more items come off the TODO list.
ext3, xfs and btrfs are compared below. The benchmarking machine is a Dell desktop machine (2.4ghz, dual core) on top of a single SATA drive running v2.6.21. Each FS was formatted on the same 40GB LVM volume. All three filesystems are using IO barriers to force cache flushes.
ext3 is mounted -o data=writeback,barrier=1 and created with htree indexing.
xfs was formatted with mkfs.xfs -d agcount=1 -l size=128m,version=2. This creates one allocation group on the disk, which gave the best XFS results on this machine (hint from SGI's Dave Chinner). xfs was mounted -o logbsize=256k
This is a fairly boring benchmark, just creating enough kernel trees to fill up ram a few times over. However, the results were not what I expected, so there is some discussion below about why the numbers come out this way. compilebench -i 20 -r 0 is used to create files with the same names and sizes as the 2.6.20 kernel tree, and the speed of the 20 creation runs are averaged together for the result:
|XFS||16.95 MB/s||IO graph|
|Ext3||22.92 MB/s||IO graph|
It should be noted that throughput in the table above is calculated from the total file data written divided by the time the test required. Throughput in the IO graphs is derived from blktrace, and so it includes everything the FS actually writes to the drive. It may be higher than the speeds observed by the application.
The Ext3 results look faster than XFS, but the graph shows that toward the end of the run ext3 performance starts to degrade. Even though ext3 is placing the metadata and data close together on disk, the two are not being written at the same time. Right at 250 seconds on the graph, you can see someone come in and flush out the block device inode in a series of seeky (although increasing) writes. As the number of compilebench runs increases, ext3 average performance goes down.
XFS has the same problem of metadata vs data seeking, but the graph shows it is writing back the metadata much more frequently. The XFS numbers stay consistent as more compilebench runs are done.
Btrfs does not intermix metadata and data on the drive, and so it is able to run in this workload with much less seeking. The Btrfs numbers also stay consistent as more runs are done (Btrfs pays for this with read speed).
Copy on write filesystems are more likely to fragment as the FS ages. compilebench was developed to try and measure how an FS performs during a long run of file creates, deletions and modifications. It does this by simulating a kernel compile, creating and deleting files with the same names and sizes as they appear in the kernel (thanks to Matt Mackall for this idea). Times to read, stat, create, delete, patch and clean trees are also collected. Please check the compilebench homepage for more details.
This test used compilebench -i 90 -r 150, which creates 90 initial trees and then runs 150 random operations on them.
|Initial Create||29.17 MB/s||15.92 MB/s||15.11 MB/s|
|Create tree||13.05 MB/s||9.22 MB/s||10.63 MB/s|
|Patch tree||4.73 MB/s||3.70 MB/s||5.14 MB/s|
|Compile Tree||20.48 MB/s||23.51 MB/s||17.24 MB/s|
|Clean tree||96.50 MB/s||141.11 MB/s||47.27 MB/s|
|Read Tree||6.91 MB/s||9.17 MB/s||9.04 MB/s|
|Read Compiled Tree||12.51 MB/s||15.46 MB/s||15.48 MB/s|
|Delete Tree||17.12 seconds||17.10 seconds||19.26 seconds|
|Delete Compiled Tree||24.67 seconds||21.68 seconds||31.28 seconds|
|Stat Tree||15.29 seconds||6.33 seconds||12.59 seconds|
|Stat Compiled Tree||15.29 seconds||7.93 seconds||14.04 seconds|
|Fsck time after run||1m44s||3m32s||11m0s|
As expected, the numbers show that Btrfs scores higher on the write phases than on the read phases. Overall, XFS wins most of the phases.
These tests stress the directory indexing, inode allocation and metadata writeback routines. One million files are created in a single directory on an empty FS, and then read and deleted. Unmounts are done between each operation.
Tar is used to demonstrate performance from reading files in the order that readdir returns them. acp also reads directories, but it sorts the files by inode number as it finds them and does large batches of open(2). acp then does readahead(2) to optimize data reads, and finally reads the data. acp is available in two versions, acp (syslets) and acp (readahead). Since there is only one directory in this case, the two give about the same results.
Times for one million empty files in a single dir
Times for one million files in a single dir, 512 bytes each
Times for one million files in a single dir, 16k each
The numbers show that read and delete performance in these tests is dominated by how closely readdir order matches the order of inodes on disk. Presumably, the XFS readahead code for directories is the major reason it does so well. Btrfs hits a middle ground, but it is clear there is work to be done in avoiding seeks while fetching the inodes.
Btrfs is able to win the read with tar run on 512 byte files because those small files are packed into the same btree block that stores the inode.
The ext3 numbers show that htree needs some help. While it is possible to create programs like acp that sort everything by inode number, this should not be required to avoid horrible performance during backups.