Mason's Blog

September 15, 2008

Btrfs has been making great progress, including contributors from Redhat, Intel and others. Some of my favorite new features are improved data=ordered mode suport (with v0.16) and fsync optimizations (queued for v0.17). Many Linux filesystems have had problems in the past with workloads that mix fsync() and lots of buffered IO. A recent writeup of this centered around the Firefox fsync bug

The short version is that Ext3 flushes all dirty data pages in the FS when you run fsync on any file when mounted in the default data=ordered mode. This can create significant latencies for new IO and for the poor process waiting on fsync. Both XFS and Ext4 have different schemes to provide data=ordered without waiting on every dirty page.

With v0.16, Btrfs moved away from an Ext3 style data=ordered implementation to something that updates metadata only after the data extents are fully on disk. This means a transaction commit never has to wait on data extents, and makes it possible to provide atomic writes of any single write system call.

The Btrfs unstable trees have new optimizations around fsync and O_SYNC writes. Before, any synchronous write would force a full transaction commit, which writes all the dirty metadata blocks. The COW btrees are pretty efficient in long running transactions, but for short transactions we end up writing significant amounts of metadata, especially metadata for the extent allocation tree due to frequent block recow.

The new code creates a write ahead logging tree. An fsync on a single file or directory copies the items that have changed in the current transaction into the logging tree, and then pushes the logging tree down to disk. Crash recovery involves merging the items found in the logging tree with the metadata in the subvolume btree, and recording data extent allocations.

This is very efficient, and it completely decouples synchronous operations from other operations on the FS. The write ahead logging tree uses the same btree code, item formats and IO submission paths as the rest of the FS. This means all the raid policies for metadata are applied to the logging tree without any special case code.

Using the firefox test case, the old Btrfs fsync code would trigger long stalls while loading pages. With the new code, you can't tell firefox is running fsync at all. For some harder numbers, I ran Andrew Morton's synctest program to simulate a mail server on a single sata drive with barriers enabled. The results show the new code is very effective. Diving into the data, it appears most of the differences between the filesystems center on data allocation schemes instead of logging speed.

October 19, 2007

Well, it took about three times longer than I expected, but the large blocksize code for Btrfs is finally working. My home directory is running on a freshly formatted 16k blocksize Btrfs. I don't think I'll get the huge blocks I was hoping for without changing the format of the data in the leaves, but first I'll try tuning for SSD via allocator optimizations.

I've collected some benchmarks, and the results are pretty good. I'll make the default 16K for starters, both because it performs well and because I'd like to shake out any large block bugs.

In other news, I'll visit Portland next month to brainstorm on filesystem features, and will meet with most of the Oracle kernel team, including Andy Grover who just joined us.

September 19, 2007

I've been meaning to setup a place where I can post status updates and other details. Since blogs.oracle.com is under reorg, I'll run it under oss.oracle.com for a little while.

2008 Storage Workshop: It is time to get rolling on the 2008 storage workshop, which I'll organize this year. You can see details about the last workshop on the usenix page. This year it will be somewhat more hands on, but still adjacent to FAST (late Feb '08). Please send me some email if you have comments on things you would like to see at LSF this year.

SSD tuning: I've started experimenting with tuning Btrfs on an SSD drive. In general Btrfs seems to squeeze more write speed out of the chips than other filesystems, but my read performance is somewhat lower. My current plan is to implement monster btree leaves (1MB or more), and 128k btree nodes.

This should better match the underlying blocksizes of the SSD, and it'll make for some interesting ammo in the current large block size flamefest. LWN has a good writeup as well.

But, large blocksizes in Btrfs have an added cost, larger blocks mean more reference count changes as each block is COW'd. So, there will be lots of interesting tuning involved.

Readdir experiments: I've been planning on implementing backpointers from inodes to each directory that links it. This will make it much easier to detect bogus link counts on inodes, and will be a key part of implementing incremental backups on Btrfs. I realized it also allows me to implement a readdir optimization Andreas Dilger mentioned a few years back.

If the inodes linked into a directory are relatively sequential, and you have a backpointer that includes the filename in the inode, you can implement readdir by walking the inodes inside the directory. For Btrfs, this basically makes readdir a btree walk of the file data. Most of the file metadata will end up in cache, and any later read (or delete) operations will be much faster.

Since I also have directories indexed by the inode number of the things linked into it, I can use a hybrid approach that switches between an inode walk and a directory walk as needed. I'm pretty anxious to try it out, but I think the large blocksize support will have to come first.