[Btrfs-devel] btrfs and Solid State Disks (SSD).

Mon Feb 11 19:55:58 PST 2008

On 2/12/08, Chris Mason <chris.mason at oracle.com> wrote:
> On Monday 11 February 2008, Miguel Figueiredo Mascarenhas Sousa Filipe wrote:
> > Hi there,
> >
> > This might sound stupid, but since I cannot infer from the
> > documentation and features of btrfs the answer to my doubts.
> > Here it goes:
> >
> > Is the data and metadata ondisk layout of btrfs favorable for SSDs ?
>
> Yes, SSDs are a big target of mine, and so the parts that are not currently
> favorable for SSDs will be changed.  The big problem right now is that btrfs
> writes to a fixed super block for every commit.  That will change to a
> rotating set of fixed super blocks to lower wear and improve redundancy.
>

I expect that the only thing that matters with ondisk layout is
the alignment between the logical allocation unit of fs and
the erasure block of the flash-based storage.

> >
> > >From what I read, current SSD are characterized by:
> >
> > - poor performance in random writes (because of block erasure)
>
> Random writes are very fast, as long as they fill an entire erasure block
> (often 128KB) There are rumors that SSDs will have smaller erasure blocks in
> the future.

Unfortunately, the future size of the erasure blocks will be bigger.
128KB is for SLC NAND and 512KB for MLC.
Furthermore, SSD groups multiple flash memories to exploit parallelism
and the effective erasure block size will be proportional to the number
of chips inside.

>
> > - require wear leveling, even those that emulate sata/ide/scsi disk
> > with onboard wear leveling logic.
>
> Yes, they all require wear leveling, and usually they do this internally.
> Btrfs will not do wear leveling.

I also think so.

>
> > - excellent seek latency
> > - excellent read (random and sequential) performance.
> > - good at sequential writes.
> >
> > I've read that journaling file systems are usually bad for SSD because
> > of (from what I suppose are) two things:
> > - increased "random" write load (journal + proper data)
> > - write hot spot on the journal, causing lots of write cycles on a
> > given set of blocks.
> >
> > Theoretically, ext2 would be better for a SSD than ext3 because of these
> > issues.
>
> Journaled filesystems will definitely exercise the wear leveling firmware, but
> so will ext2.  The metadata and file data blocks are in a fixed location and
> use small block sizes.  So, metadata heavy workloads will hammer on the SSD
> either way.
>

I agree. The I/O pattern from journaling fs tends to be non-aligned, scattered,
and of small size.

> >
> > So, is the design of btrfs a good match for the peculiarities of SSDs ?
> >
>
> Yes, because Btrfs is copy on write it is able to always cluster metadata and
> data writes in an optimal fashion on the SSD.  On traditional storage you get
> very bad performance with this type of allocation model because it will have
> many more seeks on reads.  But with SSD, there is no read penalty.
>

I expect that COW or log-structured design is a good match for SSD.

> With v0.12, I introduced a knob to tune the allocator for SSD (clearly there
> is much more work to do here).  You mount with -o ssd to enable the tuning.
> Here is an example graph on an SSD device with postmark, which basically does
> random writes to a bunch of files:
>
> http://oss.oracle.com/~mason/seekwatcher/pm-compare.png
>
> Here is the same workload on a traditional sata drive, but with ext3 in the
> results:
>
> http://oss.oracle.com/~mason/seekwatcher/postmark/postmark-compare.png
>
> Notice that on the spinning sata drive btrfs -o ssd isn't much faster overall.
> This is because write cache is enabled on the drive, without the cache on -o
> ssd is about 2x faster than the defaults.
>

I've also run some benchmark with several SSDs and found that
the btrfs v0.12 (with ssd option) outperforms other journaling fs.

--
Dongjun