[Btrfs-devel] Re: Initial Planning document for multiple device
chris.mason at oracle.com
Wed Jan 23 07:07:57 PST 2008
On Wednesday 23 January 2008, Andi Kleen wrote:
> On Wednesday 23 January 2008 13:50:24 Chris Mason wrote:
> > On Wednesday 23 January 2008, Andi Kleen wrote:
[ why not use LVM or MD for mirroring ]
> > The key problem that requires most of this infrastructure is mirroring
> > metadata on a single spindle.
> You mean multiple spindles?
Even on a single drive, I'll duplicate btrees by default (optionally). The
expectation is that drives die in pieces and that Btrfs will be able to
survive this on single spindles by using duplicate copies of metadata.
> So basically the problem is: the file system should allocate new blocks
> that should be on a different device than another block so that you
> can make sure that the metadata ends up on different devices?
Yes, we also want to be sure on multiple spindles that the duplicates end up
on different spindles.
> If it's only that I assume suitable interfaces to MD etc. could be created
> to ask the underlying block device for a suitable hint although they would
> be probably be somewhat and might be little difficult to implement with
> efficient lookup.
The efficient lookup part is where things get difficult because you want a
hint where the FS hasn't already used all the free space.
> What would be more difficult is to make sure that when blocks get
> migrated later from some reason that the invariant of metadata getting
> mirrored stays true.
This is much more true with complex LVM configurations where devices come and
go. Keeping the FS in sync with changes to the linear address space is a big
> On the other hand if that was implemented as hints it would be in
> theory possible to support such hints in larger RAID boxes that
> have internal redundancy already.
A scsi command for read-other-mirror would be very helpful.
> But then again I might also miss some details here.
> > Chunks aren't required to solve it, but they
> > do add flexibility to do lots of other things. For example, relocating
> > hot blocks on to the SSD portion of a combined SSD/spindle drive, or
> > writing to the SSD when on battery and then transferring in bulk to the
> > spindle.
> LVM can do that already I thought.
It can, and it looks like they have improved the algorithm since I last used
it. But, this ties into the part where the FS has to know about the kinds of
storage that different parts of the linear address space are on.
My long term goal is to push as much of this out of the FS as I can. If we
can find practical ways to use it as a generic dm target, I'll work to make
enough hooks for other filesystems to take advantage of it.
With LVM today, you allocate and configure your storage (these PEs are
mirrored, these are striped etc), and then create an FS on top.
The Btrfs plan allocates storage early, but only configures it on demand. The
end result should be faster mirror rebuilds when devices die and much easier
More information about the Btrfs-devel