[Btrfs-devel] Re: Initial Planning document for multiple device support

Wed Jan 23 06:07:47 PST 2008

On Wednesday 23 January 2008 13:50:24 Chris Mason wrote:
> On Wednesday 23 January 2008, Andi Kleen wrote:
> > Chris Mason <chris.mason at oracle.com> writes:
> >
> > Just commenting on something that tripped me while reading
> > the document.
> >
> > >If Btrfs were to rely on device mapper or MD for mirroring, it would
> > >not be able to resolve checksum failures by checking the mirrored
> > >copy. The lower layers don't know the checksum or granularity of the
> > >filesystem blocks, and so they are not able to verify the data they
> > >return.
> >
> > I cannot imagine it would be that difficult to add a new READ_OTHER_COPY
> > io operation that would cause MD/LVM/... to return the other copy
> > in a mirror set.
>
> This is something SGI recently proposed, and it is a very good idea I
> think. It also makes sense for hooks between MD and the FS to figure out
> which blocks are in use during a rebuild, and for the FS to tell LVM when
> blocks are freed to help make snapshots more efficient.
>
> > Even without btrfs that might be even generally useful for other
> > applications that do some checking on their files.
> >
> > e.g. I could well imagine a new system call to trigger this on the
> > page cache level.
> >
> > There might be other reasons to reinvent another storage manager
> > of course. Just that one above doesn't seem to be very convincing.
> > I admit I haven't thought too deeply about the other issues you
> > raise in the document.
>
> The key problem that requires most of this infrastructure is mirroring
> metadata on a single spindle.

You mean multiple spindles?

So basically the problem is: the file system should allocate new blocks
that should be on a different device than another block so that you
can make sure that the metadata ends up on different devices?

If it's only that I assume suitable interfaces to MD etc. could be created to
ask the underlying block device for a suitable hint although they would be 
probably be somewhat and might be little difficult to implement with
efficient lookup.

What would be more difficult is to make sure that when blocks get 
migrated later from some reason that the invariant of metadata getting
mirrored stays true.

On the other hand if that was implemented as hints it would be in
theory possible to support such hints in larger RAID boxes that
have internal redundancy already.

But then again I might also miss some details here.

> Chunks aren't required to solve it, but they 
> do add flexibility to do lots of other things.  For example, relocating hot
> blocks on to the SSD portion of a combined SSD/spindle drive, or writing to
> the SSD when on battery and then transferring in bulk to the spindle.

LVM can do that already I thought.

-Andi