[Btrfs-devel] Re: Initial Planning document for multiple device support

Wed Jan 23 04:50:24 PST 2008

On Wednesday 23 January 2008, Andi Kleen wrote:
> Chris Mason <chris.mason at oracle.com> writes:
>
> Just commenting on something that tripped me while reading
> the document.
>
> >If Btrfs were to rely on device mapper or MD for mirroring, it would
> >not be able to resolve checksum failures by checking the mirrored
> >copy. The lower layers don't know the checksum or granularity of the
> >filesystem blocks, and so they are not able to verify the data they
> >return.
>
> I cannot imagine it would be that difficult to add a new READ_OTHER_COPY
> io operation that would cause MD/LVM/... to return the other copy
> in a mirror set.

This is something SGI recently proposed, and it is a very good idea I think.  
It also makes sense for hooks between MD and the FS to figure out which 
blocks are in use during a rebuild, and for the FS to tell LVM when blocks 
are freed to help make snapshots more efficient.

>
> Even without btrfs that might be even generally useful for other
> applications that do some checking on their files.
>
> e.g. I could well imagine a new system call to trigger this on the
> page cache level.
>
> There might be other reasons to reinvent another storage manager
> of course. Just that one above doesn't seem to be very convincing.
> I admit I haven't thought too deeply about the other issues you
> raise in the document.

The key problem that requires most of this infrastructure is mirroring 
metadata on a single spindle.  Chunks aren't required to solve it, but they 
do add flexibility to do lots of other things.  For example, relocating hot 
blocks on to the SSD portion of a combined SSD/spindle drive, or writing to 
the SSD when on battery and then transferring in bulk to the spindle.

The chunk code is basically a storage layer with three or four hooks into the 
FS.  Once I have it working, I'll take a hard look at pushing it down into DM 
where it can be used for other things.

-chris