[Ocfs2-devel] [RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER_SIZE
Mark Fasheh
mfasheh at suse.com
Thu Mar 19 17:30:22 PDT 2009
On Wed, Mar 18, 2009 at 09:57:48PM +0800, Tao Ma wrote:
> Hi Mark/Joel,
> I meet with some meta allocation bugs when I implement reflink these
> days. And after some investigation, I think we should have the same
> problem when we have PAGE_SIZE > CLUSTER_SIZE. So I create a scenario
> today in one ppc box and try. the box panic as I expected. ;)
>
> The scenario is that: Create a file with the disk layout like this(with
> bs=512, and cs=4K).
>
> debugfs: stat 15151
> Inode: 66072 Mode: 0644 Generation: 59969160 (0x3930e88)
> <snip>
> Tree Depth: 1 Count: 19 Next Free Rec: 2
> ## Offset Clusters Block#
> 0 0 258 86365
> 1 258 66 86367
> SubAlloc Bit: 21 SubAlloc Slot: 0
> Blknum: 86365 Next Leaf: 86367
> CRC32: N/A ECC: N/A
> Tree Depth: 0 Count: 28 Next Free Rec: 28
> ## Offset Clusters Block# Flags
> 0 0 1 116696 0x0
> <snip>
> 25 25 1 117096 0x0
> 26 256 1 117112 0x1
> 27 257 1 117120 0x0
> SubAlloc Bit: 23 SubAlloc Slot: 0
> Blknum: 86367 Next Leaf: 0
> CRC32: N/A ECC: N/A
> Tree Depth: 0 Count: 28 Next Free Rec: 2
> ## Offset Clusters Block# Flags
> 0 258 2 117128 0x1
> 1 260 64 117176 0x1
>
> Please note the extent record from "26" to "0" of the next block are
> contiguous allocated with unwritten and then divide it by write to the
> 256 with 1 cluster.
>
> Now if we try to write 40960 bytes at offset 256. We will panic. Why?
> The reason is that:
> 1. with ppc box, we have page_size=64K. So in one
> ocfs2_write_begin_no_lock we will try to handle 40960 bytes together.
> 2. in ocfs2_lock_allocators we will get that no metadata is need since
> the 2nd extent block has so many empty extent recs.
So, the core problem then is that ocfs2_lock_allocators is not doing as good
a job calculating the meta data allocation this write needs.
> 3. then write begin one cluster by one in ocfs2_write_cluster.
> 1) The 1st cluster(256) nothing special.
> 2) the 2nd (257), it will be merged with 256.
> 3) the 3rd (258), be merged with 256.
> 4) the 4th (259), be merged. Now 256-259 will be merged into 1 extent
> rec, so the 2nd extent block will be removed. and we will get.
> 26 256 4 117112 0x0
> 27 260 64 117176 0x1
> 5) Now comes the 260, we need to split and call ocfs2_add_branch to
> allocate a new block. But wait, we have no metadata reserved. So we
> panic here.
Right, ok.
> So my thought is that can we reuse the freed extent block? I guess we
> can. We just need to store the pointer of ocfs2_cached_dealloc_ctxt in
> ocfs2_alloc_context. So whenever we allocate a new metadata, we try to
> search ocfs2_cached_dealloc_ctxt first, if there is some, we use it
> directly and delete it from ocfs2_cached_dealloc_ctxt. The same can go
> for cluster allocation I guess although I don't know whether we have
> such case for clusters.
>
> make sense?
Yes, but I think a better approach is to fix the core problem, instead of
working around it. What we want to do then is make ocfs2_lock_allocators
smart enough to catch this case so that it can reserve the proper amount of
meta data. The good news is that we already do a scan of the writeable area
in ocfs2_populate_write_desc(). It seems to me that scan is a good place
where we could store some information about contiguous extents which will be
merged. The only thing that'd be left for ocfs2_lock_allocators to do is
look at the remaining extent block counts.
--Mark
--
Mark Fasheh
More information about the Ocfs2-devel
mailing list