[Ocfs2-devel] [RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER_SIZE

Wed Mar 18 06:57:48 PDT 2009

Hi Mark/Joel,
	I meet with some meta allocation bugs when I implement reflink these 
days. And after some investigation, I think we should have the same 
problem when we have PAGE_SIZE > CLUSTER_SIZE. So I create a scenario 
today in one ppc box and try. the box panic as I expected. ;)

The scenario is that: Create a file with the disk layout like this(with
bs=512, and cs=4K).

debugfs: stat 15151
	Inode: 66072   Mode: 0644   Generation: 59969160 (0x3930e88)
<snip>
	Tree Depth: 1   Count: 19   Next Free Rec: 2
	## Offset        Clusters       Block#
	0  0             258            86365
	1  258           66             86367
	SubAlloc Bit: 21   SubAlloc Slot: 0
	Blknum: 86365   Next Leaf: 86367
	CRC32: N/A   ECC: N/A
	Tree Depth: 0   Count: 28   Next Free Rec: 28
	## Offset        Clusters       Block#          Flags
	0  0             1              116696          0x0
<snip>
	25 25            1              117096          0x0
	26 256           1              117112          0x1
	27 257           1              117120          0x0
	SubAlloc Bit: 23   SubAlloc Slot: 0
	Blknum: 86367   Next Leaf: 0
	CRC32: N/A   ECC: N/A
	Tree Depth: 0   Count: 28   Next Free Rec: 2
	## Offset        Clusters       Block#          Flags
	0  258           2              117128          0x1
	1  260           64             117176          0x1

Please note the extent record from "26" to "0" of the next block are
contiguous allocated with unwritten and then divide it by write to the
256 with 1 cluster.

Now if we try to write 40960 bytes at offset 256. We will panic. Why?
The reason is that:
1. with ppc box, we have page_size=64K. So in one
ocfs2_write_begin_no_lock we will try to handle 40960 bytes together.
2. in ocfs2_lock_allocators we will get that no metadata is need since
the 2nd extent block has so many empty extent recs.
3. then write begin one cluster by one in ocfs2_write_cluster.
   1) The 1st cluster(256) nothing special.
   2) the 2nd (257), it will be merged with 256.
   3) the 3rd (258), be merged with 256.
   4) the 4th (259), be merged. Now 256-259 will be merged into 1 extent
rec, so the 2nd extent block will be removed. and we will get.
	26 256           4              117112          0x0
	27 260           64             117176          0x1
   5) Now comes the 260, we need to split and call ocfs2_add_branch to
allocate a new block. But wait, we have no metadata reserved. So we
panic here.

So my thought is that can we reuse the freed extent block? I guess we 
can. We just need to store the pointer of ocfs2_cached_dealloc_ctxt in 
ocfs2_alloc_context. So whenever we allocate a new metadata, we try to 
search ocfs2_cached_dealloc_ctxt first, if there is some, we use it 
directly and delete it from ocfs2_cached_dealloc_ctxt. The same can go 
for cluster allocation I guess although I don't know whether we have 
such case for clusters.

make sense?

btw, this is critical because we often meet with this type of issue in 
reflink(the 1st step delete a leaf extent block because of merge while 
the 2nd step want to create one because of merge while no metadata are 
reserved). And even worse, I met with a scenario that the process of 
delete/add goes for 6 times.

Regards,
Tao