ocfs2 de-fragmentation

wen.gang.wang@oracle.com Feb 2008

backgroud:

there is case of that the file system usage is 60% but on some nodes in the cluster space, creating a very small file fails with error indicating no space left. it's because of the fragmentation of the blobal_bitmap.

testcase results in this situation:

format a ocfs2 volume with opertions "-b 1024 -C 4096".
mount the ocfs2 volume on one node only.
on this mount point, fills the FS with very small(N bytes) files.
on this mount point, remove file every two contiguous files.

uses of inode_alloc for all slots

//inode_alloc:0000 Bitmap Total: 186368 Used: 93277 Free: 93091

//inode_alloc:0001 Bitmap Total: 0 Used: 0 Free: 0

//inode_alloc:0002 Bitmap Total: 0 Used: 0 Free: 0

//inode_alloc:0003 Bitmap Total: 0 Used: 0 Free: 0

What was the free% of inode_alloc for all slots? As in, was inode_alloc full for one slot but free on others?

goals:

make very node in cluster domain can make full use of the ocfs2 volume.

todos:

needs an offline de-fragmentation tool to coalesce and merge(merging is not implemented yet in the "detail" section) data extents on a ocfs2 volume.

Is the aim to coalesce data extents or to also pack the allocated inodes?

change the cluster reserving/claiming policy

offline de-fragmentation tool:

must guarantee safety when power down or other crashes during the procedure of de-fragmentation(not implemented yet in the "detail" section).

This is a MUST.

for performance consideration, defragment files that have more than N extents(not implemented yet in the "detail" section).

As users are not looking for a fully defragmented volume, performance can be "improved" by simply limiting the goal. As in, defragment the files that have more than N extents.

move data clusters to make more larger contiguous free bits on groups.

implementation detail:

this tool involves modifying on groups in global_bitmap, extent records for files and data clusters for files. since local_alloc is freed when umount, as a offline tool, it need not take local_alloc into consideration.

reads all groups into memory(also with memory mapping to groups on disk, since number of groups is not very large).
reads all regular files and directories)under / and saves their extents in memory.
on every group. moves bits for data clusters to front of this group -- makes larger contiguous free bits at the end of that group. in detail, for each extent
- In memory, frees the corresponding bits begining at offset A in the bitmap of this group, then allocates in the same bitmap the same count of bits begining at offset B. If B is not prior to A, undo the allocating and freeing in reverse order and ends the iteration for this extent.
- On disk, copies data clusters. to avoid overwrite clusters, copy them one by one.
- Fills the original data clusters with 0 for debugging.
- Modify the extent in memory.
looks for a group that has the largest free bits among the 'left' groups. takes this group out of the 'left' groups. moves data clusters from this group to other groups. --make much larger contiguous free bits on this group. For the destination group(s), a group which contains data clusters for the same file is selected. If no such group or no enough space on this group, another group in the 'left' groups is selected. For this moving policy, it is possible to move a same data cluster several times. This is to be improved( selecting the group that has minimal free bits is a choice). The detail for this step is similar with those in step 3, difference is freeing bits and allocating bits are not in the same bitmap.
runs from step 4, until there is only on group in the 'left' groups.
does what is done in step 3 again to pack data clusters and make larger contiguous free bits at the end of groups.
updates inodes and extent blocks to disk.
commit all groups to disk.
sync the writes on disk.

as seen, updating inodes/extents and committing groups to disk are done very late. this aims getting better performance of this tool. But in case of error, this may be done partially or not done at all. -- to be improved.

and this procedure doesn't take handlink into consideration. -- to be improved.

change the cluster reserving/claiming policy:

current policy:

for large allocation, it finds a chain with the most count of free bits and searches in this chain for a group that has N contiguous free bits. and allocates bits in this group.

This is true for all large allocs. Small allocs happen via suballocators.

new policy:

put all groups into group lists according to their max contiguous free bits. for example, group lists as below
list1: groups in the list has max-cong-bits <=1024; list2: 1024< max-cong-bits <= 2048; list3: 2048< max-cong-bits <= 4096 and so on. but this will lead to frequently scaning bitmpas which is cpu consuming. so just put this thought here for your information, don't implement it until found a way to avoid the cost of cpu.
For this to work, we will have to scan all the bits too frequently which is eat up cpu and affect performance. In the end, loss of performance is too steep a price for a bit more contiguity.
when allocating, get the proper list according the bits to allocate; allocates bits in the first group in the found list.
move the group to a proper list.

so that we can keep larger max contiguous free bits untouched. and we can deal with bigger claiming request later. but don't know how this can help unti-fragmentation, so just put here for your information.

Not sure how this is relevant. Small allocs are not made directly from the Global Bitmap directly. They happen via the localalloc. Only large allocs are made directly from the Global bitmap.

how to test:

before de-fragmentaion, make md5sum for all files under /;
while moving data clusters, fill the original data clusters with 0;
after de-fragmentation, check the md5sum again.

OCFS2/DesignDocs/defragmentation

ocfs2 de-fragmentation