ocfs2 de-fragmentation on file.

backgroud

for files having too many fragments, I/O performance on them is bad because of more I/O by ocfs2 and the more seeking time of disk headers.

de-fragment tool

this is a de-fragment tool which de-fragments on files(not on bitmaps) to make less fragments.

detail of merging extent-records for a file.

1) reads meta blocks (inode and extent-blocks) for a file into memory.

2) reads all extent-records in meta blocks to a list named all-extent-records-list which is used to hold all extent-records

3) if the sum of extent-records <= TRESHOLD(a fixed value specified as parameter or a default), abort the merging for the file.

4) re-alloc data clusters.

4.1) set PIECE_SIZE as JNL_BLKS - META_BLKS - 10. (JNL_BLKS is the sum of blocks reserved for journaling in ocfs2 FS. it's not too small; META_BLKS is the sum of meta-blocks). we will merge PIECE_SIZE extent-records in a loop in the following steps. the answer to why merge JNL_BLKS - META_BLKS - 10 extent-blocks in a loop is that: it takes consideration of number of journaling blocks. At most, there are META_BLKS meta-blocks to commit to journaling. and 1 extent-records maps to only 1 group in global bitmap, M extent-records at most map to M groups. we have to have META_BLKS +M < JNL_BLKS, and to avoid deadlock of jbd in situation of being lack of blocks, set PIECE_SIZE to JNL_BLKS - META_BLKS - 10.
4.2) if PIECE_SIZE < 2, abort merging on this file. if PIECE_SIZE< 2, the file almost has 4MB size of meta and journaling blocks are not enough for them. any good idea for this problem?
4.3) for every PIECE_SIZE extent-records in all-extent-record-list, do
- 4.3.1) moves the extent-records out from all-extent-record-list to a temporary list named piece-orig-list;
- 4.3.2) clear all meta-blocks(in memory);
- 4.3.3) allocates data clusters from global bitmap and create new extent-records(memory objects) and stores them in another temporary list named piece-new-list(allocates in memory).
  4.3.4) if the sum of extent-records in piece-new-list >= that of extent-records in piece-orig-list(in this case, no help on de-frag), undo the allocates; moves extent-records in piece-orig-list back to all-extent-record-list; releases extent-records in piece-new-list; finish this loop.
- 4.3.5) frees data clusters represented by the extent-records in piece-orig-list to global bitmap(in memory); .
- 4.3.6) copies data clusters according to piece-orig-list and piece-new-list with syncing(to disk);
- 4.3.7) releases extent-records in piece-orig-list.
- 4.3.8) moves extent-records in piece-new-list to all-extent-record-list.
- 4.3.9) merges extent-records in all-extent-record-list.
- 4.3.10) inserts each extent-records in all-extent-record-list to meta-blocks. --does something like ocfs2_insert_extent() in ocfs2 does(but no merge needed at this step).
- 4.3.11) for meta-blocks that are not in use any longer, frees them to global bitmap(in memory) and release the in-memory objects. updates META_BLKS.
- 4.3.12) commit all groups in global bitmap and meta-blocks to disk via journal.
4.4) do step 4.1 until there is no really merging happened.
Note: if we do steps 4.3.3 and 4.3.5 we maybe get better result for the possibility of successful allocation, but for safety consideration, we don't do it as that.

journaling

to guarantee the safety of meta, journal is needed. Here, jbd is used. it's implemented by a kernel module making use of jbd. it offers interface to user space via ioctls. the jbd will use the journal file in ocfs2 as it's journaling blocks. when started, let jbd do journaling if there are valid journaling blocks.

safety

for meta, journaling is used, it can guarantee the safety of meta; for data, we are writing copies of the original, there is no destroy on original data. and if crash occurred when reallocating bitmaps or coping data, meta is not modified, it's still safe. so this tool is safe.

limitations

1) this tool have to be run offline.

2) suit for large files -- for small files, copying out in copying in(with changing inode number) solves the problem as Sunil always suggest users to do :P