[Ocfs2-tools-devel] [Fwd: Re: [Ocfs2-devel] [patch 1/1] offline
de-fragmentation tool]
wengang wang
wen.gang.wang at oracle.com
Mon Feb 4 01:38:09 PST 2008
-------- Original Message --------
Subject: Re: [Ocfs2-devel] [patch 1/1] offline de-fragmentation tool
Date: Mon, 04 Feb 2008 17:35:41 +0800
From: wengang wang <wen.gang.wang at oracle.com>
To: wengang wang <wen.gang.wang at oracle.com>
CC: Tao Ma <tao.ma at oracle.com>, Greg Marsden <greg.marsden at oracle.com>,
ocfs2-tools-devel at oss.oracle.com
References: <47A67D89.2030403 at oracle.com> <47A6D304.8060604 at oracle.com>
<47A6DA21.7050003 at oracle.com>
too bad indentation for groups layout. attach it.
first: before defrag,
second: after it
wengang wang wrote:
> Hi Tao Ma,
>
> yes, this a demo version:).
> I put it here just want our experts know it:). If it's useful, will
> commit the well written version :-P
>
> Tao Ma wrote:
>> Sunil,
>> What do you think of this tools? It is good I think.
>>
>> Thanks for your hard work, wengang.
>> It should really take you much time since the patch contains more
>> than 2000 lines. ;)
>> Have you written any design doc about it? Do you have any test script
>> that can show us how this fantastic tool work?
>>
> yea, I have one design doc at
> http://oss.oracle.com/osswiki/OCFS2/DesignDocs/defragmentation, but it
> needs to be revised.
> yea, I have test scripts to fill ocfs2 partition with data and do
> md5sum check on the files. but didn't paste them here.
> to fill data, I
> 1) copy files in /usr/bin, /usr/sbin and ~/ to ocfs2 partition.
> 2) delete one file very two files in order.
>
> before de-fragmetating, groups layout is like this:
> Total groups: 5 clusters per group: 7680, blocks per group: 245760
> No blkno total-bits free-bits max-cong-bits
> 0 32 7680 3955 1804 1 245760
> 7680 4963 2165 2 491520 7680 4945
> 2166 3 737280 7680 4961 2166 4 983040
> 2162 1366 792
> and running, it's like
> Total groups: 5 clusters per group: 7680, blocks per group: 245760
> No blkno total-bits free-bits max-cong-bits
> 0 32 7680 0 0 1 245760
> 7680 7679 7679 3 737280 7680 7679
> 7679 2 491520 7680 3482 3300 4 983040
> 2162 1350 1237
>> But next time please use the git tree to get the latest ocfs2-tools
>> source code and generate your patch against it, so that I can merge
>> your code quickly in my tree. This time, since Joel has moved the
>> include file to other places, I can't merge your patch successfully.
>> So I only read your patch and give some comments.
>>
> i see.
>> wengang wang wrote:
>>> I wrote a user space tool to de-fragmentate global bitmap, hope this
>>> tool is helpful.
>> Next time please send it to ocfs2-tools-devel since your patch only
>> contains the modification to ocfs2-tools.
> ok.
>>>
>>> for the case of storing non-DB data on ocfs2, there are not several
>>> very large file, but lots of relative small files.
>>> after a long time of using, --especially creating and deleting, the
>>> global bitmap is split into fragments. so that even there is enough
>>> free space(but not contiguous), creating a file may fail.
>>>
>>> there is a relative bug 6730723(on bugdb) though closed with "not
>>> supported". and I have made a scenario that "df" show the partition
>>> usage 51%, but on other nodes creating a file fails with "no space"
>>> error.
>>>
>>> this offline tool, o2defrag, can make larger contiguous free bits on
>>> global bitmap by moving data clusters of regular file and directories.
>>> it does:
>>> 1) for each group, move data clusters on the group to the front of
>>> the same group to make bigger free space at the end.
>>> 2) for groups that has more free space, move data clusters to other
>>> group(s) to make much more free space.
>>> a) firstly, it try to move data clusters to the group on which
>>> there are data clusters of the same file. if no such group or no
>>> space on these groups, goto b).
>>> b) move data clusters to a group on which there is no data
>>> clusters of the same file.
>>> 3) does step 1) again.
>>>
>>> stuff changed by this tool is
>>> I) moving of data clusters.
>>> II) moving of corresponding bits in global bitmap.
>>> III) extent record in ocfs2_dinode block or extension block
>>>
>>> this tool doesn't merge or split extent records for now. and it's
>>> nearly help nothing to fs performance.
>>> the feature to be added is trying to move all data clusters of a
>>> file together as possible. so that accessing to file on ocfs2 can
>>> get better performance.
>>>
>>> the patch is based on ocfs2-tool 1.2.6. for compiling, needs to add
>>> a symbolic link named "include" in o2defrag directory of
>>> ../debugfs.ocfs2/include.
>>> usage is "o2defrag <ocfs2 partition>"
>>>
>>>
>>> thanks,
>>> wengang.
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> diff -N -u -p -r ocfs2-tools-1.2.6.orig/libocfs2/chain.c
>>> ocfs2-tools-1.2.6/libocfs2/chain.c
>>> --- ocfs2-tools-1.2.6.orig/libocfs2/chain.c 2008-02-05
>>> 01:30:21.000000000 -0500
>>> +++ ocfs2-tools-1.2.6/libocfs2/chain.c 2008-02-05
>>> 02:00:53.000000000 -0500
>>> @@ -43,7 +43,33 @@ void ocfs2_swap_group_desc(struct ocfs2_
>>> gd->bg_parent_dinode = bswap_64(gd->bg_parent_dinode);
>>> gd->bg_blkno = bswap_64(gd->bg_blkno);
>>> }
>>> +errcode_t ocfs2_read_group_desc2(ocfs2_filesys *fs, uint64_t blkno,
>>> + char *gd_buf)
>>>
>> You added so many *2 functions. And the biggest difference is that
>> you don't allocate the spaces in it and read-write using the buffer
>> given by the caller. I am not sure whether it is OK. Even if it is
>> OK, you should put it into another patch, since it has no
>> relationship with your o2defrag patch. And it also makes your patch
>> smaller so that people can review it quickly.
>>
> the original functions do a duplicate for swap between cpu data and le
> data. not using it after swapping to le data is ok.
> yea, should do that for product version.
>
>>> +struct o2_group_list{
>>> + int group_size; /* count of total bits of a group */
>>> + int count; /* count of groups */
>>> + struct o2_group *head; /* the first group in list */
>>> + struct o2_group *last; /* the last group in list */
>>> +};
>>> +
>>> +struct o2_file_list all_file_list;
>>> +struct o2_group_list all_group_list;
>>> +
>>> +/*
>>> +struct current_name{
>>> + int current_layer;
>>> + char *name[MAX_FILE_DEP+1];
>>> +};
>>> +
>>> +struct current_name current_name;
>>> +*/
>>>
>> Could you please remove these unused code before you generate the
>> patch? thanks.
> sure, for product version. it's used for debugging only.
>>> +
>>> +struct o2_modify_meta{
>>> + uint64_t meta_blkno; /* inode number of a fle */
>>> + uint64_t old_data_blkno; + uint64_t new_data_blkno;
>>> + int needs_free; /* this is only for internal use */
>>> + struct o2_modify_meta *next;
>>> +};
>>> +
>>> +static void * local_malloc(size_t size, int *needs_free)
>>> +{
>>> + char *ret;
>>> + ssize_t cache_size;
>>> + int off;
>>> +
>>> + cache_size = sizeof(struct o2_modify_meta) *
>>> COUNT_OF_MODIFY_META;
>>> +
>>> + if (memory_object.current_meta == COUNT_OF_MODIFY_META) {
>>> + memory_object.buf = malloc(cache_size);
>>> + if (!memory_object.buf) {
>>> + printf("no mem\n");
>>>
>> Use com_err like other ocfs2-tools please.
> sure for product version.
>>> + return NULL;
>>> + }
>>> + memory_object.current_meta = 0;
>>> + }
>>> + + if (memory_object.current_meta == 0) {
>>> + *needs_free = 1;
>>> + } else {
>>> + *needs_free = 0;
>>> + }
>>> + off = sizeof(struct o2_modify_meta) * memory_object.current_meta;
>>> + ret = memory_object.buf + off;
>>> + memory_object.current_meta ++;
>>> +
>>> + return ret;
>>> +}
>>> +
>>> +static int worktree(char *basename, uint64_t blkno)
>>> + int res;
>>> + char *buf = NULL;
>>> + struct ocfs2_dinode *inode = NULL;
>>> + struct o2_file *file = NULL;
>>> + char type;
>>> +
>>> + res = push_name(basename);
>>> + if (res)
>>> + return -1;
>>> + res = ocfs2_malloc_block(gbls.fs->fs_io, &buf);
>>> + if (res) {
>>> + printf("ocfs2_malloc_block failed. %d\n", res);
>>> + return -1;
>>> + }
>>> + + res = ocfs2_read_inode2(gbls.fs, blkno, buf);
>>> + if (res) {
>>> + printf("ocfs2_read_inode error %d\n",res);
>>> + goto ret1;
>>> + }
>>> + inode = (struct ocfs2_dinode *)buf;
>>> + if (S_ISREG(inode->i_mode))
>>> + type='f';
>>> + else if (S_ISDIR(inode->i_mode))
>>> + type='d';
>>> + else {
>>> + // ignore other type
>>> + goto ret2;
>>> + }
>>> +
>>> + file = new_file(inode, type);
>>> + if (!file) {
>>> + goto ret1;
>>> + }
>>> + insert_file_to_list(file);
>>>
>> You don't handle the file which has link_count > 2. So here you may
>> insert it twice.
> yes. needs more consideration.
>>> + res = access_one_node(inode, file);
>>> + if (res)
>>> + goto ret2;
>>> +
>>> + if (file->type == 'd') {
>>> + res = ocfs2_dir_iterate(gbls.fs, blkno, 0, NULL,
>>> + do_with_childrens, NULL);
>>> + if (res) {
>>> + printf("ocfs2_dir_iterate failed. %d\n", res);
>>> + goto ret2;
>>> + }
>>> + }
>>> + goto ret1;
>>> +
>>> +ret2: + if (file && file->filename)
>>> + free(file->filename);
>>> + if (file)
>>> + free(file);
>>> +ret1:
>>> + if (buf)
>>> + ocfs2_free(&buf);
>>> + if (res) {
>>> + pop_name();
>>> + } else {
>>> + res = pop_name();
>>> + }
>>> +
>>> + return res;
>>> +}
>>> +
>>> +/* find the record --ocfs2_extent_rec, which blkno is
>>> old_data_blkno, change blkno to +new_data_blkno
>>> +*/
>>> +static inline int modify_file_meta(uint64_t file_blkno, uint64_t
>>> from_blkno,
>>> + uint64_t to_blkno)
>>>
>> Why go through the file once again? Since you have already iterate
>> all the files, you may record the information there. So that you
>> don't need to iterate it now the second time.
> can't hold all meta in memory for memory limitation. have to iterate
> it again.
>
>>> +{
>>> + struct ocfs2_dinode *inode;
>>> + struct ocfs2_extent_list *el;
>>> + int res;
>>> +
>>> + if (!update_disk)
>>> + return 0;
>>> +
>>> + res = ocfs2_read_inode2(gbls.fs, file_blkno,
>>> buf_modify_file_meta);
>>> + if (res) {
>>> + printf("ocfs2_read_inode2 %lu, failed. %d\n",file_blkno, res);
>>> + return -1;
>>> + }
>>> + inode = (struct ocfs2_dinode *)buf_modify_file_meta;
>>> + el= &(inode->id2.i_list);
>>> +
>>> + res = find_and_change_extent_rec(el, buf_modify_file_meta,
>>> file_blkno,
>>> + 1, from_blkno, to_blkno);
>>> + if (res == 1) {
>>> + printf("such meta block %lu not found\n", from_blkno);
>>> + }
>>> +
>>> + return res;
>>> +}
>>> +
>>> +/* move the clusters specified by file_group to front of the same
>>> group */
>>> +/* -1 -->error; 0 -->moved successfully; 1 -->no space */
>>> +static int move_file_group_on_group(struct o2_group *group,
>>> + struct o2_file_group *file_group)
>>> +{
>>> + struct ocfs2_group_desc *gd;
>>> + int alloc_start_bits_off;
>>> + int res;
>>> + int first_free_bit;
>>> + uint64_t fromblk, toblk;
>>> + + gd = group->gd;
>>> + + first_free_bit = get_first_free_bit(gd);
>>> + if (-1 == first_free_bit) { //group full
>>> + res = 1;
>>> + goto done;
>>> + }
>>> + + if (file_group->off < first_free_bit) {
>>> + res = 1;
>>> + goto done;
>>> + }
>>> + + /* clear in memory the bits owned by this file_group */
>>> + clear_bits_on_group(file_group->off, file_group->count, gd);
>>> +
>>> + /* try to find enough bits to move to */
>>> + res = get_N_contig_free_bits(gd, file_group->count,
>>> + &alloc_start_bits_off);
>>> + if (res) {
>>> + //no space
>>> + set_bits_on_group(file_group->off, file_group->count, gd);
>>> + res = 1;
>>> + goto done;
>>> + }
>>> + + if (alloc_start_bits_off > file_group->off)
>>>
>> Here you need to handle alloc_start_bits_off == file_group->off.
> yea. adding == will avoid some data copies.
>>> + {
>>> + // got a starting address later than the original
>>> + clear_bits_on_group(alloc_start_bits_off,
>>> file_group->count, gd);
>>> + set_bits_on_group(file_group->off, file_group->count, gd);
>>> + res = 1;
>>> + goto done;
>>> + }
>>> +/* now copy data clusters from old place to new place one by one */
>>> + fromblk = file_group->off<<get_shift_bits();
>>> + toblk = alloc_start_bits_off<<get_shift_bits();
>>> + if (!is_first_group_gd(group)) {
>>> + fromblk += group->blkno;
>>> + toblk += group->blkno;
>>> + } +
>>> + res = copy_N_clusters(fromblk, toblk, file_group->count);
>>> + if (res) {
>>> + //error, won't continue to access.
>>> + goto done;
>>> + }
>>> +
>>> + /* modify the meta data of the file on disk */
>>> + if (!new_and_insert_meta(file_group->file->blkno, fromblk,
>>> toblk)) {
>>> + //error, won't continue to access.
>>> + res = -1;
>>> + goto done;
>>> + }
>>> +
>>> + /* update the file group object */
>>> + file_group->off = alloc_start_bits_off;
>>> + file_group->blkno = file_group->off<<get_shift_bits();
>>> + if (!is_first_group_gd(file_group->group)) {
>>> + file_group->blkno += file_group->group->blkno;
>>> + }
>>> +
>>> +done:
>>> + return res;
>>> +}
>>> +
>>> +
>>> +/* move all data clusters of files in a group to the end of this
>>> group */
>>>
>> You move to the beginning, not the end, it seems. ;)
> yea, "to the end" is my very first thought. since it is difficult,
> gave up. "to front" is correct.
>>> +/* for all groups, + for all files that have data-clusters on one
>>> of the groups,
>>> + move the the data-clusters to front of this group, and move
>>> corresponding
>>> + bits to the front of the bitmap in this group.
>>> +
>>> + this moving is based on a ocfs2_extent_rec, no mering or
>>> spliting on ocfs2_extent_rec.
>>> +
>>> + what's modified on disk:
>>> + 1) data-clusters copying, (while debugging, set the old clusters
>>> to 0)
>>> + 2) file meta-data. --modifying the ocfs2_extent_rec.e_blkno.
>>> + 3) clearing bits and setting bits on bitmap of that group.
>>> +*/
>>> +static int defrag1_move_data_to_front_on_all_group()
>>> +{
>>> + int res;
>>> + struct o2_group *group;
>>> + struct o2_file_group *tmp;
>>> + int first_free_bit;
>>> +
>>> + group = all_group_list.head;
>>> + printf("process in defrag1 ...\n");
>>> + + while ( group) {
>>> + //move group 2 only for debug + printf("defrag1
>>> working on group %d/%d...", group->group_num, all_group_list.count);
>>> +
>>> + first_free_bit = get_first_free_bit(group->gd);
>>> + if (-1 == first_free_bit) {
>>> + //group full
>>> + group = group->next;
>>> + printf("done.\n");
>>> + continue;
>>> + }
>>> + + tmp = group->file_group_list.head;
>>> + while (tmp) {
>>> + res = move_file_group_on_group(group, tmp);
>>> + if (res == -1) {
>>> + printf("move_file_group_on_group failed. %d\n", res);
>>> + return res;
>>> + }
>>> + tmp = tmp->next_in_group;
>>> + }
>>> + printf("done.\n");
>>> + group = group->next;
>>> + }
>>> +
>>> + res = update_all_meta();
>>>
>> You can't do it here. You really should do it immediately after your
>> move the cluster in move_file_group_on_group. Otherwise if the system
>> panic before this function, all the files' content is corrupted.
>>
> this tool needs to be run offline.
>> There are many trailing white spaces in your patch. So please remove
>> them all.
>>
> sure.
>> anyway, thanks again for your work.
> thanks,
> wengang.
>
--
Wengang Wang
Member of Technical Staff
Oracle Asia R&D Center
Open Source Technologies Development
Tel: +86 10 8278 6265
Mobile: +86 13381078925
--
Wengang Wang
Member of Technical Staff
Oracle Asia R&D Center
Open Source Technologies Development
Tel: +86 10 8278 6265
Mobile: +86 13381078925
-------------- next part --------------
Total groups: 5 clusters per group: 7680, blocks per group: 245760
No blkno total-bits free-bits max-cong-bits
0 32 7680 3955 1804
1 245760 7680 4963 2165
2 491520 7680 4945 2166
3 737280 7680 4961 2166
4 983040 2162 1366 792
Total groups: 5 clusters per group: 7680, blocks per group: 245760
No blkno total-bits free-bits max-cong-bits
0 32 7680 0 0
1 245760 7680 7679 7679
3 737280 7680 7679 7679
2 491520 7680 3482 3300
4 983040 2162 1350 1237
More information about the Ocfs2-tools-devel
mailing list