[Ocfs2-tools-devel] [Fwd: Re: [Ocfs2-devel] [patch 1/1] offline de-fragmentation tool]

Mon Feb 4 01:38:09 PST 2008

-------- Original Message --------
Subject: 	Re: [Ocfs2-devel] [patch 1/1] offline de-fragmentation tool
Date: 	Mon, 04 Feb 2008 17:35:41 +0800
From: 	wengang wang <wen.gang.wang at oracle.com>
To: 	wengang wang <wen.gang.wang at oracle.com>
CC: 	Tao Ma <tao.ma at oracle.com>, Greg Marsden <greg.marsden at oracle.com>, 
ocfs2-tools-devel at oss.oracle.com
References: 	<47A67D89.2030403 at oracle.com> <47A6D304.8060604 at oracle.com> 
<47A6DA21.7050003 at oracle.com>

too bad indentation for groups layout. attach it.
first: before defrag,
second: after it

wengang wang wrote:
> Hi Tao Ma,
>
> yes, this a demo version:).
> I put it here just want our experts know it:).  If it's useful, will 
> commit the well written version :-P
>
> Tao Ma wrote:
>> Sunil,
>>    What do you think of this tools? It is good I think.
>>
>> Thanks for your hard work, wengang.
>> It should really take you much time since the patch contains more 
>> than 2000 lines. ;)
>> Have you written any design doc about it? Do you have any test script 
>> that can show us how this fantastic tool work?
>>
> yea, I have one design doc at 
> http://oss.oracle.com/osswiki/OCFS2/DesignDocs/defragmentation, but it 
> needs to be revised.
> yea, I have test scripts to fill ocfs2 partition with data and do 
> md5sum check on the files.  but didn't  paste them here.
> to fill data, I
> 1) copy files in /usr/bin, /usr/sbin  and ~/ to ocfs2 partition.
> 2) delete one file very two files in order.
>
> before de-fragmetating, groups layout is like this:
> Total groups: 5  clusters per group: 7680,  blocks per group: 245760
> No   blkno       total-bits free-bits max-cong-bits
> 0    32              7680       3955      1804    1    245760      
> 7680       4963      2165    2    491520      7680       4945      
> 2166    3    737280      7680       4961      2166    4    983040      
> 2162       1366      792  
> and running, it's like
> Total groups: 5  clusters per group: 7680,  blocks per group: 245760
> No   blkno       total-bits free-bits max-cong-bits
> 0    32              7680       0            0       1    245760      
> 7680       7679      7679    3    737280      7680       7679      
> 7679    2    491520      7680       3482      3300    4    983040      
> 2162       1350      1237 
>> But next time please use the git tree to get the latest ocfs2-tools 
>> source code and generate your patch against it, so that I can merge 
>> your code quickly in my tree. This time, since Joel has moved the 
>> include file to other places, I can't merge your patch successfully. 
>> So I only read your patch and give some comments.
>>
> i see.
>> wengang wang wrote:
>>> I wrote a user space tool to de-fragmentate global bitmap, hope this 
>>> tool is helpful.
>> Next time please send it to ocfs2-tools-devel since your patch only 
>> contains the modification to ocfs2-tools.
> ok.
>>>
>>> for the case of storing non-DB data on ocfs2,  there are not several 
>>> very large file, but lots of relative small files.
>>> after a long time of using, --especially creating and deleting, the 
>>> global bitmap is split into fragments. so that even there is enough 
>>> free space(but not contiguous), creating a file may fail.
>>>
>>> there is a relative bug 6730723(on bugdb) though closed with "not 
>>> supported". and I have made a scenario that "df" show the partition 
>>> usage 51%, but on other nodes creating a file fails with "no space" 
>>> error.
>>>
>>> this offline tool, o2defrag, can make larger contiguous free bits on 
>>> global bitmap by moving data clusters of regular file and directories.
>>> it does:
>>> 1)  for each group, move data clusters on the group to the front of 
>>> the same group to make bigger free space at the end.
>>> 2)  for groups that has more free space, move data clusters to other 
>>> group(s) to make much more free space.
>>>     a)   firstly, it try to move data clusters to the group on which 
>>> there are data clusters of the same file. if no such group or no 
>>> space on these groups, goto b).
>>>     b)   move data clusters to a group on which there is no data 
>>> clusters of the same file.
>>> 3) does step 1) again.
>>>
>>> stuff changed by this tool is
>>> I)   moving of data clusters.
>>> II)  moving of corresponding bits in global bitmap.
>>> III) extent record in ocfs2_dinode block or extension block
>>>
>>> this tool doesn't merge or split extent records for now. and it's 
>>> nearly help nothing to fs performance.
>>> the feature to be added is trying to move all data clusters of a 
>>> file together as possible. so that accessing to file on ocfs2 can 
>>> get better performance.
>>>
>>> the patch is based on ocfs2-tool 1.2.6.  for compiling, needs to add 
>>> a symbolic link named "include" in o2defrag directory of 
>>> ../debugfs.ocfs2/include.
>>> usage is  "o2defrag <ocfs2 partition>"
>>>
>>>
>>> thanks,
>>> wengang.
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>
>>> diff -N -u -p -r ocfs2-tools-1.2.6.orig/libocfs2/chain.c 
>>> ocfs2-tools-1.2.6/libocfs2/chain.c
>>> --- ocfs2-tools-1.2.6.orig/libocfs2/chain.c    2008-02-05 
>>> 01:30:21.000000000 -0500
>>> +++ ocfs2-tools-1.2.6/libocfs2/chain.c    2008-02-05 
>>> 02:00:53.000000000 -0500
>>> @@ -43,7 +43,33 @@ void ocfs2_swap_group_desc(struct ocfs2_
>>>      gd->bg_parent_dinode = bswap_64(gd->bg_parent_dinode);
>>>      gd->bg_blkno = bswap_64(gd->bg_blkno);
>>>  }
>>> +errcode_t ocfs2_read_group_desc2(ocfs2_filesys *fs, uint64_t blkno,
>>> +                    char *gd_buf)
>>>   
>> You added so many *2 functions. And the biggest difference is that 
>> you don't allocate the spaces in it and read-write using the buffer 
>> given by the caller. I am not sure whether it is OK. Even if it is 
>> OK, you should put it into another patch, since it has no 
>> relationship with your o2defrag patch. And it also makes your patch 
>> smaller so that people can review it quickly.
>>
> the original functions do a duplicate for swap between cpu data and le 
> data.  not using it after swapping to le data is ok.
> yea, should do that for product version.
>
>>> +struct o2_group_list{
>>> +    int group_size;        /* count of total bits of a group */
>>> +    int count;        /* count of groups */
>>> +    struct o2_group *head;    /* the first group in list */
>>> +    struct o2_group *last;  /* the last group in list */
>>> +};
>>> +
>>> +struct o2_file_list all_file_list;
>>> +struct o2_group_list all_group_list;
>>> +
>>> +/*
>>> +struct current_name{
>>> +    int current_layer;
>>> +    char *name[MAX_FILE_DEP+1];
>>> +};
>>> +
>>> +struct current_name current_name;
>>> +*/
>>>   
>> Could you please remove these unused code before you generate the 
>> patch? thanks.
> sure, for product version.  it's used for debugging only.
>>> +
>>> +struct o2_modify_meta{
>>> +    uint64_t meta_blkno;        /* inode number of a fle */
>>> +    uint64_t old_data_blkno;   +    uint64_t new_data_blkno;
>>> +    int needs_free;            /* this is only for internal use */
>>> +    struct o2_modify_meta *next;
>>> +};
>>> +
>>> +static void * local_malloc(size_t size, int *needs_free)
>>> +{
>>> +    char *ret;
>>> +    ssize_t cache_size;
>>> +    int off;
>>> +
>>> +    cache_size  = sizeof(struct o2_modify_meta) * 
>>> COUNT_OF_MODIFY_META;
>>> +
>>> +    if (memory_object.current_meta == COUNT_OF_MODIFY_META) {
>>> +        memory_object.buf = malloc(cache_size);
>>> +        if (!memory_object.buf) {
>>> +            printf("no mem\n");
>>>   
>> Use com_err like other ocfs2-tools please.
> sure for product version.
>>> +            return NULL;
>>> +        }
>>> +        memory_object.current_meta = 0;
>>> +    }
>>> +   +    if (memory_object.current_meta == 0) {
>>> +        *needs_free = 1;
>>> +    } else {
>>> +        *needs_free = 0;
>>> +    }
>>> +    off = sizeof(struct o2_modify_meta) * memory_object.current_meta;
>>> +    ret =  memory_object.buf + off;
>>> +    memory_object.current_meta ++;
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static int worktree(char *basename, uint64_t blkno)
>>>   +    int res;
>>> +    char *buf = NULL;
>>> +    struct ocfs2_dinode *inode = NULL;
>>> +    struct o2_file *file = NULL;
>>> +    char type;
>>> +
>>> +    res = push_name(basename);
>>> +    if (res)
>>> +        return -1;
>>> +    res = ocfs2_malloc_block(gbls.fs->fs_io, &buf);
>>> +    if (res) {
>>> +        printf("ocfs2_malloc_block failed. %d\n", res);
>>> +        return -1;
>>> +    }
>>> +   +    res = ocfs2_read_inode2(gbls.fs, blkno, buf);
>>> +    if (res) {
>>> +        printf("ocfs2_read_inode error %d\n",res);
>>> +        goto ret1;
>>> +    }
>>> +    inode = (struct ocfs2_dinode *)buf;
>>> +    if (S_ISREG(inode->i_mode))
>>> +        type='f';
>>> +    else if (S_ISDIR(inode->i_mode))
>>> +        type='d';
>>> +    else {
>>> +        // ignore other type
>>> +        goto ret2;
>>> +    }
>>> +
>>> +    file = new_file(inode, type);
>>> +    if (!file) {
>>> +        goto ret1;
>>> +    }
>>> +    insert_file_to_list(file);
>>>   
>> You don't handle the file which has link_count > 2. So here you may 
>> insert it twice.
> yes. needs more consideration.
>>> +    res = access_one_node(inode, file);
>>> +    if (res)
>>> +        goto ret2;
>>> +
>>> +    if (file->type == 'd') {
>>> +        res = ocfs2_dir_iterate(gbls.fs, blkno, 0, NULL,
>>> +                do_with_childrens, NULL);
>>> +        if (res) {
>>> +            printf("ocfs2_dir_iterate failed. %d\n", res);
>>> +            goto ret2;
>>> +        }
>>> +    }
>>> +    goto ret1;
>>> +
>>> +ret2:   +    if (file && file->filename)
>>> +        free(file->filename);
>>> +    if (file)
>>> +        free(file);
>>> +ret1:
>>> +    if (buf)
>>> +        ocfs2_free(&buf);
>>> +    if (res) {
>>> +        pop_name();
>>> +    } else {
>>> +        res = pop_name();
>>> +    }
>>> +
>>> +    return res;
>>> +}
>>> +
>>> +/* find the record --ocfs2_extent_rec, which blkno is 
>>> old_data_blkno, change blkno to +new_data_blkno
>>> +*/
>>> +static inline int modify_file_meta(uint64_t file_blkno, uint64_t 
>>> from_blkno,
>>> +                    uint64_t to_blkno)
>>>   
>> Why go through the file once again? Since you have already iterate 
>> all the files, you may record the information there. So that you 
>> don't need to iterate it now the second time.
> can't hold all meta in memory for memory limitation. have to iterate 
> it again.
>
>>> +{
>>> +    struct ocfs2_dinode *inode;
>>> +    struct ocfs2_extent_list *el;
>>> +    int res;
>>> +
>>> +    if (!update_disk)
>>> +        return 0;
>>> +
>>> +    res = ocfs2_read_inode2(gbls.fs, file_blkno, 
>>> buf_modify_file_meta);
>>> +    if (res) {
>>> +        printf("ocfs2_read_inode2 %lu, failed. %d\n",file_blkno, res);
>>> +        return -1;
>>> +    }
>>> +    inode = (struct ocfs2_dinode *)buf_modify_file_meta;
>>> +    el= &(inode->id2.i_list);
>>> +
>>> +    res = find_and_change_extent_rec(el, buf_modify_file_meta, 
>>> file_blkno,
>>> +                        1, from_blkno, to_blkno);
>>> +    if (res == 1) {
>>> +        printf("such meta block %lu not found\n", from_blkno);
>>> +    }
>>> +
>>> +    return res;
>>> +}
>>> +
>>> +/* move the clusters specified by file_group to front of the same 
>>> group */
>>> +/* -1 -->error; 0 -->moved successfully; 1 -->no space */
>>> +static int move_file_group_on_group(struct o2_group *group,
>>> +                struct o2_file_group *file_group)
>>> +{
>>> +    struct ocfs2_group_desc *gd;
>>> +    int alloc_start_bits_off;
>>> +    int res;
>>> +    int first_free_bit;
>>> +    uint64_t fromblk, toblk;
>>> +   +    gd = group->gd;
>>> +   +    first_free_bit = get_first_free_bit(gd);
>>> +    if (-1 == first_free_bit) {    //group full
>>> +        res = 1;
>>> +        goto done;
>>> +    }
>>> +   +    if (file_group->off < first_free_bit) {
>>> +        res = 1;
>>> +        goto done;
>>> +    }
>>> +   +    /* clear in memory the bits owned by this file_group */
>>> +    clear_bits_on_group(file_group->off, file_group->count, gd);
>>> +
>>> +    /* try to find enough bits to move to */
>>> +    res = get_N_contig_free_bits(gd, file_group->count,
>>> +                    &alloc_start_bits_off);
>>> +    if (res) {
>>> +        //no space
>>> +        set_bits_on_group(file_group->off, file_group->count, gd);
>>> +        res = 1;
>>> +        goto done;
>>> +    }
>>> +   +    if (alloc_start_bits_off > file_group->off)
>>>   
>> Here you need to handle alloc_start_bits_off == file_group->off.
> yea. adding == will avoid some data copies.
>>> +    {
>>> +        // got a starting address later than the original
>>> +        clear_bits_on_group(alloc_start_bits_off, 
>>> file_group->count, gd);
>>> +        set_bits_on_group(file_group->off, file_group->count, gd);
>>> +        res = 1;
>>> +        goto done;
>>> +    }
>>> +/* now copy data clusters from old place to new place one by one */
>>> +    fromblk = file_group->off<<get_shift_bits();
>>> +    toblk = alloc_start_bits_off<<get_shift_bits();
>>> +    if (!is_first_group_gd(group)) {
>>> +        fromblk += group->blkno;
>>> +        toblk += group->blkno;
>>> +    } +
>>> +    res = copy_N_clusters(fromblk, toblk, file_group->count);
>>> +    if (res) {
>>> +        //error, won't continue to access.
>>> +        goto done;
>>> +    }
>>> +
>>> +    /* modify the meta data of the file on disk */
>>> +    if (!new_and_insert_meta(file_group->file->blkno, fromblk, 
>>> toblk)) {
>>> +        //error, won't continue to access.
>>> +        res = -1;
>>> +        goto done;
>>> +    }
>>> +
>>> +    /* update the file group object */
>>> +    file_group->off = alloc_start_bits_off;
>>> +    file_group->blkno = file_group->off<<get_shift_bits();
>>> +    if (!is_first_group_gd(file_group->group)) {
>>> +        file_group->blkno += file_group->group->blkno;
>>> +    }
>>> +
>>> +done:
>>> +    return res;
>>> +}
>>> +
>>> +
>>> +/* move all data clusters of files in a group to the end of this 
>>> group */
>>>   
>> You move to the beginning, not the end, it seems. ;)
> yea, "to the end" is my very first thought. since it is difficult, 
> gave up. "to front" is correct.
>>> +/* for all groups, +   for all files that have data-clusters on one 
>>> of the groups,
>>> +   move the the data-clusters to front of this group, and move 
>>> corresponding
>>> +   bits to the front of the bitmap in this group.
>>> +
>>> +   this moving is based on a ocfs2_extent_rec, no mering or 
>>> spliting on ocfs2_extent_rec.
>>> +
>>> +   what's modified on disk:
>>> +   1) data-clusters copying, (while debugging, set the old clusters 
>>> to 0)
>>> +   2) file meta-data. --modifying the ocfs2_extent_rec.e_blkno.
>>> +   3) clearing bits and setting bits on bitmap of that group.
>>> +*/
>>> +static int defrag1_move_data_to_front_on_all_group()
>>> +{
>>> +    int res;
>>> +    struct o2_group *group;
>>> +    struct o2_file_group *tmp;
>>> +    int first_free_bit;
>>> +
>>> +    group = all_group_list.head;
>>> +    printf("process in defrag1 ...\n");
>>> +   +    while ( group) {
>>> +        //move group 2 only for debug +        printf("defrag1 
>>> working on group %d/%d...", group->group_num, all_group_list.count);
>>> +
>>> +        first_free_bit = get_first_free_bit(group->gd);
>>> +        if (-1 == first_free_bit) {
>>> +            //group full
>>> +            group = group->next;
>>> +            printf("done.\n");
>>> +            continue;
>>> +        }
>>> +       +        tmp = group->file_group_list.head;
>>> +        while (tmp) {
>>> +            res = move_file_group_on_group(group, tmp);
>>> +            if (res == -1) {
>>> +                printf("move_file_group_on_group failed. %d\n", res);
>>> +                return res;
>>> +            }
>>> +            tmp = tmp->next_in_group;
>>> +        }
>>> +        printf("done.\n");
>>> +        group = group->next;
>>> +    }
>>> +
>>> +    res = update_all_meta();
>>>   
>> You can't do it here. You really should do it immediately after your 
>> move the cluster in move_file_group_on_group. Otherwise if the system 
>> panic before this function, all the files' content is corrupted.
>>
> this tool needs to be run offline.
>> There are many trailing white spaces in your patch. So please remove 
>> them all.
>>
> sure.
>> anyway, thanks again for your work.
> thanks,
> wengang.
>

-- 
Wengang Wang
Member of Technical Staff
Oracle Asia R&D Center
Open Source Technologies Development

Tel:      +86 10 8278 6265
Mobile:   +86 13381078925

-- 
Wengang Wang
Member of Technical Staff
Oracle Asia R&D Center
Open Source Technologies Development

Tel:      +86 10 8278 6265
Mobile:   +86 13381078925

-------------- next part --------------
Total groups: 5  clusters per group: 7680,  blocks per group: 245760
No   blkno       total-bits free-bits max-cong-bits
0    32          7680       3955      1804     
1    245760      7680       4963      2165     
2    491520      7680       4945      2166     
3    737280      7680       4961      2166     
4    983040      2162       1366      792     

Total groups: 5  clusters per group: 7680,  blocks per group: 245760
No   blkno       total-bits free-bits max-cong-bits
0    32          7680       0         0        
1    245760      7680       7679      7679     
3    737280      7680       7679      7679     
2    491520      7680       3482      3300     
4    983040      2162       1350      1237