[Ocfs2-tools-devel] Re: [Ocfs2-devel] [patch 1/1] offline de-fragmentation tool

wengang wang wen.gang.wang at oracle.com
Mon Feb 4 01:38:23 PST 2008


Hi Tao Ma,

yes, this a demo version:).
I put it here just want our experts know it:).  If it's useful, will 
commit the well written version :-P

Tao Ma wrote:
> Sunil,
>    What do you think of this tools? It is good I think.
>
> Thanks for your hard work, wengang.
> It should really take you much time since the patch contains more than 
> 2000 lines. ;)
> Have you written any design doc about it? Do you have any test script 
> that can show us how this fantastic tool work?
>
yea, I have one design doc at 
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/defragmentation, but it 
needs to be revised.
yea, I have test scripts to fill ocfs2 partition with data and do md5sum 
check on the files.  but didn't  paste them here.
to fill data, I
1) copy files in /usr/bin, /usr/sbin  and ~/ to ocfs2 partition.
2) delete one file very two files in order.

before de-fragmetating, groups layout is like this:
Total groups: 5  clusters per group: 7680,  blocks per group: 245760
No   blkno       total-bits free-bits max-cong-bits
0    32              7680       3955      1804    
1    245760      7680       4963      2165    
2    491520      7680       4945      2166    
3    737280      7680       4961      2166    
4    983040      2162       1366      792   

and running, it's like
Total groups: 5  clusters per group: 7680,  blocks per group: 245760
No   blkno       total-bits free-bits max-cong-bits
0    32              7680       0            0       
1    245760      7680       7679      7679    
3    737280      7680       7679      7679    
2    491520      7680       3482      3300    
4    983040      2162       1350      1237  

> But next time please use the git tree to get the latest ocfs2-tools 
> source code and generate your patch against it, so that I can merge 
> your code quickly in my tree. This time, since Joel has moved the 
> include file to other places, I can't merge your patch successfully. 
> So I only read your patch and give some comments.
>
i see.
> wengang wang wrote:
>> I wrote a user space tool to de-fragmentate global bitmap, hope this 
>> tool is helpful.
> Next time please send it to ocfs2-tools-devel since your patch only 
> contains the modification to ocfs2-tools.
ok.
>>
>> for the case of storing non-DB data on ocfs2,  there are not several 
>> very large file, but lots of relative small files.
>> after a long time of using, --especially creating and deleting, the 
>> global bitmap is split into fragments. so that even there is enough 
>> free space(but not contiguous), creating a file may fail.
>>
>> there is a relative bug 6730723(on bugdb) though closed with "not 
>> supported". and I have made a scenario that "df" show the partition 
>> usage 51%, but on other nodes creating a file fails with "no space" 
>> error.
>>
>> this offline tool, o2defrag, can make larger contiguous free bits on 
>> global bitmap by moving data clusters of regular file and directories.
>> it does:
>> 1)  for each group, move data clusters on the group to the front of 
>> the same group to make bigger free space at the end.
>> 2)  for groups that has more free space, move data clusters to other 
>> group(s) to make much more free space.
>>     a)   firstly, it try to move data clusters to the group on which 
>> there are data clusters of the same file. if no such group or no 
>> space on these groups, goto b).
>>     b)   move data clusters to a group on which there is no data 
>> clusters of the same file.
>> 3) does step 1) again.
>>
>> stuff changed by this tool is
>> I)   moving of data clusters.
>> II)  moving of corresponding bits in global bitmap.
>> III) extent record in ocfs2_dinode block or extension block
>>
>> this tool doesn't merge or split extent records for now. and it's 
>> nearly help nothing to fs performance.
>> the feature to be added is trying to move all data clusters of a file 
>> together as possible. so that accessing to file on ocfs2 can get 
>> better performance.
>>
>> the patch is based on ocfs2-tool 1.2.6.  for compiling, needs to add 
>> a symbolic link named "include" in o2defrag directory of 
>> ../debugfs.ocfs2/include.
>> usage is  "o2defrag <ocfs2 partition>"
>>
>>
>> thanks,
>> wengang.
>>
>> ------------------------------------------------------------------------
>>
>> diff -N -u -p -r ocfs2-tools-1.2.6.orig/libocfs2/chain.c 
>> ocfs2-tools-1.2.6/libocfs2/chain.c
>> --- ocfs2-tools-1.2.6.orig/libocfs2/chain.c    2008-02-05 
>> 01:30:21.000000000 -0500
>> +++ ocfs2-tools-1.2.6/libocfs2/chain.c    2008-02-05 
>> 02:00:53.000000000 -0500
>> @@ -43,7 +43,33 @@ void ocfs2_swap_group_desc(struct ocfs2_
>>      gd->bg_parent_dinode = bswap_64(gd->bg_parent_dinode);
>>      gd->bg_blkno = bswap_64(gd->bg_blkno);
>>  }
>> +errcode_t ocfs2_read_group_desc2(ocfs2_filesys *fs, uint64_t blkno,
>> +                    char *gd_buf)
>>   
> You added so many *2 functions. And the biggest difference is that you 
> don't allocate the spaces in it and read-write using the buffer given 
> by the caller. I am not sure whether it is OK. Even if it is OK, you 
> should put it into another patch, since it has no relationship with 
> your o2defrag patch. And it also makes your patch smaller so that 
> people can review it quickly.
>
the original functions do a duplicate for swap between cpu data and le 
data.  not using it after swapping to le data is ok.
yea, should do that for product version.

>> +struct o2_group_list{
>> +    int group_size;        /* count of total bits of a group */
>> +    int count;        /* count of groups */
>> +    struct o2_group *head;    /* the first group in list */
>> +    struct o2_group *last;  /* the last group in list */
>> +};
>> +
>> +struct o2_file_list all_file_list;
>> +struct o2_group_list all_group_list;
>> +
>> +/*
>> +struct current_name{
>> +    int current_layer;
>> +    char *name[MAX_FILE_DEP+1];
>> +};
>> +
>> +struct current_name current_name;
>> +*/
>>   
> Could you please remove these unused code before you generate the 
> patch? thanks.
sure, for product version.  it's used for debugging only.
>> +
>> +struct o2_modify_meta{
>> +    uint64_t meta_blkno;        /* inode number of a fle */
>> +    uint64_t old_data_blkno;   
>> +    uint64_t new_data_blkno;
>> +    int needs_free;            /* this is only for internal use */
>> +    struct o2_modify_meta *next;
>> +};
>> +
>> +static void * local_malloc(size_t size, int *needs_free)
>> +{
>> +    char *ret;
>> +    ssize_t cache_size;
>> +    int off;
>> +
>> +    cache_size  = sizeof(struct o2_modify_meta) * COUNT_OF_MODIFY_META;
>> +
>> +    if (memory_object.current_meta == COUNT_OF_MODIFY_META) {
>> +        memory_object.buf = malloc(cache_size);
>> +        if (!memory_object.buf) {
>> +            printf("no mem\n");
>>   
> Use com_err like other ocfs2-tools please.
sure for product version.
>> +            return NULL;
>> +        }
>> +        memory_object.current_meta = 0;
>> +    }
>> +   
>> +    if (memory_object.current_meta == 0) {
>> +        *needs_free = 1;
>> +    } else {
>> +        *needs_free = 0;
>> +    }
>> +    off = sizeof(struct o2_modify_meta) * memory_object.current_meta;
>> +    ret =  memory_object.buf + off;
>> +    memory_object.current_meta ++;
>> +
>> +    return ret;
>> +}
>> +
>> +static int worktree(char *basename, uint64_t blkno)
>>   +    int res;
>> +    char *buf = NULL;
>> +    struct ocfs2_dinode *inode = NULL;
>> +    struct o2_file *file = NULL;
>> +    char type;
>> +
>> +    res = push_name(basename);
>> +    if (res)
>> +        return -1;
>> +    res = ocfs2_malloc_block(gbls.fs->fs_io, &buf);
>> +    if (res) {
>> +        printf("ocfs2_malloc_block failed. %d\n", res);
>> +        return -1;
>> +    }
>> +   
>> +    res = ocfs2_read_inode2(gbls.fs, blkno, buf);
>> +    if (res) {
>> +        printf("ocfs2_read_inode error %d\n",res);
>> +        goto ret1;
>> +    }
>> +    inode = (struct ocfs2_dinode *)buf;
>> +    if (S_ISREG(inode->i_mode))
>> +        type='f';
>> +    else if (S_ISDIR(inode->i_mode))
>> +        type='d';
>> +    else {
>> +        // ignore other type
>> +        goto ret2;
>> +    }
>> +
>> +    file = new_file(inode, type);
>> +    if (!file) {
>> +        goto ret1;
>> +    }
>> +    insert_file_to_list(file);
>>   
> You don't handle the file which has link_count > 2. So here you may 
> insert it twice.
yes. needs more consideration.
>> +    res = access_one_node(inode, file);
>> +    if (res)
>> +        goto ret2;
>> +
>> +    if (file->type == 'd') {
>> +        res = ocfs2_dir_iterate(gbls.fs, blkno, 0, NULL,
>> +                do_with_childrens, NULL);
>> +        if (res) {
>> +            printf("ocfs2_dir_iterate failed. %d\n", res);
>> +            goto ret2;
>> +        }
>> +    }
>> +    goto ret1;
>> +
>> +ret2:   
>> +    if (file && file->filename)
>> +        free(file->filename);
>> +    if (file)
>> +        free(file);
>> +ret1:
>> +    if (buf)
>> +        ocfs2_free(&buf);
>> +    if (res) {
>> +        pop_name();
>> +    } else {
>> +        res = pop_name();
>> +    }
>> +
>> +    return res;
>> +}
>> +
>> +/* find the record --ocfs2_extent_rec, which blkno is 
>> old_data_blkno, change blkno to +new_data_blkno
>> +*/
>> +static inline int modify_file_meta(uint64_t file_blkno, uint64_t 
>> from_blkno,
>> +                    uint64_t to_blkno)
>>   
> Why go through the file once again? Since you have already iterate all 
> the files, you may record the information there. So that you don't 
> need to iterate it now the second time.
can't hold all meta in memory for memory limitation. have to iterate it 
again.

>> +{
>> +    struct ocfs2_dinode *inode;
>> +    struct ocfs2_extent_list *el;
>> +    int res;
>> +
>> +    if (!update_disk)
>> +        return 0;
>> +
>> +    res = ocfs2_read_inode2(gbls.fs, file_blkno, buf_modify_file_meta);
>> +    if (res) {
>> +        printf("ocfs2_read_inode2 %lu, failed. %d\n",file_blkno, res);
>> +        return -1;
>> +    }
>> +    inode = (struct ocfs2_dinode *)buf_modify_file_meta;
>> +    el= &(inode->id2.i_list);
>> +
>> +    res = find_and_change_extent_rec(el, buf_modify_file_meta, 
>> file_blkno,
>> +                        1, from_blkno, to_blkno);
>> +    if (res == 1) {
>> +        printf("such meta block %lu not found\n", from_blkno);
>> +    }
>> +
>> +    return res;
>> +}
>> +
>> +/* move the clusters specified by file_group to front of the same 
>> group */
>> +/* -1 -->error; 0 -->moved successfully; 1 -->no space */
>> +static int move_file_group_on_group(struct o2_group *group,
>> +                struct o2_file_group *file_group)
>> +{
>> +    struct ocfs2_group_desc *gd;
>> +    int alloc_start_bits_off;
>> +    int res;
>> +    int first_free_bit;
>> +    uint64_t fromblk, toblk;
>> +   
>> +    gd = group->gd;
>> +   
>> +    first_free_bit = get_first_free_bit(gd);
>> +    if (-1 == first_free_bit) {    //group full
>> +        res = 1;
>> +        goto done;
>> +    }
>> +   
>> +    if (file_group->off < first_free_bit) {
>> +        res = 1;
>> +        goto done;
>> +    }
>> +   
>> +    /* clear in memory the bits owned by this file_group */
>> +    clear_bits_on_group(file_group->off, file_group->count, gd);
>> +
>> +    /* try to find enough bits to move to */
>> +    res = get_N_contig_free_bits(gd, file_group->count,
>> +                    &alloc_start_bits_off);
>> +    if (res) {
>> +        //no space
>> +        set_bits_on_group(file_group->off, file_group->count, gd);
>> +        res = 1;
>> +        goto done;
>> +    }
>> +   
>> +    if (alloc_start_bits_off > file_group->off)
>>   
> Here you need to handle alloc_start_bits_off == file_group->off.
yea. adding == will avoid some data copies.
>> +    {
>> +        // got a starting address later than the original
>> +        clear_bits_on_group(alloc_start_bits_off, file_group->count, 
>> gd);
>> +        set_bits_on_group(file_group->off, file_group->count, gd);
>> +        res = 1;
>> +        goto done;
>> +    }
>> +/* now copy data clusters from old place to new place one by one */
>> +    fromblk = file_group->off<<get_shift_bits();
>> +    toblk = alloc_start_bits_off<<get_shift_bits();
>> +    if (!is_first_group_gd(group)) {
>> +        fromblk += group->blkno;
>> +        toblk += group->blkno;
>> +    } +
>> +    res = copy_N_clusters(fromblk, toblk, file_group->count);
>> +    if (res) {
>> +        //error, won't continue to access.
>> +        goto done;
>> +    }
>> +
>> +    /* modify the meta data of the file on disk */
>> +    if (!new_and_insert_meta(file_group->file->blkno, fromblk, 
>> toblk)) {
>> +        //error, won't continue to access.
>> +        res = -1;
>> +        goto done;
>> +    }
>> +
>> +    /* update the file group object */
>> +    file_group->off = alloc_start_bits_off;
>> +    file_group->blkno = file_group->off<<get_shift_bits();
>> +    if (!is_first_group_gd(file_group->group)) {
>> +        file_group->blkno += file_group->group->blkno;
>> +    }
>> +
>> +done:
>> +    return res;
>> +}
>> +
>> +
>> +/* move all data clusters of files in a group to the end of this 
>> group */
>>   
> You move to the beginning, not the end, it seems. ;)
yea, "to the end" is my very first thought. since it is difficult, gave 
up. 
"to front" is correct.
>> +/* for all groups, +   for all files that have data-clusters on one 
>> of the groups,
>> +   move the the data-clusters to front of this group, and move 
>> corresponding
>> +   bits to the front of the bitmap in this group.
>> +
>> +   this moving is based on a ocfs2_extent_rec, no mering or spliting 
>> on ocfs2_extent_rec.
>> +
>> +   what's modified on disk:
>> +   1) data-clusters copying, (while debugging, set the old clusters 
>> to 0)
>> +   2) file meta-data. --modifying the ocfs2_extent_rec.e_blkno.
>> +   3) clearing bits and setting bits on bitmap of that group.
>> +*/
>> +static int defrag1_move_data_to_front_on_all_group()
>> +{
>> +    int res;
>> +    struct o2_group *group;
>> +    struct o2_file_group *tmp;
>> +    int first_free_bit;
>> +
>> +    group = all_group_list.head;
>> +    printf("process in defrag1 ...\n");
>> +   
>> +    while ( group) {
>> +        //move group 2 only for debug +        printf("defrag1 
>> working on group %d/%d...", group->group_num, all_group_list.count);
>> +
>> +        first_free_bit = get_first_free_bit(group->gd);
>> +        if (-1 == first_free_bit) {
>> +            //group full
>> +            group = group->next;
>> +            printf("done.\n");
>> +            continue;
>> +        }
>> +       
>> +        tmp = group->file_group_list.head;
>> +        while (tmp) {
>> +            res = move_file_group_on_group(group, tmp);
>> +            if (res == -1) {
>> +                printf("move_file_group_on_group failed. %d\n", res);
>> +                return res;
>> +            }
>> +            tmp = tmp->next_in_group;
>> +        }
>> +        printf("done.\n");
>> +        group = group->next;
>> +    }
>> +
>> +    res = update_all_meta();
>>   
> You can't do it here. You really should do it immediately after your 
> move the cluster in move_file_group_on_group. Otherwise if the system 
> panic before this function, all the files' content is corrupted.
>
this tool needs to be run offline.
> There are many trailing white spaces in your patch. So please remove 
> them all.
>
sure.
> anyway, thanks again for your work.
thanks,
wengang.




More information about the Ocfs2-tools-devel mailing list