[OracleOSS] [TitleIndex] [WordIndex]

OCFS2/DesignDocs/RemoveSlotsTunefs

REMOVE SLOTS SUPPORT OF TUNEFS.OCFS2

Owner: TaoMa

OVERVIEW

mkfs.ocfs2 can format a ocfs2 volume according to the node numbers set by the user. But as the time goes by, the user may want to change the node numbers to fit his/her own need. Now tunefs.ocfs2 can add slot numbers when the cluster is off-line, and this design documents is about how to decrease slot numbers.

Under some circumstances, the system have many unused nodes and they have very large empty journal files, with the implementation of decreasing, we can remove the corresponding journal files and reuse the disk space and this is also another cause for this tunefs.ocfs2 update.

TECHNICAL OVERVIEW

When we want to add some slots, tunefs.ocfs2 will do the following steps: adding the system files, allocing/initializing the journal, and lastly incrementing the super block slots count.

Removing slots is more involved, we may have to consider the following steps(just a sketch, more details will be filled in):

1. Mark an inprog incompat in the super block so that any corruption in the future may cause the volume unaccessible and another fsck.ocfs2 is needed to continue the work.

2. Groups in inode_alloc and extent_alloc will be linked into the remaining groups(maybe 0? and is that all the thing we need to do?). We have to modify the inode and all the group descriptors to indicate the real change.

3. The Sub Alloc Slot in all inodes/extents in the moved groups will have to be updated. The Sub Alloc bit should not change since both fs and fsck will not check its value.

debugfs: stat /large_files
       Inode: 266966   Mode: 0755   Generation: 2939310569 (0xaf3251e9)
...
       Last Extblk: 0
       Sub Alloc Slot: 0   Sub Alloc Bit: 1720
       Tree Depth: 0   Count: 115   Next Free Rec: 1
       ## Offset        Clusters       Block#
       0  0             1              551940

4. Truncate the journal and orphan dir to release their clusters to the global bitmap.

5. The extra system dir entries should be removed.

6. The slot number is updated in the super block.

7. The inprog incompat in super should be cleared.

8. The backup super blocks are updated(The code is already there, we just need to make sure we fresh the super block before the update process).

Another thing that has to be mentioned is that if tunefs.ocfs2 fails to decrease the slot numbers, fsck.ocfs2 must have the ability to recover from the corruption so that the volume is OK for another tunefs.ocfs2 try and the normal usage.

INVESTIGATION ON CURRENT SYSTEM BEHAVIOR

tunefs.ocfs2 will do much modification to the whole ocfs2 volume and it may segfault after any steps I described above, so how the current ocfs2 system response will be investigated and future enhancement may also be needed.

Segfault after we link the groups in extent_alloc to other slots, but don't change the group descriptors

1. fsck.ocfs2 won't find any errors since it don't check whether the Sub Alloc Slot is right or not.

2. If not run fsck.ocfs2 first, mount can work but when we delete the files the fs finds the error and make the volume read-only.

Segfault after we link the groups in inode_alloc to other slots, but don't change the group descriptors

1. fsck.ocfs2 will find the error "GROUP_PARENT" and fix it.

2. If not run fsck.ocfs2 first, mount can work but when we delete the files the fs finds the error and make the volume read-only.

Segfault after we finish all the work of link groups to other slots

Now all the inodes which are allocated by the old slots will have the invalid information of Sub Alloc Slot.

1. Run fsck.ocfs2 first, and this utility will find the error "INODE_SUBALLOC" and fix it.

2. Don't run fsck.ocfs2 and mount the volume. The ocfs2 kernel can't delete the inode. It just make the inode as invalid and make the file sytem read-only. The error message shows:

Segfault after we update the Sub Alloc Slot in all inodes in the moved groups

Now the system is in good condition, we just change some inode's Sub Alloc Slot and some group's position, it is OK for the whole volume.

Segfault after we release the clusters allocated to the orphan dir

1. If we don't run fsck.ocfs2 and mount the volume, kernel panic.

2. If we run fsck.ocfs2 first:

fsck.ocfs2 should be enhanced to allocate the cluster to the orphan_dir, not removing them. Maybe we have to add a new pass check that do some specical check in system dir since currently the removing mechanism of system inode is in pass1 and the same as normal inodes.

Segfault after we truncate the journal size to 0

1. fsck.ocfs2 can't find the error. Bad! Here we may need to add some mechanism to check the journal header to see whether the journal file is in good shape.

2. mount failed and output the error message.

Segfault when removing the extra system dir entries

We have to go on the above 2 steps to fix orphan_dir and journal file.

Normal file's inode error: Sub Alloc Slot > initial_slots

What does the fs do when it is deleting an inode having a Sub Alloc slot > initial_slots?

Actually this can't happen since we will move all the inode to other slots before decreasing the total slot nums. But I have managed to create this scenario and investigate it.In this case, the block groups have not been merged with another but say only the initial count has been reduced.

1. Run fsck.ocfs2 first, and this utility will find the error "INODE_SUBALLOC" and fix it.

2. Don't run fsck.ocfs2 and mount the volume.

Duplicate groups or missing groups

When we relink the groups in extent_alloc and inode_alloc, it contains 2 steps, deleting from the old inode and relinking to the new inode. So which should be carried first since we may panic between the two steps.

IMPLEMENTATION DETAIL

1. Divide the whole slot decrease into some small repeats:

Since the user may decrease the slot sharply, and we want to make the work simple and fsck.ocfs2 work more smoothly, we need to break up the process so as to remove only one slot at a time. We set the incompat flag at the beginning(step 1), but we will call step 2 to 6 repeatedly with only one slot decrease until we reach the final slot number. And finally we recover the incompat flag and write the super block(step 7 and 8). We can decrease the slot number gradually and in case of any panic or system file error, we can reserve the work which is already done. And after fsck.ocfs2 fix the problem, we can continue the process easily. This way also make fsck.ocfs2 work more efficient. Not it can know the exact point of panic and its work is narrowed down to only one slot's recovery. So in general, the steps for just decreasing one slot looks like this.

     /* Link the last extent alloc file to other slots. */
      ret = relink_extent_alloc(fs, removed_slot, preserved_slots);
      if (ret)
             goto bail;

     /* Link the specified inode alloc file to others and
      * update all the Sub Alloc Slot in all inodes affected.
      */
     ret = relink_inode_alloc(fs, removed_slot, preserved_slots);
     if (ret)
           goto bail;

     /* Truncate the journal and orphan dir to release their
      * clusters to the global bitmap.
      */
     ret = truncate_journal_orphan_dir(fs, removed_slot);
     if (ret)
            goto bail;

     /* The extra system dir entries should be removed. */
     ret = remove_slot_entry(fs, removed_slot);
     if (ret)
           goto bail;

     /* The slot number is updated in the super block.*/
     OCFS2_RAW_SB(fs->fs_super)->s_max_slots--;
     ret = ocfs2_write_super(fs);
     if (ret)
            goto bail;

2. Relinking inode_alloc and extent_alloc:

3. We will need a patch that will force oops at each stage. This will allow us to test all failure cases. And as described above, fsck.ocfs2 also needs some improvement to make it can work right with this new feature. So another test script is needed for us to go on the whole corrupt process and test whether fsck.ocfs2 can fix all the problems.

4. fsck.ocfs2 have to implement to check extent block for its Sub Alloc Slot. Since you can see that if an extent block has a wrong Sub Alloc Slot, it may cause the file system fail to read-only system.


2011-12-23 01:01