[OracleOSS] [TitleIndex] [WordIndex]


Add BLKDISCARD to mkfs.ocfs2


mkfs.ocfs2 should issue ioctl(BLKDISCARD) at the very start.

Allow custom uuid values (Checked in - commit ee646321d04df8cf3dabeee73e631c5cacd7f9e1)

Owner: Tiger Yang

Enhance tunefs.ocfs2 [-U | --uuid-reset ] to _possibly_ accept a uuid from the user. Currently it regenerates it automatically. With this change, the user will have the option of specifying a custom uuid. The tool will accept a hex string in the format shown below. It will validate each char as being a valid hex value but it will _not_ check for uniqueness.

# tunefs.ocfs2 -Q "%U\n" /dev/sda

# tunefs.ocfs2 --uuid-reset 1234567890ABCDEF1234567890ABCDEF /dev/sda

# tunefs.ocfs2 -Q "%U\n" /dev/sda

b-tree fsck


Now ocfs2 use b-tree as the basic structure to store different data, xattr, file data, refcount tree etc. So the check for b-tree in fsck.ocfs2 is very important.

Currently, if we find something error in check_er, we set er->e_blkno = 0, so that check_el can find it and remove it(by moving all the extent recs after this extent rec ahead one position. The codes are like this:

    if (ocfs2_block_out_of_range(ost->ost_fs, er->e_blkno) &&
        prompt(ost, PY, PR_EXTENT_BLKNO_RANGE,
               "Extent record %u in owner %"PRIu64" "
               "refers to a block that is out of range.  Remove "
               "this record from the extent list?", i, owner)) {

            if (!trust_next_free) {
                printf("Can't remove the record becuase "
                       "next_free_rec hasn't been fixed\n");
            cpy = (max_recs - i - 1) * sizeof(*er);
            /* shift the remaining recs into this ones place */
            if (cpy != 0) {
                memcpy(er, er + 1, cpy);
                       memset(&el->l_recs[max_recs - 1], 0,
            *changed = 1;

It have several problems:

1. Actually, in ocfs2 b-tree, we only allow empty extent in ocfs2_extent_list.l_recs[0], so we should really move the records before it to spare an empty record in the recs[0].

2. It works when the tree has depth==0, but for depth >= 1, it is totally broken. It changed l_next_free_rec, and later it will make the mounted ocfs2 volume readonly if following code is called.

        if (left_el->l_next_free_rec != left_el->l_count) {
                            "Inode %llu has non-full interior leaf node %llu"
                            "(next free = %u)",
                            (unsigned long long)ocfs2_metadata_cache_owner(et->et_ci),
                            (unsigned long long)left_leaf_bh->b_blocknr,
                return -EROFS;

So the right step should be that in case we find an error extent rec, let the 1st record empty and don't change the l_next_free_rec. This is only the first step. what if the extent list already has an empty record? Please note that we can't run ocfs2_rotate_tree_left now to move all the extent recs in the latter extent block ahead one position since they haven't been checked by now. So the ways I figure out we can do is:

1. If this is some leaf records that has leaf_rec_clusters >= 2, then we can split it to make the tree coherent.

2. record a flag so that after we have finished checking up the whole tree, call ocfs2_rotate_tree_left to make the whole tree coherent.

So the above steps are all about how to handle corruptions in the leaf extent record. What if we have a problem in the branch extent record? We have no way but rotating the tree left after the whole b-tree is checked. Currently we have no such codes for rotating the branch extent record. So some improvements for the b-tree rotation maybe. Or just a hacky code for fsck.ocfs2.



So while cp(1) can work as a splitclone, I think a utility to inplace convert a file will be very useful. Especially if the file is "huge". An inplace convert will not only be quicker, but also will not pollute the pagecache.

Are we going to do this in-kernel? If so, we need to try to share an API with btrfs, etc

-- Joel

Yes, in-kernel. I don't see how we could do it safely in user-space. If so, do correct me. Yes, having a common api would be ideal. Also, we need a better name than split clone. Or, we have to rename reflink().

-- Sunil

Extend du(1)

Extend the du(1) utility to make it show extended usage. du --extended will call fiemap() to get the extent layout of the files. It will maintain a bitmap of allocated blocks and also maintain the count of the number of shared blocks. The output will show the allocated blocks and the shared blocks for each file. The summary will show the usage figure the user should expect if all files were "unshared".

Extend stat(1)


Extend stat(1) utility to make it display the usage. stat --usage will show the number of blocks allocated, unwritten, holes, shared, etc. The tool could use the fiemap() call to get the information for the regular file.

FIEMAP extension in kernel

Owner: Sunil Mushran

Extend FIEMAP to make it identify shared extents. Patches have been emailed.


Owner: Sunil Mushran

Tool to punch holes in files. The tool will read the file in chunks and if all zeroes, punch a hole in it. By default, the utility should not punch holes wherever it can. It should be conservative and look for large areas. (The definition for large is still up for grabs.) --max-compact can make the tool to punch holes wherever it can. --dry-run should give the user the feel for what to expect without actually any changes. The tool expects the user to not be using the file. No locks will be taken in the tool.

Multipath-aware mounted.ocfs2


mounted.ocfs2 needs to be multipath aware.

See scandisk.c (among other source files) in the scan-new branch of oracleasm-support.



We need a tool to provide fs information to the user. Currently, we are providing some information via tunefs.ocfs2. But that's limited to scalar (superblock) values. What we are missing is information like global bitmap free space fragmentation or free inode info for each slot. On a related note, e2fsprogs recently added e2freefrag tool to provide the free space fragmentation information. Very useful.

Also, as this is an info tool, it should allow all users to use it and not just ones that have the read priv on the device. o2info /path/to/file/on/ocfs2/vol can use OCFS2_IOC_INFO ioctl to get the info from the fs. If the information is sensitive, it could be restricted to CAP_SYS_RAWIO. o2info /dev/sdX1 can use libocfs2 to access the same information.

We need an ioctl (as much as I've tried to think of an alternative solution) to allow non-privileged info gathering. This ioctl, OCFS2_IOC_INFO, must be forwards and backwards compatible. Any version of o2info must work with any version of ocfs2 that has OCFS2_IOC_INFO for sane behavior to result.

The way to do this is with distinct information requests. OCFS2_IOC_INFO is called with a NULL-terminated array of pointers to these requests. The kernel driver then fills in the requests it understands and ignores the ones it does not. The requests must be simple enough that they don't need extending. For example, if you had a "OCFS2_INFO_SUPERBLOCK" request that had all the superblock information, you'd have to extend it every time you changed the superblock. That's bad. Instead, you would have something like OCFS2_INFO_CLUSTERSIZE. It means more requests in the chain, but it also means you can add OCFS2_INFO_XATTR_HASH at a later time without breaking OCFS2_INFO_CLUSTERSIZE.

I envision the requests looking like this:

/* Magic number of all requests */
#define OCFS2_INFO_MAGIC            0x4F32494E  /* Magic number for requests */
/* Flags for struct ocfs2_info_request */
/* Filled by the caller */
#define OCFS2_INFO_FL_NON_COHERENT  0x00000001  /* Cluster coherency not required.
                                                   This is a hint.  It is up to
                                                   ocfs2 whether the request can
                                                   be fulfilled without locking. */
/* Filled by ocfs2 */
#define OCFS2_INFO_FL_FILLED        0x80000000  /* Filesystem understood this
                                                   request and filled in the
                                                   answer */

struct ocfs2_info_request {
        __u32 ir_magic;    /* Magic number */
        __u32 ir_code;     /* Info request code */
        __u32 ir_size;     /* Size of request */
        __u32 ir_flags;    /* Request flags */
        /* request-specified fields */

struct ocfs2_info_request_clustersize {
        struct ocfs2_info_request ir_request;
        __u32 ir_clustersize;

struct ocfs2_info_request_features {


And you'd use it like this:

    struct ocfs2_info_request_clustersize crq = {0, };
    struct ocfs2_info_request_blocksize brq = {0, };
    struct ocfs2_info_request *info[3] = {&crq, &brq, NULL };

    rc = ioctl(fd, OCFS2_IOC_INFO, info);

While it is a little cumbersome, it means we never have problems adding new information in the future. Folks can test whether a filesystem supports a given request by checking for the FL_FILLED flag. Convenience functions can probably pre-do some of the boilerplate; imagine something that has static request structures and just returns you an array of the ones you specified by request code. Etc.

-- Joel

Cluster (in)coherency can be an option for the mounted case. Right now I don't know which would be the better default.

The default should be cluster coherency. As this is a new tool, we don't have any previous behaviors to support. I think folks would expect cluster coherency in the information. If they are worried about performance, they can ask for non-coherent results

-- Joel

List of options:

Free space fragmentation. See e2freefrag's manpage.
# o2info --freefrag 8192 /some/file/or/device
Blocksize: 4096 bytes
Clustersize: 4096 bytes
Total clusters: 1504085
Free clusters: 292995 (19.5%)

Min. free extent: 4 KB
Max. free extent: 24008 KB
Avg. free extent: 252 KB

Chunksize: 8388608 bytes (2048 clusters)
Total chunks: 242
Free chunks: 189 (78.1%)

Extent Size Range :   Free extents   Free Clusters Percent
    4K...    8K- :           704           704     0.2%
    8K...   16K- :           810          1979     0.7%
   16K...   32K- :           843          4467     1.5%
   32K...   64K- :           579          6263     2.1%
   64K...  128K- :           493         11067     3.8%
  128K...  256K- :           394         18097     6.2%
  256K...  512K- :           281         25477     8.7%
  512K... 1024K- :           253         44914    15.3%
    1M...    2M- :           143         51897    17.7%
    2M...    4M- :            73         50683    17.3%
    4M...    8M- :            37         52417    17.9%
    8M...   16M- :             7         19028     6.5%
   16M...   32M- :             1          6002     2.0%

Free inode space for all slots.
# o2info --freeinode /some/file/or/device
Slot     Space       Free
  1     500000       1000
  2       2000        100
  3      23000          0
  4          0          0
Total   525000       1100

Extended inode stat information. Standard stat + frag (see command in debugfs.ocfs2) + shared space (refcount) + unwritten space
(We should not need a special ioctl for this. Use standard stat and fiemap.)
# o2info --filestat Makefile
  File: `Makefile'
  Size: 13676           Blocks: 32         IO Block: 4096   regular file
 Frag%: 0.01          Clusters: 32          Extents: 2      Score: 1
Shared: 10           Unwritten: 0             Holes: 0
Device: 807h/2055d       Inode: 14521580      Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1009/smushran)   Gid: ( 1009/smushran)
Access: 2009-10-15 14:31:05.524111391 -0700
Modify: 2009-10-15 14:21:41.740122112 -0700
Change: 2009-10-15 14:21:41.752116012 -0700

# o2info --usage Makefile
Blocks: 32    Shared: 10    Unwritten: 0    Holes: 0

Compat, Incompat and RO Compat features.
# o2info --fs-features
backup-super strict-journal-super sparse inline-data unwritten

Blocksize, Clustersize, Numslots, Label and UUID.
# o2info --volinfo
  Block Size: 4096
Cluster Size: 8192
  Node Slots: 32
       Label: TestMe
        UUID: 84C09DEAF8384484ABC3D53F186299A8

Duplicate Prompt Codes in fsck.ocfs2 (Checked in - commit 52fdd9b46fc61fdd7c6ad5b4ea73104ab2fdf28d)

Owner: tristan.ye

fsck.ocfs2 is supposed to only use each prompt code once. However, about four of the codes are used in multiple places. For example, INODE_CLUSTERS is used for an inaccurate cluster count in three places. The thing is, an inaccurate cluster count is different for inline inodes, sparse inodes, and non-sparse inodes. We should make distinct codes for each case. INODE_CLUSTERS should stay for non-sparse inodes, because that was the original meaning. INODE_SPARSE_CLUSTERS and INODE_INLINE_CLUSTERS should be added for the newer cases.

The same should be done for the other duplicates. You can see them via the command:

$ make -C fsck.ocfs2/ check-prompt-dups

fswreck Bugfixes (Checked in - commit 1f774d917ef6ddb3cabd465a7c858ed3d3c63304)

Owner: tristan.ye

fswreck fails a lot on big endian platforms. In addition, it doesn't work in the presence of some new features, such as inline-data. These bugs need to be fixed. The big endian fixes might just be in the libocfs2 swap code. The known inline-data problems are in the directory code, where it tries to add directory entries to dirblocks even for inline directory inodes. Corruptions using add_ent_to_dir() function really should be using ocfs2_link() instead. The corruptions that directly modify dirents need to understand inline dirs.

Offline Defrag


We don't have to get too fancy as a first pass, just the ability to defragment the file system, especially getting rid of unused inode/extent block groups. We could even have a mode which allows for the utility to move inodes around (which it shouldn't do by default).


Owner: Goldwyn Rodrigues

e2fsprogs seems to get some defaults from /etc/mke2fs.conf, we might want to consider the same thing. It could be useful to system admins who create ocfs2 file systems often.

Cluster Verification Utility (Done)

Owner: Xiaowei.hu mailto:xiaowei.hu@oracle.com

check patch file: o2cb_vertify.sh

This yet-to-be-named tool is expected to read the /etc/ocfs2/cluster.conf and verify its contents. It should use ssh to ensure the validity of the hostnames and the ip addresses.

Directory Optimization in fsck.ocfs2

e2fsck has an option '-D' that performs directory compression. It removes empty directory entries and compresses the directory allocation where it can. We might want something like that, as we have the same directory format.

Support fsck codes in fswreck (Checked in - commit 8d847f1eccff77d013e57de25cea46cb4dad3673)

Owner: tristan.ye

The current fswreck corrupt codes do more than one corruption and have no relation to the fsck.ocfs2 codes. fswreck needs a new option that can take one or more fsck.ocfs2 codes and do exactly those corruptions.

What I mean is that I should be able to see a prompt code in fsck.ocfs2 and ask fswreck to create exactly that. If I see PR_ROOT_DIR_MISSING in the fsck code, I should be able to do:

# fswreck -k ROOT_DIR_MISSING /dev/sdb2

(I picked '-k' arbitrarily. If you have a better option character, speak up.)

The '-k' option should take a comma-separated list of prompts. So you could do any combination you want:

# fswreck -k ROOT_DIR_MISSING,CHAIN_COUNT /dev/sdb2

I suspect the best way to do this is to have fswreck include fsck's prompt-codes.h. We can even print warnings for prompt codes that aren't yet supported.

Sparse Support in tools (Checked in - commit 09bb6a876fa19608654a6bd174eec221ecf77763)


tunefs, mkfs and debugfs need to be made sparse aware. Apart from the requirement that we add support to read sparse files in libocfs2, we also need to allow for users to convert an existing volume to support sparse files (it should clear out the data between i_size and i_clusters for all files and then set the compat flag). tunefs should also allow users to go back, as in detect files that have holes, make another tree with no holes, copy data from old to new, point the inode to the new tree and clean up old tree.

I/O read cache in tools (Checked in - commit 23b486d32c47b6f48ce659d308a8307fd5480990)

Owner: JoelBecker

The tools currently read and write blocks from O_DIRECT only. Some future work, such as the sparse file code, will want to reread blocks with some frequency. For performance, the library needs a read cache underneath those reads. We add a simple cache that is transparent to io_read/write_block() when enabled. It is not cluster safe, programs need to lock out the cluster before enabling it. This code is complete, but will not land in trunk until we're ready. It's on the iocache branch.

Next, someone needs to add caching to the tools. Each tool needs to be examined and caching enabled where appropriate.

Add heartbeat detection in mounted.ocfs2


The existing tool does two tasks. First it scans all the partitions in /proc/partitions and detects and lists all the OCFS2 volumes. Its other task is to list all the nodes in the //slot_map as "active" nodes. The latter has a problem. If a OCFS2 volume was not cleanly umounted, the data in the slot_map is invalid. The only way to fix this is to read the //heartbeat and detect whether there are any live nodes in the cluster. The //slot_map data should be disregarded if no nodes are found to be alive.

Commands in debugfs.ocfs2


debugfs.ocfs2 is missing support of some 20+ commands from standard debugfs. debugfs commands has the list of all the commands. As always, before implementing, ping the group because not all the commands are appropriate for OCFS2.

tunefs.ocfs2 update for remove slots (Completed as of r1384)

Owner: TaoMa

mkfs.ocfs2 can format a ocfs2 volume according to the node numbers set by the user. But as the time goes by, the user may want to change the node numbers to fit his/her own need. Now tunefs.ocfs2 can add slot numbers when the cluster is off-line, and this design documents is about how to decrease slot numbers.

Under some circumstances, the system have many unused nodes and they have very large empty journal files, with the implementation of decreasing, we can remove the corresponding journal files and reuse the disk space and this is also another cause for this tunefs.ocfs2 update.

fswreck (Completed as of r1266)

Owner: TaoMa

This is a file system wrecker tool and is envisioned as a test tool for fsck.ocfs2. The idea is that this tool will corrupt specific blocks as per the option provided by the user. The test tool will then run fsck.ocfs2 to ensure that it does detect and fix the corruption. The full list of "corruption types" is listed in fsck.ocfs2.checks.8.

Offline Resize (Completed as of r1251)

Owner: Sunil Mushran

This feature should allow one to grow an offlined OCFS2 volume. It should be able to handle SIGSEVs or any other event(s) that can cause the resize to abort. An aborted resized volume should not be allowed to mount until one explicitly fsck's it. This can be handled by stamping an INCOMPAT flag at the start of the resize that is cleared at the end. fsck, on detecting that flag, should trigger a full scan, that should not only fix the global bitmap but also make the superblock (i_clusters) consistent with it.

2012-11-08 13:01