What's New in OCFS2. o Disk format now deals in blocks and clusters, not bytes. - Disks can only deal with blocks, so why keep byte offsets? Object addresses (inodes, extent metadata) are now in blocks, not bytes. - Files are allocated in clusters, so why keep track of anything smaller? Extents are now stored on disk as a (file-virtual-cluster-offset, num-clusters, disk-block-offset) tuple. Total file allocation is also kept in clusters. o Blocksize is now variable, decided at mkfs time. - OCFS only supported blocks of 512 bytes. OCFS2 supports a blocksize of 512B, 1K, 2K, or 4K. The maximum is 4K because the minimum cluster size is 4K. - The filesystem truly deals with these blocksizes as a proper filesystem should. Block offsets are in terms of this blocksize. - With a 4K blocksize, systems like zSeries and iSeries work. Multinode zSeries is already part of the testing. o Filesystem is now on-disk compatible with multiple architectures. - OCFS wrote on disk from x86 (32bit little-endian), but would not write the same thing from larger (64bit) or different (big-endian) systems. - OCFS2 will be endian and size neutral. All on-disk structures will be identical whether written from a 32bit or 64bit machine, little or big endian. - All OCFS2 metadata is little-endian, except for the network IPC configuration, which is in network byte order. o Extent metadata is reorganized. - In addition to the new (cluster-off, num-clusters, disk-block-off) tuple, the list of extents is seperated from the blocks that contain it. The structure describing a list of extents is now used in the inode and in an extent metadata block. - An extent metadata block is now describes its place in the tree. All details about sub-blocks and child extents is contained in the extent list object contained at the end of the block. - The extent list at the end of an extent block or an inode now fills out the entire block. With the size reductions and this dynamism, an inode, which used to hold 3 extent records in OCFS, now holds 20 if the blocksize is the same 512B. If the blocksize grows to 4K, the inode can hold around 150 extent records all by itself. - Extent blocks have a similar increase. In OCFS, a 512B extent block would hold 17 extent records. In OCFS2, it holds 28. With a 4K blocksize, it goes near 170. - This means that the filesystem metadata tree uses far less space and stays far less complex. o Allocation of data areas is now node-local. - A node-local allocation area exists for each node, so that file data allocation does not contend on the global allocation bitmap. - This allocation area is lockless (to the cluster), providing great speed. o The on-disk inode is reorganized. - Allocations are in clusters. Modes are valid modes instead of the random flags of OCFS. - All objects in the filesystem are represented by the disk inode. The tail of the disk inode can be an extent list, a bitmap for a node-local alloction area, or the super block information. - This means that the global allocation bitmap is now just an extent of a system inode. If the disk grows, the filesystem can grow as well by adding more extents to this system inode. o In-memory inodes are directly connected to on-disk inodes. - In OCFS, the in-memory inode was somewhat related, but not connected to the on-disk file entry. - OCFS2 has a 1:1 mapping, and the lifetime is exact. - Inode number is POSIX (it won't change underneath an open file). o The superblock is completely new. - The superblock is an inode, with the superblock specific information at the tail of the disk inode. - The superblock lives at block 3 on disk. This is a block offset, the blocksize affects it here. - The first two sectors (512B sectors, not in terms of blocksize) are valid OCFS sectors with a version and signature that will cause OCFS to fail to mount the volume. This way, OCFS can cleanly avoid the volume, but OCFS2 can easily detect it. o System inodes are now dynamically located. - OCFS put its system inodes at a magic offset. Each system file type had 32 inodes, one per possible node, at a given offset. OCFS2 has a directory of system inodes. Because it is a valid filesystem directory, the actual location of the inodes can be anywhere. The directory can be added to, adding system inodes for a larger node set when necessary. - All system objects are in this hidden system directory. The superblock contains the offset of the hidden system directory. o No magic first mount. - OCFS had a magic first mount, where mkfs.ocfs would only do half the work of initializing the filesystem, and the first mount of the filesystem would finish the initialization. - OCFS2 finishes formating in mkfs.ocfs2. It is a complete format. - Mounting the first time is identical to mounting the 1000th time, as far as the filesystem driver is concerned. o System inodes are proper inodes. - There is far less special case code to handle system inodes differently than regular inodes. - Locking can be done just as with regular inodes. o Everything-as-inode makes the code incredibly simpler, conceptually. - All locking follows the same pattern. - All allocation follows the same pattern. - All journalling follows the same pattern. - All I/O and caching follows the same pattern. - Makes the system much easier to debug and extend. o Memory usage is greatly reduced. - OCFS used many, many temporary buffers. - OCFS did many, many memory copies. - OCFS2 uses the linux buffer cache directly, removing the copying and reducing the memory used. - Special hashtables for filesystem data are gone, as the in-memory inode linkage is all that is needed. - No more vmalloc. o Vastly simpler DLM operation. - Allows holding of locks for long periods, releasing when other nodes need the locks. - Reduce the number of lock levels, simplifying operation. - Integrated with kjournald to keep locks held until operations are committed and visible to other nodes. o Number of nodes is more flexible. - OCFS has a limit of 32 nodes. System inodes are allocated for all 32 nodes at format time, even if two nodes are all that are ever used. - OCFS2 can support up to 256 nodes in software. This limitation is software, and could be lifted if that is ever needed. - OCFS2 does not create system inodes for all nodes at format time. The number of initial nodes is specified to mkfs (this defaults to 4) and can be increased offline via tuneocfs2. o Journaling is done via the Journaled Block Device (JBD). - All operations are journaled, allowing full transactionality. - Each node has its own journal. If the node were to crash, other nodes can recover the journal to bring the outstanding changes up to date. - With proper journaling, caching of information becomes possible. - OCFS only journaled a few operations, and often failed to recover dead nodes. - OCFS often relied on disk write ordering to protect consistency, resulting in synchronous I/O everywhere. o Metadata and data can be cached. - Unlike OCFS, which only really provided consistency with O_DIRECT I/O, OCFS2 caches file data and metadata in memory. - Changes on one node are seen on other nodes, yet the originating node does not see the penalty of the disk I/O, as the operation is done in cache. o I/O is now asynchronous. - With caching and lock holding, sync I/Os are a thing of the past. - When an I/O has to go out, it is done in the background. o More than one operation at a time. - OCFS only allowed one outstanding operation at a time on a node. This was the only way to safely access filesystem data. - OCFS2 now locks all I/O operations, allowing a fully threaded implementation. Many processes can run filesystem operations in parallel. - In-memory structures are properly protected, so parallel processes do not corrupt each other. o NO MORE LIFETIME ISSUES for in-memory structures. - OCFS had serious problems with lifetimes, because the lifetime of a structure representing an on-disk object would be completely disjoint from the in-memory object lifetime (eg, struct inode vs file_entry). - All OCFS2 lifetimes are exactly tied to kernel objects. This is *amazingly* simpler to debug and understand. o Physical sizing limits are massively increased. - The global bitmap inode can refer 4Gb of bitmap. Each bit represents a cluster. OCFS could only refer to 8Mb of bitmap. - A filesystem with a cluster size of 4K can grow to 16TB. A filesystem with a cluster size of 128K (the OCFS default) can grow to 512TB. Finally, a filesystem with a cluster size of 1MB can grow to 4PB. - OCFS, by contast, maxed out at 32GB, 1TB, and 8TB for 4K, 128K, and 1MB cluster sizes, respectively. That's right, a 4K clustersize was 32GB, and is now 16TB. - Inodes have the same 4Gb of clusters. So an inode can grow to the size of the filesystem (minus any overhead). o Software sizing limits are also larger, though not by as much. - Representing 4Gb of bitmap, and the other parts, is difficult. The current codebase cannot handle filesystems as large as the disk format can. However, that can be changed in the future with no disk format change. - The limit on the software size is the amount of memory required for block pointers. As the system can only allocate 128K of block pointers, that is 32768 pointers on a 32bit system and 16384 pointers on a 64bit system. - With a blocksize of 512B, 32768 block pointers can represent 128Mb of clusters. With a 4K cluster size, that is 512GB of filesystem space. With a 1MB cluster size, that is 128TB. Still huge. A 64bit platform, with half as many block pointers, sees half the space (256MB filesystems on a 4K cluster size). - But wait, there's more! With a 4K block size, 32768 block pointers can represent 1Gb of clusters. With a 4K cluster size, that is a 4TB filesystem. - In the end, the right combination of block size and cluster size can fill any current disk with the current codebase software limits. o Ext2/3-style directories. - Removes the inode-inside-the-directory-data problem of OCFS directories. - Allows link(2) - Understood code. Allows htree when required. - Allows namespace operations without reading/locking directory child inodes. OCFS had to do that, and it was a locking nightmare. - In essence, namespace operations are seperate from object operations, as they should be. - Readdir and other lookup operations are properly locked, avoiding the directory corruption seen in OCFS. o Proper, clean CDSL (Context Dependant Symbolic Link). - There is no limitation on node maximum. - There are no node-specific recovery issues. - To userspace, it is really a symlink. - Behaves precisely like VMS/Tru64 CDSL. o Orphan directory on unlink(2) - Allows POSIX unlink(2), where node 1 creates a file, node 2 opens it, then node 1 unlinks it. Node 2 can still access the open descriptor, but no other open() or readdir() will see it. When node2 closes the descriptor, the inode is removed.