Block-based Error Detection and Correction for OCFS2
Joel Becker, Oracle
January 18, 2007
Introduction
One of the common themes filesystems face today is that disks will fail. There will be errors and corruptions. While a complete description of all possible problems is beyond the scope of this document, it's well understood that corruption can be detected at the level of one block.
Most filesystems have some sort of ability to detect corruption. OCFS2 already has a simple mechanism via the signatures written at the start of each metadata block. When OCFS2 detects an invalid signature, it forces a volume read-only and refuses to modify the volume. It disconnects from the cluster as well, so that other nodes can consider this node dead. This is just in case the corruption happened in the memory of the faulty node. If the corruption is on the physical media, other nodes will eventually discover it and take themselves offline as well.
This is far from optimal or robust. A filesystem with any corruption will go completely offline quickly if it is part of a frequently used structure. No one will be able to use it until fsck(8) is run. While correcting the corruption is the best outcome, isolating it allows the uncorrupted portions of the filesystem to still be used.
This document describes a simple method that will detect most corruptions, correct any single-bit errors, and allow isolation of any corruptions that cannot be corrected.
Design Scope
Some modern filesystems go to great lengths to discover and correct corruption. They keep multiple copies of all metadata. They have strong checksums of objects stored both with the block and with any referring object. When corruption is detected, they can read an alternate and repair the corruption.
OCFS2 is already out in the wild. It was not designed for this sort of robustness. Modifying it to this level of safety isn't really possible without creating an entirely new filesystem. Nor is this necessary to achieve significant improvement over its current state.
This design allows for a lot of detection and a little correction without significantly modifying any on-disk structure or hierarchy. It makes use of unused space in every on-disk block. It will be backward compatible with older formats, and forward compatible if a single superblock bit is cleared.
Core Ideas
CRC32 is a simple and fast checksum that has a very good probability of determining corruption. iSCSI, for example, considers CRC32c strong enough for all data protection. We will be using the Ethernet (802.3) CRC32. It provides the same guarantees for any blocks smaller than 216 (RFC 3385). This includes OCFS2's maximum block size of 4K (212). Zach pointed me to a paper that reports CRC32 maxes out its detection capability for 213 bits. Thankfully, OCFS2 still fits. The kernel has fast implementations for most architectures, and its pretty fast even in naive C code. This will use 32 bits of space.
Hamming codes allow for single-bit error detection via a much smaller number of parity bits. For OCFS2's largest blocksize, 4KB, requires 16 bits of parity to cover 32768 bits of data. That's quite a scaling factor. This will use 16 bits of space.
On-Disk Structures
Every on-disk metadata structure fills an entire block. Each has reserved space set aside for future enhancement - like this design.
We will create a structure specific to error detection and correction:
struct ocfs2_block_check { /*00*/ __le32 bc_crc32e; /* 802.3 Ethernet II CRC32 */ __le16 bc_ecc; /* Single-error-correction parity vector. This is a simple Hamming code dependant on the blocksize. OCFS2's maximum blocksize, 4K, requires 16 parity bits, so we fit in __le16. */ __le16 bc_reserved1; /*08*/ };
Note that this structure is padded out to 64 bits for convenience. This will use one u64 of each metadata structure.
The only metadata structure that looks different are directory blocks. These are taken from Ext3, and do not have the usual structure of other OCFS2 metadata blocks. They are simply a list of ocfs2_dir_entry structures. As such, we can insert a "hidden" directory entry that has a zero length name. The filesystem code will skip it when reading directories.
struct ocfs2_dir_check_entry { /*00*/ __le64 inode; __le16 rec_len; __u8 name_len; __u8 file_type; __le32 reserved1; /*10*/ struct ocfs2_block_check check; /*18*/ /* Actual on-disk length specified by rec_len, but it will always be 0x18 */ };
The ocfs2_block_check structure lives where a name normally would go. This fake entry will follow the entries for '.' and '..' in the first block of a directory and at the start of every additional directory block.
All CRC32 and Hamming codes will be calculated with the ocfs2_block_check structure zeroed out.
Finally, an INCOMPAT feature flag, OCFS2_FEATURE_INCOMPAT_BLOCK_CHECK, will be set on the superblock.
Operation
When the OCFS2_FEATURE_INCOMPAT_BLOCK_CHECK flag is not set, the filesystem will behave as it always has. Newly created metadata will have the ocfs2_block_check structure zeroed. When metadata is read, the check values will be ignored. Thus, old drivers and new drives will treat the filesystem equally. The same goes for the OCFS2 tool suite.
When the OCFS2_FEATURE_INCOMPAT_BLOCK_CHECK flat is set on the superblock, old filesystem drivers that do not support it will refuse to mount the filesystem. Older tools will also refuse to operate on the filesystem. This ensures that drivers and tools always write valid check information -- there will be no filesystem with half of its metadata checked and half not.
Currently, the filesystem checks the signature of a block when it is read in. The new code will do that as well, but then utilize the check data. First, the CRC32 will be calculated. If it matches the stored CRC32, we know that the block is valid.
If the CRC32 fails, the Hamming code will be calculated and compared against the stored Hamming code. If there is a single bit error (some sort of memory or wire corruption), this will correct the bit that has flipped. A second CRC32 calculation will show the block is now valid.
If the second CRC32 fails, the corruption affected more than one bit. We have reached the limit of our correction scheme, and we must mark this metadata broken.
However, because we have the checks on all metadata, we can trust any other metadata that validates. In other words, we no longer need to take the entire filesystem offline. The routines that access a specific piece of broken metadata must return EIO, but routines accessing other metadata (eg, other inodes) can continue successfully.
We will not tell other nodes about our discovery. If the corruption is on the storage media, the will discover it eventually. If the corruption occurred in our memory or our cabling, they can continue to access this metadata successfully. In fact, if the bad metadata expires from our cache, we may re-read it at a later time and get valid data.
Enabling
It should be an option to mkfs.ocfs2(8) to enable block checking. Once the filesystem is handling checking in production, the block checking feature should be enabled by default.
Existing filesystems should be able to enable block checking via tunefs.ocfs2(8). When enabling check structures, tunefs.ocfs2(8) must go through the entire filesystem and generate the check data for all metadata. This will be expensive, especially for directories.
It can always be disabled via tunefs.ocfs2(8) as well. This will merely clear the feature bit, as the driver will now ignore the check data.
Current Status
Back in June of 2006, I cooked up changes in libocfs2 to do checking. This lives in the ecc branch of the ocfs2-tools Subversion repository. The URL to check it out is http://oss.oracle.com/projects/ocfs2-tools/src/branches/ecc/. This includes all of the changes to the on-disk structures in libocfs2/include/ocfs2_fs.h.
Here's the diff(1) from ocfs2-tools mainline code: ocfs2-tools-ecc.patch. This patch is from June of 2006, and the branch needs to be brought up to date with the latest libocfs2.
I think there is a bug in the handling of directory check blocks. I don't handle the first directory block and the '.' and '..' entries.
Pending Work
The filesystem needs to be modified to check all metadata blocks when they are read and calculate check values for blocks when they are written. This includes honoring the incompat feature bit.
On the tools side, fsck.ocfs2(8) needs to learn to validate and generate check blocks. tunefs.ocfs2(8) must learn to enable and disable existing filesystems.
Miscellaneous
The interweb doesn't seem to have ready-made routines to calculate Hamming codes for 4K blocks (32768 bits). They mostly stop at 32 bits. I had to write my own routine, which you can see in the ocfs2-tools branch. You can also read an email in which I describe my performance observations of my Hamming code algorithm. Be warned, it's long.