Oracle Open Source

Frequently Asked Questions


Q: The TCP/IP checksum algorithm is notoriously bad at detecting single-bit errors. Why didn't you pick a stronger algorithm?

A: Other options were contemplated, including Fletcher and XOR. The IP checksum was chosen because it was already implemented in the controller we were targeting.

Also, the purpose of the checksum isn't necessarily to detect bit errors. Server-class systems feature error checking and correcting memory and buses. The main intent of the checksum is to allow verification that the data buffer matches the integrity metadata. And the IP checksum handles that fine.

The strength is not in the choice of algorithm but in the separation of the data and integrity metadata buffers.

That being said, we would like to explore the performance impact of CRC32C which is due to become available as a processor operation in SSE4.2 (Nehalem).


Q: Why didn't you simply use the T10 16-bit CRC?

A: The code allows you to chose between the IP checksum and the T10 CRC. Calculating a cyclic redundancy check is much more expensive than the IP checksum and benchmarks showed that it had a significant impact on system performance. The IP checksum was the fastest of all the protection algorithms we benchmarked and also had the smallest impact on system performance.


Q: Isn't converting the checksum cheating? Doesn't it create a window of error?

A: The controller firmware verifies the IP checksum of the data before generating the CRC and sending it out on the wire. Mismatch between the data buffer and the integrity metadata will be detected during this check and the write aborted.


Q: A disk won't have to do any conversion/change the integrity metadata. Your controller requirements violate this fundamental design of the T10 DIF specification.

A: Simple disks don't have to change the integrity metadata, that is correct. Disk arrays, however, do. The logical volume sector number, for instance, isn't the same as the physical disk sector that the write needs to go to.

Another example: In the case of RAID5 the array will need to generate its own integrity metadata for the parity block.

So the protection data will inevitably be verified and modified on its path from the application to the platter.


Q: Does this mean we have to trust controller/array/disk firmware?

A: You already do.


A: But now there's more than can go wrong.

Q: Yes, but when things go wrong with DIF enabled, we have a bigger chance of knowing about it. And when things go bad, writes get rejected.


Q: Does this mean btrfs checksumming is futile?

A: No. It solves a different problem, namely data going bad on disk. This corruption scenario is often caused by operator error rather than "physical" bit rot on the platter.

If btrfs detects corrupted data on disk it can retry the read from another mirror or ask the RAID to reconstruct from parity.

DIF protects against errors in the I/O path between application/kernel and the disk. That's orthogonal to the btrfs checksumming feature.

Another way of looking at it is that DIF protects physical data chunks, whereas filesystem checksumming protects logical data chunks (blocks/extents).