Hamming Code Optimization
From Joel.Becker@oracle.com Mon Jun 19 00:35:27 2006 Date: Mon, 19 Jun 2006 00:35:27 -0700 From: Joel Becker <Joel.Becker@oracle.com> To: martin.petersen@oracle.com, mark.fasheh@oracle.com, zach.brown@oracle.com Subject: A tale of optimization Message-ID: <20060619073527.GM29684@ca-server1.us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Burt-Line: Trees are cool. X-Red-Smith: Ninety feet between bases is perhaps as close as man has ever come to perfection. User-Agent: Mutt/1.5.11 Status: RO Content-Length: 5229 Lines: 108 [Ed note: I just realized I've typed out a long boring story. It may or may not be interesting, though it will help anyone trying to optimize this more to understand where I've been. Feel free to ignore if silly optimization results are boring when they take three pages.] Folks, Here's a fun story of optimization (as a fun story, you can feel free to ignore it for a while if you are busy :-). I was cooking up a Hamming Code algorithm for ECC in OCFS2. With Hamming Code parity, we can correct single-bit errors in 16bits of parity for a 4K block (less parity for smaller blocks). The standard method to calculate the parity is with matrix math. This quickly generates the code block (an interleaved bitstream of parity and data bits). However, the matrix has dimensions [total bits x data bits]. So, for our 4K blocksize, that's [32768 x 32784]. We can't do that in memory, not nicely :-) Lots of googling later, I understood enough to implement it mathematically. Details of the algorithm are handwaved here. The big problem is that parity is based on the offset in the code bitsream, not the data bitstream. In the code bitstream, all power-of-two bits are parity bits. Bits 1, 2, 4, ... are parity bits. The data bits are interleaved. So in the code bitstream, data bit 1 is code bit 3 (skipping past code bits 1 and 2, which are for parity). Any time you want to work on a bit, you have to know its code bitstream offset. Here's my original algorithm, the simple cut. I call it the "for" algoritm, for obvious reasons. In all the examples, 'i' is the data bit offset, 'b' is the code bit offset. I'm going to use 1-based arrays because that's what Hamming code uses. total_bits = total_data_bits + total_parity_bits; for (i = 1, b = 1; i < total_bits; i++, b++) while (is_power_of_two(b)) b++ if (test_bit(i)) for (j = 1; j < total_parity_bits; j++) if (b affects j) j ^= 1; I'm going to handwave right now how we determine whether bit b affects each parity bit, but it relies on b's number in the _code_ space, so we have to have that is_power_of_two() check. To sum up, the for loop runs through _every_bit_ and calls test_bit() on it. If the bit is set, it loops through each parity bit. That's a lot of looping. This naive implementation is slow. Surprise! My test program fills a blocksize buffer from /dev/urandom and runs N encodes. I have the same program do crc32 and similar. 10000 loops of a naive crc32 (the kernel's un-optimized fallback code) on my laptop take 0.300s. 10000 loops of zlib's optimized crc32 takes 0.030s. 10000 loops of our parity encode take 6s! The naive crc32 is an order of magnitude better! I wanted to find a way to move forward more than one bit at a time. Many algorithms (smarter crc32s, for example) work on words or larger chunks. But each bit affects different parity bits, so I'm not yet sure how to calculate more than one bit's parity in a single operation. I did realize that we skip 0 bits. Why can't we skip them faster? Instead of using the for loop to compute b's offset, I concocted a function that would take i and return b for any bit. I then changed the loop to use find_next_set. Let's call this the 'ffs' algorithm (handwaving around 1-based arrays): for (i = 1; (i = ocfs2_find_next_bit_set(data, total_bits, i)) < total_bits; i++) b = calc_code_bit(i); for (j = 1; j < total_parity_bits; j++) if (b affects j) j ^= 1; Here's the fun part. I figured that by skipping zero bits, we'd get some speedup. When I ran it, it was SLOWER! What? It ran 7s instead of 6s. So I profiled it. The 'for' algorithm's second biggest time consumer was ocfs2_test_bit(), with ~8m calls taking up 7s of profiled (slower) run time. The 'ffs' algorithm's ocfs2_find_next_bit_set() took 8s of profiled time with ~4m calls. Our naive ffs is *slow*. This is where I was when I started the email. I thought it was a fun outcome. But since then I've had some thoughts and success on the topic. This lead me to two conclusions. First, a fast find_next_bit(), like the kernel has in ASM, will be much faster. Second, half of our test_bits were returning true (4m ffs calls vs 8m test_bits). Of course they are, I'm using /dev/urandom. A real inode will have a ton of zeros, especially if it isn't fragmented. I changed the test buffer. It still reads from /dev/urandom, but then it zeros the tail of the inode, where it will likely be empty. The 'for' algorithm speeds up, as it has fewer loops through the parity checks, running about 2.5s. But the 'ffs' algorithm really shines. 1.3s for it's run of 10000. Assembly based ffs in-kernel will speed it up some more, but after that we'll have to do some tricks on the parity loop. I'm willing to call the code acceptible for now. Please, if anyone has more optimization ideas, have at. :-) Joel -- "Here's something to think about: How come you never see a headline like ``Psychic Wins Lottery''?" - Jay Leno Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127