OCFS2/DesignDocs/BlockErrorDetection/HammingCodeOptimization

Hamming Code Optimization

From Joel.Becker@oracle.com Mon Jun 19 00:35:27 2006
Date: Mon, 19 Jun 2006 00:35:27 -0700
From: Joel Becker <Joel.Becker@oracle.com>
To: martin.petersen@oracle.com, mark.fasheh@oracle.com,
        zach.brown@oracle.com
Subject: A tale of optimization
Message-ID: <20060619073527.GM29684@ca-server1.us.oracle.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Burt-Line: Trees are cool.
X-Red-Smith: Ninety feet between bases is perhaps as close as man has ever come to perfection.
User-Agent: Mutt/1.5.11
Status: RO
Content-Length: 5229
Lines: 108

[Ed note: I just realized I've typed out a long boring story.  It may or
 may not be interesting, though it will help anyone trying to optimize
 this more to understand where I've been.  Feel free to ignore if silly
 optimization results are boring when they take three pages.]

Folks,
        Here's a fun story of optimization (as a fun story, you can feel
free to ignore it for a while if you are busy :-).  I was cooking up a
Hamming Code algorithm for ECC in OCFS2.  With Hamming Code parity, we
can correct single-bit errors in 16bits of parity for a 4K block (less
parity for smaller blocks).
        The standard method to calculate the parity is with matrix math.
This quickly generates the code block (an interleaved bitstream of
parity and data bits).  However, the matrix has dimensions [total bits x
data bits].  So, for our 4K blocksize, that's [32768 x 32784].  We can't
do that in memory, not nicely :-)  Lots of googling later, I understood
enough to implement it mathematically.
        Details of the algorithm are handwaved here.  The big problem is
that parity is based on the offset in the code bitsream, not the data
bitstream.  In the code bitstream, all power-of-two bits are parity
bits.  Bits 1, 2, 4, ... are parity bits.  The data bits are
interleaved.  So in the code bitstream, data bit 1 is code bit 3
(skipping past code bits 1 and 2, which are for parity).  Any time you
want to work on a bit, you have to know its code bitstream offset.
        Here's my original algorithm, the simple cut.  I call it the
"for" algoritm, for obvious reasons.  In all the examples, 'i' is the
data bit offset, 'b' is the code bit offset.  I'm going to use 1-based
arrays because that's what Hamming code uses.

    total_bits = total_data_bits + total_parity_bits;
    for (i = 1, b = 1; i < total_bits; i++, b++)
        while (is_power_of_two(b))
            b++

        if (test_bit(i))
            for (j = 1; j < total_parity_bits; j++)
                if (b affects j)
                    j ^= 1;

I'm going to handwave right now how we determine whether bit b affects
each parity bit, but it relies on b's number in the _code_ space, so we
have to have that is_power_of_two() check.  To sum up, the for loop runs
through _every_bit_ and calls test_bit() on it.  If the bit is set, it
loops through each parity bit.  That's a lot of looping.
        This naive implementation is slow.  Surprise!  My test program
fills a blocksize buffer from /dev/urandom and runs N encodes.  I have
the same program do crc32 and similar.  10000 loops of a naive crc32
(the kernel's un-optimized fallback code) on my laptop take 0.300s.
10000 loops of zlib's optimized crc32 takes 0.030s.   10000 loops of our
parity encode take 6s!  The naive crc32 is an order of magnitude better!
        I wanted to find a way to move forward more than one bit at a
time.  Many algorithms (smarter crc32s, for example) work on words or
larger chunks.  But each bit affects different parity bits, so I'm not
yet sure how to calculate more than one bit's parity in a single
operation.
        I did realize that we skip 0 bits.  Why can't we skip them
faster?  Instead of using the for loop to compute b's offset, I
concocted a function that would take i and return b for any bit.  I then
changed the loop to use find_next_set.  Let's call this the 'ffs'
algorithm (handwaving around 1-based arrays):

    for (i = 1;
         (i = ocfs2_find_next_bit_set(data, total_bits, i)) < total_bits;
         i++)
        b = calc_code_bit(i);

        for (j = 1; j < total_parity_bits; j++)
            if (b affects j)
                j ^= 1;
  
        Here's the fun part.  I figured that by skipping zero bits, we'd
get some speedup.  When I ran it, it was SLOWER!  What?  It ran 7s
instead of 6s.  So I profiled it.  The 'for' algorithm's second biggest
time consumer was ocfs2_test_bit(), with ~8m calls taking up 7s of
profiled (slower) run time.  The 'ffs' algorithm's
ocfs2_find_next_bit_set() took 8s of profiled time with ~4m calls.  Our
naive ffs is *slow*.
        This is where I was when I started the email.  I thought it was
a fun outcome.  But since then I've had some thoughts and success on the
topic.
        This lead me to two conclusions.  First, a fast find_next_bit(),
like the kernel has in ASM, will be much faster.  Second, half of our
test_bits were returning true (4m ffs calls vs 8m test_bits).  Of course
they are, I'm using /dev/urandom.  A real inode will have a ton of
zeros, especially if it isn't fragmented.
        I changed the test buffer.  It still reads from /dev/urandom,
but then it zeros the tail of the inode, where it will likely be empty.
The 'for' algorithm speeds up, as it has fewer loops through the parity
checks, running about 2.5s.  But the 'ffs' algorithm really shines.
1.3s for it's run of 10000.  Assembly based ffs in-kernel will speed it
up some more, but after that we'll have to do some tricks on the parity
loop.
        I'm willing to call the code acceptible for now.  Please, if
anyone has more optimization ideas, have at. :-)

Joel

-- 

"Here's something to think about:  How come you never see a headline
 like ``Psychic Wins Lottery''?"
        - Jay Leno

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
hamming.txt