08-12-10 - The Lost Huffman Paper

"On the Implementation of Minimum Redundancy Prefix Codes" by Moffat and Turpin , 1997 , is a brilliant work that outlines very clearly the best ways to write a huffman decoder. It's a completely modern, practical work, and basically nobody has added anything beyond this technique in the last 13 years.

However, the importance of it was missed when it came out. For many years afterwards people continued to publish "improvements" to Huffman decoding (such as Sub-linear Decoding of Huffman Codes Almost In-Place (1998) ) which are just pure useless shit (I don't mean to single out that paper, there were probably hundreds of shitty foolish pointless papers on "efficient huffman" written after Moffat/Turpin).

Most people in the implementation community also missed this paper (eg. zlib, JPEG, etc. people who make important use of huffman decodes have missed these techniques).

I missed it too. Recently we did a lot of work on Huffman decoding at RAD, and after trying many techniques and lots of brainstorming, we came up with what we thought was a brilliant idea :

Store the code in your variable bit input word left-justified in a register. The Huffman codes are numerically arranged such that for codes of any given length, leaves (symbols) are lower values than nodes (branches). Then, the code for the first branch of each codelen can be left-justified in a word, and your entire Huffman decode consists of :

while ( bits >= huff_branchCodeLeftAligned[codeLen] )

return ( (bits>>(WORD_SIZE-codeLen)) - baseCode[ codeLen ] );

(this returns a symbol in "canonical order" - that is most probable is 0 ; if your symbols are not in order from most to least probable, you need an additional table lookup to reorder them).

This is really incredibly fucking hot. Of course it's obvious that it can be improved in various ways - you can use a fast table to skip the first few steps, you can use a nextCodeLen table to skip blanks in the codeLen sequence, and you can use a binary search instead of linear search. For known-at-compile-time huffman trees you could even optimize the binary search for the probability distribution of the codes and generate machine code for the decoder directly.

All of those ideas are in the Moffat+Turpin paper.

ADDENDUM : this post happened to get linked up, so I thought I'd flesh it out and fill in some of the blanks I'm implying above, since I'm sure you blog readers aren't actually going and reading the Moffat/Turpin paper like you should.

Here are some other posts on Huffman codes :

cbloom rants 08-11-10 - Huffman - Arithmetic Equivalence
cbloom rants 08-10-10 - Transmission of Huffman Trees
cbloom rants 07-02-10 - Length-Limitted Huffman Codes
cbloom rants 05-22-09 - A little more Huffman
cbloom rants 05-20-09 - Some Huffman notes
cbloom rants 05-18-09 - Lagrange Space-Speed Optimization (and 05-25-09 - Using DOT graphviz for some Huffman space-speed SVG's)

In particular : cbloom rants 05-22-09 - A little more Huffman describes a 1 bit at a time Huffman decoder with the code values right-justified in the word.

And cbloom rants 05-18-09 - Lagrange Space-Speed Optimization describes first Dmitry Shkarin's standard table-walking Huffman decoder and then a generalization of it; both use N-bit reads of right-justified code words and table stepping.

In all cases a practical Huffman decoder should use an initial table lookup to accelerate the first N bit step. (N usually 8-12 depending on application). The reality is that what you do after that is not super important because it is rare (the majority of Huffman codes are short). Because of this, there are advantages to using a right-justified "upside down" huffman code word, with the MSB at the bottom (I believe Zip does this) because it means you can get the first N bits by doing just an AND with a constant (eg. get the "top" 8 bits by doing &0xFF).

There are two key efficiency issues for Huffman decoder implementation : 1. Reducing branches, and reducing the dependency-chain that leads to branches. That is, doing math is not bad, but doing math to decide to branch is bad. 2. Avoiding variable shifts, because variable shifts are catastrophically slow on some important platforms.

Finally, let's look at a real implementation of the Moffat/Turpin Huffman decoder. The variable bit input word is stored in a word left-justified with top bits at the left. The first branch code at each code length is also left-aligned.

We start with table-accelerating the first N bits :

if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;

In practice you might use two tables, and Consume() is an unavoidable variable shift.

Next you have to handle the cases of code lens > TABLE_N_BITS. In that case rather than the loop in the pseudo-code above, you would actually unroll :

if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS+1] )
    return symbolUnsort[ (bits>>(WORD_SIZE-(TABLE_N_BITS+1))) - baseCode[ (TABLE_N_BITS+1) ] ];
if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS+2] )
    return symbolUnsort[ (bits>>(WORD_SIZE-(TABLE_N_BITS+2))) - baseCode[ (TABLE_N_BITS+2) ] ];

this does a branch on each code length, but avoids variable shifts. In some extreme cases (platforms with very slow variable shift, and huffman trees with very low minimum code lengths), this unrolled branching version can even be faster than the peek table, in which case you would simply set TABLE_N_BITS to zero.

In some cases, certain code lengths might not occur at all, and you can avoid checking them by having an additional table of which codelengths actually occur (in practice this rarely helps). This would look like :

if ( bits < huff_branchCodeLeftAligned[11] )
    return symbolUnsort[ (bits>>(WORD_SIZE-codeLenTable[11])) - baseCode[ 11 ] ];

where the 11 is not a codelen but the 11th code len which actually occurs, and you have the extra codeLenTable[] lookup.

Obviously you could just unroll directly starting at codelen=1 , and obviously this is also just a search. You are just trying to find where bits lies in the sorted huff_branchCodeLeftAligned table. So instead of just a linear search you could binary search. However note that lower code lens are much more likely, so you don't want to just binary search at the beginning. And the binary search makes the branches much less predictable, so it's not always a win. However, as Moffat/Turpin describes, in some cases, for example if you have a hard-coded Huffman tree, the huff_branchCodeLeftAligned can be constants and you can optimize the binary tree branch structure, so you can do codegen to make an optimal decoder, like :

if ( bits < 0xA01230000 )
  if ( bits < 0x401230000 )
    // decode codeLen = 4 
    // decode codeLen = 5

There's one final issue. Because the bit word is left aligned in the machine word, we can't make any branchCode value for "all bits are branches". In particular, with this compare :

if ( bits < huff_branchCodeLeftAligned[11] )

when bits is all 1's (0xFFF...) we can't make a branchCodeLeftAligned that returns true. There are a few solutions for this, one is to use <= branchCodeMinusOne , but then you have to make sure that you start with codeLen >= minCodeLen , because below that branchCode is zero and you have a problem. Another solution is to make sure bits is never full; that is, if you have a 32 bit word, then only allow 31 bits (or less) in your variable bit register. The final solution is the one I prefer :

The actual case of bits = all 1's in a 32 bit register should only occur 1 in 4 billion times, so we don't have to handle it fast, we just have to handle it correctly. So I suggest you do the unrolled checks for decode, and then after you have checked all codelens up to maximum allowed (24 or 32 or whatever your max codelen is), if bits was ~0 , it will have not decoded, so you can do :

if ( bits < huff_branchCodeLeftAligned[21] ) .. return decoded 21 bit code
if ( bits < huff_branchCodeLeftAligned[22] ) .. return decoded 22 bit code
if ( bits < huff_branchCodeLeftAligned[23] ) .. return decoded 23 bit code
if ( bits < huff_branchCodeLeftAligned[24] ) .. return decoded 24 bit code
// 24 is my max allowed code len

// failed to do any decode ! must be the bad case
assert( bits == (~0) );
// huff code must be maxCodeLen (not necessarily 24 - the maximum actually used in this huffman table)
// and symbol must be the last ( least likely one )
// return decoded maxCodeLen code;
return symbolUnsort[ numSymbols-1 ];

Finally note that in most cases it is slightly more efficient to use a "semi-huffman" code rather than a true huffman code. The semi-huffman code I propose is huffman up to codelen = 16 or so, and then simply flat after that. In most cases this affects compression ratio microscopically (because the long code lengths are very rare) but can reduce complexity a lot. How does it reduce complexity?

1. You don't have to do the proper length-limitted huffman tree construction. Instead you just build a normal unlimitted huffman code length set, and then anything with code length >= 16 is flagged as part of the semi-huffman set.

2. When you transmit your code lengths, you don't have to send the lengths > 16, you just send lengths in [0,16] (0 means doesn't occur).

3. Your unrolled decoder only has to go up to 16 branches (if you table-accelerate, you do 8 bits by table then 8 more branches).

4. Then in the final case instead of just handling the ~0 case you handle all the "long" symbols with a flat code.


nothings said...

did you lose a while, and a shift in the final table lookup?

cbloom said...

eh, yeah..

Thatcher Ulrich said...

When did you discover the paper?

cbloom said...

A few weeks ago, after we had reinvented the same idea. I actually found it when I was looking for length-limited huffman code building stuff cuz I knew Moffat had a good paper on that topic.

RichG said...

FWIW, this is the same technique used by the LZHAM codec's decoder:

cbloom said...

Hi Rich, I just downloaded LZHAM.

It looks pretty much like LZMA but with the arithmetic for match lens and literals replaced with quasi-adaptive-huffman , yes?

The code is actually readable for one thing which is nice.

BTW it would be nice if you write a little about it. A few things in particular have always befuddled me in LZMA :

1. Where does the state transition table come from exactly? Do you understand it intuitively?

2. How does his optimal parse work exactly? It looks like you are doing a full forward Dijkstra on some # of bytes forward (up to 1024), which is pretty damn slow. LZMA's is much faster than that so it must be something else.

cbloom said...

BTW I suspect that the "polar codes" is actually a Shannon-Fano or BRE code. You're using a pow2 "deferred summation" model.

RichG said...

Hi Charles,
I copied the state transition table straight from Igor's decoder (lzmaDec.c). I haven't spent any time trying to really understand it yet.

His coding style makes my head hurt. I've tried several times to comprehend wtf is really going on in there.

Yea, the forward Dijkstra search is very expensive, but without it lzham's ratio isn't competitive. My flexible parsing experiments where mostly failures (in a comp. ratio sense). I'm still looking for a smarter way to do this.

The latest development version of lzham accelerates parsing by breaking up the lookahead buffer into (up to) four 3KB blocks and assigns each block to a worker thread to be independently parsed starting from the same state. After all blocks are parsed the main thread takes the lz decisions from each block and codes them using the real state. I added a special code that is issued after each block that tells the decompressor to reset its cur_state and match history.

I also moved a lot of redundant get_cost() calculations out of the innermost matchlen truncation loop, and simplified/tweaked a bunch of stuff.

The matchlen truncation loop is probably the next area I'm going to explore. Exhaustively trying all possible truncations doesn't seem worth the effort.

With these changes the new compressor is 3-4x faster on a core i7, and is now faster than 7za -mx9. But I'm still using way more overall CPU than LZMA.


cbloom said...

"The matchlen truncation loop is probably the next area I'm going to explore. Exhaustively trying all possible truncations doesn't seem worth the effort."

I have a good solution to this for the backwards optimal parse, but it doesn't work with forwards parse. It's "LowestCostPrecedingPos"
( http://cbloomrants.blogspot.com/2008/10/10-10-08-2.html ). I think a similar idea might work for forwards actually, but it requires a bit more book-keeping.

old rants