Comments on cbloom rants: 05-20-09 - Some Huffman notes

"You're avoiding branches on the input da...

2010-06-02T21:14:19.058-07:00

"You're avoiding branches on the input data, but don't you get those branches back (almost as unpredictable, although fewer since they're per-symbol instead of per-bit)"

It is inherently impossible to eliminate unpredictable branches in a decompressor. (in fact the unpredictability of the branches is related to the efficiency of the compressor! the memcpy decoder is the only decoder without branches and it also gets zero compression)

I believe this way is the minimum # of branches in a huffman decode (or very close anyway).

The IJG method actually has lots of branches that you aren't accounting for. Filling the bit buffer has branches, how many bits of the buffer are used to decode a symbol has branches, etc.

You're avoiding branches on the input data, bu...

2009-06-03T05:30:54.515-07:00

You're avoiding branches on the input data, but don't you get those branches back (almost as unpredictable, although fewer since they're per-symbol instead of per-bit) in the for loop since you can get multiple outputs from each byte?

libjpeg uses an 8-bit lookup table but without the same kind of state, just filling the buffer to make sure there's always enough. Of course, then you get a branch on “do we need to refill the buffer?”...

Your approach does has use even in a "normal&...

2009-05-21T17:17:29.064-07:00

Your approach does has use even in a "normal" one-symbol eating canonical huffman decoder (I take the IJG-decoder as an example). You normally have to maintain a little buffer-register holding enough left-align bits for the LUT, then you shift the buffer bit-wise to the left (as much bits as you consumed). Your approach allows to get rid of the buffer-register cycling, which means (algorithmically) a lot of conditional bit-shifting would be converted to conditional loops (namely all the shifting into the register, the shifting out of the register is still necessary). That allows to do full adress & size aligned cache-bypass reads (huffman-code streams are hardly ever write-back) from the code-streams.

So under the condition that you have a better instruction-pipe than ALU (loops faster than bit-ops), or the ratio register to cache/mem is high (read-manipulate-write oncache takes long, caches are too full, reading from memory takes long), you have a gain here even without a multi-symbol pushing decoder.

I couldn't say the gain is in fantastic dimensions, but it may work given the right architectural conditions.