# cbloom rants

## 5/20/2009

### 05-20-09 - Some Huffman notes

For the record, you can do a very fast Huffman decoder by always reading 8 bits and storing the bit-reading state in the table pointer. The way you do this is by creating a seperate 8-bit lookup table for each possible current bit-reading state.

You have some # of bits [0,8] that have not yet been used for output. For each # of bits there are various values those bits can have, for 0 bits its {} , for 1 bit it's {0,1}, etc. So your total number of states is 256+128+64... = 512.

So at each step you have your current table pointer which is your decode table plus holds your current bit buffer state. You read 8 more bits and look up in that table. The table tells you what symbols can be made from the combination of {past bit buffer + new 8 bits}. Those symbols may not use up all the bits that you gave it. Whatever bits are left specify a new state.

I'll show an example for clarity :

```
Say you read 3 bits at a time instead of 8 (obviously we picked 8 because it's byte aligned)

If your Huffman code is :

0 - a
10 - b
11 - c

You make these tables :

{no bits} :

000 : aaa + {no bits}
001 : aa  + {1}
010 : ab  + {no bits}
011 : ac  + {no bits}
100 : ba  + {no bits}
101 : b   + {1}
110 : ca  + {no bits}
111 : c   + {1}

{1} :  (means ther's 1 bit pending of value 1)

000 : baa + {no bits}
001 : ba  + {1}
010 : bb  + {no bits}
011 : bc  + {no bits}
100 : caa + {no bits}
101 : ca  + {1}
110 : cb  + {no bits}
111 : cc  + {no bits}

```

The decoder code looks like this :

```
struct DecodeTableItem
{
U8 numOut;
U8 output[MAX_OUT];
DecodeTableItem * nextTable;
};

DecodeTableItem * pTable = table_no_bits_pending;

for(;;)
{
U8 byte = *inptr++;

int num = pTable[byte].numOut;

for(int i=0;i < num;i++)
*outptr++ = pTable[byte].output[i];

pTable = pTable[byte].nextTable;
}

```

It's actually a tiny bit more complicated than this because you have to handle codes longer than 8 bits. That just means "numOut" is 0 and you have to fall out to a brute force loop. You could handle that case in this same system if you make tables for 9,10,11+ bits in the queue, but you don't want to make all those tables. (you also don't actually need to make the 8 bits in the queue table either). The "numOut = 0" loop will read more bytes until it has enough bits to decode a symbol, output that symbol, and then there will be 0-7 remainder bits that will select the next table.

BTW in practice this is pretty useless because we almost never want to do one giant huffman array decode, we have our huffman codes interleaved with other kinds of coding, or other huffman trees, dependent lookups, etc. (this is also useless because it's too many tables and they are too slow to make on the fly unless you are decoding a ton of data).

BTW old-school compression people might recognize that this is basically a Howard-Vitter Quasi-Arithmetic coder. In the end, every decompressor is just a state machine that's fed by input bits or bytes from the coded stream. What we are doing here is basically explicitly turning our algorithm into that state machine. The Howard-Vitter QA did the same thing for a simplified version of arithmetic coding.

There are a lot of wins for explicitly modeling your decoder as a state machine triggered on input bits. By design of the compressor, the bits in the coded stream are random, which means any branch on them in the decoder is totally unpredictable. You have to eat that unpredictable branch - but you only want one of them, not a whole bunch. In normal function-abstracted decompressors you read the bits in some coder and generate a result and then act on that result outside - you're essentially eating the branch twice or more.

A typical arithmetic decoder does something like :

```
Read bits from stream into "Code"

Test "Code" against Range and probabilities to find Symbol
-> this is the main branch on the bit stream

if ( Range too small )
do Renormalization

branch on Symbol to update probabilities

do conditional output using Symbol
eg. Symbol might be a runlength or something that makes us branch again

```
The state-based binary arithmetic coders like the QM or MQ coder combine the bit-reading, renormalization, and model updates into a single state machine.

Ethatron said...

Your approach does has use even in a "normal" one-symbol eating canonical huffman decoder (I take the IJG-decoder as an example). You normally have to maintain a little buffer-register holding enough left-align bits for the LUT, then you shift the buffer bit-wise to the left (as much bits as you consumed). Your approach allows to get rid of the buffer-register cycling, which means (algorithmically) a lot of conditional bit-shifting would be converted to conditional loops (namely all the shifting into the register, the shifting out of the register is still necessary). That allows to do full adress & size aligned cache-bypass reads (huffman-code streams are hardly ever write-back) from the code-streams.

So under the condition that you have a better instruction-pipe than ALU (loops faster than bit-ops), or the ratio register to cache/mem is high (read-manipulate-write oncache takes long, caches are too full, reading from memory takes long), you have a gain here even without a multi-symbol pushing decoder.

I couldn't say the gain is in fantastic dimensions, but it may work given the right architectural conditions.

Unknown said...

You're avoiding branches on the input data, but don't you get those branches back (almost as unpredictable, although fewer since they're per-symbol instead of per-bit) in the for loop since you can get multiple outputs from each byte?

libjpeg uses an 8-bit lookup table but without the same kind of state, just filling the buffer to make sure there's always enough. Of course, then you get a branch on “do we need to refill the buffer?”...

cbloom said...

"You're avoiding branches on the input data, but don't you get those branches back (almost as unpredictable, although fewer since they're per-symbol instead of per-bit)"

It is inherently impossible to eliminate unpredictable branches in a decompressor. (in fact the unpredictability of the branches is related to the efficiency of the compressor! the memcpy decoder is the only decoder without branches and it also gets zero compression)

I believe this way is the minimum # of branches in a huffman decode (or very close anyway).

The IJG method actually has lots of branches that you aren't accounting for. Filling the bit buffer has branches, how many bits of the buffer are used to decode a symbol has branches, etc.