5/22/2009

05-22-09 - A little more Huffman

.. okay since I got started on this, a few other notes. This is ancient stuff, in fact it's been in the "Huffman2" in crblib for 10+ years, and it used to be common knowledge, but it's some of that art that's sort of drifted away.

For one thing, Huffman codes are specified only by the code lens. You don't need to store the bit patterns. Of course then you need a convention for how the bit patterns are made.

The "canonical" (aka standard way) to make the codes is to put numerically increasing codes in order with numerically increasing symbols of the same code length. I helps to think about the bit codes as if the top bit is transmitted first , even if you don't actually do that. That is, if the Huffman code is 011 that means first we send a 0 bit, then two ones, and that has value "3" in the integer register.

You can create the canonical code from the lengths in O(N) for N symbols using a simple loop like this :


for( len between min and max )
{
    nextCode = (nextCode + numCodesOfLen[len]) << 1;
    codePrefix[len] = nextCode;
}

for( all symbols )
{
    code[sym] = codePrefix[ codeLen[sym] ];
    codePrefix[ codeLen[sym] ] ++;
}

this looks a bit complex, but it's easy to see what's going on. If you imagine that you had your symbols sorted first by code length, and then by symbol Id, then you are simply handing out codes in numerical order, and shifting left one each time the code length bumps up. That is :

curCode = 0;
for( all symbols sorted by codelen : )
{
    code[sym] = curCode;
    curCode ++;
    curCode <<= codeLen[sym+1] - codeLen[sym];
}

An example :

code lens :

c : 1
b : 2
a : 3
d : 3

[code=0]

c : [0]
    code ++ [1]
    len diff = (2-1) so code <<= 1 [10]
b : [10]
    code ++ [11]
    len diff = (3-2) so code <<= 1 [110]
a : [110]
    code ++ [111]
    no len diff
d : [111]

very neat.

The other thing is to decode you don't actually need to build any tree structure. Huffman trees are specified entirely by the # of codes of each length, so you only need that information, not all the branches. It's like a balanced heap kind of a thing; you don't actually want to build a tree structure, you just store it in an array and you know the tree structure.

You can do an O(H) (H is the entropy, which is the average # of bits needed to decode a symbol) decoder using only 2 bytes per symbol (for 12 bit symbols and max code lengths of 16). (note that H <= log(N) - a perfectly balanced alphabet is the slowest case and it takes log(N) bits).


a node is :

Node
{
  CodeLen : 4 bits
  Symbol  : 12 bits
}

You just store them sorted by codelen and then by code within that len (aka "canonical Huffman order"). So they are sitting in memory in same order you would draw the tree :

0 : c
10 : b
11 : a

would be

[ 1 : c ]
[ 2 : b ]
[ 2 : a ]

then the tree descent is completely determined, you don't need to store any child pointers, the bits that you read are the index into the tree. You can find the node you should be at any time by doing :

index = StartIndex[ codeLen ] + code;

StartIndex[codeLen] contains the index of the first code of that length *minus* the value of the first code bits at that len. You could compute StartIndex in O(1) , but in practice you should put them in a table; you only need 16 of them, one for each code len.

So the full decoder looks like :


code = 0;
codeLen = 0;

for(;;)
{
    code <<= 1;
    code |= readbit();
    codeLen ++;

    if ( Table[ StartIndex[ codeLen ] + code ].codeLen == codeLen )
        return Table[ StartIndex[ codeLen ] + code ].Symbol
}

obviously you can improve this in various ways, but it's interesting to understand the Huffman tree structure this way.

The thing is if you just list out the huffman codes in order, they numerically increase :


0
10
110
111

= 0, 2, 6,7

If you have a prefix that doesn't yet decode to anything, eg if you have "11" read so far - that will always be a code value that numerically doesn't exist in the table.

Finally, if your symbols actually naturally occur in sorted order, eg. 0 is most likely, then 1, etc.. (as they do with AC coefficients in the DCT which have geometric distribution) - then you can do a Huffman decoder that's O(H) with just the # of codes of each length !

No comments:

old rants