7/02/2010

07-02-10 - Length-Limitted Huffman Codes

I have something interesting to write about Huffman decoders, but that will have to wait a bit.

In the mean time I finally wrote a length-limitted huffman code builder. Everyone uses the "package-merge" algorithm (see Wikipedia , or the paper "A Fast and Space-Economical Algorithm for Length-Limited Coding" by Moffat et.al ; the Larmore/Hirschberg paper is impenetrable).

Here's my summary :


Explicity what we are trying to do is solve :

Cost = Sum[ L_i * C_i ]

C_i = count of ith symbol
L_i = huffman code length

given C_i, find L_i to minimize Cost

contrained such that L_i <= L_max

and 

Sum[ 2 ^ - L_i ] = 1
(Kraft prefix constraint)

This is solved by construction of the "coin collector problem"

The Cost that we minimize is the real (numismatic) value of the coins that the collector pays out
C_i is the numismatic value of the ith coin
L_i is the number of times the collector uses a coin of type i
so Cost = Sum[ L_i * C_i ] is his total cost.

For each value C_i, the coins have face value 2^-1, 2^-2, 2^-4, ...
If the coin collector pays out total face value of (n-1) , then he creates a Kraft correct prefix code

The coin collector problem is simple & obvious ; you just want to pay out from your 2^-1 value items ;
an item is either a 2^-1 value coin, or a pair of 2^-2 value items ; pick the one with lower numismatic value

The fact that this creates a prefix code is a bit more obscure
But you can prove it by the kraft property

If you start with all lenghts = 0 , then K = sum[2^-L] = N
Now add an item from the 2^-1 list
if it's a leaf, L changes from 0 to 1, so K does -= 1/2
if it's a node, then it will bring in two nodes at a lower level
    equivalent to to leaves at that level, so L changes from 1 to 2 twice, so K does -= 1/2 then too
    
so if the last list has length (2N-2) , you get K -= 1/2 * (2N-2) , or K -= N-1 , hence K = 1 afterward

-----------------------------------
BTW you can do this in a dynamic programming sort of way where only the active front is needed; has same
run time but less storage requirements.

You start at the 2^-1 (final) list.  You ask : what's the next node of this list?  It's either a symbol or
  made from the first two nodes of the prior list.  So you get the first two nodes of the prior list.
When you select a node into the final list, that is committed, and all its children in the earlier lists
  become final; they can now just do their increments onto CodeLen and be deleted.
If you select a symbol into the final list, then the nodes that you looked at earlier stick around so you
  can look at them next time.

Okay, so it all works fine, but it bothers me.

I can see that "package-merge" solves the "coin collector problem". In fact, that's obvious, it's the obvious way to solve that problem. I can also see that the minimization of the real value cost in "coin collector problem" can be made equivalent to the minimization of the total code length, which is what we want for Huffman code building. Okay. And I can understand the proof that the codes built in this way are prefix. But it's all very indirect and round-about.

What I can't see is a way to understand the "package-merge" algorithm directly in terms of building huffman codes. Obviously you can see certain things that are suggestive - the making pairs of items with minimum cost is a lot like how you would build a huffman tree. The funny thing is that the pairing here is not actually building the huffman tree - the huffman tree is never actually made; instead we make code lengths by counting the number of times the symbol appears in the active set. Even that we can sort of understand intuitively - if a symbol has very low count, it will appear in all L lists, so it will have a code length of L, the max. If it has a higher count, it will get bumped out of some of the lists by packages of lower-count symbols, so it will have a length < L. So that sort of makes sense, but it just makes me even more unhappy that I can't quite see it.

9 comments:

Anonymous said...

I've never actually looked at this either--are the codes significantly different from what you'd get by building the regular huffman tree and then shuffling the leaves upwards?

cbloom said...

"--are the codes significantly different from what you'd get by building the regular huffman tree and then shuffling the leaves upwards?"

Well of course "it depends".

The short answer is that yes, they are very different, but it also doesn't matter.

The question is how much pressure you are putting on the code length limit. eg. what % of the original symbols were over length limit. If the length limit is reasonably long, then that % will be very small because the least likely symbols are the long ones.

So because of that, even though the heuristic way may be far from optimal, it just doesn't matter.

cbloom said...

For example, on a real file ("pic") with limit = 16, these are the number of codes of each length generated by the heuristic or optimal builder :

Heuristic :

1 : 1 , 0.628931%
4 : 1 , 0.628931%
5 : 2 , 1.257862%
6 : 14 , 8.805032%
7 : 11 , 6.918239%
8 : 9 , 5.660378%
9 : 9 , 5.660378%
10 : 7 , 4.402516%
11 : 10 , 6.289308%
12 : 9 , 5.660378%
13 : 13 , 8.176101%
14 : 20 , 12.578616%
15 : 3 , 1.886792%
16 : 50 , 31.446541%

Optimal :

1 : 1 , 0.628931%
4 : 1 , 0.628931%
5 : 2 , 1.257862%
6 : 14 , 8.805032%
7 : 11 , 6.918239%
8 : 9 , 5.660378%
9 : 9 , 5.660378%
10 : 7 , 4.402516%
11 : 10 , 6.289308%
12 : 9 , 5.660378%
13 : 11 , 6.918239%
14 : 21 , 13.207547%
15 : 14 , 8.805032%
16 : 40 , 25.157232%

you can see they are in fact quite different, but in both cases the actual coded entropy is exactly 1.661

cbloom said...

If you artificially put a lot of pressure on the system you can get big differences. For example with a code length limit of 8 :

heuristic :

2 : 1 , 0.628931%
7 : 34 , 21.383648%
8 : 124 , 77.987419%
entropy : 2.655

optimal :

2 : 1 , 0.628931%
5 : 1 , 0.628931%
6 : 5 , 3.144654%
7 : 12 , 7.547170%
8 : 140 , 88.050316%
entropy : 2.607

Anonymous said...

What does the unconstrained version of that look like? I'm confused why a heuristic algorithm wouldn't have preserved a shorter code for the second symbol.

cbloom said...

" I'm confused why a heuristic algorithm wouldn't have preserved a shorter code for the second symbol. "

Because it's heuristic so it's not ideal.

It bumps all the long code lens up to 8. Then it has to fix the codes by making others longer. It starts with the lowest count codes. In this case there is a real problem with that length 1 code, but before it gets there it first bumps the length 4 code all the way up to 7. But that isn't enough so it winds up making the length 1 code into length 2.

Then it does the backwards pass to shorten some codes. You'd like it to shorten that second most likely one back to 5, but it can't do that unless it also lengthens some of the 7's up to 8, which it isn't considering any more.

The way to solve this correctly of course is just to do package merge, which is addressed in the follow-up to this post.

Anonymous said...

In this case there is a real problem with that length 1 code, but before it gets there it first bumps the length 4 code all the way up to 7.

This is what I missed, thanks.

Anonymous said...

I wonder if there's a better heuristic algorithm where you figure out the highest leaf you'll need to split/move-down to have enough room, you move that down, repeat, so that you don't move things down too far.

It's just it seems like, even if you throw out the actual frequency data, the tree itself gives you information about the approximate frequencies that you ought to be able to leverage straightforwardly. (Maybe nobody's bothered looking at it because the gain is minimal in practice?)

cbloom said...

Yeah there's a better algorithm, it's called package-merge ;)

old rants