In the mean time I finally wrote a length-limitted huffman code builder. Everyone uses the "package-merge" algorithm (see Wikipedia , or the paper "A Fast and Space-Economical Algorithm for Length-Limited Coding" by Moffat et.al ; the Larmore/Hirschberg paper is impenetrable).
Here's my summary :
Explicity what we are trying to do is solve : Cost = Sum[ L_i * C_i ] C_i = count of ith symbol L_i = huffman code length given C_i, find L_i to minimize Cost contrained such that L_i
<= L_max and Sum[ 2 ^ - L_i ] = 1 (Kraft prefix constraint) This is solved by construction of the "coin collector problem" The Cost that we minimize is the real (numismatic) value of the coins that the collector pays out C_i is the numismatic value of the ith coin L_i is the number of times the collector uses a coin of type i so Cost = Sum[ L_i * C_i ] is his total cost. For each value C_i, the coins have face value 2^-1, 2^-2, 2^-4, ... If the coin collector pays out total face value of (n-1) , then he creates a Kraft correct prefix code The coin collector problem is simple & obvious ; you just want to pay out from your 2^-1 value items ; an item is either a 2^-1 value coin, or a pair of 2^-2 value items ; pick the one with lower numismatic value The fact that this creates a prefix code is a bit more obscure But you can prove it by the kraft property If you start with all lenghts = 0 , then K = sum[2^-L] = N Now add an item from the 2^-1 list if it's a leaf, L changes from 0 to 1, so K does -= 1/2 if it's a node, then it will bring in two nodes at a lower level equivalent to to leaves at that level, so L changes from 1 to 2 twice, so K does -= 1/2 then too so if the last list has length (2N-2) , you get K -= 1/2 * (2N-2) , or K -= N-1 , hence K = 1 afterward ----------------------------------- BTW you can do this in a dynamic programming sort of way where only the active front is needed; has same run time but less storage requirements. You start at the 2^-1 (final) list. You ask : what's the next node of this list? It's either a symbol or made from the first two nodes of the prior list. So you get the first two nodes of the prior list. When you select a node into the final list, that is committed, and all its children in the earlier lists become final; they can now just do their increments onto CodeLen and be deleted. If you select a symbol into the final list, then the nodes that you looked at earlier stick around so you can look at them next time.
Okay, so it all works fine, but it bothers me.
I can see that "package-merge" solves the "coin collector problem". In fact, that's obvious, it's the obvious way to solve that problem. I can also see that the minimization of the real value cost in "coin collector problem" can be made equivalent to the minimization of the total code length, which is what we want for Huffman code building. Okay. And I can understand the proof that the codes built in this way are prefix. But it's all very indirect and round-about.
What I can't see is a way to understand the "package-merge" algorithm directly in terms of building huffman codes. Obviously you can see certain things that are suggestive - the making pairs of items with minimum cost is a lot like how you would build a huffman tree. The funny thing is that the pairing here is not actually building the huffman tree - the huffman tree is never actually made; instead we make code lengths by counting the number of times the symbol appears in the active set. Even that we can sort of understand intuitively - if a symbol has very low count, it will appear in all L lists, so it will have a code length of L, the max. If it has a higher count, it will get bumped out of some of the lists by packages of lower-count symbols, so it will have a length < L. So that sort of makes sense, but it just makes me even more unhappy that I can't quite see it.