cbloom rants: 08-10-10 - Transmission of Huffman Trees

Transmission of Huffman Trees is one of those peripheral problems of compression that has never been properly addressed. There's not really any research literature on it, because in the N -> infinity case it disappears.

Of course in practice, it can be quite important, particularly because we don't actually just send one huffman tree per file. All serious compressors that use huffman resend the tree every so often. For example, to compress bytes what you might do is extend your alphabet to [0..256] inclusive, where 256 means "end of block" , when you decode a 256, you either are at the end of file, or you read another huffman tree and start on the next block. (I wrote about how the encoder might make these block split decisions here ).

So how might you send a Huffman tree?

For background, you obviously do not want to actually send the codes. The Huffman code value should be implied by the symbol identity and the code length. The so-called "canonical" codes are created by assigning codes in numerical counting up order to symbols of the same length in their alphabetical order. You also don't need to send the character counts and have the decoder make its own tree, you send the tree directly in the form of code lengths.

So in order to send a canonical tree, you just have to send the code lens. Now, not all symbols in the alphabet may occur at all in the block. Those technically have a code length of "infinite" but most people store them as code length 0 which is invalid for characters that do occur. So you have to code :


which symbols occur at all
which code lengths occur
which symbols that do occur have which code length

Now I'll go over the standard ways of doing this and some ideas for the future.

The most common way is to make the code lengths into an array indexed by symbol and transmit that array. Code lengths are typically in [1,31] (or even less [1,16] , and by far most common is [4,12]), and you use 0 to indicate "symbol does not occur". So you have an array like :


{ 0 , 0 , 0 , 4 , 5 , 7 , 6, 0 , 12 , 5 , 0, 0, 0 ... }

1. Huffman the huffman tree ! This code length array is just another array of symbols to compress - you can of course just run your huffman encoder on that array. In a typical case you might have a symbol count of 256 or 512 or so, so you have to compress that many bytes, and then your "huffman of huffmans" will have a symbol count of only 16 or so, so you can then send the tree for the secondary huffman in a simpler scheme.

2. Delta from neighbor. The code lens tend to be "clumpy" , that is , they have correlation with their neighbors. The typical way to model this is to subtract each (non-zero) code len from the last (non-zero) code len, thus turning them into deltas from neighbors. You can then take these signed deltas and "fold up" the negatives to make them unsigned and then use one of the other schemes for transmitting them (such as huffman of huffmans). (actually delta from an FIR or IIR filter of previous is better).

3. Runlens for zeros. The zeros (does not occur) in particular tend to be clumpy, so most people send them with a runlen encoder.

4. Runlens of "same". LZX has a special flag to send a bunch of codelens in a row with the same value.

5. Golomb or other "variable length coding" scheme. The advantage of this over Huffman-of-huffmans is that it can be adaptive, by adjusting the golomb parameter as you go. (see for example on how to estimate golomb parameters). The other advantage is you don't have to send a tree for the tree.

6. Adaptive Arithmetic Code the tree! Of course if you can Huffman or Golomb code the tree you can arithmetic code it. This actually is not insane; the reason you're using Huffman over Arithmetic is for speed, but the Huffman will be used on 32k symbols or so, and the arithmetic coder will only be used on the 256-512 or so Huffman code lengths. I don't like this just because it brings in a bunch more code that I then have to maintain and port to all the platforms, but it is appealing because it's much easier to write an adaptive arithmetic coder that is efficient than any of these other schemes.

BTW That's a general point that I think with is worth stressing : often you can come up with some kind of clever heuristic bit packing compression scheme that is close to optimal. The real win of adaptive arithmetic coding is not the slight gain in efficiency, it's the fact that it is *so* much easier to compress anything you throw at it. It's much more systematic and scientific, you have tools, you make models, you estimate probabilities and compress them. You don't have to sit around fiddling with "oh I'll combined these two symbols, then I'll write a run length, and this code will mean switch to a different coding", etc.

Okay, that's all standard review stuff, now let's talk about some new ideas.

One issue that I've been facing is that coding the huffman tree in this way is not actually very nice for the decoder to be able to very quickly construct trees. (I wind up seeing the build tree time show up in my profiles, even though I only buld tree 5-10 times per 256k symbols). The issue is that it's in the wrong order. To build the canonical huffman code, what you need is the symbols in order of codelen, from lowest codelen to highest, and with the symbols sorted by id within each codelen. That is, something like :


codeLen 4 : symbols 7, 33, 48
codeLen 5 : symbols 1, 6, 8, 40 , 44
codeLen 7 : symbols 3, 5, 22
...

obviously you can generate this from the list of codelens per symbol, but it requires a reshuffle which takes time.

So, maybe we could send the tree directly in this way?

One approach is through counting combinations / enumeration . For each codeLen, you send the # of symbols that have that codeLen. Then you have to select the symbols which have that codelen. If there are M symbols of that codeLen and N remaining unclaimed symbols, the number of ways is N!/(M!*(N-M)!) , and the number of bits needed to send the combination index is log2 of that. Note in this scheme you should also send the positions of the "not present" codeLen=0 group, but you can skip sending the entire group that's largest. You should also send the groups in order of smallest to largest (actually in order or *complement* order, a group that's nearly full is as good as a group that's nearly empty).

I think this is an okay way to send huffman trees, but there are two problems : 1. it's slow to decode a combination index, and 2. it doesn't take advantage of modelling clumpiness.

Another similar approach is binary group flagging. For each codeLen, you want to specify which remaining symbols are of that codelen or not of that codelen. This is just a bunch of binary off/on flags. You could send them with a binary RLE coder, or the elegant thing would be Group Testing. Again the problem is you would have to make many passes over the stream and each time you would have to exclude already done ones.

(ADDENDUM : a better way to do this which takes more advantage of "clumpiness" is like this : first code a binary event for each symbol to indicate codelen >=1 (vs. codeLen < 1). Then, on the subset that is >= 1, code an event for is it >= 2, and so on. This the same amount of binary flags as the above method, but when the clumpiness assumption is true this will give you flags that are very well grouped together, so they will code well with a method that makes coherent binary smaller (such as runlengths)).

Note that there's another level of redundancy that's not being exploited by any of these coders. In particular, we know that the tree must be a "prefix code" , that is satisfy Kraft, (Sum 2^-L = 1). This constrains the code lengths. (this is most extreme for the case of small trees; for example with a two symbol tree the code lengths are completely specified by this; on a three symbol tree you only have one free choice - which one gets the length 1, etc).

Another idea is to use MTF for the codelengths instead of delta from previous. I think that this would be better, but it's also slower.

Finally when you're sending multiple trees in a file you should be able to get some win by making the tree relative to the previous one, but I've found this is not trivial.

I've tried about 10 different schemes for huffman tree coding, but I'm not going to have time to really solve this issue, so it will remain neglected as it always has.

8 comments:

Unknown said...: This comment has been removed by the author.; August 10, 2010 at 7:44 PM
Unknown said...: Range coder is faster than Huffman. There's really no reason to use Huffman since the majority of range coder related IBM patents have now expired.; August 10, 2010 at 7:46 PM
cbloom said...: "Range coder is faster than Huffman. "

That is 110% wrong.; August 10, 2010 at 7:49 PM
cbloom said...: Maybe more.; August 10, 2010 at 7:49 PM
Sam said...: Round one! Fight!; August 11, 2010 at 4:20 AM
cbloom said...: I'm taking your damn trolling and turning into constructive and interesting gold. Gold, Jerry, gold!; August 11, 2010 at 12:06 PM
Unknown said...: I'm talking about decompression only and assuming a fixed probability table.; September 13, 2010 at 10:31 PM
cbloom said...: "I'm talking about decompression only and assuming a fixed probability table."

Basically this is just not right.

As I demonstrated in great detail, the fastest arithmetic decoder would be one in which the total probabilities was a power of 2, and each individual probability was a power of 2. That's a Huffman code.

If you make the total a power of 2 but let each individual once be a sum of two powers of two, that's slightly slower (ala Rissanen-Mohiuden / DCC95).

Anything more general is slower still.; September 14, 2010 at 4:34 PM

cbloom rants

8/10/2010

08-10-10 - Transmission of Huffman Trees

8 comments:

old rants