4/13/2017

Tunstall Context

Looking at Marlin today, some reminders to self/all about context :

Tunstall was originally designed with binary alphabets in mind; variable to fixed on *binary*. In that context, doing full child trees (so dictionaries are full prefix trees and encoding is unique) makes sense. As soon as you do variable-to-fixed (hence VTF) on large alphabets, "plural" trees are obviously better and have been written about much in the past. With plural trees, encoding is not unique ("a" and "ab" are both in the dictionary).

There's a big under-specified distinction between VTF dictionaries that model higher level correlation and those that don't. eg. does P("ab") = P(a)*P(b) or is there order-1 correlation?

I looked at Tunstall codes a while ago (TR "failed experiment : Tunstall Codes" 12/4/2015). I didn't make this clear but in my experiment I was looking at a specific scenario :

symbols are assumed to have only order-0 entropy
(eg. symbol probabilities describe their full statistics)

encoder transmits symbol probabilities (or the dictionary of words)
but there are other possibilities that some of the literature addresses. There are at lot of papers on "improved Tunstall" that use the order-N probabilities (the true N-gram counts for the words rather than multiplying the probability of each character). Whether or not this works in practice depends on context, eg. on LZ literals the characters are non-adjacent in the source so this might not make sense.

There's a fundamental limitation with Tunstall in practice and a very narrow window where it makes sense.

On current chips, 12-bit words is ideal (because 4096 dwords = 16k = fits in L1). 16 bit can sometimes give much better compression, but falling out of L1 is a disaster for speed.

12-bit VTF words works great if the entropy of the source is <= 5 bits or so. As it goes over 5, you have too many bigrams that don't pack well into 12, and the compression ratio starts to suffer badly (and decode speed suffers a bit).

I was investigating Tunstall in the case of normal LZ literals, where entropy is always in the 6-8 bpc range (because any more compressability has been removed by the string-match portion of the LZ). In that case Tunstall just doesn't work.

Tunstall is best when entropy <= 3 bits or so. Not only do you get compression closer to entropy, you also get more decode speed.

Now for context, that's a bit of a weird place to just do entropy coding. Normally in low-entropy scenarios, you would have some kind of coder before just tossing entropy coding at it. eg. take DCT residuals, or any image residual situation. You will have lots of 0's and 1's so it looks like a very low entropy scenario for order-0 entropy, but typically you would remove that by doing RLE or something else so that the alphabet you hand to the entropy coder is higher entropy. (eg. JPEG does RLE of 0's and EOB).

Even if you did just entropy code on a low-entropy source, you might instead use a kind of cascaded coder. Again assuming something like prediction residuals where the 0's and 1's are very common, you might make a two-stage alphabet that's something like :


alphabet 1 : {0,1,2,3+}
alphabet 2 : {3,4,...255}

Then with alphabet 1 you could pack 4 symbols per byte and do normal Huffman. Obviously a Huffman decode is a little slower than Tunstall, but you're getting always 4 symbols per decode so output len is not variable, and compression ratio is better.

Tunstall for LZ literals might be interesting in a very fast LZ with MML 8 or so. (eg. things like ZStd level 1, which is also where multi-symbol-output Huff works well).

Point is - the application window here is pretty narrow, and there are other techniques that also address the same problem.

No comments:

old rants