Classical Tunstall

Before continuing with Marlin, I want to take a brief digression to review "classical" or "true" Tunstall.

The classical Tunstall algorithm constructs VTF (variable to fixed) codes for binary memoryless (order-0) sources. It constructs the optimal code.

You start with dictionary = { "0","1" } , the single bit binary strings. (or dictionary = the null string if you prefer)

You then split one word W in the dictionary to make two new words "W0" and "W1" ; when you split W, it is removed since all possible following symbols now have words in the dictionary.

The algorithm is simple and iterative :

while dic size < desired
find best word W to split
remove W
add W0 and W1

each step increments dictionary size by +1

What is the best word to split?

Our goal is to maximize average code length :

A = Sum[words] P(W) * L(W)

under the split operation, what happens to A ?

W -> W0, W1

delta(A) = P(W0) * L(W0) + P(W1) * L(W1) - P(W) * L(W)

P(W0) = P(W)*P(0)
P(W1) = P(W)*P(1)
L(W0) = L(W)+1

.. simplify ..

delta(A) = P(W)

so to get the best gain of A, you just split the word with maximum probability. Note of course this is just greedy optimization of A and that might not be the true optimum, but in fact it is and the proof is pretty neat but I won't do it here.

You can naively build the optimal Tunstall code in NlogN time with a heap, or slightly more cleverly you can use two linear queues for left and right children and do it in O(N) time.

Easy peasy, nice and neat. But this doesn't work the same way for the large-alphabet scenario.

Now onto something that is a bit messy that I haven't figured out.

For "plural Tunstall" we aren't considering adding all children, we're only considering adding the next child.

A "split" operation is like :

start with word W with no children
W ends in state 0 (all chars >= 0 are possible)

the next child of W to consider is "W0"
(symbols sorted so most probable is first)

if we add "W0" then W goes to state 1 (only chars >= 1 possible)

W_S0 -> "W0" + W_S1

W_S1 -> "W1" + W_S2


again, we want to maximize A, the average codelen. What is delta(A) under a split operation?

delta(A) = P("W0") * L("W0") + P(W_S1) * L(W) - P(W_S0) * L(W)

delta(A) = P("W0") + (P("W0") + P(W_S1) - P(W_S0)) * L(W)

P("W0") + P(W_S1) - P(W_S0) = 0


delta(A) = P("W0") 

it seems like in plural Tunstall you should "split" the word that has maximum P("W0") ; that is maximize the probability of the word you *create* not the one you *remove*. This difference arises from the fact that we are only making one child of longer length - the other "child" in the pseudo-split here is actually the same parent node again, just with a reduced exit state.

In practice that doesn't seem to be so. I experimentally measured that choosing to split the word with maximum P(W) is better than splitting the word with maximum P(child).

I'm not sure what's going wrong with this analysis. In the Marlin code they just split the word with maximum P(W) by analogy to true Tunstall, which I'm not convinced is well justified in plural Tunstall.

While I'm bringing up mysteries, I tried optimal-parsing plural Tunstall. Obviously with "true tunstall" or any prefix-free code that's silly, the greedy parse is the only parse. But with plural Tunstall, you might have "aa" and also "aaa" in the tree. In this scenario, by analogy to LZ, the greedy parse is usually imperfect because it is sometimes better to take a shorter match now to get a longer one on the next work. So maybe some kind of lazy , or heck full optimal parse. (the trivial LZSS backward parse works well here).

Result : optimal-parsed plural Tunstall is identical to greedy. Exactly, so it must be provable. I don't see an easy way to show that the greedy parse is optimal in the plural case. Is it true for all plural dictionaries? (I doubt it) What are the conditions on the dictionary that guarantee it?

I think that this is because for any string in the dictionary, all shorter substrings of that string are in the dictionary too. This makes optimal parsing useless. But I think that property is a coincidence/bug of how Marlin and I did the dictionary construction, which brings me to :

Marlin's dictionary construction method and the one I was using before, which is slightly different, both have the property that they never remove parent nodes when they make children. I believe this is wrong but I haven't been able to make it work a different way.

The case goes like this :

you have word W in the dictionary with no children

you have following chars a,b,c,d.  a and b are very probable, c and d are very rare.

P(W) initially = P_init(W)

you add child 'a' ; W -> Wa , W(b+)
P(W) -= P(Wa)
add child 'b'
W(b+) -> Wb , W(c+)
P(W) -= P(Wb)

now the word W in the dictionary has
P(W) = P(Wc) + P(Wd)

these are quite rare, so P(W) now is very small

W is no longer a desirable dictionary entry.

We got all the usefulness of W out in Wa and Wb, we don't want to keep W in the dictionary just to be able to code it with rare following c's and d's - we'd like to now remove W.

In particular, if the current P(W) of the parent word is now lower than a child we could make somewhere else by splitting, remove W and split the other node. Or something like that - here's where I haven't quite figured out how to make this idea work in practice.

So I believe that both Marlin and my code are NOT making optimal general VTF plural dictionaries, they are making them under the (unnecessary) constraint of the shorter-substring-is-present property.

No comments:

old rants