Your dictionary is like the probability interval. You start with a full interval [0,1]. You put in all the single-character words, with P(c) for each char, so that all sums to one. You iteratively split the largest interval, and subdivide the range.

In binary the "split" operation is : W -> W0 , W1 P(W) = P(W)*P(0) + P(W)*P(1) In N-ary the split is : W -> Wa , W[b+] P(W) = P(W)*P(a) + P(w)*Ptail(b) W[b+] means just the word W, but in the state "b+" (aka state "1"), following sym must be >= b (P(w) here means P_naive(w), just the char probability product) W[b+] -> Wb , W[c+] P(w)*Ptail(b) = P(w)*P(b) + P(w)*Ptail(c) (recall Ptail(c) = tail cumprob, sum of P(char >= c)) (Ptail(a) = 1.0)

So we can draw a picture like an arithmetic coder does, spliting ranges and specifying cumprob intervals to choose our string :

You just keep splitting the largest interval until you have a # of intervals = to the desired number of codes. (8 here for 3-bit Tunstall).

At that point, you still have an arithmetic coder - the intervals are fractional sizes and are correctly sized to the probabilities of each code word. eg. the interval for 01 is P(0)*P(1).

In the final step, each interval is just mapped to a dictionary codeword index. This gives each codeword an equal 1/|dictionary_size| probability interval, which in general is not quite right.

This is where the coding inefficiency comes from - it's the difference between the correct interval sizes and the size that we snap them to.

(ADD : in the drawing I wrote "snap to powers of 2" ; that's not the best way of describing that; they're just snapping to the subsequent i/|dictionary_size|. In this case with dictionary_size = 8 those points are {0,1/8,1/4,3/8,1/2,..} which is why I was thinking about powers of 2 intervals.)

## No comments:

Post a Comment