Marlin Summary

Sum up with simple take-away.

Plural Tunstall VTF coding in general is extremely fast to decode. It works best with 12-bit tables (must stay in L1), which means it only works well at entropy <= 4 bpb.

Marlin introduces an improved word probability model that captures the way the first letter probability is skewed by the longer-string exclusion. (this is just like the LAM exclusion in LZ)

That is, after coding with word W, subsequent chars that exist in the dictionary (Wa,Wb) cannot then be the next character to start a word, so the probably of the first char of the word being >= c is increased.

The Marlin word probability model improves compression by 1-4% over naive plural Tunstall.

The simplest implementation of the Marlin probability adjustment is like this :

P_naive(W) = Prod[chars c in W] P('c')

P_word(W) = P_scale_first_char( W[0] ) * P_naive(W) * Ptail( num_children(W) )

(where Ptail is the tail-cumulative-proability :
Ptail(x) = Sum[ c >= x ] P('c')  (sum of character probabilities to end)

(instead of scaling by Ptail you can subtract off the child node probabilities as you make them)

on the first iteration of dictionary building, set P_scale_first_char() = 1.0

this is "naive plural Tunstall"

after building the dictionary, you now have a set of words and P(W) for each
compute :

P_state(i) = Sum[ words W with i children ] P(W)

(state i means only chars >= i can follow)

(iterating P_state -> P(W) a few times here is optional but unnecessary)

P_scale_first_char(c) = Sum[ i <= c ] P_state(i) / P_tail(i)

(P_scale_first_char = P_state_tail)

then repeat dic building one more time
(optionally repeat more but once seems fine)

And that's it!

What do these values actually look like? I thought it might be illuminating to dump them. This is on a pretty skewed file so the effect is large, the larger the MPS probability the bigger the effect.

filelen = 1572918
H = 2.917293

P of chars =

        [0] 0.46841475525106835 double
        [1] 0.11553621994280693 double
        [2] 0.11508546535801611 double
        [3] 0.059216055763873253    double
        [4] 0.058911526220693004    double
        [5] 0.036597584870921435    double
        [6] 0.036475518749229136    double
        [7] 0.018035269480036465    double
        [8] 0.017757441900976400    double
        [9] 0.010309501194595012    double
        [10]    0.0097379520102128646   double

P_state =

        [0] 0.62183816678155190 double
        [1] 0.15374679894466811 double
        [2] 0.032874234239563829    double
        [3] 0.063794018822874776    double
        [4] 0.026001940955215786    double
        [5] 0.011274295764837820    double
        [6] 0.028098911350290755    double
        [7] 0.012986055279597277    double
        [8] 0.0013397794289329405   double
        .. goes to zero pretty fast ..

P_scale_first_char =

        [0] 0.62183816678155202 double
        [1] 0.91106139196208336 double
        [2] 0.99007668169839014 double
        [3] 1.2020426052137052  double
        [4] 1.3096008656256881  double
        [5] 1.3712643080249607  double
        [6] 1.5634088663186672  double
        [7] 1.6817189544649160  double
        [8] 1.6963250203077103  double
        [9] 1.9281295477496172  double
        [10]    1.9418127462353780  double
        [11]    2.0234438458996773  double
        [12]    2.0540542047979415  double
        [13]    2.1488636999462676  double
        [14]    2.2798060244386895  double
        [15]    2.2798060244386895  double
        [16]    2.2798060244386895  double
        [17]    2.3660205062350039  double
        [18]    2.3840557757150402  double
        [19]    2.3840557757150402  double
        [20]    2.3840557757150402  double
        [21]    2.4066061022686838  double
        [22]    2.5584098628550294  double
        [23]    2.5584098628550294  double
        [24]    2.6690752676624321  double
        .. gradually goes up to 4.14

estimate real first char probability 
= P(c) * P_scale_first_char(c) =

        [0] 0.29127817269875372 double
        [1] 0.10526058936313110 double
        [2] 0.11394343565337962 double
        [3] 0.071180221940886246    double
        [4] 0.077150585733949978    double
        [5] 0.050184961893408854    double
        [6] 0.057026149416117611    double
        [7] 0.030330254533459933    double

the effect is to reduce the skewing of the probabilities in the post-word alphabet.

I think this problem space is not fully explored yet and I look forward to seeing more work in this domain in the future.

I'm not convinced that the dictionary building scheme here is optimal. Is there an optimal plural VTF dictionary?

I think maybe there's some space between something like RANS and Tunstall. Tunstall inputs blocks of N bits and doesn't carry any state between them. RANS pulls blocks of N bits and carries full state of unused value range between them (which is therefore a lot slower because of dependency chains). Maybe there's something in between?

Another way to think about this Marlin ending state issue is a bit like an arithmetic coder. When you send the index of a word that has children, you have not only sent that word, you've also sent information about the *next* word. That is, some fractional bits of information that you'd like to carry forward.

Say you have {W,Wa,Wb} in the dictionary and you send a W. You've also sent that the next word start with >= c. That's like saying your arithmetic coder cumprob is >= (Pa+Pb). You've refined the range of the next interval.

This could be done with Tunstall by doing the multi-table method (separate tables for state 0,1, and 2+), but unfortunately that doesn't fit in L1.

BTW you can think of Tunstall in an arithmetic codey way. Maybe I'll draw a picture because it's easier to show that way...

1 comment:

cbloom said...

I noted this in one of the posts, but I think it bears repeating :

The Marlin probability adjustment is not the only way to handle the tail-cant-be-head exclusion. An alternative that was previously in the literature is to have a separate Tunstall table for each tail exclusion case.

A hybrid approach is also possible & interesting.

When the alphabet is quite skew; the MPS (most probably symbol) has high probability.

In that case you might like to make two Tunstall tables. One where MPS can be the first symbol, and one where it can't. (MPS can't be the first symbol is common and important because it will often be available as a tail symbol). You could then do the Marlin probability adjustment on both of those tables, which would compensate for the non-MPS exclusion cases.

The tail-cant-be-head is similar to the LAM exclusion in LZ77.

old rants