cbloom rants

4/14/2017

Some commentary on Marlin and the code

WARNING : I'm a bit delirious with flu and lack of sleep at the moment so I'm not entirely sure what I'm writing. Apologies if it's a mess!

First, there's no point in making the T_ij transition matrix they talk about.

Back in "Understanding Marlin" you may recall I presented the algorithm thusly :

P_state(i) is given from a previous iteration and is constant

build dictionary using Marlin word model

we now have P(W) for all words in our dic

Use P(W) to compute new P_state(i)

optionally iterate a few times (~ 10 times) :
use that P_state to compute adjusted P(W)
use P(W) to compute new P_state

iteration dictionary building again (3-4 times)

in the paper (and code) they do it a bit differently. They compute the state transition matrix, which is :

T_ij = Sum[ all words W that end in state S_i ] P(W|S_j)

this is the probability that if you started in state j you will wind up in state i

instead of iterating P_state -> P(W) , they iterate :

T <- T * T

and then P_state(i) = T_i0

I tested both ways and they produce the exact same result, but just doing it through the P(W) computation is far simpler and faster. The matrix multiply is O(alphabet^3) while the P way is only O(alphabet+dic_size)

Also for the record - I have yet to find a case where iterating to convergence here actually helps. If you just make P_State from PW once and don't iterate, you get 99% of the win. eg :

laplacian distribution :

no iteration :

0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1

iterate 10X :

0.67952             :  1,000,000 ->   503,688 =  4.030 bpb =  1.985 to 1

You *do* need to iterate the dictionary build. I do it 4 times. 3 times would be fine though, heck 2 is probably fine.

4: 0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1

3: 0.67952             :  1,000,000 ->   503,721 =  4.030 bpb =  1.985 to 1

2: 0.67952             :  1,000,000 ->   503,817 =  4.031 bpb =  1.985 to 1

The first iteration builds a "naive plural Tunstall" dictionary; the P_state is made from that, second iteration does the first "Marlin" dictionary build.

In general I think they erroneously come to the conclusion that plural Tunstall dictionaries are really slow to create. They're only 1 or 2 orders of magnitude slower than building a Huffman tree, certainly not slow compared to many encoder speeds. Sure sure if you want super fast encoding you wouldn't want to do it, but otherwise it's totally possible to build the dictionaries for each use.

There's a lot of craziness in the Marlin code that makes their dic build way slower than it should be. Some is just over-C++ madness, some is failure to factor out constant expressions.

the word is :

struct Word : public std::vector<uint8_t> {
};

and the dictionary is :

std::vector<Word> W;

(with no reserves anywhere)
yeah that's a problem.  May I suggest :

struct Word {
uint64 chars;
int len;
};

also reserve() is good and calling size() in loops is bad.

This is ouchy :

virtual double phi(const Word &) const = 0;

and this is even more ouchy :

virtual std::vector<Word> split(const Word &) const = 0;

The P(w) function that's used frequently is bad.

The key part is :

for (size_t t = 0; t<=w; t++) {
double p = PcurrState[t];
p *= P[w]/PnextState[t];
ret += p;
}

which you can easily factor out the common P[w] from :

int c0 = w;
for (size_t t = 0; t<= c0; t++) {
ret += PcurrState[t]/PnextState[t];
}
ret *= P[c0]

but even more significant would be to realize that PcurrState (my P_state) and
PnextState (my P_tail) are not updated during dic building at all!  They're constant
during that phase, so that whole thing can be precomputed and put in a table.
Then this is just :

int c0 = w;
ret = PcurrState_over_PnextState_partial_sum[c0];
ret *= P[c0]

that also gives us a chance to state clearly (again) the difference between "Marlin" and naive plural Tunstall. It's that one table right there.

int c0 = w;
ret = 1.0;
ret *= P[c0]

this is naive plural Tunstall. It comes down to a modifed probability table for the first letter in the word.

Recall that :

P_naive(W) = Prod[ chars c in W ] P(c)

simple P_word(W) = P_naive(W) * Ptail( num_children(W) )

Reading Yamamoto and Yokoo "Average-Sense Optimality and Competitive Optimality for Almost Instantaneous VF Codes".

They construct the naive plural Tunstall VF dictionary. They are also aware of the Marlin-style state transition problem. (that is, partical nodes leave you in a state where some symbols are excluded).

They address the problem by constructing multiple parse trees, one for each initial state S_i. In tree T_i you know that the first character is >= i so all words that start with lower symbols are excluded.

This should give reasonably more compression than the Marlin approach, obviously with the cost of having multiple dictionaries.

In skewed alphabet cases where the MPS is very probable, this should be significant because words that start with the MPS dominate the dictionary, but in all states S_1 and higher those words cannot be used. In fact I conjecture that even having just 2 or 3 trees should give most of the win. One tree for state S_0, one for S_1 and the last for all states >= S_2. In practice this problematic because the multiple code sets would fall out of cache and it adds a bit of decoder complexity to select the following tree.

There's also a continuity between VTF codes and blocked arithmetic coders. The Yamamoto-Yokoo scheme is like a way of carrying the residual information between blocked transmissions, similar to multi-table arithmetic coding schemes.

Phhlurge.

I just went and got the marlin code to compile in VS 2015. Bit of a nightmare. I wanted to confirm I didn't screw anything up in my implementation.

two-sided Laplacian distribution centered at 0
(this is what the Marlin code assumes)

r = 0.67952

H = 3.798339

my version of Marlin-probability plural Tunstall :

0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1

Marlin reference code : 1,000,000 -> 507,412

naive plural Tunstall :

0.67952             :  1,000,000 ->   508,071 =  4.065 bpb =  1.968 to 1

I presume the reason they compress worse than my version is because they make dictionaries for a handfull of Laplacian distributions and then pick the closest one. I make a dictionary for the actual char counts in the array, so their dictionary is mis-matching the actual distribution slightly.