Context on Chroma Subsampling

Poked by Guetzli into thinking about pros/cons of chroma subsampling -

Chroma downsampling (as in standard JPEG YCbCr 420) is a big ugly hammer. It just throws away a ton of bits of information. That's pretty much always a bad thing in compression.

Now, preferring to throw away chroma detail before luma detail is valid and good. So if you are not chroma subsampling, then you need a perceptually optimizing encoder that knows to give fewer bits to high frequency chroma. You have much more control doing this through bit allocation than you do by just smashing the chroma planes. (for example, on blocks where there is almost no luma signal, then you might keep more of the high frequency chroma, on blocks with luma masking you can throw away lots of the high chroma AC bits - you just have way more precise control).

The chroma subsample is just a convenient way to get decent perceptual tradeoffs in a *non* optimizing encoder.

Chroma subsample is of course an R-D choice. It throws away signal, giving a certain disortion, in exchange it saves you some rate. This particular move is a good R-D choice at some tradeoff zone. Generally at high bit rate, it's a bad move. In most encoders, it becomes a good move at some lower quality. (in JPEG the tradeoff point is usually somewhere around 85). Measuring this D in RMSE is easy, but measuring it perceptually is rather tricky (when luma masking is present it may be near zero D perceptually, but without luma masking it can be much worse).

There are other issues.

In non-subsampled world, the approximate important weights for YCbCr are something like {0.7,0.13,0.17} . If you do subsample, then the chroma weights per-pixel need to go up by 4X , in which case they become pretty close to all being the same.

Many people mistakenly say the "eye does not see blue levels well". Not true, the eye can see the overall level of blue perfectly well, just as well as red or green. (eg for images where the whole thing is a single solid color). What the eye has is very poor spatial resolution in blue.

One of the issues is that chroma subsample is a nice win for *speed*. It gives you 4X less pixels to work on in two of your planes, half as many pixels overall. This means that subsampled chroma images are almost 2X faster to decode.

I used to be anti-chroma-subsample in my early years. For example in wavelets it's much neater to keep all your color planes full res, but send your bitplanes in [YUV] order. That way when you truncate the bottom bit planes, you drop the highest frequency chroma first. But then I realized that the 2X speedup from chroma subsample was nothing to sneeze at, and in general I'm now pro-chroma-subsample.

Another reminder : if you don't chroma subsample, then you may as well do a KLT on the color planes, rather than just use YUV or whatever. (maybe even KLT per region). The advantage of using standard YUV is that the chroma are known to be perceptually okay to downsample (you can't downsample the two minor components of the KLT transformed planes because you have no guarantee that they are of a type that the eye can't perceive high frequency data).

You can obviously construct adversarial images where the detail is all in chroma (the whole image has a constant luma). In that case chroma downsampling looks really bad and is perceptually a big mistake.

Chroma-from-luma in the decoder fixes all the color fringing that people associate with JPEG, but obviously it doesn't help in the adversarial cases where there is no luma detail to boost the chroma with.

I should also note while I'm at it that there are many codecs out there that just have bugs and/or mistakes in their downsamplers and/or upsamplers that cause this operation to produce way more error than it should.

ADD : Won sent me an email with an interesting idea I'd never thought about. Rather than just jumping between not downsampling chroma and doing 2x2 downsample, you could take more progressive steps, such as going to a checkerboard of chroma (half as many pixels) or a Bayer pattern. It's probably too complex to support these well and make good encoder decisions in practice, but they're interesting in theory.


Tunstall in an arithmetic way

You can think of the Tunstall dictionary build in an arithmetic codery way.

Your dictionary is like the probability interval. You start with a full interval [0,1]. You put in all the single-character words, with P(c) for each char, so that all sums to one. You iteratively split the largest interval, and subdivide the range.

In binary the "split" operation is :

W -> W0 , W1

P(W) = P(W)*P(0) + P(W)*P(1)

In N-ary the split is :

W -> Wa , W[b+]

P(W) = P(W)*P(a) + P(w)*Ptail(b)

W[b+] means just the word W, but in the state "b+" (aka state "1"), following sym must be >= b
(P(w) here means P_naive(w), just the char probability product)

W[b+] -> Wb , W[c+]

P(w)*Ptail(b) = P(w)*P(b) + P(w)*Ptail(c)

(recall Ptail(c) = tail cumprob, sum of P(char >= c))
(Ptail(a) = 1.0)

So we can draw a picture like an arithmetic coder does, spliting ranges and specifying cumprob intervals to choose our string :

You just keep splitting the largest interval until you have a # of intervals = to the desired number of codes. (8 here for 3-bit Tunstall).

At that point, you still have an arithmetic coder - the intervals are fractional sizes and are correctly sized to the probabilities of each code word. eg. the interval for 01 is P(0)*P(1).

In the final step, each interval is just mapped to a dictionary codeword index. This gives each codeword an equal 1/|dictionary_size| probability interval, which in general is not quite right.

This is where the coding inefficiency comes from - it's the difference between the correct interval sizes and the size that we snap them to.

(ADD : in the drawing I wrote "snap to powers of 2" ; that's not the best way of describing that; they're just snapping to the subsequent i/|dictionary_size|. In this case with dictionary_size = 8 those points are {0,1/8,1/4,3/8,1/2,..} which is why I was thinking about powers of 2 intervals.)


Classical Tunstall

Before continuing with Marlin, I want to take a brief digression to review "classical" or "true" Tunstall.

The classical Tunstall algorithm constructs VTF (variable to fixed) codes for binary memoryless (order-0) sources. It constructs the optimal code.

You start with dictionary = { "0","1" } , the single bit binary strings. (or dictionary = the null string if you prefer)

You then split one word W in the dictionary to make two new words "W0" and "W1" ; when you split W, it is removed since all possible following symbols now have words in the dictionary.

The algorithm is simple and iterative :

while dic size < desired
find best word W to split
remove W
add W0 and W1

each step increments dictionary size by +1

What is the best word to split?

Our goal is to maximize average code length :

A = Sum[words] P(W) * L(W)

under the split operation, what happens to A ?

W -> W0, W1

delta(A) = P(W0) * L(W0) + P(W1) * L(W1) - P(W) * L(W)

P(W0) = P(W)*P(0)
P(W1) = P(W)*P(1)
L(W0) = L(W)+1

.. simplify ..

delta(A) = P(W)

so to get the best gain of A, you just split the word with maximum probability. Note of course this is just greedy optimization of A and that might not be the true optimum, but in fact it is and the proof is pretty neat but I won't do it here.

You can naively build the optimal Tunstall code in NlogN time with a heap, or slightly more cleverly you can use two linear queues for left and right children and do it in O(N) time.

Easy peasy, nice and neat. But this doesn't work the same way for the large-alphabet scenario.

Now onto something that is a bit messy that I haven't figured out.

For "plural Tunstall" we aren't considering adding all children, we're only considering adding the next child.

A "split" operation is like :

start with word W with no children
W ends in state 0 (all chars >= 0 are possible)

the next child of W to consider is "W0"
(symbols sorted so most probable is first)

if we add "W0" then W goes to state 1 (only chars >= 1 possible)

W_S0 -> "W0" + W_S1

W_S1 -> "W1" + W_S2


again, we want to maximize A, the average codelen. What is delta(A) under a split operation?

delta(A) = P("W0") * L("W0") + P(W_S1) * L(W) - P(W_S0) * L(W)

delta(A) = P("W0") + (P("W0") + P(W_S1) - P(W_S0)) * L(W)

P("W0") + P(W_S1) - P(W_S0) = 0


delta(A) = P("W0") 

it seems like in plural Tunstall you should "split" the word that has maximum P("W0") ; that is maximize the probability of the word you *create* not the one you *remove*. This difference arises from the fact that we are only making one child of longer length - the other "child" in the pseudo-split here is actually the same parent node again, just with a reduced exit state.

In practice that doesn't seem to be so. I experimentally measured that choosing to split the word with maximum P(W) is better than splitting the word with maximum P(child).

I'm not sure what's going wrong with this analysis. In the Marlin code they just split the word with maximum P(W) by analogy to true Tunstall, which I'm not convinced is well justified in plural Tunstall.

While I'm bringing up mysteries, I tried optimal-parsing plural Tunstall. Obviously with "true tunstall" or any prefix-free code that's silly, the greedy parse is the only parse. But with plural Tunstall, you might have "aa" and also "aaa" in the tree. In this scenario, by analogy to LZ, the greedy parse is usually imperfect because it is sometimes better to take a shorter match now to get a longer one on the next work. So maybe some kind of lazy , or heck full optimal parse. (the trivial LZSS backward parse works well here).

Result : optimal-parsed plural Tunstall is identical to greedy. Exactly, so it must be provable. I don't see an easy way to show that the greedy parse is optimal in the plural case. Is it true for all plural dictionaries? (I doubt it) What are the conditions on the dictionary that guarantee it?

I think that this is because for any string in the dictionary, all shorter substrings of that string are in the dictionary too. This makes optimal parsing useless. But I think that property is a coincidence/bug of how Marlin and I did the dictionary construction, which brings me to :

Marlin's dictionary construction method and the one I was using before, which is slightly different, both have the property that they never remove parent nodes when they make children. I believe this is wrong but I haven't been able to make it work a different way.

The case goes like this :

you have word W in the dictionary with no children

you have following chars a,b,c,d.  a and b are very probable, c and d are very rare.

P(W) initially = P_init(W)

you add child 'a' ; W -> Wa , W(b+)
P(W) -= P(Wa)
add child 'b'
W(b+) -> Wb , W(c+)
P(W) -= P(Wb)

now the word W in the dictionary has
P(W) = P(Wc) + P(Wd)

these are quite rare, so P(W) now is very small

W is no longer a desirable dictionary entry.

We got all the usefulness of W out in Wa and Wb, we don't want to keep W in the dictionary just to be able to code it with rare following c's and d's - we'd like to now remove W.

In particular, if the current P(W) of the parent word is now lower than a child we could make somewhere else by splitting, remove W and split the other node. Or something like that - here's where I haven't quite figured out how to make this idea work in practice.

So I believe that both Marlin and my code are NOT making optimal general VTF plural dictionaries, they are making them under the (unnecessary) constraint of the shorter-substring-is-present property.

Tunstall vs Marlin Results Part 1

For some reason I keep trying to type "marling". Anyway...

Geometric distribution , P(n) = r^n

I am comparing "Marlin" = plural Tunstall with P_state word probability model vs. naive plural Tunstall (P_word = P_naive). In both cases 8-byte output words, 12-bit codes.

Marlin :

filelen = 1000000
H = 7.658248
sym_count = 256
r=0.990             :  1,000,000 -> 1,231,686 =  9.853 bpb =  0.812 to 1 
decode_time2 : seconds:0.0018 ticks per: 3.064 b/kc : 326.42 MB/s : 564.38
filelen = 1000000
H = 7.345420
sym_count = 256
r=0.985             :  1,000,000 -> 1,126,068 =  9.009 bpb =  0.888 to 1 
decode_time2 : seconds:0.0016 ticks per: 2.840 b/kc : 352.15 MB/s : 608.87
filelen = 1000000
H = 6.878983
sym_count = 256
r=0.978             :  1,000,000 ->   990,336 =  7.923 bpb =  1.010 to 1 
decode_time2 : seconds:0.0014 ticks per: 2.497 b/kc : 400.54 MB/s : 692.53
filelen = 1000000
H = 6.323152
sym_count = 256
r=0.967             :  1,000,000 ->   862,968 =  6.904 bpb =  1.159 to 1 
decode_time2 : seconds:0.0013 ticks per: 2.227 b/kc : 449.08 MB/s : 776.45
filelen = 1000000
H = 5.741045
sym_count = 226
r=0.950             :  1,000,000 ->   779,445 =  6.236 bpb =  1.283 to 1 
decode_time2 : seconds:0.0012 ticks per: 2.021 b/kc : 494.83 MB/s : 855.57
filelen = 1000000
H = 5.155050
sym_count = 150
r=0.927             :  1,000,000 ->   701,049 =  5.608 bpb =  1.426 to 1 
decode_time2 : seconds:0.0011 ticks per: 1.821 b/kc : 549.09 MB/s : 949.37
filelen = 1000000
H = 4.572028
sym_count = 109
r=0.892             :  1,000,000 ->   611,238 =  4.890 bpb =  1.636 to 1 
decode_time2 : seconds:0.0009 ticks per: 1.577 b/kc : 633.93 MB/s : 1096.07
filelen = 1000000
H = 3.986386
sym_count = 78
r=0.842             :  1,000,000 ->   529,743 =  4.238 bpb =  1.888 to 1 
decode_time2 : seconds:0.0008 ticks per: 1.407 b/kc : 710.53 MB/s : 1228.51
filelen = 1000000
H = 3.405910
sym_count = 47
r=0.773             :  1,000,000 ->   450,585 =  3.605 bpb =  2.219 to 1 
decode_time2 : seconds:0.0007 ticks per: 1.237 b/kc : 808.48 MB/s : 1397.86
filelen = 1000000
H = 2.823256
sym_count = 36
r=0.680             :  1,000,000 ->   373,197 =  2.986 bpb =  2.680 to 1 
decode_time2 : seconds:0.0006 ticks per: 1.053 b/kc : 950.07 MB/s : 1642.67
filelen = 1000000
H = 2.250632
sym_count = 23
r=0.560             :  1,000,000 ->   298,908 =  2.391 bpb =  3.346 to 1 
decode_time2 : seconds:0.0005 ticks per: 0.891 b/kc : 1122.53 MB/s : 1940.85

vs. plural Tunstall :

filelen = 1000000
H = 7.658248
sym_count = 256
r=0.99000           :  1,000,000 -> 1,239,435 =  9.915 bpb =  0.807 to 1 
decode_time2 : seconds:0.0017 ticks per: 2.929 b/kc : 341.46 MB/s : 590.39
filelen = 1000000
H = 7.345420
sym_count = 256
r=0.98504           :  1,000,000 -> 1,130,025 =  9.040 bpb =  0.885 to 1 
decode_time2 : seconds:0.0016 ticks per: 2.814 b/kc : 355.36 MB/s : 614.41
filelen = 1000000
H = 6.878983
sym_count = 256
r=0.97764           :  1,000,000 ->   990,855 =  7.927 bpb =  1.009 to 1 
decode_time2 : seconds:0.0014 ticks per: 2.416 b/kc : 413.96 MB/s : 715.73
filelen = 1000000
H = 6.323152
sym_count = 256
r=0.96665           :  1,000,000 ->   861,900 =  6.895 bpb =  1.160 to 1 
decode_time2 : seconds:0.0012 ticks per: 2.096 b/kc : 477.19 MB/s : 825.07
filelen = 1000000
H = 5.741045
sym_count = 226
r=0.95039           :  1,000,000 ->   782,118 =  6.257 bpb =  1.279 to 1 
decode_time2 : seconds:0.0011 ticks per: 1.898 b/kc : 526.96 MB/s : 911.12
filelen = 1000000
H = 5.155050
sym_count = 150
r=0.92652           :  1,000,000 ->   704,241 =  5.634 bpb =  1.420 to 1 
decode_time2 : seconds:0.0010 ticks per: 1.681 b/kc : 594.73 MB/s : 1028.29
filelen = 1000000
H = 4.572028
sym_count = 109
r=0.89183           :  1,000,000 ->   614,061 =  4.912 bpb =  1.629 to 1 
decode_time2 : seconds:0.0008 ticks per: 1.457 b/kc : 686.27 MB/s : 1186.57
filelen = 1000000
H = 3.986386
sym_count = 78
r=0.84222           :  1,000,000 ->   534,300 =  4.274 bpb =  1.872 to 1 
decode_time2 : seconds:0.0007 ticks per: 1.254 b/kc : 797.33 MB/s : 1378.58
filelen = 1000000
H = 3.405910
sym_count = 47
r=0.77292           :  1,000,000 ->   454,059 =  3.632 bpb =  2.202 to 1 
decode_time2 : seconds:0.0006 ticks per: 1.078 b/kc : 928.04 MB/s : 1604.58
filelen = 1000000
H = 2.823256
sym_count = 36
r=0.67952           :  1,000,000 ->   377,775 =  3.022 bpb =  2.647 to 1 
decode_time2 : seconds:0.0005 ticks per: 0.935 b/kc : 1069.85 MB/s : 1849.77
filelen = 1000000
H = 2.250632
sym_count = 23
r=0.56015           :  1,000,000 ->   304,887 =  2.439 bpb =  3.280 to 1 
decode_time2 : seconds:0.0004 ticks per: 0.724 b/kc : 1381.21 MB/s : 2388.11

Very very small difference. eg :

plural Tunstall :

H = 3.986386
sym_count = 78
r=0.84222           :  1,000,000 ->   534,300 =  4.274 bpb =  1.872 to 1 

Marlin :

H = 3.986386
sym_count = 78
r=0.842             :  1,000,000 ->   529,743 =  4.238 bpb =  1.888 to 1 
decode_time2 : seconds:0.0008 ticks per: 1.407 b/kc : 710.53 MB/s : 1228.51

Yes the Marlin word probability estimator helps a little bit, but it's not massive.

I'm not surprised but a bit sad to say that once again the Marlin paper compares to ridiculous straw men and doesn't compare to the most obvious, naive, and well known (see Savari for example, or Yamamoto & Yokoo) similar alternative - just doing plural Tunstall/VTF without the Marlin word probability model.

Entropy above 4 or so is terrible for 12-bit VTF codes.

The Marlin paper uses a "percent efficiency" scale which I find rather misleading. For example, this :

H = 3.986386
sym_count = 78
r=0.842             :  1,000,000 ->   529,743 =  4.238 bpb =  1.888 to 1 

is what I would consider pretty poor entropy coding. Entropy of 3.98 -> 4.23 bpb is way off. But as a "percent efficiency" it's 94% , which is really high on their graphs.

The more standard and IMO useful way to show this is a delta of your output bits minus the entropy, eg.

excess = 4.238 - 3.986 = 0.252

half a bit per byte wasted. A true arithmetic coder has an excess around 0.001 bpb typically. The worst you can ever do is an excess of 1.0 which occurs in any integer-bit entroy coder as the probability of the MPS goes towards 1.0

Part of my hope / curiosity in investigating this was wondering whether the Marlin procedure would help at all with the way Tunstall VTF codes really collapse in the H > 4 range , and the answer is - no , it doesn't help with that problem at all.

Anyway, on to more results.

Tunstall vs Marlin Results Part 2

Testing on some real files.

Marlin :

loading : R:\tunstall_test\lzt24.literals
filelen = 1111673
H = 7.452694
sym_count = 256
lzt24.literals      :  1,111,673 -> 1,286,166 =  9.256 bpb =  0.864 to 1 
decode_time2 : seconds:0.0022 ticks per: 3.467 b/kc : 288.41 MB/s : 498.66
loading : R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293
sym_count = 236
monarch.tga.rrz_filtered.bmp:  1,572,918 ->   618,447 =  3.145 bpb =  2.543 to 1 
decode_time2 : seconds:0.0012 ticks per: 1.281 b/kc : 780.92 MB/s : 1350.21
loading : R:\tunstall_test\paper1
filelen = 53161
H = 4.982983
sym_count = 95
paper1              :     53,161 ->    35,763 =  5.382 bpb =  1.486 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.988 b/kc : 503.06 MB/s : 869.78
loading : R:\tunstall_test\PIC
filelen = 513216
H = 1.210176
sym_count = 159
PIC                 :    513,216 ->   140,391 =  2.188 bpb =  3.656 to 1 
decode_time2 : seconds:0.0002 ticks per: 0.800 b/kc : 1250.71 MB/s : 2162.48
loading : R:\tunstall_test\tabdir.tab
filelen = 190428
H = 2.284979
sym_count = 77
tabdir.tab          :    190,428 ->    68,511 =  2.878 bpb =  2.780 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.031 b/kc : 969.81 MB/s : 1676.80
total bytes out : 1974785
naive plural Tunstall :
loading : R:\tunstall_test\lzt24.literals
filelen = 1111673
H = 7.452694
sym_count = 256
lzt24.literals      :  1,111,673 -> 1,290,015 =  9.283 bpb =  0.862 to 1 
decode_time2 : seconds:0.0022 ticks per: 3.443 b/kc : 290.45 MB/s : 502.18
loading : R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293
sym_count = 236
monarch.tga.rrz_filtered.bmp:  1,572,918 ->   627,747 =  3.193 bpb =  2.506 to 1 
decode_time2 : seconds:0.0012 ticks per: 1.284 b/kc : 779.08 MB/s : 1347.03
loading : R:\tunstall_test\paper1
filelen = 53161
H = 4.982983
sym_count = 95
paper1              :     53,161 ->    35,934 =  5.408 bpb =  1.479 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.998 b/kc : 500.61 MB/s : 865.56
loading : R:\tunstall_test\PIC
filelen = 513216
H = 1.210176
sym_count = 159
PIC                 :    513,216 ->   145,980 =  2.276 bpb =  3.516 to 1 
decode_time2 : seconds:0.0002 ticks per: 0.826 b/kc : 1211.09 MB/s : 2093.97
loading : R:\tunstall_test\tabdir.tab
filelen = 190428
H = 2.284979
sym_count = 77
tabdir.tab          :    190,428 ->    74,169 =  3.116 bpb =  2.567 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.103 b/kc : 906.80 MB/s : 1567.86
total bytes out : 1995503

About the files :

lzt24.literals are the literals left over after LZ-parsing (LZQ1) lzt24
  like all LZ literals they are high entropy and thus do terribly in Tunstall

monarch.tga.rrz_filtered.bmp is the image residual after filtering with my DPCM
  (it actually has a BMP header on it which is giving Tunstall a harder time
   than if I stripped the header)

paper1 & pic are standard

tabdir.tab is a text file of a dir listing with lots of tabs in it

For speed comparison, this is the Oodle Huffman on the same files :
loading file (0/5) : lzt24.literals
ooHuffman1 : ed...........................................................
ooHuffman1 :  1,111,673 -> 1,036,540 =  7.459 bpb =  1.072 to 1
encode           : 8.405 millis, 13.07 c/b, rate= 132.26 mb/s
decode           : 1.721 millis, 2.68 c/b, rate= 645.81 mb/s
loading file (1/5) : monarch.tga.rrz_filtered.bmp
ooHuffman1 : ed...........................................................
ooHuffman1 :  1,572,918 ->   586,839 =  2.985 bpb =  2.680 to 1
encode           : 7.570 millis, 8.32 c/b, rate= 207.80 mb/s
decode           : 2.348 millis, 2.58 c/b, rate= 669.94 mb/s
loading file (2/5) : paper1
ooHuffman1 :     53,161 ->    33,427 =  5.030 bpb =  1.590 to 1
encode           : 0.268 millis, 8.70 c/b, rate= 198.67 mb/s
decode           : 0.080 millis, 2.60 c/b, rate= 665.07 mb/s
loading file (3/5) : PIC
ooHuffman1 :    513,216 ->   106,994 =  1.668 bpb =  4.797 to 1
encode           : 2.405 millis, 8.10 c/b, rate= 213.41 mb/s
decode           : 0.758 millis, 2.55 c/b, rate= 677.32 mb/s
loading file (4/5) : tabdir.tab
ooHuffman1 :    190,428 ->    58,307 =  2.450 bpb =  3.266 to 1
encode           : 0.926 millis, 8.41 c/b, rate= 205.70 mb/s
decode           : 0.279 millis, 2.54 c/b, rate= 681.45 mb/s

Tunstall is crazy fast. And of course that's a rather basic implementation of the decoder, I'm sure it could get faster.

Is there an application for plural Tunstall? I'm not sure. I tried it back in 2015 as an idea for literals in Mermaid/Selkie and abandoned it as not very relevant there. It works on low-entropy order-0 data (like image prediction residuals).

Of course if you wanted to test it against the state of the art you should consider SIMD Ryg RANS or GPU RANS. You should consider something like TANS with multiple symbols in the output table. You should consider merged-symbol codes, perhaps using escapes, perhaps runlen transforms. See for example "crblib/huffa.c" for a survey of Huffman ideas from 1996 (pre-runtransform, blocking MPS's, order-1-huff, multisymbol output, etc.)

Some commentary on Marlin and the code

WARNING : I'm a bit delirious with flu and lack of sleep at the moment so I'm not entirely sure what I'm writing. Apologies if it's a mess!

First, there's no point in making the T_ij transition matrix they talk about.

Back in "Understanding Marlin" you may recall I presented the algorithm thusly :

P_state(i) is given from a previous iteration and is constant

build dictionary using Marlin word model

we now have P(W) for all words in our dic

Use P(W) to compute new P_state(i)

optionally iterate a few times (~ 10 times) :
  use that P_state to compute adjusted P(W)
  use P(W) to compute new P_state

iteration dictionary building again (3-4 times)

in the paper (and code) they do it a bit differently. They compute the state transition matrix, which is :

T_ij = Sum[ all words W that end in state S_i ] P(W|S_j)

this is the probability that if you started in state j you will wind up in state i

instead of iterating P_state -> P(W) , they iterate :

T <- T * T

and then P_state(i) = T_i0

I tested both ways and they produce the exact same result, but just doing it through the P(W) computation is far simpler and faster. The matrix multiply is O(alphabet^3) while the P way is only O(alphabet+dic_size)

Also for the record - I have yet to find a case where iterating to convergence here actually helps. If you just make P_State from PW once and don't iterate, you get 99% of the win. eg :

laplacian distribution :

no iteration :

     0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

iterate 10X :

     0.67952             :  1,000,000 ->   503,688 =  4.030 bpb =  1.985 to 1 

You *do* need to iterate the dictionary build. I do it 4 times. 3 times would be fine though, heck 2 is probably fine.

4: 0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

3: 0.67952             :  1,000,000 ->   503,721 =  4.030 bpb =  1.985 to 1 

2: 0.67952             :  1,000,000 ->   503,817 =  4.031 bpb =  1.985 to 1 

The first iteration builds a "naive plural Tunstall" dictionary; the P_state is made from that, second iteration does the first "Marlin" dictionary build.

In general I think they erroneously come to the conclusion that plural Tunstall dictionaries are really slow to create. They're only 1 or 2 orders of magnitude slower than building a Huffman tree, certainly not slow compared to many encoder speeds. Sure sure if you want super fast encoding you wouldn't want to do it, but otherwise it's totally possible to build the dictionaries for each use.

There's a lot of craziness in the Marlin code that makes their dic build way slower than it should be. Some is just over-C++ madness, some is failure to factor out constant expressions.

the word is :

        struct Word : public std::vector<uint8_t> {

and the dictionary is :

std::vector<Word> W;

 (with no reserves anywhere)
yeah that's a problem.  May I suggest :

struct Word {
    uint64 chars;
    int len;

also reserve() is good and calling size() in loops is bad.

This is ouchy :

        virtual double phi(const Word &) const = 0;

and this is even more ouchy :

        virtual std::vector<Word> split(const Word &) const = 0;

The P(w) function that's used frequently is bad.

The key part is :

    for (size_t t = 0; t<=w[0]; t++) {
        double p = PcurrState[t]; 
        p *= P[w[0]]/PnextState[t];
        ret += p;

which you can easily factor out the common P[w[0]] from :

    int c0 = w[0];
    for (size_t t = 0; t<= c0; t++) {
        ret += PcurrState[t]/PnextState[t];
    ret *= P[c0]

but even more significant would be to realize that PcurrState (my P_state) and
PnextState (my P_tail) are not updated during dic building at all!  They're constant
during that phase, so that whole thing can be precomputed and put in a table.
Then this is just :

    int c0 = w[0];
    ret = PcurrState_over_PnextState_partial_sum[c0];
    ret *= P[c0]

that also gives us a chance to state clearly (again) the difference between "Marlin" and naive plural Tunstall. It's that one table right there.

    int c0 = w[0];
    ret = 1.0;
    ret *= P[c0]

this is naive plural Tunstall. It comes down to a modifed probability table for the first letter in the word.

Recall that :

P_naive(W) = Prod[ chars c in W ] P(c)

simple P_word(W) = P_naive(W) * Ptail( num_children(W) )

Reading Yamamoto and Yokoo "Average-Sense Optimality and Competitive Optimality for Almost Instantaneous VF Codes".

They construct the naive plural Tunstall VF dictionary. They are also aware of the Marlin-style state transition problem. (that is, partical nodes leave you in a state where some symbols are excluded).

They address the problem by constructing multiple parse trees, one for each initial state S_i. In tree T_i you know that the first character is >= i so all words that start with lower symbols are excluded.

This should give reasonably more compression than the Marlin approach, obviously with the cost of having multiple dictionaries.

In skewed alphabet cases where the MPS is very probable, this should be significant because words that start with the MPS dominate the dictionary, but in all states S_1 and higher those words cannot be used. In fact I conjecture that even having just 2 or 3 trees should give most of the win. One tree for state S_0, one for S_1 and the last for all states >= S_2. In practice this problematic because the multiple code sets would fall out of cache and it adds a bit of decoder complexity to select the following tree.

There's also a continuity between VTF codes and blocked arithmetic coders. The Yamamoto-Yokoo scheme is like a way of carrying the residual information between blocked transmissions, similar to multi-table arithmetic coding schemes.


I just went and got the marlin code to compile in VS 2015. Bit of a nightmare. I wanted to confirm I didn't screw anything up in my implementation.

two-sided Laplacian distribution centered at 0
(this is what the Marlin code assumes)

r = 0.67952

H = 3.798339

my version of Marlin-probability plural Tunstall :

0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

Marlin reference code : 1,000,000 -> 507,412

naive plural Tunstall :

0.67952             :  1,000,000 ->   508,071 =  4.065 bpb =  1.968 to 1 

I presume the reason they compress worse than my version is because they make dictionaries for a handfull of Laplacian distributions and then pick the closest one. I make a dictionary for the actual char counts in the array, so their dictionary is mis-matching the actual distribution slightly.

Marlin Summary

Sum up with simple take-away.

Plural Tunstall VTF coding in general is extremely fast to decode. It works best with 12-bit tables (must stay in L1), which means it only works well at entropy <= 4 bpb.

Marlin introduces an improved word probability model that captures the way the first letter probability is skewed by the longer-string exclusion. (this is just like the LAM exclusion in LZ)

That is, after coding with word W, subsequent chars that exist in the dictionary (Wa,Wb) cannot then be the next character to start a word, so the probably of the first char of the word being >= c is increased.

The Marlin word probability model improves compression by 1-4% over naive plural Tunstall.

The simplest implementation of the Marlin probability adjustment is like this :

P_naive(W) = Prod[chars c in W] P('c')

P_word(W) = P_scale_first_char( W[0] ) * P_naive(W) * Ptail( num_children(W) )

(where Ptail is the tail-cumulative-proability :
Ptail(x) = Sum[ c >= x ] P('c')  (sum of character probabilities to end)

(instead of scaling by Ptail you can subtract off the child node probabilities as you make them)

on the first iteration of dictionary building, set P_scale_first_char() = 1.0

this is "naive plural Tunstall"

after building the dictionary, you now have a set of words and P(W) for each
compute :

P_state(i) = Sum[ words W with i children ] P(W)

(state i means only chars >= i can follow)

(iterating P_state -> P(W) a few times here is optional but unnecessary)

P_scale_first_char(c) = Sum[ i <= c ] P_state(i) / P_tail(i)

(P_scale_first_char = P_state_tail)

then repeat dic building one more time
(optionally repeat more but once seems fine)

And that's it!

What do these values actually look like? I thought it might be illuminating to dump them. This is on a pretty skewed file so the effect is large, the larger the MPS probability the bigger the effect.

filelen = 1572918
H = 2.917293

P of chars =

        [0] 0.46841475525106835 double
        [1] 0.11553621994280693 double
        [2] 0.11508546535801611 double
        [3] 0.059216055763873253    double
        [4] 0.058911526220693004    double
        [5] 0.036597584870921435    double
        [6] 0.036475518749229136    double
        [7] 0.018035269480036465    double
        [8] 0.017757441900976400    double
        [9] 0.010309501194595012    double
        [10]    0.0097379520102128646   double

P_state =

        [0] 0.62183816678155190 double
        [1] 0.15374679894466811 double
        [2] 0.032874234239563829    double
        [3] 0.063794018822874776    double
        [4] 0.026001940955215786    double
        [5] 0.011274295764837820    double
        [6] 0.028098911350290755    double
        [7] 0.012986055279597277    double
        [8] 0.0013397794289329405   double
        .. goes to zero pretty fast ..

P_scale_first_char =

        [0] 0.62183816678155202 double
        [1] 0.91106139196208336 double
        [2] 0.99007668169839014 double
        [3] 1.2020426052137052  double
        [4] 1.3096008656256881  double
        [5] 1.3712643080249607  double
        [6] 1.5634088663186672  double
        [7] 1.6817189544649160  double
        [8] 1.6963250203077103  double
        [9] 1.9281295477496172  double
        [10]    1.9418127462353780  double
        [11]    2.0234438458996773  double
        [12]    2.0540542047979415  double
        [13]    2.1488636999462676  double
        [14]    2.2798060244386895  double
        [15]    2.2798060244386895  double
        [16]    2.2798060244386895  double
        [17]    2.3660205062350039  double
        [18]    2.3840557757150402  double
        [19]    2.3840557757150402  double
        [20]    2.3840557757150402  double
        [21]    2.4066061022686838  double
        [22]    2.5584098628550294  double
        [23]    2.5584098628550294  double
        [24]    2.6690752676624321  double
        .. gradually goes up to 4.14

estimate real first char probability 
= P(c) * P_scale_first_char(c) =

        [0] 0.29127817269875372 double
        [1] 0.10526058936313110 double
        [2] 0.11394343565337962 double
        [3] 0.071180221940886246    double
        [4] 0.077150585733949978    double
        [5] 0.050184961893408854    double
        [6] 0.057026149416117611    double
        [7] 0.030330254533459933    double

the effect is to reduce the skewing of the probabilities in the post-word alphabet.

I think this problem space is not fully explored yet and I look forward to seeing more work in this domain in the future.

I'm not convinced that the dictionary building scheme here is optimal. Is there an optimal plural VTF dictionary?

I think maybe there's some space between something like RANS and Tunstall. Tunstall inputs blocks of N bits and doesn't carry any state between them. RANS pulls blocks of N bits and carries full state of unused value range between them (which is therefore a lot slower because of dependency chains). Maybe there's something in between?

Another way to think about this Marlin ending state issue is a bit like an arithmetic coder. When you send the index of a word that has children, you have not only sent that word, you've also sent information about the *next* word. That is, some fractional bits of information that you'd like to carry forward.

Say you have {W,Wa,Wb} in the dictionary and you send a W. You've also sent that the next word start with >= c. That's like saying your arithmetic coder cumprob is >= (Pa+Pb). You've refined the range of the next interval.

This could be done with Tunstall by doing the multi-table method (separate tables for state 0,1, and 2+), but unfortunately that doesn't fit in L1.

BTW you can think of Tunstall in an arithmetic codey way. Maybe I'll draw a picture because it's easier to show that way...


Tunstall Context

Looking at Marlin today, some reminders to self/all about context :

Tunstall was originally designed with binary alphabets in mind; variable to fixed on *binary*. In that context, doing full child trees (so dictionaries are full prefix trees and encoding is unique) makes sense. As soon as you do variable-to-fixed (hence VTF) on large alphabets, "plural" trees are obviously better and have been written about much in the past. With plural trees, encoding is not unique ("a" and "ab" are both in the dictionary).

There's a big under-specified distinction between VTF dictionaries that model higher level correlation and those that don't. eg. does P("ab") = P(a)*P(b) or is there order-1 correlation?

I looked at Tunstall codes a while ago (TR "failed experiment : Tunstall Codes" 12/4/2015). I didn't make this clear but in my experiment I was looking at a specific scenario :

symbols are assumed to have only order-0 entropy
(eg. symbol probabilities describe their full statistics)

encoder transmits symbol probabilities (or the dictionary of words)
but there are other possibilities that some of the literature addresses. There are at lot of papers on "improved Tunstall" that use the order-N probabilities (the true N-gram counts for the words rather than multiplying the probability of each character). Whether or not this works in practice depends on context, eg. on LZ literals the characters are non-adjacent in the source so this might not make sense.

There's a fundamental limitation with Tunstall in practice and a very narrow window where it makes sense.

On current chips, 12-bit words is ideal (because 4096 dwords = 16k = fits in L1). 16 bit can sometimes give much better compression, but falling out of L1 is a disaster for speed.

12-bit VTF words works great if the entropy of the source is <= 5 bits or so. As it goes over 5, you have too many bigrams that don't pack well into 12, and the compression ratio starts to suffer badly (and decode speed suffers a bit).

I was investigating Tunstall in the case of normal LZ literals, where entropy is always in the 6-8 bpc range (because any more compressability has been removed by the string-match portion of the LZ). In that case Tunstall just doesn't work.

Tunstall is best when entropy <= 3 bits or so. Not only do you get compression closer to entropy, you also get more decode speed.

Now for context, that's a bit of a weird place to just do entropy coding. Normally in low-entropy scenarios, you would have some kind of coder before just tossing entropy coding at it. eg. take DCT residuals, or any image residual situation. You will have lots of 0's and 1's so it looks like a very low entropy scenario for order-0 entropy, but typically you would remove that by doing RLE or something else so that the alphabet you hand to the entropy coder is higher entropy. (eg. JPEG does RLE of 0's and EOB).

Even if you did just entropy code on a low-entropy source, you might instead use a kind of cascaded coder. Again assuming something like prediction residuals where the 0's and 1's are very common, you might make a two-stage alphabet that's something like :

alphabet 1 : {0,1,2,3+}
alphabet 2 : {3,4,...255}

Then with alphabet 1 you could pack 4 symbols per byte and do normal Huffman. Obviously a Huffman decode is a little slower than Tunstall, but you're getting always 4 symbols per decode so output len is not variable, and compression ratio is better.

Tunstall for LZ literals might be interesting in a very fast LZ with MML 8 or so. (eg. things like ZStd level 1, which is also where multi-symbol-output Huff works well).

Point is - the application window here is pretty narrow, and there are other techniques that also address the same problem.

Understanding Marlin

I will be using "Tunstall" to mean any variable to fixed coder. I am considering large alphabet (eg. 8-bit input alphabet), "plural" (eg. non-prefix-free dictionary). I am also considering only modeling order-0 statistics.

I label the symbols 'a','b','c' from most probable to least probable. I will use single quotes for symbols, and double quotes for words in the dictionary. So :

P('a') , P('b'), etc. are given

P('a') >= P('b') >= P('c') are ordered

I will use the term "Marlin" to describe the way they estimate the probability of dictionary words. (everything else in the paper is either obvious or well known (eg. the way the decoder works), so the innovation and interesting part is the word probability estimation, so that is what I will call "Marlin" , the rest is just "Tunstall").

Ok. To build a Tunstall dictionary your goal is to maximize the average input length, which is :

average input length = Sum[words] { P(word) * L(word) }

since the output length is fixed, maximizing the input length maximizes compression ratio.

In the original Tunstall algorithm on binary input alphabet, this is easily optimized by splitting the most probable word, and adding its two children. This can be done in linear time using two queues for left and right (0 and 1) children.

The Marlin algorithm is all about estimating P(word).

The first naive estimate (what I did in my 12/4/2015 report) is just to multiply the character probabilities :

P(word) = Prod[c in word] P('c')

that is

P("xyz") = P('x') * P('y') * P('z')

but that's obviously not right. The reason is that the existence of words in the dictionary affects the probability of other words.

In particular, the general trend is that the dictionary will accumulate words with the most probable characters (a,b,c) which will make the effective probability of the other letters in the remainder greater.

For example :

Start the dictionary with the 256 single-letter words

At this point the naive probabilities are exact, that is :

P("a") (the word "a") = P('a') (the letter 'a')

Now add the most probable bigram "aa" to the dictionary.

We now have a 257-word dictionary.  What are the probabilities when we code from it ?

Some of the occurrances of the letter 'a' will now be coded with the word "aa"

That means P("a") in the dictionary is now LESS than P('a')

Now add the next most probable words, "ab" and "ba"

The probability of "a" goes down more, as does P("b")

Now if we consider the choice of what word to add next - is it "ac" or "bb" ?

The fact that some of the probability of those letters has been used by words in the dictionary affects
our estimate, which affects our choice of how to build the dictionary.

so that's the intuition of the problem, and the Marlin algorithm is one way to solve it.

Let's do it intuitively again in a bit more detail.

There are two issues : the way the probability of a shorter word is reduced by the presence of longer words, and the way the probability of raw characters that start words is changed by the probability of them coming after words in the dictionary.

Say you have word W in your dictionary

and also some of the most probable children.

W, Wa, Wb are in dictionary

Wc, Wd are not

We'll say word "W" has 2 children (Wa and Wb).

So word "W" will only be used from the dictionary if the children are NOT a or b
(since in that case the longer word would be used).

So if you have seen word "W" so far, to use word W, the next character must be terminal,
eg. one that doesn't correspond to another child.

So the probability of word W should be adjusted by :

P(W) *= P(c) + P(d)

Because we are dealing with sorted probability alphabets, we can describe the child set with just one integer to indicate which ones are in the dictionary. In Marlin terminology this is c(w), and corresponds to the state Si.

If we make the tail cumulative probability sum :

Ptail(x) = Sum[ c >= x ] P('c')  (sum of character probabilities to end)
Ptail(255) = P(255)
Ptail(254) = P(254) + P(255)
Ptail(0) = sum of all P('c')  = 1.0

then the adjustment is :

P(W) *= P(c) + P(d)
P(W) *= Ptail('c')
P(W) *= Ptail(first char that's not a child)

P(W) *= Ptail( num_children(W) )
(I'm zero-indexing, so no +1 here as in the Marlin paper, they 1-base-index)

ADD : I realized there's a simpler way to think about this. When you add a child word, you simply remove that probability from the parent. That is :

let P_init(W) be the initial probability of word W
when it is first added to the dictionary and has no children

track running estimate of P(W)

when you add child 'x' making word "Wx"

The child word probability is initialized from the parent's whole probability :

P_init(Wx) = P_init(W) * P('x')

And remove that from the running P(W) :

P(W) -= P_init(Wx)

That is you just make P(W) the probability of word W, excluding children that exist.
Once you add all children, P(W) will be zero and the word is useless.

Okay, so that does the first issue (probability of words reduced by the presence of longer words). Now the next issue. Consider the same simple example case first :

W,Wa,Wb are in dictionary, Wc,Wd are not
(no longer children of W are either)

Say you reach node "Wa"

there are no children of "Wa" in the dictionary, so all following characters are equally likely

This means that starting the next word, the character probabilities are equal to their original true probabilities

But say you reach node "W" and leave via 'c' or 'd'

In that case the next character must be 'c' or 'd' , it can never be 'a' or 'b'

So the probability of the next character being 'c' goes up by the probability of using word "W" and the
probability of being a 'c' after "W" , that's :

estimate_P('c') +=  P("W") * P('c') / ( P('c') + P('d') )


estimate_P('c') +=  P("W") * P('c') / Ptail( num_children(W) )

now you have to do this for all paths through the dictionary. But all ways to exit with a certain child count are similar, so you can merge those paths to reduce the work. All words with 2 children will be in the same exit probability state ('a' and 'b' can't occur but chars >= 'c' can).

This is the Marlin state S_i. S_i means that character is >= i. It happens because you left the tree with a word that had i children.

When you see character 2 that can happen from state 0 or 1 or 2 but never states >= 3.

for estimating probability of word W

W can only occur in states where the first character W[0] is possible

that is state S_i with i <= W[0]

When character W[0] does occur in state S_i , the probability of that character is effectively higher,
because we know that chars < i can't occur.

Instead of just being P(char W[0]) , it's divided by Ptail(i)

(as in estimate_P('c') +=  P("W") * P('c') / Ptail( num_children(W) ) above)

So :

P(W) = Sum[ i <= W[0] ] P( state S_i ) * P( W | S_i )

the probability of W is the sum of the probability of states it can start from
(recall states = certain terminal character sets)
times the probability of W given that state


P_naive(W) = Product[ chars c in W ] P(char 'c')

be the naive word probability, then :

P(W | S_i) = (1 / Ptail(i)) * P_naive(W) * Ptail( num_children(W) )

is what we need.  This is equation (1) in the Marlin paper.

The first term increases the probability of W for higher chars, because we know the more probable lower chars can't occur in this state (because they found longer words in the dictionary)

The last term decreases the probability of W because it will only be used when the following character doesn't cause a longer word in the dictionary to be used

Now of course there's a problem. This P(W) probability estimate for words requires the probability of starting the word from state S_i, which we don't know. If we had the P(W) then the P of states is just :

P(S_i) = Sum[ words W that have i children ] * P(W)

so to solve this you can just iterate. Initialize the P(S_i) to some guess; the Marlin code just does :

P(state 0) = 1.0
P(all others) = 0.0

(recall state 0 is the state where all chars are possible, no exclusions, so
characters just have their original order-0 probability)

feeds that in to get P(W), feeds that to update P(S_i), and repeats to convergence.

To build the dictionary you simply find the word W with highest P(W) and split it (adding its next most probable child and increasing its child count by 1).

The Marlin code does this :

seed state probabilities


build dictionary greedily using fixed state probabilities

update state probabilities


That is, during the dictionary creation, word probabilities are estimated using state probabilities from the previous iteration. They hard-code this to 3 iterations.

There is an alternative, which is to update the state probabilities as you go. Any time you do a greedy word split, you're changing 3 state probabilities, so that's not terrible. But changing the state probabilities means all your previous word estimate probabilities are now wrong, so you have to go back through them and recompute them. This makes it O(N^2) in the dictionary size, which is bad.

For reference, combining our above equations to make the word probability estimate :

P(W) = Sum[ i <= W[0] ] P( state S_i ) * P( W | S_i )

P(W | S_i) = (1 / Ptail(i)) * P_naive(W) * Ptail( num_children(W) )


P(W) = P_naive(W) * Ptail( num_children(W) ) * Sum[ i <= W[0] ] P( state S_i ) / Ptail(i)

the second half of that can be tabulated between iterations :

P_state_tail(j) = Sum[ i <= j ] P( state S_i ) / Ptail(i)

so :

P(W) = P_naive(W) * Ptail( num_children(W) ) * P_state_tail( W[0] )

you can see the "Marlin" aspect of all this is just in using this P(W) rather than P_naive(W) . How important is it exactly to get this P(W) right? (and is it right?) We'll find out next time...

old rants