8/22/2017

Oodle 2.5.5 - encoder bug fix

Oodle 2.5.5 fixes a bug in the Kraken & Mermaid encoders which could cause them to make compressed data that decodes incorrectly (producing output different than the original) or could cause the decoder to return failure.

This bug was present from Oodle 2.5.0 to 2.5.4 ; if you use those versions you should update to 2.5.5

When the bug occurs, the OodleLZ_Compress call returns success, thinking it made valid compressed data, but it has actually made a damaged bit stream. When you call Decompress it might return failure, or it might return success but produce decompressed output that does not match the original bits.

Any compressed data that you have made which decodes successfully (and matches the original uncompressed data) is fine. The presence of the bug can only be detected by attempting to decode compressed data and checking that it matches the original uncompressed data.

The decoder is not affected by this bug, so if you have shipped user installations that only do decoding, they don't need to be updated. If you have compressed files which were made incorrectly because of this bug, you can patch only those individual compressed files.


Technical details :

This bug was caused by one of the internal bit stream write pointers writing past the end of its bits, potentially over-writing another previously written bit stream. This caused some of the previously written bits to become garbage, causing them to decode into something other than what they had been encoded from.

This only occured with 64-bit encoders. Any data written by 32-bit encoders is not affected by this bug.

This bug could in theory occur on any Kraken & Mermaid compressed data. In practice it's very rare and I've only seen it in one particular case - "whole huff chunks" on data that is only getting a little bit of compression, with uncompressed data that has a trinary byte structure (such as 24-bit RGB). It's also much more likely in pre-2.3.0 compatibility mode (eg. with OodleLZ_BackwardsCompatible_MajorVersion=2 or lower).


BTW it's probably a good idea in general to decode and verify the data after every compress.

I don't do it automatically in Oodle because it would add to encode time, but on second thought that might be a mistake. Pretty much all the Oodle codecs are so asymmetric, that doing a full decode every time wouldn't add much to the encode time. For example :


Kraken Normal level encodes at 50 MB/s
Kraken decodes at 1000 MB/s

To encode 1 MB is 0.02 s
To decode 1 MB is 0.001 s

To decode after every encode changes the encode time to 0.021 s = 47.6 MB/s

it's not a very significant penalty to encode time, and it's worth it to verify that your data definitely decodes correctly. I think it's a good idea to go ahead and add this to your tools.

I may add a "verify" option to the Compress API in the future to automate this.

8/08/2017

Oodle 2.5.4 - now with Windows UWP

Oodle 2.5.4 is out. There's now a separate Windows UWP SDK (separate from Win32).

Oodle for Windows UWP comes with only the "core" library that does memory to memory compression. The Oodle Core library uses no threads, has minimal dependencies (just the CRT), no funny business, making it very portable.

For full details see the Oodle Change Log

7/03/2017

Well Crap

I was cleaning my blog, deleting a bunch old posts, and accidentally deleted some I didn't want to. I'm going to repost a few, so if you have a subscription you may see odd old posts floating in because of that.

Unfortunately there's no blogger recover or trash can feature that I can just undo the delete. Frowny face. Also, while I can repost them, the comments are gone. And unfortunately it seems I can't post them to the same URL. The blogger post URL seems to be irrevocably marked with the post date, and even if I retro-date the post, it munges the URL to not be the same as the original.

ADD : I reposted a few of the ones I wanted to save. The new links are :

cbloom rants 09-27-08 On LZ and ACB
cbloom rants 10-05-08 Rant on New Arithmetic Coders
cbloom rants 10-06-08 Followup on the Russian Range Coder
cbloom rants 10-07-08 Random file stuff I've learned
cbloom rants 10-07-08 A little more on arithmetic coding ...
cbloom rants 10-08-08 Arithmetic coders throw away accuracy in lots of little places.
cbloom rants 10-10-08 On LZ Optimal Parsing
cbloom rants 10-10-08 On the Art of Good Arithmetic Coder Use

5/13/2017

How we used Exceptions on Oddworld : Stranger's Wrath

Apparently I talked about this before in my Game Tech talk in 2004, but I never wrote it on my bloggy blog, so here goes.

On Stranger we used exceptions as a last gasp measure during dev to try to keep the game running for our content creation team. It worked great and I think everyone should use a similar system in game development.

We did not ship with exceptions. They were only used during development. To be clear, what we did NOT do :


We did NOT :

Use C++ exceptions (we used SEH with __try , __throw , __except)

Try to do proper "exception-safe" C++ ala Herb Sutter
  (this is a bizarre and very tricky complex way of writing C++ that requires
  doing everything in a different way than the normal linear imperative code; it
  uses lots of swaps and temp objects)

Return every error with exceptions ; most errors were via return value

Try to unwind exceptions cleanly/robustly

Just kill the game on exceptions

Any error that we expected to happen, or could happen in ship, such as files not found, media errors, etc. were handled with return codes. The standard way of functions returning codes and the calling code checking it and handling it somehow.

Also, errors that we could detect and just fix immediately, but not return a code, we would just fix. So, like say you tried to create an Actor and the pref file describing that actor didn't exist, we'd just print an error (and automatically email it to dev@oddworld) and just not create that Actor. Hey, shit's wrong, but you can continue.

The principle is : don't block artist A from working on their stuff just because the programmers or some other artist checked in other broken stuff. If possible, just disable the broken stuff, because artist A can probably continue.

Say the guys working on particle systems keep checking in broken stuff that would crash the game or cause lots of errors - fine. The rest of the art team can still be syncing to new builds, and they will just see an error printed about "particle system XX failed ; disabled" and then they can continue working on their other stuff.

Blocking the art/design team (potentially a lot of people) for even 5 minutes while you try to roll things back or whatever to fix it is really a HUGE HUGE disaster and should never ever happen.

Any time your artists/designers have to get up and go get coffee/snacks in the kitchen because things are broken and they can't work - you massively fucked up and you should endeavor to never do that again.

But there are inevitably problems that we didn't just detect and disable the object (like the pref not found above). Maybe you just get a crash in some code due to an array ref out of bounds, or somewhere deep in the code you detect a bad fault that you can't fix.

So, as a catch of last measure we used exceptions. The way we did it was to wrap a try/catch around each game object creation & update, and if it caught an exception, that object was removed.


for each object O in the world list
{

__try
{
  O->Update();
}
__except
{
  show & email error about O throwing
  remove O from world list
  // don't delete the object O since it could still be pointed at by the world, or could be corrupt
}

}

Removing O prevents it from trying to Update again and thus spamming. We assume that once it throws, something is badly broken there and we'll just get rid of it.

As I said before, this is NOT trying to catch every error and handle it in a robust way. Obviously O may have been partially through with its Update and put the world in a weird state, it may not keep the game from crashing to just remove O, there are lots of possible problems and we don't try to handle them. It's "optimistic" in that sense that we sort of expect it to fail and cause problems, but if it ever does work, then awesome, great, it saved an artist from crashing. In practice it actually works fine 90% of the time.

We specifically do *not* want to be robust, because writing fully robust exception-safe code (that would have to roll back all the partial changes to the world if there was a throw somewhere through the update) is too onerous. The idea of this system is that it imposes *zero* extra work on programmers writing normal game code.

We could also manually __throw in some places where appropriate. The criterion for doing that is : not an error you should ever get in the final game, it's a spot where you can't return an error code or just show a failure measure and do some kind of default fallback. You also don't need to __throw if it's a spot where the CPU will throw an interrupt for you.

For example, places where we might manually __throw : inside a vector push_back if the malloc to extend failed. In an array out of bounds deref. In the smart-pointer deref if the pointer is null.

Places where we don't __throw : trying to normalize a zero vector or orthonormalize a degenerate frame. These are better to detect, log an error message, and just stuff in unitZ or something, because while that is broken it's a better way to break than the throw-catch mechanism which should only be used if there's no nicer way to stub-out the broken behavior.

Some (not particularly related) links :

cbloom rants 02-04-09 - Exceptions
cbloom rants 06-07-10 - Exceptions
cbloom rants 11-09-11 - Weird shite about Exceptions in Windows

4/18/2017

Context on Chroma Subsampling

Poked by Guetzli into thinking about pros/cons of chroma subsampling -

Chroma downsampling (as in standard JPEG YCbCr 420) is a big ugly hammer. It just throws away a ton of bits of information. That's pretty much always a bad thing in compression.

Now, preferring to throw away chroma detail before luma detail is valid and good. So if you are not chroma subsampling, then you need a perceptually optimizing encoder that knows to give fewer bits to high frequency chroma. You have much more control doing this through bit allocation than you do by just smashing the chroma planes. (for example, on blocks where there is almost no luma signal, then you might keep more of the high frequency chroma, on blocks with luma masking you can throw away lots of the high chroma AC bits - you just have way more precise control).

The chroma subsample is just a convenient way to get decent perceptual tradeoffs in a *non* optimizing encoder.

Chroma subsample is of course an R-D choice. It throws away signal, giving a certain disortion, in exchange it saves you some rate. This particular move is a good R-D choice at some tradeoff zone. Generally at high bit rate, it's a bad move. In most encoders, it becomes a good move at some lower quality. (in JPEG the tradeoff point is usually somewhere around 85). Measuring this D in RMSE is easy, but measuring it perceptually is rather tricky (when luma masking is present it may be near zero D perceptually, but without luma masking it can be much worse).

There are other issues.

In non-subsampled world, the approximate important weights for YCbCr are something like {0.7,0.13,0.17} . If you do subsample, then the chroma weights per-pixel need to go up by 4X , in which case they become pretty close to all being the same.

Many people mistakenly say the "eye does not see blue levels well". Not true, the eye can see the overall level of blue perfectly well, just as well as red or green. (eg for images where the whole thing is a single solid color). What the eye has is very poor spatial resolution in blue.

One of the issues is that chroma subsample is a nice win for *speed*. It gives you 4X less pixels to work on in two of your planes, half as many pixels overall. This means that subsampled chroma images are almost 2X faster to decode.

I used to be anti-chroma-subsample in my early years. For example in wavelets it's much neater to keep all your color planes full res, but send your bitplanes in [YUV] order. That way when you truncate the bottom bit planes, you drop the highest frequency chroma first. But then I realized that the 2X speedup from chroma subsample was nothing to sneeze at, and in general I'm now pro-chroma-subsample.

Another reminder : if you don't chroma subsample, then you may as well do a KLT on the color planes, rather than just use YUV or whatever. (maybe even KLT per region). The advantage of using standard YUV is that the chroma are known to be perceptually okay to downsample (you can't downsample the two minor components of the KLT transformed planes because you have no guarantee that they are of a type that the eye can't perceive high frequency data).

You can obviously construct adversarial images where the detail is all in chroma (the whole image has a constant luma). In that case chroma downsampling looks really bad and is perceptually a big mistake.

Chroma-from-luma in the decoder fixes all the color fringing that people associate with JPEG, but obviously it doesn't help in the adversarial cases where there is no luma detail to boost the chroma with.

I should also note while I'm at it that there are many codecs out there that just have bugs and/or mistakes in their downsamplers and/or upsamplers that cause this operation to produce way more error than it should.


ADD : Won sent me an email with an interesting idea I'd never thought about. Rather than just jumping between not downsampling chroma and doing 2x2 downsample, you could take more progressive steps, such as going to a checkerboard of chroma (half as many pixels) or a Bayer pattern. It's probably too complex to support these well and make good encoder decisions in practice, but they're interesting in theory.

4/15/2017

Tunstall in an arithmetic way

You can think of the Tunstall dictionary build in an arithmetic codery way.

Your dictionary is like the probability interval. You start with a full interval [0,1]. You put in all the single-character words, with P(c) for each char, so that all sums to one. You iteratively split the largest interval, and subdivide the range.


In binary the "split" operation is :

W -> W0 , W1

P(W) = P(W)*P(0) + P(W)*P(1)

In N-ary the split is :

W -> Wa , W[b+]

P(W) = P(W)*P(a) + P(w)*Ptail(b)

W[b+] means just the word W, but in the state "b+" (aka state "1"), following sym must be >= b
(P(w) here means P_naive(w), just the char probability product)

W[b+] -> Wb , W[c+]

P(w)*Ptail(b) = P(w)*P(b) + P(w)*Ptail(c)

(recall Ptail(c) = tail cumprob, sum of P(char >= c))
(Ptail(a) = 1.0)

So we can draw a picture like an arithmetic coder does, spliting ranges and specifying cumprob intervals to choose our string :

You just keep splitting the largest interval until you have a # of intervals = to the desired number of codes. (8 here for 3-bit Tunstall).

At that point, you still have an arithmetic coder - the intervals are fractional sizes and are correctly sized to the probabilities of each code word. eg. the interval for 01 is P(0)*P(1).

In the final step, each interval is just mapped to a dictionary codeword index. This gives each codeword an equal 1/|dictionary_size| probability interval, which in general is not quite right.

This is where the coding inefficiency comes from - it's the difference between the correct interval sizes and the size that we snap them to.

(ADD : in the drawing I wrote "snap to powers of 2" ; that's not the best way of describing that; they're just snapping to the subsequent i/|dictionary_size|. In this case with dictionary_size = 8 those points are {0,1/8,1/4,3/8,1/2,..} which is why I was thinking about powers of 2 intervals.)

4/14/2017

Classical Tunstall

Before continuing with Marlin, I want to take a brief digression to review "classical" or "true" Tunstall.

The classical Tunstall algorithm constructs VTF (variable to fixed) codes for binary memoryless (order-0) sources. It constructs the optimal code.

You start with dictionary = { "0","1" } , the single bit binary strings. (or dictionary = the null string if you prefer)

You then split one word W in the dictionary to make two new words "W0" and "W1" ; when you split W, it is removed since all possible following symbols now have words in the dictionary.

The algorithm is simple and iterative :


while dic size < desired
{
find best word W to split
remove W
add W0 and W1
}

each step increments dictionary size by +1

What is the best word to split?

Our goal is to maximize average code length :


A = Sum[words] P(W) * L(W)

under the split operation, what happens to A ?

W -> W0, W1

delta(A) = P(W0) * L(W0) + P(W1) * L(W1) - P(W) * L(W)

P(W0) = P(W)*P(0)
P(W1) = P(W)*P(1)
L(W0) = L(W)+1

.. simplify ..

delta(A) = P(W)

so to get the best gain of A, you just split the word with maximum probability. Note of course this is just greedy optimization of A and that might not be the true optimum, but in fact it is and the proof is pretty neat but I won't do it here.

You can naively build the optimal Tunstall code in NlogN time with a heap, or slightly more cleverly you can use two linear queues for left and right children and do it in O(N) time.

Easy peasy, nice and neat. But this doesn't work the same way for the large-alphabet scenario.


Now onto something that is a bit messy that I haven't figured out.

For "plural Tunstall" we aren't considering adding all children, we're only considering adding the next child.

A "split" operation is like :


start with word W with no children
W ends in state 0 (all chars >= 0 are possible)

the next child of W to consider is "W0"
(symbols sorted so most probable is first)

if we add "W0" then W goes to state 1 (only chars >= 1 possible)

W_S0 -> "W0" + W_S1

W_S1 -> "W1" + W_S2

etc.

again, we want to maximize A, the average codelen. What is delta(A) under a split operation?

delta(A) = P("W0") * L("W0") + P(W_S1) * L(W) - P(W_S0) * L(W)

delta(A) = P("W0") + (P("W0") + P(W_S1) - P(W_S0)) * L(W)

P("W0") + P(W_S1) - P(W_S0) = 0

so

delta(A) = P("W0") 

it seems like in plural Tunstall you should "split" the word that has maximum P("W0") ; that is maximize the probability of the word you *create* not the one you *remove*. This difference arises from the fact that we are only making one child of longer length - the other "child" in the pseudo-split here is actually the same parent node again, just with a reduced exit state.

In practice that doesn't seem to be so. I experimentally measured that choosing to split the word with maximum P(W) is better than splitting the word with maximum P(child).

I'm not sure what's going wrong with this analysis. In the Marlin code they just split the word with maximum P(W) by analogy to true Tunstall, which I'm not convinced is well justified in plural Tunstall.


While I'm bringing up mysteries, I tried optimal-parsing plural Tunstall. Obviously with "true tunstall" or any prefix-free code that's silly, the greedy parse is the only parse. But with plural Tunstall, you might have "aa" and also "aaa" in the tree. In this scenario, by analogy to LZ, the greedy parse is usually imperfect because it is sometimes better to take a shorter match now to get a longer one on the next work. So maybe some kind of lazy , or heck full optimal parse. (the trivial LZSS backward parse works well here).

Result : optimal-parsed plural Tunstall is identical to greedy. Exactly, so it must be provable. I don't see an easy way to show that the greedy parse is optimal in the plural case. Is it true for all plural dictionaries? (I doubt it) What are the conditions on the dictionary that guarantee it?

I think that this is because for any string in the dictionary, all shorter substrings of that string are in the dictionary too. This makes optimal parsing useless. But I think that property is a coincidence/bug of how Marlin and I did the dictionary construction, which brings me to :


Marlin's dictionary construction method and the one I was using before, which is slightly different, both have the property that they never remove parent nodes when they make children. I believe this is wrong but I haven't been able to make it work a different way.

The case goes like this :


you have word W in the dictionary with no children

you have following chars a,b,c,d.  a and b are very probable, c and d are very rare.

P(W) initially = P_init(W)

you add child 'a' ; W -> Wa , W(b+)
P(W) -= P(Wa)
add child 'b'
W(b+) -> Wb , W(c+)
P(W) -= P(Wb)

now the word W in the dictionary has
P(W) = P(Wc) + P(Wd)

these are quite rare, so P(W) now is very small

W is no longer a desirable dictionary entry.

We got all the usefulness of W out in Wa and Wb, we don't want to keep W in the dictionary just to be able to code it with rare following c's and d's - we'd like to now remove W.

In particular, if the current P(W) of the parent word is now lower than a child we could make somewhere else by splitting, remove W and split the other node. Or something like that - here's where I haven't quite figured out how to make this idea work in practice.

So I believe that both Marlin and my code are NOT making optimal general VTF plural dictionaries, they are making them under the (unnecessary) constraint of the shorter-substring-is-present property.

Tunstall vs Marlin Results Part 1

For some reason I keep trying to type "marling". Anyway...

Geometric distribution , P(n) = r^n

I am comparing "Marlin" = plural Tunstall with P_state word probability model vs. naive plural Tunstall (P_word = P_naive). In both cases 8-byte output words, 12-bit codes.

Marlin :

filelen = 1000000
H = 7.658248
sym_count = 256
r=0.990             :  1,000,000 -> 1,231,686 =  9.853 bpb =  0.812 to 1 
decode_time2 : seconds:0.0018 ticks per: 3.064 b/kc : 326.42 MB/s : 564.38
filelen = 1000000
H = 7.345420
sym_count = 256
r=0.985             :  1,000,000 -> 1,126,068 =  9.009 bpb =  0.888 to 1 
decode_time2 : seconds:0.0016 ticks per: 2.840 b/kc : 352.15 MB/s : 608.87
filelen = 1000000
H = 6.878983
sym_count = 256
r=0.978             :  1,000,000 ->   990,336 =  7.923 bpb =  1.010 to 1 
decode_time2 : seconds:0.0014 ticks per: 2.497 b/kc : 400.54 MB/s : 692.53
filelen = 1000000
H = 6.323152
sym_count = 256
r=0.967             :  1,000,000 ->   862,968 =  6.904 bpb =  1.159 to 1 
decode_time2 : seconds:0.0013 ticks per: 2.227 b/kc : 449.08 MB/s : 776.45
filelen = 1000000
H = 5.741045
sym_count = 226
r=0.950             :  1,000,000 ->   779,445 =  6.236 bpb =  1.283 to 1 
decode_time2 : seconds:0.0012 ticks per: 2.021 b/kc : 494.83 MB/s : 855.57
filelen = 1000000
H = 5.155050
sym_count = 150
r=0.927             :  1,000,000 ->   701,049 =  5.608 bpb =  1.426 to 1 
decode_time2 : seconds:0.0011 ticks per: 1.821 b/kc : 549.09 MB/s : 949.37
filelen = 1000000
H = 4.572028
sym_count = 109
r=0.892             :  1,000,000 ->   611,238 =  4.890 bpb =  1.636 to 1 
decode_time2 : seconds:0.0009 ticks per: 1.577 b/kc : 633.93 MB/s : 1096.07
filelen = 1000000
H = 3.986386
sym_count = 78
r=0.842             :  1,000,000 ->   529,743 =  4.238 bpb =  1.888 to 1 
decode_time2 : seconds:0.0008 ticks per: 1.407 b/kc : 710.53 MB/s : 1228.51
filelen = 1000000
H = 3.405910
sym_count = 47
r=0.773             :  1,000,000 ->   450,585 =  3.605 bpb =  2.219 to 1 
decode_time2 : seconds:0.0007 ticks per: 1.237 b/kc : 808.48 MB/s : 1397.86
filelen = 1000000
H = 2.823256
sym_count = 36
r=0.680             :  1,000,000 ->   373,197 =  2.986 bpb =  2.680 to 1 
decode_time2 : seconds:0.0006 ticks per: 1.053 b/kc : 950.07 MB/s : 1642.67
filelen = 1000000
H = 2.250632
sym_count = 23
r=0.560             :  1,000,000 ->   298,908 =  2.391 bpb =  3.346 to 1 
decode_time2 : seconds:0.0005 ticks per: 0.891 b/kc : 1122.53 MB/s : 1940.85

vs. plural Tunstall :

filelen = 1000000
H = 7.658248
sym_count = 256
r=0.99000           :  1,000,000 -> 1,239,435 =  9.915 bpb =  0.807 to 1 
decode_time2 : seconds:0.0017 ticks per: 2.929 b/kc : 341.46 MB/s : 590.39
filelen = 1000000
H = 7.345420
sym_count = 256
r=0.98504           :  1,000,000 -> 1,130,025 =  9.040 bpb =  0.885 to 1 
decode_time2 : seconds:0.0016 ticks per: 2.814 b/kc : 355.36 MB/s : 614.41
filelen = 1000000
H = 6.878983
sym_count = 256
r=0.97764           :  1,000,000 ->   990,855 =  7.927 bpb =  1.009 to 1 
decode_time2 : seconds:0.0014 ticks per: 2.416 b/kc : 413.96 MB/s : 715.73
filelen = 1000000
H = 6.323152
sym_count = 256
r=0.96665           :  1,000,000 ->   861,900 =  6.895 bpb =  1.160 to 1 
decode_time2 : seconds:0.0012 ticks per: 2.096 b/kc : 477.19 MB/s : 825.07
filelen = 1000000
H = 5.741045
sym_count = 226
r=0.95039           :  1,000,000 ->   782,118 =  6.257 bpb =  1.279 to 1 
decode_time2 : seconds:0.0011 ticks per: 1.898 b/kc : 526.96 MB/s : 911.12
filelen = 1000000
H = 5.155050
sym_count = 150
r=0.92652           :  1,000,000 ->   704,241 =  5.634 bpb =  1.420 to 1 
decode_time2 : seconds:0.0010 ticks per: 1.681 b/kc : 594.73 MB/s : 1028.29
filelen = 1000000
H = 4.572028
sym_count = 109
r=0.89183           :  1,000,000 ->   614,061 =  4.912 bpb =  1.629 to 1 
decode_time2 : seconds:0.0008 ticks per: 1.457 b/kc : 686.27 MB/s : 1186.57
filelen = 1000000
H = 3.986386
sym_count = 78
r=0.84222           :  1,000,000 ->   534,300 =  4.274 bpb =  1.872 to 1 
decode_time2 : seconds:0.0007 ticks per: 1.254 b/kc : 797.33 MB/s : 1378.58
filelen = 1000000
H = 3.405910
sym_count = 47
r=0.77292           :  1,000,000 ->   454,059 =  3.632 bpb =  2.202 to 1 
decode_time2 : seconds:0.0006 ticks per: 1.078 b/kc : 928.04 MB/s : 1604.58
filelen = 1000000
H = 2.823256
sym_count = 36
r=0.67952           :  1,000,000 ->   377,775 =  3.022 bpb =  2.647 to 1 
decode_time2 : seconds:0.0005 ticks per: 0.935 b/kc : 1069.85 MB/s : 1849.77
filelen = 1000000
H = 2.250632
sym_count = 23
r=0.56015           :  1,000,000 ->   304,887 =  2.439 bpb =  3.280 to 1 
decode_time2 : seconds:0.0004 ticks per: 0.724 b/kc : 1381.21 MB/s : 2388.11

Very very small difference. eg :


plural Tunstall :

H = 3.986386
sym_count = 78
r=0.84222           :  1,000,000 ->   534,300 =  4.274 bpb =  1.872 to 1 

Marlin :

H = 3.986386
sym_count = 78
r=0.842             :  1,000,000 ->   529,743 =  4.238 bpb =  1.888 to 1 
decode_time2 : seconds:0.0008 ticks per: 1.407 b/kc : 710.53 MB/s : 1228.51

Yes the Marlin word probability estimator helps a little bit, but it's not massive.

I'm not surprised but a bit sad to say that once again the Marlin paper compares to ridiculous straw men and doesn't compare to the most obvious, naive, and well known (see Savari for example, or Yamamoto & Yokoo) similar alternative - just doing plural Tunstall/VTF without the Marlin word probability model.

Entropy above 4 or so is terrible for 12-bit VTF codes.

The Marlin paper uses a "percent efficiency" scale which I find rather misleading. For example, this :


H = 3.986386
sym_count = 78
r=0.842             :  1,000,000 ->   529,743 =  4.238 bpb =  1.888 to 1 

is what I would consider pretty poor entropy coding. Entropy of 3.98 -> 4.23 bpb is way off. But as a "percent efficiency" it's 94% , which is really high on their graphs.

The more standard and IMO useful way to show this is a delta of your output bits minus the entropy, eg.


excess = 4.238 - 3.986 = 0.252

half a bit per byte wasted. A true arithmetic coder has an excess around 0.001 bpb typically. The worst you can ever do is an excess of 1.0 which occurs in any integer-bit entroy coder as the probability of the MPS goes towards 1.0

Part of my hope / curiosity in investigating this was wondering whether the Marlin procedure would help at all with the way Tunstall VTF codes really collapse in the H > 4 range , and the answer is - no , it doesn't help with that problem at all.

Anyway, on to more results.

Tunstall vs Marlin Results Part 2

Testing on some real files.

Marlin :

loading : R:\tunstall_test\lzt24.literals
filelen = 1111673
H = 7.452694
sym_count = 256
lzt24.literals      :  1,111,673 -> 1,286,166 =  9.256 bpb =  0.864 to 1 
decode_time2 : seconds:0.0022 ticks per: 3.467 b/kc : 288.41 MB/s : 498.66
loading : R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293
sym_count = 236
monarch.tga.rrz_filtered.bmp:  1,572,918 ->   618,447 =  3.145 bpb =  2.543 to 1 
decode_time2 : seconds:0.0012 ticks per: 1.281 b/kc : 780.92 MB/s : 1350.21
loading : R:\tunstall_test\paper1
filelen = 53161
H = 4.982983
sym_count = 95
paper1              :     53,161 ->    35,763 =  5.382 bpb =  1.486 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.988 b/kc : 503.06 MB/s : 869.78
loading : R:\tunstall_test\PIC
filelen = 513216
H = 1.210176
sym_count = 159
PIC                 :    513,216 ->   140,391 =  2.188 bpb =  3.656 to 1 
decode_time2 : seconds:0.0002 ticks per: 0.800 b/kc : 1250.71 MB/s : 2162.48
loading : R:\tunstall_test\tabdir.tab
filelen = 190428
H = 2.284979
sym_count = 77
tabdir.tab          :    190,428 ->    68,511 =  2.878 bpb =  2.780 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.031 b/kc : 969.81 MB/s : 1676.80
total bytes out : 1974785
naive plural Tunstall :
loading : R:\tunstall_test\lzt24.literals
filelen = 1111673
H = 7.452694
sym_count = 256
lzt24.literals      :  1,111,673 -> 1,290,015 =  9.283 bpb =  0.862 to 1 
decode_time2 : seconds:0.0022 ticks per: 3.443 b/kc : 290.45 MB/s : 502.18
loading : R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293
sym_count = 236
monarch.tga.rrz_filtered.bmp:  1,572,918 ->   627,747 =  3.193 bpb =  2.506 to 1 
decode_time2 : seconds:0.0012 ticks per: 1.284 b/kc : 779.08 MB/s : 1347.03
loading : R:\tunstall_test\paper1
filelen = 53161
H = 4.982983
sym_count = 95
paper1              :     53,161 ->    35,934 =  5.408 bpb =  1.479 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.998 b/kc : 500.61 MB/s : 865.56
loading : R:\tunstall_test\PIC
filelen = 513216
H = 1.210176
sym_count = 159
PIC                 :    513,216 ->   145,980 =  2.276 bpb =  3.516 to 1 
decode_time2 : seconds:0.0002 ticks per: 0.826 b/kc : 1211.09 MB/s : 2093.97
loading : R:\tunstall_test\tabdir.tab
filelen = 190428
H = 2.284979
sym_count = 77
tabdir.tab          :    190,428 ->    74,169 =  3.116 bpb =  2.567 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.103 b/kc : 906.80 MB/s : 1567.86
total bytes out : 1995503

About the files :


lzt24.literals are the literals left over after LZ-parsing (LZQ1) lzt24
  like all LZ literals they are high entropy and thus do terribly in Tunstall

monarch.tga.rrz_filtered.bmp is the image residual after filtering with my DPCM
  (it actually has a BMP header on it which is giving Tunstall a harder time
   than if I stripped the header)

paper1 & pic are standard

tabdir.tab is a text file of a dir listing with lots of tabs in it

For speed comparison, this is the Oodle Huffman on the same files :
loading file (0/5) : lzt24.literals
ooHuffman1 : ed...........................................................
ooHuffman1 :  1,111,673 -> 1,036,540 =  7.459 bpb =  1.072 to 1
encode           : 8.405 millis, 13.07 c/b, rate= 132.26 mb/s
decode           : 1.721 millis, 2.68 c/b, rate= 645.81 mb/s
ooHuffman1,1036540,8405444,1721363
loading file (1/5) : monarch.tga.rrz_filtered.bmp
ooHuffman1 : ed...........................................................
ooHuffman1 :  1,572,918 ->   586,839 =  2.985 bpb =  2.680 to 1
encode           : 7.570 millis, 8.32 c/b, rate= 207.80 mb/s
decode           : 2.348 millis, 2.58 c/b, rate= 669.94 mb/s
ooHuffman1,586839,7569562,2347859
loading file (2/5) : paper1
ooHuffman1 :     53,161 ->    33,427 =  5.030 bpb =  1.590 to 1
encode           : 0.268 millis, 8.70 c/b, rate= 198.67 mb/s
decode           : 0.080 millis, 2.60 c/b, rate= 665.07 mb/s
ooHuffman1,33427,267579,79933
loading file (3/5) : PIC
ooHuffman1 :    513,216 ->   106,994 =  1.668 bpb =  4.797 to 1
encode           : 2.405 millis, 8.10 c/b, rate= 213.41 mb/s
decode           : 0.758 millis, 2.55 c/b, rate= 677.32 mb/s
ooHuffman1,106994,2404854,757712
loading file (4/5) : tabdir.tab
ooHuffman1 :    190,428 ->    58,307 =  2.450 bpb =  3.266 to 1
encode           : 0.926 millis, 8.41 c/b, rate= 205.70 mb/s
decode           : 0.279 millis, 2.54 c/b, rate= 681.45 mb/s
ooHuffman1,58307,925742,279447

Tunstall is crazy fast. And of course that's a rather basic implementation of the decoder, I'm sure it could get faster.

Is there an application for plural Tunstall? I'm not sure. I tried it back in 2015 as an idea for literals in Mermaid/Selkie and abandoned it as not very relevant there. It works on low-entropy order-0 data (like image prediction residuals).

Of course if you wanted to test it against the state of the art you should consider SIMD Ryg RANS or GPU RANS. You should consider something like TANS with multiple symbols in the output table. You should consider merged-symbol codes, perhaps using escapes, perhaps runlen transforms. See for example "crblib/huffa.c" for a survey of Huffman ideas from 1996 (pre-runtransform, blocking MPS's, order-1-huff, multisymbol output, etc.)

Some commentary on Marlin and the code

WARNING : I'm a bit delirious with flu and lack of sleep at the moment so I'm not entirely sure what I'm writing. Apologies if it's a mess!

First, there's no point in making the T_ij transition matrix they talk about.

Back in "Understanding Marlin" you may recall I presented the algorithm thusly :


P_state(i) is given from a previous iteration and is constant

build dictionary using Marlin word model

we now have P(W) for all words in our dic

Use P(W) to compute new P_state(i)

optionally iterate a few times (~ 10 times) :
  use that P_state to compute adjusted P(W)
  use P(W) to compute new P_state

iteration dictionary building again (3-4 times)

in the paper (and code) they do it a bit differently. They compute the state transition matrix, which is :

T_ij = Sum[ all words W that end in state S_i ] P(W|S_j)

this is the probability that if you started in state j you will wind up in state i

instead of iterating P_state -> P(W) , they iterate :

T <- T * T

and then P_state(i) = T_i0

I tested both ways and they produce the exact same result, but just doing it through the P(W) computation is far simpler and faster. The matrix multiply is O(alphabet^3) while the P way is only O(alphabet+dic_size)

Also for the record - I have yet to find a case where iterating to convergence here actually helps. If you just make P_State from PW once and don't iterate, you get 99% of the win. eg :


laplacian distribution :

no iteration :

     0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

iterate 10X :

     0.67952             :  1,000,000 ->   503,688 =  4.030 bpb =  1.985 to 1 

You *do* need to iterate the dictionary build. I do it 4 times. 3 times would be fine though, heck 2 is probably fine.

4: 0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

3: 0.67952             :  1,000,000 ->   503,721 =  4.030 bpb =  1.985 to 1 

2: 0.67952             :  1,000,000 ->   503,817 =  4.031 bpb =  1.985 to 1 

The first iteration builds a "naive plural Tunstall" dictionary; the P_state is made from that, second iteration does the first "Marlin" dictionary build.

In general I think they erroneously come to the conclusion that plural Tunstall dictionaries are really slow to create. They're only 1 or 2 orders of magnitude slower than building a Huffman tree, certainly not slow compared to many encoder speeds. Sure sure if you want super fast encoding you wouldn't want to do it, but otherwise it's totally possible to build the dictionaries for each use.

There's a lot of craziness in the Marlin code that makes their dic build way slower than it should be. Some is just over-C++ madness, some is failure to factor out constant expressions.


the word is :

        struct Word : public std::vector<uint8_t> {
};

and the dictionary is :

std::vector<Word> W;

 (with no reserves anywhere)
yeah that's a problem.  May I suggest :

struct Word {
    uint64 chars;
    int len;
};

also reserve() is good and calling size() in loops is bad.

This is ouchy :

        virtual double phi(const Word &) const = 0;

and this is even more ouchy :

        virtual std::vector<Word> split(const Word &) const = 0;

The P(w) function that's used frequently is bad.

The key part is :

    for (size_t t = 0; t<=w[0]; t++) {
        double p = PcurrState[t]; 
        p *= P[w[0]]/PnextState[t];
        ret += p;
    }

which you can easily factor out the common P[w[0]] from :

    int c0 = w[0];
    for (size_t t = 0; t<= c0; t++) {
        ret += PcurrState[t]/PnextState[t];
    }
    ret *= P[c0]

but even more significant would be to realize that PcurrState (my P_state) and
PnextState (my P_tail) are not updated during dic building at all!  They're constant
during that phase, so that whole thing can be precomputed and put in a table.
Then this is just :

    int c0 = w[0];
    ret = PcurrState_over_PnextState_partial_sum[c0];
    ret *= P[c0]

that also gives us a chance to state clearly (again) the difference between "Marlin" and naive plural Tunstall. It's that one table right there.


    int c0 = w[0];
    ret = 1.0;
    ret *= P[c0]

this is naive plural Tunstall. It comes down to a modifed probability table for the first letter in the word.

Recall that :


P_naive(W) = Prod[ chars c in W ] P(c)

simple P_word(W) = P_naive(W) * Ptail( num_children(W) )


Reading Yamamoto and Yokoo "Average-Sense Optimality and Competitive Optimality for Almost Instantaneous VF Codes".

They construct the naive plural Tunstall VF dictionary. They are also aware of the Marlin-style state transition problem. (that is, partical nodes leave you in a state where some symbols are excluded).

They address the problem by constructing multiple parse trees, one for each initial state S_i. In tree T_i you know that the first character is >= i so all words that start with lower symbols are excluded.

This should give reasonably more compression than the Marlin approach, obviously with the cost of having multiple dictionaries.

In skewed alphabet cases where the MPS is very probable, this should be significant because words that start with the MPS dominate the dictionary, but in all states S_1 and higher those words cannot be used. In fact I conjecture that even having just 2 or 3 trees should give most of the win. One tree for state S_0, one for S_1 and the last for all states >= S_2. In practice this problematic because the multiple code sets would fall out of cache and it adds a bit of decoder complexity to select the following tree.

There's also a continuity between VTF codes and blocked arithmetic coders. The Yamamoto-Yokoo scheme is like a way of carrying the residual information between blocked transmissions, similar to multi-table arithmetic coding schemes.


Phhlurge.

I just went and got the marlin code to compile in VS 2015. Bit of a nightmare. I wanted to confirm I didn't screw anything up in my implementation.


two-sided Laplacian distribution centered at 0
(this is what the Marlin code assumes)

r = 0.67952

H = 3.798339

my version of Marlin-probability plural Tunstall :

0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

Marlin reference code : 1,000,000 -> 507,412

naive plural Tunstall :

0.67952             :  1,000,000 ->   508,071 =  4.065 bpb =  1.968 to 1 

I presume the reason they compress worse than my version is because they make dictionaries for a handfull of Laplacian distributions and then pick the closest one. I make a dictionary for the actual char counts in the array, so their dictionary is mis-matching the actual distribution slightly.

old rants