8/08/2017

Oodle 2.5.4 - now with Windows UWP

Oodle 2.5.4 is out. There's now a separate Windows UWP SDK (separate from Win32).

Oodle for Windows UWP comes with only the "core" library that does memory to memory compression. The Oodle Core library uses no threads, has minimal dependencies (just the CRT), no funny business, making it very portable.

For full details see the Oodle Change Log

7/03/2017

Well Crap

I was cleaning my blog, deleting a bunch old posts, and accidentally deleted some I didn't want to. I'm going to repost a few, so if you have a subscription you may see odd old posts floating in because of that.

Unfortunately there's no blogger recover or trash can feature that I can just undo the delete. Frowny face. Also, while I can repost them, the comments are gone. And unfortunately it seems I can't post them to the same URL. The blogger post URL seems to be irrevocably marked with the post date, and even if I retro-date the post, it munges the URL to not be the same as the original.

ADD : I reposted a few of the ones I wanted to save. The new links are :

cbloom rants 09-27-08 On LZ and ACB
cbloom rants 10-05-08 Rant on New Arithmetic Coders
cbloom rants 10-06-08 Followup on the Russian Range Coder
cbloom rants 10-07-08 Random file stuff I've learned
cbloom rants 10-07-08 A little more on arithmetic coding ...
cbloom rants 10-08-08 Arithmetic coders throw away accuracy in lots of little places.
cbloom rants 10-10-08 On LZ Optimal Parsing
cbloom rants 10-10-08 On the Art of Good Arithmetic Coder Use

5/13/2017

How we used Exceptions on Oddworld : Stranger's Wrath

Apparently I talked about this before in my Game Tech talk in 2004, but I never wrote it on my bloggy blog, so here goes.

On Stranger we used exceptions as a last gasp measure during dev to try to keep the game running for our content creation team. It worked great and I think everyone should use a similar system in game development.

We did not ship with exceptions. They were only used during development. To be clear, what we did NOT do :


We did NOT :

Use C++ exceptions (we used SEH with __try , __throw , __except)

Try to do proper "exception-safe" C++ ala Herb Sutter
  (this is a bizarre and very tricky complex way of writing C++ that requires
  doing everything in a different way than the normal linear imperative code; it
  uses lots of swaps and temp objects)

Return every error with exceptions ; most errors were via return value

Try to unwind exceptions cleanly/robustly

Just kill the game on exceptions

Any error that we expected to happen, or could happen in ship, such as files not found, media errors, etc. were handled with return codes. The standard way of functions returning codes and the calling code checking it and handling it somehow.

Also, errors that we could detect and just fix immediately, but not return a code, we would just fix. So, like say you tried to create an Actor and the pref file describing that actor didn't exist, we'd just print an error (and automatically email it to dev@oddworld) and just not create that Actor. Hey, shit's wrong, but you can continue.

The principle is : don't block artist A from working on their stuff just because the programmers or some other artist checked in other broken stuff. If possible, just disable the broken stuff, because artist A can probably continue.

Say the guys working on particle systems keep checking in broken stuff that would crash the game or cause lots of errors - fine. The rest of the art team can still be syncing to new builds, and they will just see an error printed about "particle system XX failed ; disabled" and then they can continue working on their other stuff.

Blocking the art/design team (potentially a lot of people) for even 5 minutes while you try to roll things back or whatever to fix it is really a HUGE HUGE disaster and should never ever happen.

Any time your artists/designers have to get up and go get coffee/snacks in the kitchen because things are broken and they can't work - you massively fucked up and you should endeavor to never do that again.

But there are inevitably problems that we didn't just detect and disable the object (like the pref not found above). Maybe you just get a crash in some code due to an array ref out of bounds, or somewhere deep in the code you detect a bad fault that you can't fix.

So, as a catch of last measure we used exceptions. The way we did it was to wrap a try/catch around each game object creation & update, and if it caught an exception, that object was removed.


for each object O in the world list
{

__try
{
  O->Update();
}
__except
{
  show & email error about O throwing
  remove O from world list
  // don't delete the object O since it could still be pointed at by the world, or could be corrupt
}

}

Removing O prevents it from trying to Update again and thus spamming. We assume that once it throws, something is badly broken there and we'll just get rid of it.

As I said before, this is NOT trying to catch every error and handle it in a robust way. Obviously O may have been partially through with its Update and put the world in a weird state, it may not keep the game from crashing to just remove O, there are lots of possible problems and we don't try to handle them. It's "optimistic" in that sense that we sort of expect it to fail and cause problems, but if it ever does work, then awesome, great, it saved an artist from crashing. In practice it actually works fine 90% of the time.

We specifically do *not* want to be robust, because writing fully robust exception-safe code (that would have to roll back all the partial changes to the world if there was a throw somewhere through the update) is too onerous. The idea of this system is that it imposes *zero* extra work on programmers writing normal game code.

We could also manually __throw in some places where appropriate. The criterion for doing that is : not an error you should ever get in the final game, it's a spot where you can't return an error code or just show a failure measure and do some kind of default fallback. You also don't need to __throw if it's a spot where the CPU will throw an interrupt for you.

For example, places where we might manually __throw : inside a vector push_back if the malloc to extend failed. In an array out of bounds deref. In the smart-pointer deref if the pointer is null.

Places where we don't __throw : trying to normalize a zero vector or orthonormalize a degenerate frame. These are better to detect, log an error message, and just stuff in unitZ or something, because while that is broken it's a better way to break than the throw-catch mechanism which should only be used if there's no nicer way to stub-out the broken behavior.

Some (not particularly related) links :

cbloom rants 02-04-09 - Exceptions
cbloom rants 06-07-10 - Exceptions
cbloom rants 11-09-11 - Weird shite about Exceptions in Windows

4/18/2017

Context on Chroma Subsampling

Poked by Guetzli into thinking about pros/cons of chroma subsampling -

Chroma downsampling (as in standard JPEG YCbCr 420) is a big ugly hammer. It just throws away a ton of bits of information. That's pretty much always a bad thing in compression.

Now, preferring to throw away chroma detail before luma detail is valid and good. So if you are not chroma subsampling, then you need a perceptually optimizing encoder that knows to give fewer bits to high frequency chroma. You have much more control doing this through bit allocation than you do by just smashing the chroma planes. (for example, on blocks where there is almost no luma signal, then you might keep more of the high frequency chroma, on blocks with luma masking you can throw away lots of the high chroma AC bits - you just have way more precise control).

The chroma subsample is just a convenient way to get decent perceptual tradeoffs in a *non* optimizing encoder.

Chroma subsample is of course an R-D choice. It throws away signal, giving a certain disortion, in exchange it saves you some rate. This particular move is a good R-D choice at some tradeoff zone. Generally at high bit rate, it's a bad move. In most encoders, it becomes a good move at some lower quality. (in JPEG the tradeoff point is usually somewhere around 85). Measuring this D in RMSE is easy, but measuring it perceptually is rather tricky (when luma masking is present it may be near zero D perceptually, but without luma masking it can be much worse).

There are other issues.

In non-subsampled world, the approximate important weights for YCbCr are something like {0.7,0.13,0.17} . If you do subsample, then the chroma weights per-pixel need to go up by 4X , in which case they become pretty close to all being the same.

Many people mistakenly say the "eye does not see blue levels well". Not true, the eye can see the overall level of blue perfectly well, just as well as red or green. (eg for images where the whole thing is a single solid color). What the eye has is very poor spatial resolution in blue.

One of the issues is that chroma subsample is a nice win for *speed*. It gives you 4X less pixels to work on in two of your planes, half as many pixels overall. This means that subsampled chroma images are almost 2X faster to decode.

I used to be anti-chroma-subsample in my early years. For example in wavelets it's much neater to keep all your color planes full res, but send your bitplanes in [YUV] order. That way when you truncate the bottom bit planes, you drop the highest frequency chroma first. But then I realized that the 2X speedup from chroma subsample was nothing to sneeze at, and in general I'm now pro-chroma-subsample.

Another reminder : if you don't chroma subsample, then you may as well do a KLT on the color planes, rather than just use YUV or whatever. (maybe even KLT per region). The advantage of using standard YUV is that the chroma are known to be perceptually okay to downsample (you can't downsample the two minor components of the KLT transformed planes because you have no guarantee that they are of a type that the eye can't perceive high frequency data).

You can obviously construct adversarial images where the detail is all in chroma (the whole image has a constant luma). In that case chroma downsampling looks really bad and is perceptually a big mistake.

Chroma-from-luma in the decoder fixes all the color fringing that people associate with JPEG, but obviously it doesn't help in the adversarial cases where there is no luma detail to boost the chroma with.

I should also note while I'm at it that there are many codecs out there that just have bugs and/or mistakes in their downsamplers and/or upsamplers that cause this operation to produce way more error than it should.


ADD : Won sent me an email with an interesting idea I'd never thought about. Rather than just jumping between not downsampling chroma and doing 2x2 downsample, you could take more progressive steps, such as going to a checkerboard of chroma (half as many pixels) or a Bayer pattern. It's probably too complex to support these well and make good encoder decisions in practice, but they're interesting in theory.

4/15/2017

Tunstall in an arithmetic way

You can think of the Tunstall dictionary build in an arithmetic codery way.

Your dictionary is like the probability interval. You start with a full interval [0,1]. You put in all the single-character words, with P(c) for each char, so that all sums to one. You iteratively split the largest interval, and subdivide the range.


In binary the "split" operation is :

W -> W0 , W1

P(W) = P(W)*P(0) + P(W)*P(1)

In N-ary the split is :

W -> Wa , W[b+]

P(W) = P(W)*P(a) + P(w)*Ptail(b)

W[b+] means just the word W, but in the state "b+" (aka state "1"), following sym must be >= b
(P(w) here means P_naive(w), just the char probability product)

W[b+] -> Wb , W[c+]

P(w)*Ptail(b) = P(w)*P(b) + P(w)*Ptail(c)

(recall Ptail(c) = tail cumprob, sum of P(char >= c))
(Ptail(a) = 1.0)

So we can draw a picture like an arithmetic coder does, spliting ranges and specifying cumprob intervals to choose our string :

You just keep splitting the largest interval until you have a # of intervals = to the desired number of codes. (8 here for 3-bit Tunstall).

At that point, you still have an arithmetic coder - the intervals are fractional sizes and are correctly sized to the probabilities of each code word. eg. the interval for 01 is P(0)*P(1).

In the final step, each interval is just mapped to a dictionary codeword index. This gives each codeword an equal 1/|dictionary_size| probability interval, which in general is not quite right.

This is where the coding inefficiency comes from - it's the difference between the correct interval sizes and the size that we snap them to.

(ADD : in the drawing I wrote "snap to powers of 2" ; that's not the best way of describing that; they're just snapping to the subsequent i/|dictionary_size|. In this case with dictionary_size = 8 those points are {0,1/8,1/4,3/8,1/2,..} which is why I was thinking about powers of 2 intervals.)

4/14/2017

Classical Tunstall

Before continuing with Marlin, I want to take a brief digression to review "classical" or "true" Tunstall.

The classical Tunstall algorithm constructs VTF (variable to fixed) codes for binary memoryless (order-0) sources. It constructs the optimal code.

You start with dictionary = { "0","1" } , the single bit binary strings. (or dictionary = the null string if you prefer)

You then split one word W in the dictionary to make two new words "W0" and "W1" ; when you split W, it is removed since all possible following symbols now have words in the dictionary.

The algorithm is simple and iterative :


while dic size < desired
{
find best word W to split
remove W
add W0 and W1
}

each step increments dictionary size by +1

What is the best word to split?

Our goal is to maximize average code length :


A = Sum[words] P(W) * L(W)

under the split operation, what happens to A ?

W -> W0, W1

delta(A) = P(W0) * L(W0) + P(W1) * L(W1) - P(W) * L(W)

P(W0) = P(W)*P(0)
P(W1) = P(W)*P(1)
L(W0) = L(W)+1

.. simplify ..

delta(A) = P(W)

so to get the best gain of A, you just split the word with maximum probability. Note of course this is just greedy optimization of A and that might not be the true optimum, but in fact it is and the proof is pretty neat but I won't do it here.

You can naively build the optimal Tunstall code in NlogN time with a heap, or slightly more cleverly you can use two linear queues for left and right children and do it in O(N) time.

Easy peasy, nice and neat. But this doesn't work the same way for the large-alphabet scenario.


Now onto something that is a bit messy that I haven't figured out.

For "plural Tunstall" we aren't considering adding all children, we're only considering adding the next child.

A "split" operation is like :


start with word W with no children
W ends in state 0 (all chars >= 0 are possible)

the next child of W to consider is "W0"
(symbols sorted so most probable is first)

if we add "W0" then W goes to state 1 (only chars >= 1 possible)

W_S0 -> "W0" + W_S1

W_S1 -> "W1" + W_S2

etc.

again, we want to maximize A, the average codelen. What is delta(A) under a split operation?

delta(A) = P("W0") * L("W0") + P(W_S1) * L(W) - P(W_S0) * L(W)

delta(A) = P("W0") + (P("W0") + P(W_S1) - P(W_S0)) * L(W)

P("W0") + P(W_S1) - P(W_S0) = 0

so

delta(A) = P("W0") 

it seems like in plural Tunstall you should "split" the word that has maximum P("W0") ; that is maximize the probability of the word you *create* not the one you *remove*. This difference arises from the fact that we are only making one child of longer length - the other "child" in the pseudo-split here is actually the same parent node again, just with a reduced exit state.

In practice that doesn't seem to be so. I experimentally measured that choosing to split the word with maximum P(W) is better than splitting the word with maximum P(child).

I'm not sure what's going wrong with this analysis. In the Marlin code they just split the word with maximum P(W) by analogy to true Tunstall, which I'm not convinced is well justified in plural Tunstall.


While I'm bringing up mysteries, I tried optimal-parsing plural Tunstall. Obviously with "true tunstall" or any prefix-free code that's silly, the greedy parse is the only parse. But with plural Tunstall, you might have "aa" and also "aaa" in the tree. In this scenario, by analogy to LZ, the greedy parse is usually imperfect because it is sometimes better to take a shorter match now to get a longer one on the next work. So maybe some kind of lazy , or heck full optimal parse. (the trivial LZSS backward parse works well here).

Result : optimal-parsed plural Tunstall is identical to greedy. Exactly, so it must be provable. I don't see an easy way to show that the greedy parse is optimal in the plural case. Is it true for all plural dictionaries? (I doubt it) What are the conditions on the dictionary that guarantee it?

I think that this is because for any string in the dictionary, all shorter substrings of that string are in the dictionary too. This makes optimal parsing useless. But I think that property is a coincidence/bug of how Marlin and I did the dictionary construction, which brings me to :


Marlin's dictionary construction method and the one I was using before, which is slightly different, both have the property that they never remove parent nodes when they make children. I believe this is wrong but I haven't been able to make it work a different way.

The case goes like this :


you have word W in the dictionary with no children

you have following chars a,b,c,d.  a and b are very probable, c and d are very rare.

P(W) initially = P_init(W)

you add child 'a' ; W -> Wa , W(b+)
P(W) -= P(Wa)
add child 'b'
W(b+) -> Wb , W(c+)
P(W) -= P(Wb)

now the word W in the dictionary has
P(W) = P(Wc) + P(Wd)

these are quite rare, so P(W) now is very small

W is no longer a desirable dictionary entry.

We got all the usefulness of W out in Wa and Wb, we don't want to keep W in the dictionary just to be able to code it with rare following c's and d's - we'd like to now remove W.

In particular, if the current P(W) of the parent word is now lower than a child we could make somewhere else by splitting, remove W and split the other node. Or something like that - here's where I haven't quite figured out how to make this idea work in practice.

So I believe that both Marlin and my code are NOT making optimal general VTF plural dictionaries, they are making them under the (unnecessary) constraint of the shorter-substring-is-present property.

Tunstall vs Marlin Results Part 1

For some reason I keep trying to type "marling". Anyway...

Geometric distribution , P(n) = r^n

I am comparing "Marlin" = plural Tunstall with P_state word probability model vs. naive plural Tunstall (P_word = P_naive). In both cases 8-byte output words, 12-bit codes.

Marlin :

filelen = 1000000
H = 7.658248
sym_count = 256
r=0.990             :  1,000,000 -> 1,231,686 =  9.853 bpb =  0.812 to 1 
decode_time2 : seconds:0.0018 ticks per: 3.064 b/kc : 326.42 MB/s : 564.38
filelen = 1000000
H = 7.345420
sym_count = 256
r=0.985             :  1,000,000 -> 1,126,068 =  9.009 bpb =  0.888 to 1 
decode_time2 : seconds:0.0016 ticks per: 2.840 b/kc : 352.15 MB/s : 608.87
filelen = 1000000
H = 6.878983
sym_count = 256
r=0.978             :  1,000,000 ->   990,336 =  7.923 bpb =  1.010 to 1 
decode_time2 : seconds:0.0014 ticks per: 2.497 b/kc : 400.54 MB/s : 692.53
filelen = 1000000
H = 6.323152
sym_count = 256
r=0.967             :  1,000,000 ->   862,968 =  6.904 bpb =  1.159 to 1 
decode_time2 : seconds:0.0013 ticks per: 2.227 b/kc : 449.08 MB/s : 776.45
filelen = 1000000
H = 5.741045
sym_count = 226
r=0.950             :  1,000,000 ->   779,445 =  6.236 bpb =  1.283 to 1 
decode_time2 : seconds:0.0012 ticks per: 2.021 b/kc : 494.83 MB/s : 855.57
filelen = 1000000
H = 5.155050
sym_count = 150
r=0.927             :  1,000,000 ->   701,049 =  5.608 bpb =  1.426 to 1 
decode_time2 : seconds:0.0011 ticks per: 1.821 b/kc : 549.09 MB/s : 949.37
filelen = 1000000
H = 4.572028
sym_count = 109
r=0.892             :  1,000,000 ->   611,238 =  4.890 bpb =  1.636 to 1 
decode_time2 : seconds:0.0009 ticks per: 1.577 b/kc : 633.93 MB/s : 1096.07
filelen = 1000000
H = 3.986386
sym_count = 78
r=0.842             :  1,000,000 ->   529,743 =  4.238 bpb =  1.888 to 1 
decode_time2 : seconds:0.0008 ticks per: 1.407 b/kc : 710.53 MB/s : 1228.51
filelen = 1000000
H = 3.405910
sym_count = 47
r=0.773             :  1,000,000 ->   450,585 =  3.605 bpb =  2.219 to 1 
decode_time2 : seconds:0.0007 ticks per: 1.237 b/kc : 808.48 MB/s : 1397.86
filelen = 1000000
H = 2.823256
sym_count = 36
r=0.680             :  1,000,000 ->   373,197 =  2.986 bpb =  2.680 to 1 
decode_time2 : seconds:0.0006 ticks per: 1.053 b/kc : 950.07 MB/s : 1642.67
filelen = 1000000
H = 2.250632
sym_count = 23
r=0.560             :  1,000,000 ->   298,908 =  2.391 bpb =  3.346 to 1 
decode_time2 : seconds:0.0005 ticks per: 0.891 b/kc : 1122.53 MB/s : 1940.85

vs. plural Tunstall :

filelen = 1000000
H = 7.658248
sym_count = 256
r=0.99000           :  1,000,000 -> 1,239,435 =  9.915 bpb =  0.807 to 1 
decode_time2 : seconds:0.0017 ticks per: 2.929 b/kc : 341.46 MB/s : 590.39
filelen = 1000000
H = 7.345420
sym_count = 256
r=0.98504           :  1,000,000 -> 1,130,025 =  9.040 bpb =  0.885 to 1 
decode_time2 : seconds:0.0016 ticks per: 2.814 b/kc : 355.36 MB/s : 614.41
filelen = 1000000
H = 6.878983
sym_count = 256
r=0.97764           :  1,000,000 ->   990,855 =  7.927 bpb =  1.009 to 1 
decode_time2 : seconds:0.0014 ticks per: 2.416 b/kc : 413.96 MB/s : 715.73
filelen = 1000000
H = 6.323152
sym_count = 256
r=0.96665           :  1,000,000 ->   861,900 =  6.895 bpb =  1.160 to 1 
decode_time2 : seconds:0.0012 ticks per: 2.096 b/kc : 477.19 MB/s : 825.07
filelen = 1000000
H = 5.741045
sym_count = 226
r=0.95039           :  1,000,000 ->   782,118 =  6.257 bpb =  1.279 to 1 
decode_time2 : seconds:0.0011 ticks per: 1.898 b/kc : 526.96 MB/s : 911.12
filelen = 1000000
H = 5.155050
sym_count = 150
r=0.92652           :  1,000,000 ->   704,241 =  5.634 bpb =  1.420 to 1 
decode_time2 : seconds:0.0010 ticks per: 1.681 b/kc : 594.73 MB/s : 1028.29
filelen = 1000000
H = 4.572028
sym_count = 109
r=0.89183           :  1,000,000 ->   614,061 =  4.912 bpb =  1.629 to 1 
decode_time2 : seconds:0.0008 ticks per: 1.457 b/kc : 686.27 MB/s : 1186.57
filelen = 1000000
H = 3.986386
sym_count = 78
r=0.84222           :  1,000,000 ->   534,300 =  4.274 bpb =  1.872 to 1 
decode_time2 : seconds:0.0007 ticks per: 1.254 b/kc : 797.33 MB/s : 1378.58
filelen = 1000000
H = 3.405910
sym_count = 47
r=0.77292           :  1,000,000 ->   454,059 =  3.632 bpb =  2.202 to 1 
decode_time2 : seconds:0.0006 ticks per: 1.078 b/kc : 928.04 MB/s : 1604.58
filelen = 1000000
H = 2.823256
sym_count = 36
r=0.67952           :  1,000,000 ->   377,775 =  3.022 bpb =  2.647 to 1 
decode_time2 : seconds:0.0005 ticks per: 0.935 b/kc : 1069.85 MB/s : 1849.77
filelen = 1000000
H = 2.250632
sym_count = 23
r=0.56015           :  1,000,000 ->   304,887 =  2.439 bpb =  3.280 to 1 
decode_time2 : seconds:0.0004 ticks per: 0.724 b/kc : 1381.21 MB/s : 2388.11

Very very small difference. eg :


plural Tunstall :

H = 3.986386
sym_count = 78
r=0.84222           :  1,000,000 ->   534,300 =  4.274 bpb =  1.872 to 1 

Marlin :

H = 3.986386
sym_count = 78
r=0.842             :  1,000,000 ->   529,743 =  4.238 bpb =  1.888 to 1 
decode_time2 : seconds:0.0008 ticks per: 1.407 b/kc : 710.53 MB/s : 1228.51

Yes the Marlin word probability estimator helps a little bit, but it's not massive.

I'm not surprised but a bit sad to say that once again the Marlin paper compares to ridiculous straw men and doesn't compare to the most obvious, naive, and well known (see Savari for example, or Yamamoto & Yokoo) similar alternative - just doing plural Tunstall/VTF without the Marlin word probability model.

Entropy above 4 or so is terrible for 12-bit VTF codes.

The Marlin paper uses a "percent efficiency" scale which I find rather misleading. For example, this :


H = 3.986386
sym_count = 78
r=0.842             :  1,000,000 ->   529,743 =  4.238 bpb =  1.888 to 1 

is what I would consider pretty poor entropy coding. Entropy of 3.98 -> 4.23 bpb is way off. But as a "percent efficiency" it's 94% , which is really high on their graphs.

The more standard and IMO useful way to show this is a delta of your output bits minus the entropy, eg.


excess = 4.238 - 3.986 = 0.252

half a bit per byte wasted. A true arithmetic coder has an excess around 0.001 bpb typically. The worst you can ever do is an excess of 1.0 which occurs in any integer-bit entroy coder as the probability of the MPS goes towards 1.0

Part of my hope / curiosity in investigating this was wondering whether the Marlin procedure would help at all with the way Tunstall VTF codes really collapse in the H > 4 range , and the answer is - no , it doesn't help with that problem at all.

Anyway, on to more results.

Tunstall vs Marlin Results Part 2

Testing on some real files.

Marlin :

loading : R:\tunstall_test\lzt24.literals
filelen = 1111673
H = 7.452694
sym_count = 256
lzt24.literals      :  1,111,673 -> 1,286,166 =  9.256 bpb =  0.864 to 1 
decode_time2 : seconds:0.0022 ticks per: 3.467 b/kc : 288.41 MB/s : 498.66
loading : R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293
sym_count = 236
monarch.tga.rrz_filtered.bmp:  1,572,918 ->   618,447 =  3.145 bpb =  2.543 to 1 
decode_time2 : seconds:0.0012 ticks per: 1.281 b/kc : 780.92 MB/s : 1350.21
loading : R:\tunstall_test\paper1
filelen = 53161
H = 4.982983
sym_count = 95
paper1              :     53,161 ->    35,763 =  5.382 bpb =  1.486 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.988 b/kc : 503.06 MB/s : 869.78
loading : R:\tunstall_test\PIC
filelen = 513216
H = 1.210176
sym_count = 159
PIC                 :    513,216 ->   140,391 =  2.188 bpb =  3.656 to 1 
decode_time2 : seconds:0.0002 ticks per: 0.800 b/kc : 1250.71 MB/s : 2162.48
loading : R:\tunstall_test\tabdir.tab
filelen = 190428
H = 2.284979
sym_count = 77
tabdir.tab          :    190,428 ->    68,511 =  2.878 bpb =  2.780 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.031 b/kc : 969.81 MB/s : 1676.80
total bytes out : 1974785
naive plural Tunstall :
loading : R:\tunstall_test\lzt24.literals
filelen = 1111673
H = 7.452694
sym_count = 256
lzt24.literals      :  1,111,673 -> 1,290,015 =  9.283 bpb =  0.862 to 1 
decode_time2 : seconds:0.0022 ticks per: 3.443 b/kc : 290.45 MB/s : 502.18
loading : R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293
sym_count = 236
monarch.tga.rrz_filtered.bmp:  1,572,918 ->   627,747 =  3.193 bpb =  2.506 to 1 
decode_time2 : seconds:0.0012 ticks per: 1.284 b/kc : 779.08 MB/s : 1347.03
loading : R:\tunstall_test\paper1
filelen = 53161
H = 4.982983
sym_count = 95
paper1              :     53,161 ->    35,934 =  5.408 bpb =  1.479 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.998 b/kc : 500.61 MB/s : 865.56
loading : R:\tunstall_test\PIC
filelen = 513216
H = 1.210176
sym_count = 159
PIC                 :    513,216 ->   145,980 =  2.276 bpb =  3.516 to 1 
decode_time2 : seconds:0.0002 ticks per: 0.826 b/kc : 1211.09 MB/s : 2093.97
loading : R:\tunstall_test\tabdir.tab
filelen = 190428
H = 2.284979
sym_count = 77
tabdir.tab          :    190,428 ->    74,169 =  3.116 bpb =  2.567 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.103 b/kc : 906.80 MB/s : 1567.86
total bytes out : 1995503

About the files :


lzt24.literals are the literals left over after LZ-parsing (LZQ1) lzt24
  like all LZ literals they are high entropy and thus do terribly in Tunstall

monarch.tga.rrz_filtered.bmp is the image residual after filtering with my DPCM
  (it actually has a BMP header on it which is giving Tunstall a harder time
   than if I stripped the header)

paper1 & pic are standard

tabdir.tab is a text file of a dir listing with lots of tabs in it

For speed comparison, this is the Oodle Huffman on the same files :
loading file (0/5) : lzt24.literals
ooHuffman1 : ed...........................................................
ooHuffman1 :  1,111,673 -> 1,036,540 =  7.459 bpb =  1.072 to 1
encode           : 8.405 millis, 13.07 c/b, rate= 132.26 mb/s
decode           : 1.721 millis, 2.68 c/b, rate= 645.81 mb/s
ooHuffman1,1036540,8405444,1721363
loading file (1/5) : monarch.tga.rrz_filtered.bmp
ooHuffman1 : ed...........................................................
ooHuffman1 :  1,572,918 ->   586,839 =  2.985 bpb =  2.680 to 1
encode           : 7.570 millis, 8.32 c/b, rate= 207.80 mb/s
decode           : 2.348 millis, 2.58 c/b, rate= 669.94 mb/s
ooHuffman1,586839,7569562,2347859
loading file (2/5) : paper1
ooHuffman1 :     53,161 ->    33,427 =  5.030 bpb =  1.590 to 1
encode           : 0.268 millis, 8.70 c/b, rate= 198.67 mb/s
decode           : 0.080 millis, 2.60 c/b, rate= 665.07 mb/s
ooHuffman1,33427,267579,79933
loading file (3/5) : PIC
ooHuffman1 :    513,216 ->   106,994 =  1.668 bpb =  4.797 to 1
encode           : 2.405 millis, 8.10 c/b, rate= 213.41 mb/s
decode           : 0.758 millis, 2.55 c/b, rate= 677.32 mb/s
ooHuffman1,106994,2404854,757712
loading file (4/5) : tabdir.tab
ooHuffman1 :    190,428 ->    58,307 =  2.450 bpb =  3.266 to 1
encode           : 0.926 millis, 8.41 c/b, rate= 205.70 mb/s
decode           : 0.279 millis, 2.54 c/b, rate= 681.45 mb/s
ooHuffman1,58307,925742,279447

Tunstall is crazy fast. And of course that's a rather basic implementation of the decoder, I'm sure it could get faster.

Is there an application for plural Tunstall? I'm not sure. I tried it back in 2015 as an idea for literals in Mermaid/Selkie and abandoned it as not very relevant there. It works on low-entropy order-0 data (like image prediction residuals).

Of course if you wanted to test it against the state of the art you should consider SIMD Ryg RANS or GPU RANS. You should consider something like TANS with multiple symbols in the output table. You should consider merged-symbol codes, perhaps using escapes, perhaps runlen transforms. See for example "crblib/huffa.c" for a survey of Huffman ideas from 1996 (pre-runtransform, blocking MPS's, order-1-huff, multisymbol output, etc.)

Some commentary on Marlin and the code

WARNING : I'm a bit delirious with flu and lack of sleep at the moment so I'm not entirely sure what I'm writing. Apologies if it's a mess!

First, there's no point in making the T_ij transition matrix they talk about.

Back in "Understanding Marlin" you may recall I presented the algorithm thusly :


P_state(i) is given from a previous iteration and is constant

build dictionary using Marlin word model

we now have P(W) for all words in our dic

Use P(W) to compute new P_state(i)

optionally iterate a few times (~ 10 times) :
  use that P_state to compute adjusted P(W)
  use P(W) to compute new P_state

iteration dictionary building again (3-4 times)

in the paper (and code) they do it a bit differently. They compute the state transition matrix, which is :

T_ij = Sum[ all words W that end in state S_i ] P(W|S_j)

this is the probability that if you started in state j you will wind up in state i

instead of iterating P_state -> P(W) , they iterate :

T <- T * T

and then P_state(i) = T_i0

I tested both ways and they produce the exact same result, but just doing it through the P(W) computation is far simpler and faster. The matrix multiply is O(alphabet^3) while the P way is only O(alphabet+dic_size)

Also for the record - I have yet to find a case where iterating to convergence here actually helps. If you just make P_State from PW once and don't iterate, you get 99% of the win. eg :


laplacian distribution :

no iteration :

     0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

iterate 10X :

     0.67952             :  1,000,000 ->   503,688 =  4.030 bpb =  1.985 to 1 

You *do* need to iterate the dictionary build. I do it 4 times. 3 times would be fine though, heck 2 is probably fine.

4: 0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

3: 0.67952             :  1,000,000 ->   503,721 =  4.030 bpb =  1.985 to 1 

2: 0.67952             :  1,000,000 ->   503,817 =  4.031 bpb =  1.985 to 1 

The first iteration builds a "naive plural Tunstall" dictionary; the P_state is made from that, second iteration does the first "Marlin" dictionary build.

In general I think they erroneously come to the conclusion that plural Tunstall dictionaries are really slow to create. They're only 1 or 2 orders of magnitude slower than building a Huffman tree, certainly not slow compared to many encoder speeds. Sure sure if you want super fast encoding you wouldn't want to do it, but otherwise it's totally possible to build the dictionaries for each use.

There's a lot of craziness in the Marlin code that makes their dic build way slower than it should be. Some is just over-C++ madness, some is failure to factor out constant expressions.


the word is :

        struct Word : public std::vector<uint8_t> {
};

and the dictionary is :

std::vector<Word> W;

 (with no reserves anywhere)
yeah that's a problem.  May I suggest :

struct Word {
    uint64 chars;
    int len;
};

also reserve() is good and calling size() in loops is bad.

This is ouchy :

        virtual double phi(const Word &) const = 0;

and this is even more ouchy :

        virtual std::vector<Word> split(const Word &) const = 0;

The P(w) function that's used frequently is bad.

The key part is :

    for (size_t t = 0; t<=w[0]; t++) {
        double p = PcurrState[t]; 
        p *= P[w[0]]/PnextState[t];
        ret += p;
    }

which you can easily factor out the common P[w[0]] from :

    int c0 = w[0];
    for (size_t t = 0; t<= c0; t++) {
        ret += PcurrState[t]/PnextState[t];
    }
    ret *= P[c0]

but even more significant would be to realize that PcurrState (my P_state) and
PnextState (my P_tail) are not updated during dic building at all!  They're constant
during that phase, so that whole thing can be precomputed and put in a table.
Then this is just :

    int c0 = w[0];
    ret = PcurrState_over_PnextState_partial_sum[c0];
    ret *= P[c0]

that also gives us a chance to state clearly (again) the difference between "Marlin" and naive plural Tunstall. It's that one table right there.


    int c0 = w[0];
    ret = 1.0;
    ret *= P[c0]

this is naive plural Tunstall. It comes down to a modifed probability table for the first letter in the word.

Recall that :


P_naive(W) = Prod[ chars c in W ] P(c)

simple P_word(W) = P_naive(W) * Ptail( num_children(W) )


Reading Yamamoto and Yokoo "Average-Sense Optimality and Competitive Optimality for Almost Instantaneous VF Codes".

They construct the naive plural Tunstall VF dictionary. They are also aware of the Marlin-style state transition problem. (that is, partical nodes leave you in a state where some symbols are excluded).

They address the problem by constructing multiple parse trees, one for each initial state S_i. In tree T_i you know that the first character is >= i so all words that start with lower symbols are excluded.

This should give reasonably more compression than the Marlin approach, obviously with the cost of having multiple dictionaries.

In skewed alphabet cases where the MPS is very probable, this should be significant because words that start with the MPS dominate the dictionary, but in all states S_1 and higher those words cannot be used. In fact I conjecture that even having just 2 or 3 trees should give most of the win. One tree for state S_0, one for S_1 and the last for all states >= S_2. In practice this problematic because the multiple code sets would fall out of cache and it adds a bit of decoder complexity to select the following tree.

There's also a continuity between VTF codes and blocked arithmetic coders. The Yamamoto-Yokoo scheme is like a way of carrying the residual information between blocked transmissions, similar to multi-table arithmetic coding schemes.


Phhlurge.

I just went and got the marlin code to compile in VS 2015. Bit of a nightmare. I wanted to confirm I didn't screw anything up in my implementation.


two-sided Laplacian distribution centered at 0
(this is what the Marlin code assumes)

r = 0.67952

H = 3.798339

my version of Marlin-probability plural Tunstall :

0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

Marlin reference code : 1,000,000 -> 507,412

naive plural Tunstall :

0.67952             :  1,000,000 ->   508,071 =  4.065 bpb =  1.968 to 1 

I presume the reason they compress worse than my version is because they make dictionaries for a handfull of Laplacian distributions and then pick the closest one. I make a dictionary for the actual char counts in the array, so their dictionary is mis-matching the actual distribution slightly.

Marlin Summary

Sum up with simple take-away.

Plural Tunstall VTF coding in general is extremely fast to decode. It works best with 12-bit tables (must stay in L1), which means it only works well at entropy <= 4 bpb.

Marlin introduces an improved word probability model that captures the way the first letter probability is skewed by the longer-string exclusion. (this is just like the LAM exclusion in LZ)

That is, after coding with word W, subsequent chars that exist in the dictionary (Wa,Wb) cannot then be the next character to start a word, so the probably of the first char of the word being >= c is increased.

The Marlin word probability model improves compression by 1-4% over naive plural Tunstall.

The simplest implementation of the Marlin probability adjustment is like this :


P_naive(W) = Prod[chars c in W] P('c')

P_word(W) = P_scale_first_char( W[0] ) * P_naive(W) * Ptail( num_children(W) )

(where Ptail is the tail-cumulative-proability :
Ptail(x) = Sum[ c >= x ] P('c')  (sum of character probabilities to end)
)

(instead of scaling by Ptail you can subtract off the child node probabilities as you make them)


on the first iteration of dictionary building, set P_scale_first_char() = 1.0

this is "naive plural Tunstall"

after building the dictionary, you now have a set of words and P(W) for each
compute :

P_state(i) = Sum[ words W with i children ] P(W)

(state i means only chars >= i can follow)

(iterating P_state -> P(W) a few times here is optional but unnecessary)

P_scale_first_char(c) = Sum[ i <= c ] P_state(i) / P_tail(i)

(P_scale_first_char = P_state_tail)

then repeat dic building one more time
(optionally repeat more but once seems fine)

And that's it!


What do these values actually look like? I thought it might be illuminating to dump them. This is on a pretty skewed file so the effect is large, the larger the MPS probability the bigger the effect.


R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293

P of chars =

        [0] 0.46841475525106835 double
        [1] 0.11553621994280693 double
        [2] 0.11508546535801611 double
        [3] 0.059216055763873253    double
        [4] 0.058911526220693004    double
        [5] 0.036597584870921435    double
        [6] 0.036475518749229136    double
        [7] 0.018035269480036465    double
        [8] 0.017757441900976400    double
        [9] 0.010309501194595012    double
        [10]    0.0097379520102128646   double

P_state =

        [0] 0.62183816678155190 double
        [1] 0.15374679894466811 double
        [2] 0.032874234239563829    double
        [3] 0.063794018822874776    double
        [4] 0.026001940955215786    double
        [5] 0.011274295764837820    double
        [6] 0.028098911350290755    double
        [7] 0.012986055279597277    double
        [8] 0.0013397794289329405   double
        .. goes to zero pretty fast ..

P_scale_first_char =

        [0] 0.62183816678155202 double
        [1] 0.91106139196208336 double
        [2] 0.99007668169839014 double
        [3] 1.2020426052137052  double
        [4] 1.3096008656256881  double
        [5] 1.3712643080249607  double
        [6] 1.5634088663186672  double
        [7] 1.6817189544649160  double
        [8] 1.6963250203077103  double
        [9] 1.9281295477496172  double
        [10]    1.9418127462353780  double
        [11]    2.0234438458996773  double
        [12]    2.0540542047979415  double
        [13]    2.1488636999462676  double
        [14]    2.2798060244386895  double
        [15]    2.2798060244386895  double
        [16]    2.2798060244386895  double
        [17]    2.3660205062350039  double
        [18]    2.3840557757150402  double
        [19]    2.3840557757150402  double
        [20]    2.3840557757150402  double
        [21]    2.4066061022686838  double
        [22]    2.5584098628550294  double
        [23]    2.5584098628550294  double
        [24]    2.6690752676624321  double
        .. gradually goes up to 4.14

estimate real first char probability 
= P(c) * P_scale_first_char(c) =

        [0] 0.29127817269875372 double
        [1] 0.10526058936313110 double
        [2] 0.11394343565337962 double
        [3] 0.071180221940886246    double
        [4] 0.077150585733949978    double
        [5] 0.050184961893408854    double
        [6] 0.057026149416117611    double
        [7] 0.030330254533459933    double

the effect is to reduce the skewing of the probabilities in the post-word alphabet.


I think this problem space is not fully explored yet and I look forward to seeing more work in this domain in the future.

I'm not convinced that the dictionary building scheme here is optimal. Is there an optimal plural VTF dictionary?

I think maybe there's some space between something like RANS and Tunstall. Tunstall inputs blocks of N bits and doesn't carry any state between them. RANS pulls blocks of N bits and carries full state of unused value range between them (which is therefore a lot slower because of dependency chains). Maybe there's something in between?

Another way to think about this Marlin ending state issue is a bit like an arithmetic coder. When you send the index of a word that has children, you have not only sent that word, you've also sent information about the *next* word. That is, some fractional bits of information that you'd like to carry forward.

Say you have {W,Wa,Wb} in the dictionary and you send a W. You've also sent that the next word start with >= c. That's like saying your arithmetic coder cumprob is >= (Pa+Pb). You've refined the range of the next interval.

This could be done with Tunstall by doing the multi-table method (separate tables for state 0,1, and 2+), but unfortunately that doesn't fit in L1.

BTW you can think of Tunstall in an arithmetic codey way. Maybe I'll draw a picture because it's easier to show that way...

4/13/2017

Tunstall Context

Looking at Marlin today, some reminders to self/all about context :

Tunstall was originally designed with binary alphabets in mind; variable to fixed on *binary*. In that context, doing full child trees (so dictionaries are full prefix trees and encoding is unique) makes sense. As soon as you do variable-to-fixed (hence VTF) on large alphabets, "plural" trees are obviously better and have been written about much in the past. With plural trees, encoding is not unique ("a" and "ab" are both in the dictionary).

There's a big under-specified distinction between VTF dictionaries that model higher level correlation and those that don't. eg. does P("ab") = P(a)*P(b) or is there order-1 correlation?

I looked at Tunstall codes a while ago (TR "failed experiment : Tunstall Codes" 12/4/2015). I didn't make this clear but in my experiment I was looking at a specific scenario :

symbols are assumed to have only order-0 entropy
(eg. symbol probabilities describe their full statistics)

encoder transmits symbol probabilities (or the dictionary of words)
but there are other possibilities that some of the literature addresses. There are at lot of papers on "improved Tunstall" that use the order-N probabilities (the true N-gram counts for the words rather than multiplying the probability of each character). Whether or not this works in practice depends on context, eg. on LZ literals the characters are non-adjacent in the source so this might not make sense.

There's a fundamental limitation with Tunstall in practice and a very narrow window where it makes sense.

On current chips, 12-bit words is ideal (because 4096 dwords = 16k = fits in L1). 16 bit can sometimes give much better compression, but falling out of L1 is a disaster for speed.

12-bit VTF words works great if the entropy of the source is <= 5 bits or so. As it goes over 5, you have too many bigrams that don't pack well into 12, and the compression ratio starts to suffer badly (and decode speed suffers a bit).

I was investigating Tunstall in the case of normal LZ literals, where entropy is always in the 6-8 bpc range (because any more compressability has been removed by the string-match portion of the LZ). In that case Tunstall just doesn't work.

Tunstall is best when entropy <= 3 bits or so. Not only do you get compression closer to entropy, you also get more decode speed.

Now for context, that's a bit of a weird place to just do entropy coding. Normally in low-entropy scenarios, you would have some kind of coder before just tossing entropy coding at it. eg. take DCT residuals, or any image residual situation. You will have lots of 0's and 1's so it looks like a very low entropy scenario for order-0 entropy, but typically you would remove that by doing RLE or something else so that the alphabet you hand to the entropy coder is higher entropy. (eg. JPEG does RLE of 0's and EOB).

Even if you did just entropy code on a low-entropy source, you might instead use a kind of cascaded coder. Again assuming something like prediction residuals where the 0's and 1's are very common, you might make a two-stage alphabet that's something like :


alphabet 1 : {0,1,2,3+}
alphabet 2 : {3,4,...255}

Then with alphabet 1 you could pack 4 symbols per byte and do normal Huffman. Obviously a Huffman decode is a little slower than Tunstall, but you're getting always 4 symbols per decode so output len is not variable, and compression ratio is better.

Tunstall for LZ literals might be interesting in a very fast LZ with MML 8 or so. (eg. things like ZStd level 1, which is also where multi-symbol-output Huff works well).

Point is - the application window here is pretty narrow, and there are other techniques that also address the same problem.

Understanding Marlin

I will be using "Tunstall" to mean any variable to fixed coder. I am considering large alphabet (eg. 8-bit input alphabet), "plural" (eg. non-prefix-free dictionary). I am also considering only modeling order-0 statistics.

I label the symbols 'a','b','c' from most probable to least probable. I will use single quotes for symbols, and double quotes for words in the dictionary. So :


P('a') , P('b'), etc. are given

P('a') >= P('b') >= P('c') are ordered

I will use the term "Marlin" to describe the way they estimate the probability of dictionary words. (everything else in the paper is either obvious or well known (eg. the way the decoder works), so the innovation and interesting part is the word probability estimation, so that is what I will call "Marlin" , the rest is just "Tunstall").

Ok. To build a Tunstall dictionary your goal is to maximize the average input length, which is :


average input length = Sum[words] { P(word) * L(word) }

since the output length is fixed, maximizing the input length maximizes compression ratio.

In the original Tunstall algorithm on binary input alphabet, this is easily optimized by splitting the most probable word, and adding its two children. This can be done in linear time using two queues for left and right (0 and 1) children.

The Marlin algorithm is all about estimating P(word).

The first naive estimate (what I did in my 12/4/2015 report) is just to multiply the character probabilities :


P(word) = Prod[c in word] P('c')

that is

P("xyz") = P('x') * P('y') * P('z')

but that's obviously not right. The reason is that the existence of words in the dictionary affects the probability of other words.

In particular, the general trend is that the dictionary will accumulate words with the most probable characters (a,b,c) which will make the effective probability of the other letters in the remainder greater.

For example :


Start the dictionary with the 256 single-letter words

At this point the naive probabilities are exact, that is :

P("a") (the word "a") = P('a') (the letter 'a')

Now add the most probable bigram "aa" to the dictionary.

We now have a 257-word dictionary.  What are the probabilities when we code from it ?

Some of the occurrances of the letter 'a' will now be coded with the word "aa"

That means P("a") in the dictionary is now LESS than P('a')

Now add the next most probable words, "ab" and "ba"

The probability of "a" goes down more, as does P("b")

Now if we consider the choice of what word to add next - is it "ac" or "bb" ?

The fact that some of the probability of those letters has been used by words in the dictionary affects
our estimate, which affects our choice of how to build the dictionary.

so that's the intuition of the problem, and the Marlin algorithm is one way to solve it.

Let's do it intuitively again in a bit more detail.

There are two issues : the way the probability of a shorter word is reduced by the presence of longer words, and the way the probability of raw characters that start words is changed by the probability of them coming after words in the dictionary.


Say you have word W in your dictionary

and also some of the most probable children.

W, Wa, Wb are in dictionary

Wc, Wd are not

We'll say word "W" has 2 children (Wa and Wb).

So word "W" will only be used from the dictionary if the children are NOT a or b
(since in that case the longer word would be used).

So if you have seen word "W" so far, to use word W, the next character must be terminal,
eg. one that doesn't correspond to another child.

So the probability of word W should be adjusted by :

P(W) *= P(c) + P(d)

Because we are dealing with sorted probability alphabets, we can describe the child set with just one integer to indicate which ones are in the dictionary. In Marlin terminology this is c(w), and corresponds to the state Si.

If we make the tail cumulative probability sum :


Ptail(x) = Sum[ c >= x ] P('c')  (sum of character probabilities to end)
Ptail(255) = P(255)
Ptail(254) = P(254) + P(255)
Ptail(0) = sum of all P('c')  = 1.0

then the adjustment is :

P(W) *= P(c) + P(d)
P(W) *= Ptail('c')
P(W) *= Ptail(first char that's not a child)

P(W) *= Ptail( num_children(W) )
(I'm zero-indexing, so no +1 here as in the Marlin paper, they 1-base-index)

ADD : I realized there's a simpler way to think about this. When you add a child word, you simply remove that probability from the parent. That is :

let P_init(W) be the initial probability of word W
when it is first added to the dictionary and has no children

track running estimate of P(W)

when you add child 'x' making word "Wx"

The child word probability is initialized from the parent's whole probability :

P_init(Wx) = P_init(W) * P('x')

And remove that from the running P(W) :

P(W) -= P_init(Wx)

That is you just make P(W) the probability of word W, excluding children that exist.
Once you add all children, P(W) will be zero and the word is useless.

Okay, so that does the first issue (probability of words reduced by the presence of longer words). Now the next issue. Consider the same simple example case first :


W,Wa,Wb are in dictionary, Wc,Wd are not
(no longer children of W are either)

Say you reach node "Wa"

there are no children of "Wa" in the dictionary, so all following characters are equally likely

This means that starting the next word, the character probabilities are equal to their original true probabilities

But say you reach node "W" and leave via 'c' or 'd'

In that case the next character must be 'c' or 'd' , it can never be 'a' or 'b'

So the probability of the next character being 'c' goes up by the probability of using word "W" and the
probability of being a 'c' after "W" , that's :

estimate_P('c') +=  P("W") * P('c') / ( P('c') + P('d') )

or

estimate_P('c') +=  P("W") * P('c') / Ptail( num_children(W) )

now you have to do this for all paths through the dictionary. But all ways to exit with a certain child count are similar, so you can merge those paths to reduce the work. All words with 2 children will be in the same exit probability state ('a' and 'b' can't occur but chars >= 'c' can).

This is the Marlin state S_i. S_i means that character is >= i. It happens because you left the tree with a word that had i children.

When you see character 2 that can happen from state 0 or 1 or 2 but never states >= 3.


for estimating probability of word W

W can only occur in states where the first character W[0] is possible

that is state S_i with i <= W[0]

When character W[0] does occur in state S_i , the probability of that character is effectively higher,
because we know that chars < i can't occur.

Instead of just being P(char W[0]) , it's divided by Ptail(i)

(as in estimate_P('c') +=  P("W") * P('c') / Ptail( num_children(W) ) above)


So :

P(W) = Sum[ i <= W[0] ] P( state S_i ) * P( W | S_i )

the probability of W is the sum of the probability of states it can start from
(recall states = certain terminal character sets)
times the probability of W given that state

let

P_naive(W) = Product[ chars c in W ] P(char 'c')

be the naive word probability, then :

P(W | S_i) = (1 / Ptail(i)) * P_naive(W) * Ptail( num_children(W) )


is what we need.  This is equation (1) in the Marlin paper.

The first term increases the probability of W for higher chars, because we know the more probable lower chars can't occur in this state (because they found longer words in the dictionary)

The last term decreases the probability of W because it will only be used when the following character doesn't cause a longer word in the dictionary to be used

Now of course there's a problem. This P(W) probability estimate for words requires the probability of starting the word from state S_i, which we don't know. If we had the P(W) then the P of states is just :


P(S_i) = Sum[ words W that have i children ] * P(W)

so to solve this you can just iterate. Initialize the P(S_i) to some guess; the Marlin code just does :

P(state 0) = 1.0
P(all others) = 0.0

(recall state 0 is the state where all chars are possible, no exclusions, so
characters just have their original order-0 probability)

feeds that in to get P(W), feeds that to update P(S_i), and repeats to convergence.

To build the dictionary you simply find the word W with highest P(W) and split it (adding its next most probable child and increasing its child count by 1).

The Marlin code does this :


seed state probabilities

iterate
{

build dictionary greedily using fixed state probabilities

update state probabilities

}

That is, during the dictionary creation, word probabilities are estimated using state probabilities from the previous iteration. They hard-code this to 3 iterations.

There is an alternative, which is to update the state probabilities as you go. Any time you do a greedy word split, you're changing 3 state probabilities, so that's not terrible. But changing the state probabilities means all your previous word estimate probabilities are now wrong, so you have to go back through them and recompute them. This makes it O(N^2) in the dictionary size, which is bad.

For reference, combining our above equations to make the word probability estimate :


P(W) = Sum[ i <= W[0] ] P( state S_i ) * P( W | S_i )

P(W | S_i) = (1 / Ptail(i)) * P_naive(W) * Ptail( num_children(W) )

so:

P(W) = P_naive(W) * Ptail( num_children(W) ) * Sum[ i <= W[0] ] P( state S_i ) / Ptail(i)

the second half of that can be tabulated between iterations :

P_state_tail(j) = Sum[ i <= j ] P( state S_i ) / Ptail(i)

so :

P(W) = P_naive(W) * Ptail( num_children(W) ) * P_state_tail( W[0] )

you can see the "Marlin" aspect of all this is just in using this P(W) rather than P_naive(W) . How important is it exactly to get this P(W) right? (and is it right?) We'll find out next time...

3/24/2017

JPEG2

JPEG2 proposal / rough principles :

1. Simple simple simple. The decoder should be implementable in ~5000 lines as a single file stb.h style header. Keep it simple!

2. It should be losslessly transcodable from JPEG , ala packJPG/Lepton. That is, JPEG1 should be contained as a subset. (this just means having 8x8 DCT mode, quantization matrix). You could have other block modes in JPEG2 that simply aren't used when you transcode JPEG. You replace the entropy coded back-end with JPEG2 and should get about 20% file size reduction.

IMO this is crucial for rolling out a new format, nobody should ever be trancoding existing JPEGs and thereby introducing new error.

3. Reasonably fast to decode. Slower than JPEG1 by maybe 2X is okay, but not by 10X. eg. JPEG-ANS is okay, JPEG-Ari is probably not okay. Also think about parallelism and GPU decoding for huge images (100 MP). Keeping decoding local is important (eg. each 32x32 block or so should be independently decodable).

4. Decent quality encoding without crazy optimizing encoders. The straightforward encode without big R-D optimizing searches should still beat JPEG.

5. Support for per-block Q , so that sophisticated encoders can do bit rate allocation.

6. Support alpha, HDR. Make a clean definition of color space and gamma. But *don't* go crazy with supporting ICC profiles and lots of bit depths and so on. Needs to be the smallest set of features here. You don't want to get into the situation that's so common where the format is too complex and nobody actually supports it right in practice, so there becomes a "spec standard" and a "de-facto standard" that don't parse lots of the optional modes correctly.

7. Support larger blocks & non-square blocks; certainly 16x16 , maybe 32x32 ? Things like 16x8 , etc. This is important for increasingly large images.

Most of all keep it simple, keep it close to JPEG, because JPEG actually works and basically everything else in lossy image compression doesn't.

Anything that's not just DCT + quantize + entropy is IMO a big mistake, very suspicious and likely to be vaporware in the sense that you can make it look good on paper but it won't work well in reality.


ADD :

I have in the past posted many times about how plain old baseline JPEG + decent back-end entropy (eg. packJPG/Lepton) is surprisingly competitive with almost every modern image codec.

That's actually quite surprising.

The issue is that baseline JPEG is doing *zero* R-D optimization. Even if you use something like mozjpeg which is doing a bit of R-D optimization, it's doing it for the *wrong* rate model (assuming baseline JPEG coding, not the packjpg I then actually use).

It's well known that doing R-D optimization correctly (with the right rate model) provides absolutely enormous wins in lossy compression, so the fact that baseline JPEG + packJPG without any R-D at all can perform so well is really an indictment of everything it beats. This tells us there is a lot of room for easy improvement.

3/08/2017

Kraken Perf with Simultaneous Threaded Decodes

I had a report from a customer of poor Kraken decode performance on PS4 when using 10 simultaneous threads for decoding, and it occurred to me I had never tested that thoroughly. (Their issue turned out to be something else; see end of post).

There is reason to be concerned about running a lot of Kraken (or Mermaid/Selkie) decodes simultaneously. On most modern systems, like the PS4, the many cores share caches, perhaps share memory busses or TLBs. That means while you have N* the compute performance, you may have cache conflicts, and you could wind up bottlenecking on some of the memory subsystem. (generally we don't run into bandwidth bottlenecks, but there are lots of other limitted resources, like queue sizes, etc.)

Anyhoo, onto the testing -

I ran N threaded decodes of the same file. The buffers are copied for each thread so they can't share any cache for input or output buffers. Wiped caches before runs. I then wait on all N decodes being done and time that.

The graphs show total time for all N decodes, and time per decode (total/N).

If you had infinite compute resources, then "total time" (orange) would be a flat line. Any number of threads would take the same total time, it would not change.

Once you hit the limits of the system, the "time per" (blue) should be constant, and total then should go up linearly. (actually not quite, because when you are off the core # modulo, the threads don't all complete at the same time so you get wasted idle time; see the jump on lappy from 4-6 cores then how flat it is from 6-8, same on PS4 from 6-8 cores then flat from 9-12). If you have the threads to spare, then you can maximize throughput by minimizing "time per".


Conclusion :

No problem with lots of simultaneous Kraken decodes. Even when heavily over-subscribed, there's no major perf inversion due to overloading cache or memory subsystems.

Kraken on PS4 has near perfect threading up to 6 threads (total time goes from 0.0099 - 0.0111) ; on lappy it's not as good but still provides benefit to the time per decode up to 4 threads (total time from 0.0060 - 0.0095).

It's a surprise to me that the PS4 scales so well despite sharing cache & memory bus for the first 4 cores. It's also a surprise that lappy scales less well, I thought it would be near perfect on the first 4 cores, but maybe that's just Windows not giving me the whole machine? That was backward from my expectation.


Charts :

Kraken on PS4 (6 cores; 4 cores per 2MB L2) :

lzt24 :

lzt99:

Almost perfect threading from 1-6 cores (total time constant) even with large binary file.

webster:

webster is a large text file that uses a lot of long distance matches (offset > 1M). Text files have very different character than binary files like the lzt's. We can see that the large hot memory region used by webster does put some stress on the shared L2, there's falloff in perf from 1-4 cores.

webster Selkie :

Selkie is much faster than Kraken (2.75X faster on webster PS4) so all else being equal it should be affected a lot more by thread contention hurting memory latency. But, Selkie has some unique cleverness that makes it immune to this drawback. Threading even on webster from 1-6 cores is near perfect.


Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) no turbo :

lzt24 :

lzt99 :

webster :

Similar to PS4, lappy has almost perfect threading on binary files from 1-4 cores. On webster there is falloff in perf due to the


Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) WITH TURBO :

lzt24 :

lzt99 :

I initially mistakenly posted lappy timings with turbo enabled. I usually turn it off for perf testing on my laptop so that timings are more reliable. I think it's interesting actually to look at how the perf falloff is different with turbo.

Without turbo, total time is constant on lzt24 and lzt99 from 1-4 cores, but with turbo it steadily falls off, as adding more cores causes the laptop to reduce its clock rate. Despite that there's still a solid gain to throughput (the blue "time per" is going down despite the clock rate also going down).


raw data : (lzt24)


lappy : no turbo : (*1000)
1,   9.1360,   9.1360
2,   9.5523,   4.7761
3,   9.7850,   3.2617
4,  10.1901,   2.5475
5,  14.6867,   2.9373
6,  16.6759,   2.7793
7,  19.1105,   2.7301
8,  20.1687,   2.5211
9,  23.6391,   2.6266
10,  25.9279,   2.5928
11,  27.7395,   2.5218
12,  27.6459,   2.3038
13,  30.7935,   2.3687
14,  31.8541,   2.2753
15,  33.7883,   2.2526
16,  34.8252,   2.1766

lappy : with turbo :
1,   0.0060,   0.0060
2,   0.0070,   0.0035
3,   0.0087,   0.0029
4,   0.0095,   0.0024 <- 4
5,   0.0133,   0.0027
6,   0.0170,   0.0028
7,   0.0175,   0.0025
8,   0.0193,   0.0024 <- 8
9,   0.0228,   0.0025
10,   0.0252,   0.0025
11,   0.0262,   0.0024
12,   0.0278,   0.0023 <- 12
13,   0.0318,   0.0024
14,   0.0310,   0.0022
15,   0.0325,   0.0022
16,   0.0346,   0.0022 <- 16

PS4 :
1,   0.0099,   0.0099
2,   0.0102,   0.0051
3,   0.0104,   0.0035
4,   0.0106,   0.0027
5,   0.0110,   0.0022
6,   0.0111,   0.0018 <- min
7,   0.0147,   0.0021
8,   0.0180,   0.0022
9,   0.0204,   0.0023
10,   0.0214,   0.0021
11,   0.0217,   0.0020
12,   0.0220,   0.0018 <- same min again
13,   0.0257,   0.0020
14,   0.0297,   0.0021
15,   0.0310,   0.0021
16,   0.0319,   0.0020



comparing just lappy turbo to no-turbo :

lappy : no turbo :
1,   9.1360,   9.1360
2,   9.5523,   4.7761
3,   9.7850,   3.2617
4,  10.1901,   2.5475

lappy : with turbo :
1,   6.0,   6.0
2,   7.0,   3.5
3,   8.7,   2.9
4,   9.5,   2.4

You can see with only 1 core, turbo is 1.5X faster (9.13/6.0) than no turbo
With 4 cores they are getting close to the same speed, (10.2 vs 9.5), the turbo
has almost completely clocked down


The customer's actual issue was decoding into write-combined graphics memory. This is an absolute killer for decoder perf because Kraken (like any LZ decoder) needs to read back the buffers it writes.

On the PS4 I think the best way to decode to graphics memory (garlic) is to allocate the memory as writeback onion, do the decompress, then change it to wb_garlic with sceKernelBatchMap (which will cause a CPU cache flush; several of these changes could be combined together, eg. for level loading you only need to do it once at the end of all the resource decoding, don't do it per resource).

2/24/2017

Oodle Perf with Chunking and Dictionary Size

I get a lot of customers that want to cut their data into small blocks for paging, who ask "what's the benefit of using larger blocks" ?

The larger the block = more compression, and can help throughput (decode speed).

Obviously larger block = longer latency (to load & decode one whole block).

(though you can get data out incrementally, you don't have to wait for the whole decode to get the first byte out; but if you only needed the last byte of the block, it's strictly longer latency).

If you need fine grain paging, you have to trade off the desire to get precise control of your loading with small blocks & the benefits of larger blocks.

(obviously always follow general good paging practice, like amortize disk seeks, combine small resources into paging units, don't load a 256k chunk and just keep 1k of it and throw the rest away, etc.)

As a reference point, here's Kraken on Silesia with various chunk sizes :


Silesia : (Kraken Normal -z4)

 16k : ooKraken    : 211,938,580 ->75,624,641 =  2.855 bpb =  2.803 to 1 
 16k : decode           : 264.190 millis, 4.24 c/b, rate= 802.22 mb/s

 32k : ooKraken    : 211,938,580 ->70,906,686 =  2.676 bpb =  2.989 to 1 
 32k : decode           : 217.339 millis, 3.49 c/b, rate= 975.15 mb/s

 64k : ooKraken    : 211,938,580 ->67,562,203 =  2.550 bpb =  3.137 to 1 
 64k : decode           : 195.793 millis, 3.14 c/b, rate= 1082.46 mb/s

128k : ooKraken    : 211,938,580 ->65,274,250 =  2.464 bpb =  3.247 to 1 
128k : decode           : 183.232 millis, 2.94 c/b, rate= 1156.67 mb/s

256k : ooKraken    : 211,938,580 ->63,548,390 =  2.399 bpb =  3.335 to 1 
256k : decode           : 182.080 millis, 2.92 c/b, rate= 1163.99 mb/s

512k : ooKraken    : 211,938,580 ->61,875,640 =  2.336 bpb =  3.425 to 1 
512k : decode           : 182.018 millis, 2.92 c/b, rate= 1164.38 mb/s

1024k: ooKraken    : 211,938,580 ->60,602,177 =  2.288 bpb =  3.497 to 1 
1024k: decode           : 181.486 millis, 2.91 c/b, rate= 1167.80 mb/s

files: ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files: decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s


16k   :  2.80:1 ,   15.7 enc mbps ,  802.2 dec mbps
32k   :  2.99:1 ,   19.7 enc mbps ,  975.2 dec mbps
64k   :  3.14:1 ,   22.8 enc mbps , 1082.5 dec mbps
128k  :  3.25:1 ,   24.6 enc mbps , 1156.7 dec mbps
256k  :  3.34:1 ,   25.5 enc mbps , 1164.0 dec mbps
512k  :  3.43:1 ,   25.4 enc mbps , 1164.4 dec mbps
1024k :  3.50:1 ,   24.6 enc mbps , 1167.8 dec mbps
files :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

(note these are *chunks* not a window size; no carry-over of compressor state or dictionary is allowed across chunks. "files" means compress the individual files of silesia as whole units, but reset compressor between files.)

You may have noticed that the chunked files (once you get past the very small 16k,32k) are somewhat faster to decode. This is due to keeping match references in the CPU cache in the decoder.

Limitting the match window (OodleLZ_CompressOptions::dictionarySize) gives the same speed benefit for staying in cache, but with a smaller compression win.


window 128k : ooKraken    : 211,938,580 ->61,939,885 =  2.338 bpb =  3.422 to 1 
window 128k : decode           : 181.967 millis, 2.92 c/b, rate= 1164.71 mb/s

window 256k : ooKraken    : 211,938,580 ->60,688,467 =  2.291 bpb =  3.492 to 1 
window 256k : decode           : 182.316 millis, 2.93 c/b, rate= 1162.48 mb/s

window 512k : ooKraken    : 211,938,580 ->59,658,759 =  2.252 bpb =  3.553 to 1 
window 512k : decode           : 184.702 millis, 2.97 c/b, rate= 1147.46 mb/s

window 1M : ooKraken    : 211,938,580 ->58,878,065 =  2.222 bpb =  3.600 to 1 
window 1M : decode           : 184.912 millis, 2.97 c/b, rate= 1146.16 mb/s

window 2M :  ooKraken    : 211,938,580 ->58,396,432 =  2.204 bpb =  3.629 to 1 
window 2M :  decode           : 182.231 millis, 2.93 c/b, rate= 1163.02 mb/s

window 4M :  ooKraken    : 211,938,580 ->58,018,936 =  2.190 bpb =  3.653 to 1 
window 4M : decode           : 182.950 millis, 2.94 c/b, rate= 1158.45 mb/s

window 8M : ooKraken    : 211,938,580 ->57,657,484 =  2.176 bpb =  3.676 to 1 
window 8M : decode           : 189.241 millis, 3.04 c/b, rate= 1119.94 mb/s

window 16M: ooKraken    : 211,938,580 ->57,525,174 =  2.171 bpb =  3.684 to 1 
window 16M: decode           : 202.384 millis, 3.25 c/b, rate= 1047.21 mb/s

files     : ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files     : decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s

window 128k:  3.42:1 ,   20.1 enc mbps , 1164.7 dec mbps
window 256k:  3.49:1 ,   20.1 enc mbps , 1162.5 dec mbps
window 512k:  3.55:1 ,   20.1 enc mbps , 1147.5 dec mbps
window 1M  :  3.60:1 ,   20.0 enc mbps , 1146.2 dec mbps
window 2M  :  3.63:1 ,   19.7 enc mbps , 1163.0 dec mbps
window 4M  :  3.65:1 ,   19.3 enc mbps , 1158.5 dec mbps
window 8M  :  3.68:1 ,   18.9 enc mbps , 1119.9 dec mbps
window 16M :  3.68:1 ,   18.8 enc mbps , 1047.2 dec mbps
files      :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

WARNING : tuning perf to cache size is obviously very machine dependent; I don't really recommend fiddling with it unless you know the exact hardware you will be decoding on. The test machine here has a 4 MB L3, so speed falls off slightly as window size approaches 4 MB.


If you do need to use tiny chunks with Oodle ("tiny" being 32k or smaller; 128k or above is in the normal intended operating range) here are a few tips to consider :

1. Consider pre-allocating the Decoder object and passing in the memory to the OodleLZ_Decompress calls. This avoids doing a malloc per call, which may or may not be significant overhead.

2. Consider changing OodleConfigValues::m_OodleLZ_Small_Buffer_LZ_Fallback_Size . The default is 2k bytes. Buffers smaller than that will use LZB16 instead of the requested compressor, because many of the new ones don't do well on tiny buffers. If you want to have control of this yourself, you can set this to 0.

3. Consider changing OodleLZ_CompressOptions::spaceSpeedTradeoffBytes . This is the number of bytes that must be saved from the compressed output size before the encoder will choose a slower decode mode. eg. it controls decisions like whether literals are sent raw or with entropy coding. This number is scaled for full size buffers (128k bytes or more). When using tiny buffers, it will choose to avoid entropy coding more often. You may wish to dial down this value to scale to your buffers. The default is 256 ; I recommend trying 128 to see what the effect is.

2/09/2017

strtod

I've been fooling around with strtod for a few days for no good reason. It's been quite interesting.

Some little notes :

strtod in many compilers & standard libraries is broken. (see the excellent pages at exploring binary for details)

(to clarify "broken" : they all round-trip doubles correctly; if you print a double and scan it using the same compiler, you will get the same double; that's quite easy. They're "broken" in the sense of not converting a full-precision string to the closest possible double, and they're "broken" in the sense of not having a single consistent rule that tells you how any given full precision numeral string will be converted)

The default rounding rule for floats is round-nearest (banker). What that means is when a value is exactly at the midpoint of either round-up or round-down (the bits below the bottom bit are 100000..) , then round up or down to make the bottom bit of the result zero. This is sometimes called "round to even" but crucially it's round to even *mantissa* not round to even *number*.

Our goal for strtod should be to take a full numeral string and convert it to the closest possible double using banker rounding.

The marginal rounding value is like this in binary :


1|101010....101010|10000000000000000

^ ^              ^ ^- bits below mantissa are exactly at rounding threshold
| |              |
| |              +- bottom bit of mantissa = "banker bit"
| |
| +- 52 bits of mantissa
|
+- implicit 1 bit at top of double

in this case the "banker bit" (bottom bit of mantissa bit range) is off, the next bit is on.

If this value was exact, you should round *down*. (if the banker bit was on, you should round up). If this value is not exact by even the tiniest bit (as in, there are more significant bits below the range we have seen, eg. it was 1000....001), it changes to rounding up.

If you think of the infinity of the real numbers being divided into buckets, each bucket corresponds to one of the doubles that covers that range, this value is the boundary of a bucket edge.

In practice the way to do strtod is to work in ints, so if you have the top 64 bits of the (possibly very long) value in an int, this is :


hi = a u64 with first 64 bits
top bit of "hi" is on

(hi>>11) is the 52 bits of mantissa + implicit top bit

hi & 0x400 is the "rounding bit"
hi & 0x800 is the "banker bit"
hi & 0x3ff are the bits under the rounding bit

hi & 0xfff == 0x400 is low boundary that needs more bits
hi & 0xfff == 0xBFF is high boundary that needs more bits

At 0x400 :
"rounding bit" is on, we're right on threshold
"banker bit" is off, so we should round down if this value is exact
but any more bits beyond the first 64 will change us to rounding up
so we need more bits
- just need to see if there are any bits at all that could be on

At 0xBFF :
"banker bit" is on, so if we were at rounding threshold, we should round up
(eg. 0xC00 rounds up, and doesn't need any more bits, we know that exactly)
the bits we have are just below rounding threshold
if the remaining bits are all on (FFFF)
OR if they generate a carry into our bottom bit
then we will round up
- need enough of the remaining value to know that it can't push us up to 0xC00

The standard approach to doing a good strtod is to start by reading the first 19 digits into a U64. You just use *= 10 integer multiplies to do the base-10 to base-2 conversion, and then you have to adjust the place-value of the right hand side of those digits using a table to find a power of 10. (the place-value is obviously adjusted based on where the decimal was and the exponent provided in the input string ; so if you are given "70.23e12" you read "7023" as an int and then adjust pow10 exponent +10).

Once you have the first 19 digits in the U64, you know that the full string must correspond to an interval :


lo = current u64 from first 19 digits

hi = lo + (all remaining digits = 99999)*(place value) + (bias for placevalue mult being too low)

final value is in [lo,hi]

so the difficult cases that need a lot of digits to discriminate are the ones where the first 19 digits put you just below the threshold of rounding, but the hi end of the interval is above it.

Older MSVC only used the first 17 digits so fails sooner. For example :


34791611969279740608512
=
111010111100000111010011011110000000011001100010011101000000000000000000000
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

this is right on a rounding threshold; because the banker bit is off it should round down

the correct closest double is 
34791611969279739000000.000000

but if you bias that up beyond the digits that MSVC reads :

34791611969279740610310
=
111010111100000111010011011110000000011001100010011101000000000011100000110
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

this should round up; the closest double is
34791611969279743000000.000000

but old MSVC gets it wrong

-----------

Another example :


2022951805990391198363682
=
110101100011000000110111111101000101100100011110011001000000000000000000000000000

is the double threshold

2022951805990391198666718
=
110101100011000000110111111101000101100100011110011001000000000100100111111011110

is definitely above the threshold, should round up

but if you only read the first 19 base-10 digits :

2022951805990391198000000
=
110101100011000000110111111101000101100100011110011000111111110000010001110000000

you see a value that is below the threshold and would round down

You have to look at the interval of uncertainty -

2022951805990391198000000
2022951805990391199000000

and see that you straddle and boundary and must get more digits.

There are two things that need refinement : getting more digits & doing your place value scaling to more bits of precision. You can interatively increase the precision of each, which refines your interval smaller and smaller, until you either know which side of the rounding barrier to take, or you have got all the bits of precision.

BTW this needing lots of precision case is exactly the same as the arithmetic coder "underflow" case , where your coding interval looks like, in binary :


hi = 101101010101001 100000000000000.110101010101
lo = 101101010101001 011111111111111.01010101010
     ^               ^               ^- active coding zone, new symbols add at this fixed point position
     |               |
     |               +-- bad underflow zone! can't yet tell if this bit should be a 1 or 0
     |
     +-- hi and lo the same in top bits, these bits have been sent & renormalized out

anyway, just a funny parallel to me.

There are obviously nasty issues to be aware of with the FPU rounding more, control word, precision, or with the optimizer doing funny things to your FPU math (and of course be aware of x86 being on the FPU and x64 being in SSE). The best solution to all that is to just not use floats, use ints and construct the output floats by hand.

(there is an optimization possible if you can trust the FPU ; numbers that fit completely in an int you can just use the FPU's int-to-float to do all the work for you, if the int-to-float rounding mode matches the rounding mode you want from your strtod)

Now, you may think "hey who cares about these boundary cases - I can round-trip doubles perfectly without them". In fact, round-tripping doubles is easy, since they don't ever have any bits on below the (52+1) in the mantissa, you never get this problem of bits beyond the last one! (you only need 17 digits to round-trip doubles).

Personally I think that these kinds of routines should have well-defined behavior for all inputs. There should be a single definite right answer, and a conforming routine should get that right answer. Then nutters can go off and optimize it and fiddle around, and you can run it through a test suite and verify it *if* you have exactly defined right answers. This is particulary true for things like the CRT. (this also comes up in how the compiler converts float and double constants). The fact that so many compilers fucked this up for so long is a bit of a head scratcher (especially since good conversion routines have been around since the Fortran days). (the unwillingess of people to just use existing good code is so bizarre to me, they go off and roll their own and invariably fuck up some of the difficult corner cases).

Anyway, rolling your own is a fun project, but fortunately there's good code to do this :

"dtoa" (which contains a strtod) by David M Gay :

dtoa.c from http://www.netlib.org/fp/

(BTW lots of people seem to have different versions of dtoa with various changes/fixes; for example gcc, mozilla, chromium; I can't find a good link or home page for the best/newest version of dtoa; help?)

dtoa is excellent code in the sense of working right, using good algorithms, and being fast. It's terrible code in the sense of being some apalling spaghetti. This is actual code from dtoa :


    if (k &= 0x1f) {
        k1 = 32 - k;
        z = 0;
        do {
            *x1++ = *x << k | z;
            z = *x++ >> k1;
            }
            while(x < xe);
        if ((*x1 = z))
            ++n1;
        }
#else
    if (k &= 0xf) {
        k1 = 16 - k;
        z = 0;
        do {
            *x1++ = *x << k  & 0xffff | z;
            z = *x++ >> k1;
            }

holy order-of-operations reliance batman! Jeebus.

To build dtoa I use :

#define IEEE_8087
#pragma warning(disable : 4244 4127 4018 4706 4701 4334 4146)
#define NO_ERRNO
Note that dtoa can optionally check the FLT_ROUNDS setting but does not round correctly unless it is 1 (nearest). Furthermore, in MSVC the FLT_ROUNDS value in the header is broken in some versions (always returns 1 even if fesetenv has changed it). So, yeah. Don't mess with the float rounding mode please.

dtoa is partly messy because it supports lots of wacky stuff from rare machines (FLT_RADIX != 2 for example). (though apparently a lot of the weird float format modes are broken).

An alternative I looked at is "floatscan" from MUSL. Floatscan produces correct results, but is very slow.

Here's my timing on random large (20-300 decimal digits) strings :


strtod dtoa.c
ticks = 353

strtod mine
ticks = 267

MSVC 2005 CRT atof
ticks = 1921

MUSL floatscan
ticks = 11445

My variant here ("mine") is a little bit faster than dtoa. That basically just comes from using the modern 64-bit CPU. For example I use mulq to do 64x64 to 128 multiply , whereas he uses only 32*32 multiplies and simulates large words by doing the carries manually. I use clz to left-justify binary, whereas he uses a bunch of if's. Stuff like that.

The MUSL floatscan slowness is mmrmm. It seems to be O(N) in the # of digits even when the first 19 digits can unambiguously resolve the correct answer. Basically it just goes straight into bigint (it makes an array of U32's which each hold 9 base10 digits) which is unnecessary most of the time.

Tips for using Oodle as Efficiently as Possible

Tips for using Oodle as efficiently and robustly as possible. (wrst your game runtime loading + decoding only).

1. Use the new compressors (Kraken/Mermaid/Selkie/Hydra). Aside from being great performance, they have the most robust and well-tested fuzz safety.

2. Pass FuzzSafe_Yes to the OodleLZ_Decompress option. The KMS decoders are always internally fuzz safe. What the FuzzSafe_Yes option is add some checks to the initial header decode which will cause the decode to fail if the header byte has been tampered with to change it to a non-KMS compressor.

3. Use Oodle Core lib only, not Ext. Core lib is very light and tight. Core lib makes no threads, requires no init, and will do no allocations to decode (if you pass in the memory).

4. Disable Oodle's callbacks :


    OodlePlugins_SetAllocators(NULL,NULL);
    OodlePlugins_SetAssertion(NULL);
    OodlePlugins_SetPrintf(NULL);

(note that disabling the allocators only works because you are doing decoding only (no encoding) and because you do #5 below - pass in scratch memory for the decoder)

5. Because you disabled Oodle's access to an allocator, instead pass in the memory needed. You can ask the data for the compressor if you don't always know it.


    compressor = OodleLZ_GetChunkCompressor(in_comp,NULL);
                                    
    decoderMemorySize = OodleLZDecoder_MemorySizeNeeded(compressor,in_raw_length);
    decoderMemory = malloc( decoderMemorySize );

6. If possible keep the Decoder memory around or pull from some shared scratch space rather than allocating & freeing it every time.

7. If you're on a platform where Oodle is in a DLL or .so , sign it with a mechanism like authentisign to ensure it isn't tampered with.

8. If possible do async IO and use double-buffering to incrementally load data and decompress, so that you are maximally using the IO bus and the CPU at the same time. If loading and decoding large buffers, consider running the decoders "ThreadPhased".

9. If possible use Thread-Phased decoding on large buffers that can't be broken into multiple decodes. The most efficient use of threading always comes from running multiple unrelated encode/decodes simultaneously, rather than trying ti multi-thread a single encode/decode.

10. Use a mix of compressors to put your decode time where it helps the most. eg. use the slower compressors only when they get big size wins, and perhaps tune the compression to the game's latency needs - random access data that needs to get loaded as fast as possible, use faster decoders, background data that streams in slowly can use slower decoders. Hydra is the best way to do this, it will automatically measure decode speed and tune to maximize the space-speed tradeoff.

11. Don't load tiny buffers. Combine them into larger paging units.

12. Consider in-memory compressed data as a paging cache.

As always, feel free to contact me and ask questions.

2/01/2017

Oodle Hydra


02-01-17 | Oodle Hydra

Oodle Hydra - the many headed beast.

Hydra is a meta-compressor which selects Kraken, Mermaid, or Selkie per block. It uses the speed fit model of each compressor to do a lagrangian space-speed optimization decision about which compressor is maximizing the desired lagrange cost (size + lambda*time).

It turns out to be quite interesting.

(this is of course in addition to each of those compressors internally making space-speed decisions; each of them can enable or disable internal processing modes using the same lagrange optimization model. (eg. they can turn on and off entropy coding for various streams). And there are additional per-block implicit decisions such as choosing uncompressed blocks and huff-only blocks.)

Hydra is a single entry point to all the Oodle compressors. You simply choose how much you care about size vs. decode speed, that corresponds to a certain lagrange lambda. In Oodle this is called "spaceSpeedTradeoffBytes". It's the # of bytes that compression must save in order to take up N cycles more of decode time. You then no longer think about "do I want Kraken or Mermaid" , Oodle makes the right decision for you that optimizes the goal.

Hydra can interpolate the performance of Kraken & Mermaid to create a meta-compressor that targets the points in between. That in itself is a somewhat surprising result. Say Kraken is at 1000 mb/s , Mermaid is at 2000 mb/s decode speed, but you really want a compressor that's around 1500 mb/s with compression between the two. We don't know of a Pareto-optimal compressor that is between Kraken and Mermaid, so you're sunk, right? No, you can use Hydra.

I should note that Hydra is very much about *whole corpus* performance. That is, if your target is 1500 mb/s, you may not hit that on any one file - that file could go either all-Kraken or all-Mermaid. The target is hit overall. This is intentional and good, but if for whatever reason you are trying to hit a specific speed for an individual file then Hydra is not the way to do that.

It leads to an idea that I've tried to advocate for before : corpus lagrange optimization for bit rate allocation. If you are dealing with a limited resource that you want to allocate well, such as disk size or download size or time to load - you want to allocate that resource to the data that can make the best use of it. eg. spend your decode time where it makes the biggest size difference. (I encourage this for lossy bit rate allocation as well). So with Hydra some files decode slower and some decode faster, but when they are slower it's because the time was worth it.

And now some reports. We're going to look at 3 copora. On Silesia and gametestset, Hydra interpolates as expected. But then on PD3D, something magic happens ...

(Oodle 2.4.2 , level 7, Core i7-3770 x64)

Silesia :

total                : Kraken     : 4.106 to 1 : 994.036 MB/s
total                : Mermaid    : 3.581 to 1 : 1995.919 MB/s
total                : Hydra200   : 4.096 to 1 : 1007.692 MB/s
total                : Hydra288   : 4.040 to 1 : 1082.211 MB/s
total                : Hydra416   : 3.827 to 1 : 1474.452 MB/s
total                : Hydra601   : 3.685 to 1 : 1780.476 MB/s
total                : Hydra866   : 3.631 to 1 : 1906.823 MB/s
total                : Hydra1250  : 3.572 to 1 : 2002.683 MB/s

gametestset :

total                : Kraken     : 2.593 to 1 : 1309.865 MB/s
total                : Mermaid    : 2.347 to 1 : 2459.442 MB/s
total                : Hydra200   : 2.593 to 1 : 1338.429 MB/s
total                : Hydra288   : 2.581 to 1 : 1397.465 MB/s
total                : Hydra416   : 2.542 to 1 : 1581.959 MB/s
total                : Hydra601   : 2.484 to 1 : 1836.988 MB/s
total                : Hydra866   : 2.431 to 1 : 2078.516 MB/s
total                : Hydra1250  : 2.366 to 1 : 2376.828 MB/s

PD3D :

total                : Kraken     : 3.678 to 1 : 1054.380 MB/s
total                : Mermaid    : 3.403 to 1 : 1814.660 MB/s
total                : Hydra200   : 3.755 to 1 : 1218.745 MB/s
total                : Hydra288   : 3.738 to 1 : 1249.838 MB/s
total                : Hydra416   : 3.649 to 1 : 1381.570 MB/s
total                : Hydra601   : 3.574 to 1 : 1518.151 MB/s
total                : Hydra866   : 3.487 to 1 : 1666.634 MB/s
total                : Hydra1250  : 3.279 to 1 : 1965.039 MB/s

Whoah! Magic!

On PD3D, Hydra finds big free wins - it not only compresses more than Kraken, it decodes significantly faster, repeating the above to point it out :


total                : Kraken     : 3.678 to 1 : 1054.380 MB/s

total                : Hydra288   : 3.738 to 1 : 1249.838 MB/s
 Kraken compression ratio is in between here, around 1300 MB/s
total                : Hydra416   : 3.649 to 1 : 1381.570 MB/s

You can see it visually in the loglog plot; if you draw a line between Kraken & Mermaid, the Hydra data points are above that line (more compression) and to the right (faster).

What's happening is that once in a while there's a block where Mermaid gets the same or more compression than Kraken. While that's rare, when it does happen you just get a big free win from switching to Mermaid on that block. More often, Mermaid only gets a little bit less compression than Kraken but a lot less decode time, so switching is advantageous in the space-speed lagrange cost.

Crucial to Hydra is having a decoder speed fit for every compressor that can simulate decoding a block and count cycles needed to decode on an imaginary machine. You need a model because you don't want to actually measure the time by running the decoder on the current machine - it would take lots of runs to get reliable timing, and it would mean that you are optimizing for the exact machine that you are encoding on. I currently use a single virtual machine that is a blend of various real platforms; in the future I might expose the ability to use virtual machines that simulate specific target machines (because Hydra might make decisions differently if it knows it is targeting PC-x64 vs. Jaguar-x64 vs. Aarch64-on-A57 , etc.).

Hydra is exciting to me as a general framework for the future of Oodle. It provides a way to add in new compression modes and be sure that they are never worse. That is, you always can start with Kraken per block, and then new modes could be picked block by block only when they are known to beat Kraken (in a space-speed sense). It lets you mix in compressors that you specifically don't expect to be good in general on all data, but that might be amazing once in a while on certain data.

(Hydra requires compressors that carry no state across blocks, so you can't naively mix in something like PPM or CM/PAQ. To optimize a switching choice with compressors that carry state requires a trellis-quantization like lattice dynamic programming optimization and is rather more complex to do quickly)

old rants