4/14/2017

Tunstall vs Marlin Results Part 2

Testing on some real files.

Marlin :

loading : R:\tunstall_test\lzt24.literals
filelen = 1111673
H = 7.452694
sym_count = 256
lzt24.literals      :  1,111,673 -> 1,286,166 =  9.256 bpb =  0.864 to 1 
decode_time2 : seconds:0.0022 ticks per: 3.467 b/kc : 288.41 MB/s : 498.66
loading : R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293
sym_count = 236
monarch.tga.rrz_filtered.bmp:  1,572,918 ->   618,447 =  3.145 bpb =  2.543 to 1 
decode_time2 : seconds:0.0012 ticks per: 1.281 b/kc : 780.92 MB/s : 1350.21
loading : R:\tunstall_test\paper1
filelen = 53161
H = 4.982983
sym_count = 95
paper1              :     53,161 ->    35,763 =  5.382 bpb =  1.486 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.988 b/kc : 503.06 MB/s : 869.78
loading : R:\tunstall_test\PIC
filelen = 513216
H = 1.210176
sym_count = 159
PIC                 :    513,216 ->   140,391 =  2.188 bpb =  3.656 to 1 
decode_time2 : seconds:0.0002 ticks per: 0.800 b/kc : 1250.71 MB/s : 2162.48
loading : R:\tunstall_test\tabdir.tab
filelen = 190428
H = 2.284979
sym_count = 77
tabdir.tab          :    190,428 ->    68,511 =  2.878 bpb =  2.780 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.031 b/kc : 969.81 MB/s : 1676.80
total bytes out : 1974785
naive plural Tunstall :
loading : R:\tunstall_test\lzt24.literals
filelen = 1111673
H = 7.452694
sym_count = 256
lzt24.literals      :  1,111,673 -> 1,290,015 =  9.283 bpb =  0.862 to 1 
decode_time2 : seconds:0.0022 ticks per: 3.443 b/kc : 290.45 MB/s : 502.18
loading : R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293
sym_count = 236
monarch.tga.rrz_filtered.bmp:  1,572,918 ->   627,747 =  3.193 bpb =  2.506 to 1 
decode_time2 : seconds:0.0012 ticks per: 1.284 b/kc : 779.08 MB/s : 1347.03
loading : R:\tunstall_test\paper1
filelen = 53161
H = 4.982983
sym_count = 95
paper1              :     53,161 ->    35,934 =  5.408 bpb =  1.479 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.998 b/kc : 500.61 MB/s : 865.56
loading : R:\tunstall_test\PIC
filelen = 513216
H = 1.210176
sym_count = 159
PIC                 :    513,216 ->   145,980 =  2.276 bpb =  3.516 to 1 
decode_time2 : seconds:0.0002 ticks per: 0.826 b/kc : 1211.09 MB/s : 2093.97
loading : R:\tunstall_test\tabdir.tab
filelen = 190428
H = 2.284979
sym_count = 77
tabdir.tab          :    190,428 ->    74,169 =  3.116 bpb =  2.567 to 1 
decode_time2 : seconds:0.0001 ticks per: 1.103 b/kc : 906.80 MB/s : 1567.86
total bytes out : 1995503

About the files :


lzt24.literals are the literals left over after LZ-parsing (LZQ1) lzt24
  like all LZ literals they are high entropy and thus do terribly in Tunstall

monarch.tga.rrz_filtered.bmp is the image residual after filtering with my DPCM
  (it actually has a BMP header on it which is giving Tunstall a harder time
   than if I stripped the header)

paper1 & pic are standard

tabdir.tab is a text file of a dir listing with lots of tabs in it

For speed comparison, this is the Oodle Huffman on the same files :
loading file (0/5) : lzt24.literals
ooHuffman1 : ed...........................................................
ooHuffman1 :  1,111,673 -> 1,036,540 =  7.459 bpb =  1.072 to 1
encode           : 8.405 millis, 13.07 c/b, rate= 132.26 mb/s
decode           : 1.721 millis, 2.68 c/b, rate= 645.81 mb/s
ooHuffman1,1036540,8405444,1721363
loading file (1/5) : monarch.tga.rrz_filtered.bmp
ooHuffman1 : ed...........................................................
ooHuffman1 :  1,572,918 ->   586,839 =  2.985 bpb =  2.680 to 1
encode           : 7.570 millis, 8.32 c/b, rate= 207.80 mb/s
decode           : 2.348 millis, 2.58 c/b, rate= 669.94 mb/s
ooHuffman1,586839,7569562,2347859
loading file (2/5) : paper1
ooHuffman1 :     53,161 ->    33,427 =  5.030 bpb =  1.590 to 1
encode           : 0.268 millis, 8.70 c/b, rate= 198.67 mb/s
decode           : 0.080 millis, 2.60 c/b, rate= 665.07 mb/s
ooHuffman1,33427,267579,79933
loading file (3/5) : PIC
ooHuffman1 :    513,216 ->   106,994 =  1.668 bpb =  4.797 to 1
encode           : 2.405 millis, 8.10 c/b, rate= 213.41 mb/s
decode           : 0.758 millis, 2.55 c/b, rate= 677.32 mb/s
ooHuffman1,106994,2404854,757712
loading file (4/5) : tabdir.tab
ooHuffman1 :    190,428 ->    58,307 =  2.450 bpb =  3.266 to 1
encode           : 0.926 millis, 8.41 c/b, rate= 205.70 mb/s
decode           : 0.279 millis, 2.54 c/b, rate= 681.45 mb/s
ooHuffman1,58307,925742,279447

Tunstall is crazy fast. And of course that's a rather basic implementation of the decoder, I'm sure it could get faster.

Is there an application for plural Tunstall? I'm not sure. I tried it back in 2015 as an idea for literals in Mermaid/Selkie and abandoned it as not very relevant there. It works on low-entropy order-0 data (like image prediction residuals).

Of course if you wanted to test it against the state of the art you should consider SIMD Ryg RANS or GPU RANS. You should consider something like TANS with multiple symbols in the output table. You should consider merged-symbol codes, perhaps using escapes, perhaps runlen transforms. See for example "crblib/huffa.c" for a survey of Huffman ideas from 1996 (pre-runtransform, blocking MPS's, order-1-huff, multisymbol output, etc.)

Some commentary on Marlin and the code

WARNING : I'm a bit delirious with flu and lack of sleep at the moment so I'm not entirely sure what I'm writing. Apologies if it's a mess!

First, there's no point in making the T_ij transition matrix they talk about.

Back in "Understanding Marlin" you may recall I presented the algorithm thusly :


P_state(i) is given from a previous iteration and is constant

build dictionary using Marlin word model

we now have P(W) for all words in our dic

Use P(W) to compute new P_state(i)

optionally iterate a few times (~ 10 times) :
  use that P_state to compute adjusted P(W)
  use P(W) to compute new P_state

iteration dictionary building again (3-4 times)

in the paper (and code) they do it a bit differently. They compute the state transition matrix, which is :

T_ij = Sum[ all words W that end in state S_i ] P(W|S_j)

this is the probability that if you started in state j you will wind up in state i

instead of iterating P_state -> P(W) , they iterate :

T <- T * T

and then P_state(i) = T_i0

I tested both ways and they produce the exact same result, but just doing it through the P(W) computation is far simpler and faster. The matrix multiply is O(alphabet^3) while the P way is only O(alphabet+dic_size)

Also for the record - I have yet to find a case where iterating to convergence here actually helps. If you just make P_State from PW once and don't iterate, you get 99% of the win. eg :


laplacian distribution :

no iteration :

     0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

iterate 10X :

     0.67952             :  1,000,000 ->   503,688 =  4.030 bpb =  1.985 to 1 

You *do* need to iterate the dictionary build. I do it 4 times. 3 times would be fine though, heck 2 is probably fine.

4: 0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

3: 0.67952             :  1,000,000 ->   503,721 =  4.030 bpb =  1.985 to 1 

2: 0.67952             :  1,000,000 ->   503,817 =  4.031 bpb =  1.985 to 1 

The first iteration builds a "naive plural Tunstall" dictionary; the P_state is made from that, second iteration does the first "Marlin" dictionary build.

In general I think they erroneously come to the conclusion that plural Tunstall dictionaries are really slow to create. They're only 1 or 2 orders of magnitude slower than building a Huffman tree, certainly not slow compared to many encoder speeds. Sure sure if you want super fast encoding you wouldn't want to do it, but otherwise it's totally possible to build the dictionaries for each use.

There's a lot of craziness in the Marlin code that makes their dic build way slower than it should be. Some is just over-C++ madness, some is failure to factor out constant expressions.


the word is :

        struct Word : public std::vector<uint8_t> {
};

and the dictionary is :

std::vector<Word> W;

 (with no reserves anywhere)
yeah that's a problem.  May I suggest :

struct Word {
    uint64 chars;
    int len;
};

also reserve() is good and calling size() in loops is bad.

This is ouchy :

        virtual double phi(const Word &) const = 0;

and this is even more ouchy :

        virtual std::vector<Word> split(const Word &) const = 0;

The P(w) function that's used frequently is bad.

The key part is :

    for (size_t t = 0; t<=w[0]; t++) {
        double p = PcurrState[t]; 
        p *= P[w[0]]/PnextState[t];
        ret += p;
    }

which you can easily factor out the common P[w[0]] from :

    int c0 = w[0];
    for (size_t t = 0; t<= c0; t++) {
        ret += PcurrState[t]/PnextState[t];
    }
    ret *= P[c0]

but even more significant would be to realize that PcurrState (my P_state) and
PnextState (my P_tail) are not updated during dic building at all!  They're constant
during that phase, so that whole thing can be precomputed and put in a table.
Then this is just :

    int c0 = w[0];
    ret = PcurrState_over_PnextState_partial_sum[c0];
    ret *= P[c0]

that also gives us a chance to state clearly (again) the difference between "Marlin" and naive plural Tunstall. It's that one table right there.


    int c0 = w[0];
    ret = 1.0;
    ret *= P[c0]

this is naive plural Tunstall. It comes down to a modifed probability table for the first letter in the word.

Recall that :


P_naive(W) = Prod[ chars c in W ] P(c)

simple P_word(W) = P_naive(W) * Ptail( num_children(W) )


Reading Yamamoto and Yokoo "Average-Sense Optimality and Competitive Optimality for Almost Instantaneous VF Codes".

They construct the naive plural Tunstall VF dictionary. They are also aware of the Marlin-style state transition problem. (that is, partical nodes leave you in a state where some symbols are excluded).

They address the problem by constructing multiple parse trees, one for each initial state S_i. In tree T_i you know that the first character is >= i so all words that start with lower symbols are excluded.

This should give reasonably more compression than the Marlin approach, obviously with the cost of having multiple dictionaries.

In skewed alphabet cases where the MPS is very probable, this should be significant because words that start with the MPS dominate the dictionary, but in all states S_1 and higher those words cannot be used. In fact I conjecture that even having just 2 or 3 trees should give most of the win. One tree for state S_0, one for S_1 and the last for all states >= S_2. In practice this problematic because the multiple code sets would fall out of cache and it adds a bit of decoder complexity to select the following tree.

There's also a continuity between VTF codes and blocked arithmetic coders. The Yamamoto-Yokoo scheme is like a way of carrying the residual information between blocked transmissions, similar to multi-table arithmetic coding schemes.


Phhlurge.

I just went and got the marlin code to compile in VS 2015. Bit of a nightmare. I wanted to confirm I didn't screw anything up in my implementation.


two-sided Laplacian distribution centered at 0
(this is what the Marlin code assumes)

r = 0.67952

H = 3.798339

my version of Marlin-probability plural Tunstall :

0.67952             :  1,000,000 ->   503,694 =  4.030 bpb =  1.985 to 1 

Marlin reference code : 1,000,000 -> 507,412

naive plural Tunstall :

0.67952             :  1,000,000 ->   508,071 =  4.065 bpb =  1.968 to 1 

I presume the reason they compress worse than my version is because they make dictionaries for a handfull of Laplacian distributions and then pick the closest one. I make a dictionary for the actual char counts in the array, so their dictionary is mis-matching the actual distribution slightly.

Marlin Summary

Sum up with simple take-away.

Plural Tunstall VTF coding in general is extremely fast to decode. It works best with 12-bit tables (must stay in L1), which means it only works well at entropy <= 4 bpb.

Marlin introduces an improved word probability model that captures the way the first letter probability is skewed by the longer-string exclusion. (this is just like the LAM exclusion in LZ)

That is, after coding with word W, subsequent chars that exist in the dictionary (Wa,Wb) cannot then be the next character to start a word, so the probably of the first char of the word being >= c is increased.

The Marlin word probability model improves compression by 1-4% over naive plural Tunstall.

The simplest implementation of the Marlin probability adjustment is like this :


P_naive(W) = Prod[chars c in W] P('c')

P_word(W) = P_scale_first_char( W[0] ) * P_naive(W) * Ptail( num_children(W) )

(where Ptail is the tail-cumulative-proability :
Ptail(x) = Sum[ c >= x ] P('c')  (sum of character probabilities to end)
)

(instead of scaling by Ptail you can subtract off the child node probabilities as you make them)


on the first iteration of dictionary building, set P_scale_first_char() = 1.0

this is "naive plural Tunstall"

after building the dictionary, you now have a set of words and P(W) for each
compute :

P_state(i) = Sum[ words W with i children ] P(W)

(state i means only chars >= i can follow)

(iterating P_state -> P(W) a few times here is optional but unnecessary)

P_scale_first_char(c) = Sum[ i <= c ] P_state(i) / P_tail(i)

(P_scale_first_char = P_state_tail)

then repeat dic building one more time
(optionally repeat more but once seems fine)

And that's it!


What do these values actually look like? I thought it might be illuminating to dump them. This is on a pretty skewed file so the effect is large, the larger the MPS probability the bigger the effect.


R:\tunstall_test\monarch.tga.rrz_filtered.bmp
filelen = 1572918
H = 2.917293

P of chars =

        [0] 0.46841475525106835 double
        [1] 0.11553621994280693 double
        [2] 0.11508546535801611 double
        [3] 0.059216055763873253    double
        [4] 0.058911526220693004    double
        [5] 0.036597584870921435    double
        [6] 0.036475518749229136    double
        [7] 0.018035269480036465    double
        [8] 0.017757441900976400    double
        [9] 0.010309501194595012    double
        [10]    0.0097379520102128646   double

P_state =

        [0] 0.62183816678155190 double
        [1] 0.15374679894466811 double
        [2] 0.032874234239563829    double
        [3] 0.063794018822874776    double
        [4] 0.026001940955215786    double
        [5] 0.011274295764837820    double
        [6] 0.028098911350290755    double
        [7] 0.012986055279597277    double
        [8] 0.0013397794289329405   double
        .. goes to zero pretty fast ..

P_scale_first_char =

        [0] 0.62183816678155202 double
        [1] 0.91106139196208336 double
        [2] 0.99007668169839014 double
        [3] 1.2020426052137052  double
        [4] 1.3096008656256881  double
        [5] 1.3712643080249607  double
        [6] 1.5634088663186672  double
        [7] 1.6817189544649160  double
        [8] 1.6963250203077103  double
        [9] 1.9281295477496172  double
        [10]    1.9418127462353780  double
        [11]    2.0234438458996773  double
        [12]    2.0540542047979415  double
        [13]    2.1488636999462676  double
        [14]    2.2798060244386895  double
        [15]    2.2798060244386895  double
        [16]    2.2798060244386895  double
        [17]    2.3660205062350039  double
        [18]    2.3840557757150402  double
        [19]    2.3840557757150402  double
        [20]    2.3840557757150402  double
        [21]    2.4066061022686838  double
        [22]    2.5584098628550294  double
        [23]    2.5584098628550294  double
        [24]    2.6690752676624321  double
        .. gradually goes up to 4.14

estimate real first char probability 
= P(c) * P_scale_first_char(c) =

        [0] 0.29127817269875372 double
        [1] 0.10526058936313110 double
        [2] 0.11394343565337962 double
        [3] 0.071180221940886246    double
        [4] 0.077150585733949978    double
        [5] 0.050184961893408854    double
        [6] 0.057026149416117611    double
        [7] 0.030330254533459933    double

the effect is to reduce the skewing of the probabilities in the post-word alphabet.


I think this problem space is not fully explored yet and I look forward to seeing more work in this domain in the future.

I'm not convinced that the dictionary building scheme here is optimal. Is there an optimal plural VTF dictionary?

I think maybe there's some space between something like RANS and Tunstall. Tunstall inputs blocks of N bits and doesn't carry any state between them. RANS pulls blocks of N bits and carries full state of unused value range between them (which is therefore a lot slower because of dependency chains). Maybe there's something in between?

Another way to think about this Marlin ending state issue is a bit like an arithmetic coder. When you send the index of a word that has children, you have not only sent that word, you've also sent information about the *next* word. That is, some fractional bits of information that you'd like to carry forward.

Say you have {W,Wa,Wb} in the dictionary and you send a W. You've also sent that the next word start with >= c. That's like saying your arithmetic coder cumprob is >= (Pa+Pb). You've refined the range of the next interval.

This could be done with Tunstall by doing the multi-table method (separate tables for state 0,1, and 2+), but unfortunately that doesn't fit in L1.

BTW you can think of Tunstall in an arithmetic codey way. Maybe I'll draw a picture because it's easier to show that way...

4/13/2017

Tunstall Context

Looking at Marlin today, some reminders to self/all about context :

Tunstall was originally designed with binary alphabets in mind; variable to fixed on *binary*. In that context, doing full child trees (so dictionaries are full prefix trees and encoding is unique) makes sense. As soon as you do variable-to-fixed (hence VTF) on large alphabets, "plural" trees are obviously better and have been written about much in the past. With plural trees, encoding is not unique ("a" and "ab" are both in the dictionary).

There's a big under-specified distinction between VTF dictionaries that model higher level correlation and those that don't. eg. does P("ab") = P(a)*P(b) or is there order-1 correlation?

I looked at Tunstall codes a while ago (TR "failed experiment : Tunstall Codes" 12/4/2015). I didn't make this clear but in my experiment I was looking at a specific scenario :

symbols are assumed to have only order-0 entropy
(eg. symbol probabilities describe their full statistics)

encoder transmits symbol probabilities (or the dictionary of words)
but there are other possibilities that some of the literature addresses. There are at lot of papers on "improved Tunstall" that use the order-N probabilities (the true N-gram counts for the words rather than multiplying the probability of each character). Whether or not this works in practice depends on context, eg. on LZ literals the characters are non-adjacent in the source so this might not make sense.

There's a fundamental limitation with Tunstall in practice and a very narrow window where it makes sense.

On current chips, 12-bit words is ideal (because 4096 dwords = 16k = fits in L1). 16 bit can sometimes give much better compression, but falling out of L1 is a disaster for speed.

12-bit VTF words works great if the entropy of the source is <= 5 bits or so. As it goes over 5, you have too many bigrams that don't pack well into 12, and the compression ratio starts to suffer badly (and decode speed suffers a bit).

I was investigating Tunstall in the case of normal LZ literals, where entropy is always in the 6-8 bpc range (because any more compressability has been removed by the string-match portion of the LZ). In that case Tunstall just doesn't work.

Tunstall is best when entropy <= 3 bits or so. Not only do you get compression closer to entropy, you also get more decode speed.

Now for context, that's a bit of a weird place to just do entropy coding. Normally in low-entropy scenarios, you would have some kind of coder before just tossing entropy coding at it. eg. take DCT residuals, or any image residual situation. You will have lots of 0's and 1's so it looks like a very low entropy scenario for order-0 entropy, but typically you would remove that by doing RLE or something else so that the alphabet you hand to the entropy coder is higher entropy. (eg. JPEG does RLE of 0's and EOB).

Even if you did just entropy code on a low-entropy source, you might instead use a kind of cascaded coder. Again assuming something like prediction residuals where the 0's and 1's are very common, you might make a two-stage alphabet that's something like :


alphabet 1 : {0,1,2,3+}
alphabet 2 : {3,4,...255}

Then with alphabet 1 you could pack 4 symbols per byte and do normal Huffman. Obviously a Huffman decode is a little slower than Tunstall, but you're getting always 4 symbols per decode so output len is not variable, and compression ratio is better.

Tunstall for LZ literals might be interesting in a very fast LZ with MML 8 or so. (eg. things like ZStd level 1, which is also where multi-symbol-output Huff works well).

Point is - the application window here is pretty narrow, and there are other techniques that also address the same problem.

Understanding Marlin

I will be using "Tunstall" to mean any variable to fixed coder. I am considering large alphabet (eg. 8-bit input alphabet), "plural" (eg. non-prefix-free dictionary). I am also considering only modeling order-0 statistics.

I label the symbols 'a','b','c' from most probable to least probable. I will use single quotes for symbols, and double quotes for words in the dictionary. So :


P('a') , P('b'), etc. are given

P('a') >= P('b') >= P('c') are ordered

I will use the term "Marlin" to describe the way they estimate the probability of dictionary words. (everything else in the paper is either obvious or well known (eg. the way the decoder works), so the innovation and interesting part is the word probability estimation, so that is what I will call "Marlin" , the rest is just "Tunstall").

Ok. To build a Tunstall dictionary your goal is to maximize the average input length, which is :


average input length = Sum[words] { P(word) * L(word) }

since the output length is fixed, maximizing the input length maximizes compression ratio.

In the original Tunstall algorithm on binary input alphabet, this is easily optimized by splitting the most probable word, and adding its two children. This can be done in linear time using two queues for left and right (0 and 1) children.

The Marlin algorithm is all about estimating P(word).

The first naive estimate (what I did in my 12/4/2015 report) is just to multiply the character probabilities :


P(word) = Prod[c in word] P('c')

that is

P("xyz") = P('x') * P('y') * P('z')

but that's obviously not right. The reason is that the existence of words in the dictionary affects the probability of other words.

In particular, the general trend is that the dictionary will accumulate words with the most probable characters (a,b,c) which will make the effective probability of the other letters in the remainder greater.

For example :


Start the dictionary with the 256 single-letter words

At this point the naive probabilities are exact, that is :

P("a") (the word "a") = P('a') (the letter 'a')

Now add the most probable bigram "aa" to the dictionary.

We now have a 257-word dictionary.  What are the probabilities when we code from it ?

Some of the occurrances of the letter 'a' will now be coded with the word "aa"

That means P("a") in the dictionary is now LESS than P('a')

Now add the next most probable words, "ab" and "ba"

The probability of "a" goes down more, as does P("b")

Now if we consider the choice of what word to add next - is it "ac" or "bb" ?

The fact that some of the probability of those letters has been used by words in the dictionary affects
our estimate, which affects our choice of how to build the dictionary.

so that's the intuition of the problem, and the Marlin algorithm is one way to solve it.

Let's do it intuitively again in a bit more detail.

There are two issues : the way the probability of a shorter word is reduced by the presence of longer words, and the way the probability of raw characters that start words is changed by the probability of them coming after words in the dictionary.


Say you have word W in your dictionary

and also some of the most probable children.

W, Wa, Wb are in dictionary

Wc, Wd are not

We'll say word "W" has 2 children (Wa and Wb).

So word "W" will only be used from the dictionary if the children are NOT a or b
(since in that case the longer word would be used).

So if you have seen word "W" so far, to use word W, the next character must be terminal,
eg. one that doesn't correspond to another child.

So the probability of word W should be adjusted by :

P(W) *= P(c) + P(d)

Because we are dealing with sorted probability alphabets, we can describe the child set with just one integer to indicate which ones are in the dictionary. In Marlin terminology this is c(w), and corresponds to the state Si.

If we make the tail cumulative probability sum :


Ptail(x) = Sum[ c >= x ] P('c')  (sum of character probabilities to end)
Ptail(255) = P(255)
Ptail(254) = P(254) + P(255)
Ptail(0) = sum of all P('c')  = 1.0

then the adjustment is :

P(W) *= P(c) + P(d)
P(W) *= Ptail('c')
P(W) *= Ptail(first char that's not a child)

P(W) *= Ptail( num_children(W) )
(I'm zero-indexing, so no +1 here as in the Marlin paper, they 1-base-index)

ADD : I realized there's a simpler way to think about this. When you add a child word, you simply remove that probability from the parent. That is :

let P_init(W) be the initial probability of word W
when it is first added to the dictionary and has no children

track running estimate of P(W)

when you add child 'x' making word "Wx"

The child word probability is initialized from the parent's whole probability :

P_init(Wx) = P_init(W) * P('x')

And remove that from the running P(W) :

P(W) -= P_init(Wx)

That is you just make P(W) the probability of word W, excluding children that exist.
Once you add all children, P(W) will be zero and the word is useless.

Okay, so that does the first issue (probability of words reduced by the presence of longer words). Now the next issue. Consider the same simple example case first :


W,Wa,Wb are in dictionary, Wc,Wd are not
(no longer children of W are either)

Say you reach node "Wa"

there are no children of "Wa" in the dictionary, so all following characters are equally likely

This means that starting the next word, the character probabilities are equal to their original true probabilities

But say you reach node "W" and leave via 'c' or 'd'

In that case the next character must be 'c' or 'd' , it can never be 'a' or 'b'

So the probability of the next character being 'c' goes up by the probability of using word "W" and the
probability of being a 'c' after "W" , that's :

estimate_P('c') +=  P("W") * P('c') / ( P('c') + P('d') )

or

estimate_P('c') +=  P("W") * P('c') / Ptail( num_children(W) )

now you have to do this for all paths through the dictionary. But all ways to exit with a certain child count are similar, so you can merge those paths to reduce the work. All words with 2 children will be in the same exit probability state ('a' and 'b' can't occur but chars >= 'c' can).

This is the Marlin state S_i. S_i means that character is >= i. It happens because you left the tree with a word that had i children.

When you see character 2 that can happen from state 0 or 1 or 2 but never states >= 3.


for estimating probability of word W

W can only occur in states where the first character W[0] is possible

that is state S_i with i <= W[0]

When character W[0] does occur in state S_i , the probability of that character is effectively higher,
because we know that chars < i can't occur.

Instead of just being P(char W[0]) , it's divided by Ptail(i)

(as in estimate_P('c') +=  P("W") * P('c') / Ptail( num_children(W) ) above)


So :

P(W) = Sum[ i <= W[0] ] P( state S_i ) * P( W | S_i )

the probability of W is the sum of the probability of states it can start from
(recall states = certain terminal character sets)
times the probability of W given that state

let

P_naive(W) = Product[ chars c in W ] P(char 'c')

be the naive word probability, then :

P(W | S_i) = (1 / Ptail(i)) * P_naive(W) * Ptail( num_children(W) )


is what we need.  This is equation (1) in the Marlin paper.

The first term increases the probability of W for higher chars, because we know the more probable lower chars can't occur in this state (because they found longer words in the dictionary)

The last term decreases the probability of W because it will only be used when the following character doesn't cause a longer word in the dictionary to be used

Now of course there's a problem. This P(W) probability estimate for words requires the probability of starting the word from state S_i, which we don't know. If we had the P(W) then the P of states is just :


P(S_i) = Sum[ words W that have i children ] * P(W)

so to solve this you can just iterate. Initialize the P(S_i) to some guess; the Marlin code just does :

P(state 0) = 1.0
P(all others) = 0.0

(recall state 0 is the state where all chars are possible, no exclusions, so
characters just have their original order-0 probability)

feeds that in to get P(W), feeds that to update P(S_i), and repeats to convergence.

To build the dictionary you simply find the word W with highest P(W) and split it (adding its next most probable child and increasing its child count by 1).

The Marlin code does this :


seed state probabilities

iterate
{

build dictionary greedily using fixed state probabilities

update state probabilities

}

That is, during the dictionary creation, word probabilities are estimated using state probabilities from the previous iteration. They hard-code this to 3 iterations.

There is an alternative, which is to update the state probabilities as you go. Any time you do a greedy word split, you're changing 3 state probabilities, so that's not terrible. But changing the state probabilities means all your previous word estimate probabilities are now wrong, so you have to go back through them and recompute them. This makes it O(N^2) in the dictionary size, which is bad.

For reference, combining our above equations to make the word probability estimate :


P(W) = Sum[ i <= W[0] ] P( state S_i ) * P( W | S_i )

P(W | S_i) = (1 / Ptail(i)) * P_naive(W) * Ptail( num_children(W) )

so:

P(W) = P_naive(W) * Ptail( num_children(W) ) * Sum[ i <= W[0] ] P( state S_i ) / Ptail(i)

the second half of that can be tabulated between iterations :

P_state_tail(j) = Sum[ i <= j ] P( state S_i ) / Ptail(i)

so :

P(W) = P_naive(W) * Ptail( num_children(W) ) * P_state_tail( W[0] )

you can see the "Marlin" aspect of all this is just in using this P(W) rather than P_naive(W) . How important is it exactly to get this P(W) right? (and is it right?) We'll find out next time...

3/24/2017

JPEG2

JPEG2 proposal / rough principles :

1. Simple simple simple. The decoder should be implementable in ~5000 lines as a single file stb.h style header. Keep it simple!

2. It should be losslessly transcodable from JPEG , ala packJPG/Lepton. That is, JPEG1 should be contained as a subset. (this just means having 8x8 DCT mode, quantization matrix). You could have other block modes in JPEG2 that simply aren't used when you transcode JPEG. You replace the entropy coded back-end with JPEG2 and should get about 20% file size reduction.

IMO this is crucial for rolling out a new format, nobody should ever be trancoding existing JPEGs and thereby introducing new error.

3. Reasonably fast to decode. Slower than JPEG1 by maybe 2X is okay, but not by 10X. eg. JPEG-ANS is okay, JPEG-Ari is probably not okay. Also think about parallelism and GPU decoding for huge images (100 MP). Keeping decoding local is important (eg. each 32x32 block or so should be independently decodable).

4. Decent quality encoding without crazy optimizing encoders. The straightforward encode without big R-D optimizing searches should still beat JPEG.

5. Support for per-block Q , so that sophisticated encoders can do bit rate allocation.

6. Support alpha, HDR. Make a clean definition of color space and gamma. But *don't* go crazy with supporting ICC profiles and lots of bit depths and so on. Needs to be the smallest set of features here. You don't want to get into the situation that's so common where the format is too complex and nobody actually supports it right in practice, so there becomes a "spec standard" and a "de-facto standard" that don't parse lots of the optional modes correctly.

7. Support larger blocks & non-square blocks; certainly 16x16 , maybe 32x32 ? Things like 16x8 , etc. This is important for increasingly large images.

Most of all keep it simple, keep it close to JPEG, because JPEG actually works and basically everything else in lossy image compression doesn't.

Anything that's not just DCT + quantize + entropy is IMO a big mistake, very suspicious and likely to be vaporware in the sense that you can make it look good on paper but it won't work well in reality.


ADD :

I have in the past posted many times about how plain old baseline JPEG + decent back-end entropy (eg. packJPG/Lepton) is surprisingly competitive with almost every modern image codec.

That's actually quite surprising.

The issue is that baseline JPEG is doing *zero* R-D optimization. Even if you use something like mozjpeg which is doing a bit of R-D optimization, it's doing it for the *wrong* rate model (assuming baseline JPEG coding, not the packjpg I then actually use).

It's well known that doing R-D optimization correctly (with the right rate model) provides absolutely enormous wins in lossy compression, so the fact that baseline JPEG + packJPG without any R-D at all can perform so well is really an indictment of everything it beats. This tells us there is a lot of room for easy improvement.

3/08/2017

Kraken Perf with Simultaneous Threaded Decodes

I had a report from a customer of poor Kraken decode performance on PS4 when using 10 simultaneous threads for decoding, and it occurred to me I had never tested that thoroughly. (Their issue turned out to be something else; see end of post).

There is reason to be concerned about running a lot of Kraken (or Mermaid/Selkie) decodes simultaneously. On most modern systems, like the PS4, the many cores share caches, perhaps share memory busses or TLBs. That means while you have N* the compute performance, you may have cache conflicts, and you could wind up bottlenecking on some of the memory subsystem. (generally we don't run into bandwidth bottlenecks, but there are lots of other limitted resources, like queue sizes, etc.)

Anyhoo, onto the testing -

I ran N threaded decodes of the same file. The buffers are copied for each thread so they can't share any cache for input or output buffers. Wiped caches before runs. I then wait on all N decodes being done and time that.

The graphs show total time for all N decodes, and time per decode (total/N).

If you had infinite compute resources, then "total time" (orange) would be a flat line. Any number of threads would take the same total time, it would not change.

Once you hit the limits of the system, the "time per" (blue) should be constant, and total then should go up linearly. (actually not quite, because when you are off the core # modulo, the threads don't all complete at the same time so you get wasted idle time; see the jump on lappy from 4-6 cores then how flat it is from 6-8, same on PS4 from 6-8 cores then flat from 9-12). If you have the threads to spare, then you can maximize throughput by minimizing "time per".


Conclusion :

No problem with lots of simultaneous Kraken decodes. Even when heavily over-subscribed, there's no major perf inversion due to overloading cache or memory subsystems.

Kraken on PS4 has near perfect threading up to 6 threads (total time goes from 0.0099 - 0.0111) ; on lappy it's not as good but still provides benefit to the time per decode up to 4 threads (total time from 0.0060 - 0.0095).

It's a surprise to me that the PS4 scales so well despite sharing cache & memory bus for the first 4 cores. It's also a surprise that lappy scales less well, I thought it would be near perfect on the first 4 cores, but maybe that's just Windows not giving me the whole machine? That was backward from my expectation.


Charts :

Kraken on PS4 (6 cores; 4 cores per 2MB L2) :

lzt24 :

lzt99:

Almost perfect threading from 1-6 cores (total time constant) even with large binary file.

webster:

webster is a large text file that uses a lot of long distance matches (offset > 1M). Text files have very different character than binary files like the lzt's. We can see that the large hot memory region used by webster does put some stress on the shared L2, there's falloff in perf from 1-4 cores.

webster Selkie :

Selkie is much faster than Kraken (2.75X faster on webster PS4) so all else being equal it should be affected a lot more by thread contention hurting memory latency. But, Selkie has some unique cleverness that makes it immune to this drawback. Threading even on webster from 1-6 cores is near perfect.


Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) no turbo :

lzt24 :

lzt99 :

webster :

Similar to PS4, lappy has almost perfect threading on binary files from 1-4 cores. On webster there is falloff in perf due to the


Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) WITH TURBO :

lzt24 :

lzt99 :

I initially mistakenly posted lappy timings with turbo enabled. I usually turn it off for perf testing on my laptop so that timings are more reliable. I think it's interesting actually to look at how the perf falloff is different with turbo.

Without turbo, total time is constant on lzt24 and lzt99 from 1-4 cores, but with turbo it steadily falls off, as adding more cores causes the laptop to reduce its clock rate. Despite that there's still a solid gain to throughput (the blue "time per" is going down despite the clock rate also going down).


raw data : (lzt24)


lappy : no turbo : (*1000)
1,   9.1360,   9.1360
2,   9.5523,   4.7761
3,   9.7850,   3.2617
4,  10.1901,   2.5475
5,  14.6867,   2.9373
6,  16.6759,   2.7793
7,  19.1105,   2.7301
8,  20.1687,   2.5211
9,  23.6391,   2.6266
10,  25.9279,   2.5928
11,  27.7395,   2.5218
12,  27.6459,   2.3038
13,  30.7935,   2.3687
14,  31.8541,   2.2753
15,  33.7883,   2.2526
16,  34.8252,   2.1766

lappy : with turbo :
1,   0.0060,   0.0060
2,   0.0070,   0.0035
3,   0.0087,   0.0029
4,   0.0095,   0.0024 <- 4
5,   0.0133,   0.0027
6,   0.0170,   0.0028
7,   0.0175,   0.0025
8,   0.0193,   0.0024 <- 8
9,   0.0228,   0.0025
10,   0.0252,   0.0025
11,   0.0262,   0.0024
12,   0.0278,   0.0023 <- 12
13,   0.0318,   0.0024
14,   0.0310,   0.0022
15,   0.0325,   0.0022
16,   0.0346,   0.0022 <- 16

PS4 :
1,   0.0099,   0.0099
2,   0.0102,   0.0051
3,   0.0104,   0.0035
4,   0.0106,   0.0027
5,   0.0110,   0.0022
6,   0.0111,   0.0018 <- min
7,   0.0147,   0.0021
8,   0.0180,   0.0022
9,   0.0204,   0.0023
10,   0.0214,   0.0021
11,   0.0217,   0.0020
12,   0.0220,   0.0018 <- same min again
13,   0.0257,   0.0020
14,   0.0297,   0.0021
15,   0.0310,   0.0021
16,   0.0319,   0.0020



comparing just lappy turbo to no-turbo :

lappy : no turbo :
1,   9.1360,   9.1360
2,   9.5523,   4.7761
3,   9.7850,   3.2617
4,  10.1901,   2.5475

lappy : with turbo :
1,   6.0,   6.0
2,   7.0,   3.5
3,   8.7,   2.9
4,   9.5,   2.4

You can see with only 1 core, turbo is 1.5X faster (9.13/6.0) than no turbo
With 4 cores they are getting close to the same speed, (10.2 vs 9.5), the turbo
has almost completely clocked down


The customer's actual issue was decoding into write-combined graphics memory. This is an absolute killer for decoder perf because Kraken (like any LZ decoder) needs to read back the buffers it writes.

On the PS4 I think the best way to decode to graphics memory (garlic) is to allocate the memory as writeback onion, do the decompress, then change it to wb_garlic with sceKernelBatchMap (which will cause a CPU cache flush; several of these changes could be combined together, eg. for level loading you only need to do it once at the end of all the resource decoding, don't do it per resource).

2/24/2017

Oodle Perf with Chunking and Dictionary Size

I get a lot of customers that want to cut their data into small blocks for paging, who ask "what's the benefit of using larger blocks" ?

The larger the block = more compression, and can help throughput (decode speed).

Obviously larger block = longer latency (to load & decode one whole block).

(though you can get data out incrementally, you don't have to wait for the whole decode to get the first byte out; but if you only needed the last byte of the block, it's strictly longer latency).

If you need fine grain paging, you have to trade off the desire to get precise control of your loading with small blocks & the benefits of larger blocks.

(obviously always follow general good paging practice, like amortize disk seeks, combine small resources into paging units, don't load a 256k chunk and just keep 1k of it and throw the rest away, etc.)

As a reference point, here's Kraken on Silesia with various chunk sizes :


Silesia : (Kraken Normal -z4)

 16k : ooKraken    : 211,938,580 ->75,624,641 =  2.855 bpb =  2.803 to 1 
 16k : decode           : 264.190 millis, 4.24 c/b, rate= 802.22 mb/s

 32k : ooKraken    : 211,938,580 ->70,906,686 =  2.676 bpb =  2.989 to 1 
 32k : decode           : 217.339 millis, 3.49 c/b, rate= 975.15 mb/s

 64k : ooKraken    : 211,938,580 ->67,562,203 =  2.550 bpb =  3.137 to 1 
 64k : decode           : 195.793 millis, 3.14 c/b, rate= 1082.46 mb/s

128k : ooKraken    : 211,938,580 ->65,274,250 =  2.464 bpb =  3.247 to 1 
128k : decode           : 183.232 millis, 2.94 c/b, rate= 1156.67 mb/s

256k : ooKraken    : 211,938,580 ->63,548,390 =  2.399 bpb =  3.335 to 1 
256k : decode           : 182.080 millis, 2.92 c/b, rate= 1163.99 mb/s

512k : ooKraken    : 211,938,580 ->61,875,640 =  2.336 bpb =  3.425 to 1 
512k : decode           : 182.018 millis, 2.92 c/b, rate= 1164.38 mb/s

1024k: ooKraken    : 211,938,580 ->60,602,177 =  2.288 bpb =  3.497 to 1 
1024k: decode           : 181.486 millis, 2.91 c/b, rate= 1167.80 mb/s

files: ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files: decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s


16k   :  2.80:1 ,   15.7 enc mbps ,  802.2 dec mbps
32k   :  2.99:1 ,   19.7 enc mbps ,  975.2 dec mbps
64k   :  3.14:1 ,   22.8 enc mbps , 1082.5 dec mbps
128k  :  3.25:1 ,   24.6 enc mbps , 1156.7 dec mbps
256k  :  3.34:1 ,   25.5 enc mbps , 1164.0 dec mbps
512k  :  3.43:1 ,   25.4 enc mbps , 1164.4 dec mbps
1024k :  3.50:1 ,   24.6 enc mbps , 1167.8 dec mbps
files :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

(note these are *chunks* not a window size; no carry-over of compressor state or dictionary is allowed across chunks. "files" means compress the individual files of silesia as whole units, but reset compressor between files.)

You may have noticed that the chunked files (once you get past the very small 16k,32k) are somewhat faster to decode. This is due to keeping match references in the CPU cache in the decoder.

Limitting the match window (OodleLZ_CompressOptions::dictionarySize) gives the same speed benefit for staying in cache, but with a smaller compression win.


window 128k : ooKraken    : 211,938,580 ->61,939,885 =  2.338 bpb =  3.422 to 1 
window 128k : decode           : 181.967 millis, 2.92 c/b, rate= 1164.71 mb/s

window 256k : ooKraken    : 211,938,580 ->60,688,467 =  2.291 bpb =  3.492 to 1 
window 256k : decode           : 182.316 millis, 2.93 c/b, rate= 1162.48 mb/s

window 512k : ooKraken    : 211,938,580 ->59,658,759 =  2.252 bpb =  3.553 to 1 
window 512k : decode           : 184.702 millis, 2.97 c/b, rate= 1147.46 mb/s

window 1M : ooKraken    : 211,938,580 ->58,878,065 =  2.222 bpb =  3.600 to 1 
window 1M : decode           : 184.912 millis, 2.97 c/b, rate= 1146.16 mb/s

window 2M :  ooKraken    : 211,938,580 ->58,396,432 =  2.204 bpb =  3.629 to 1 
window 2M :  decode           : 182.231 millis, 2.93 c/b, rate= 1163.02 mb/s

window 4M :  ooKraken    : 211,938,580 ->58,018,936 =  2.190 bpb =  3.653 to 1 
window 4M : decode           : 182.950 millis, 2.94 c/b, rate= 1158.45 mb/s

window 8M : ooKraken    : 211,938,580 ->57,657,484 =  2.176 bpb =  3.676 to 1 
window 8M : decode           : 189.241 millis, 3.04 c/b, rate= 1119.94 mb/s

window 16M: ooKraken    : 211,938,580 ->57,525,174 =  2.171 bpb =  3.684 to 1 
window 16M: decode           : 202.384 millis, 3.25 c/b, rate= 1047.21 mb/s

files     : ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files     : decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s

window 128k:  3.42:1 ,   20.1 enc mbps , 1164.7 dec mbps
window 256k:  3.49:1 ,   20.1 enc mbps , 1162.5 dec mbps
window 512k:  3.55:1 ,   20.1 enc mbps , 1147.5 dec mbps
window 1M  :  3.60:1 ,   20.0 enc mbps , 1146.2 dec mbps
window 2M  :  3.63:1 ,   19.7 enc mbps , 1163.0 dec mbps
window 4M  :  3.65:1 ,   19.3 enc mbps , 1158.5 dec mbps
window 8M  :  3.68:1 ,   18.9 enc mbps , 1119.9 dec mbps
window 16M :  3.68:1 ,   18.8 enc mbps , 1047.2 dec mbps
files      :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

WARNING : tuning perf to cache size is obviously very machine dependent; I don't really recommend fiddling with it unless you know the exact hardware you will be decoding on. The test machine here has a 4 MB L3, so speed falls off slightly as window size approaches 4 MB.


If you do need to use tiny chunks with Oodle ("tiny" being 32k or smaller; 128k or above is in the normal intended operating range) here are a few tips to consider :

1. Consider pre-allocating the Decoder object and passing in the memory to the OodleLZ_Decompress calls. This avoids doing a malloc per call, which may or may not be significant overhead.

2. Consider changing OodleConfigValues::m_OodleLZ_Small_Buffer_LZ_Fallback_Size . The default is 2k bytes. Buffers smaller than that will use LZB16 instead of the requested compressor, because many of the new ones don't do well on tiny buffers. If you want to have control of this yourself, you can set this to 0.

3. Consider changing OodleLZ_CompressOptions::spaceSpeedTradeoffBytes . This is the number of bytes that must be saved from the compressed output size before the encoder will choose a slower decode mode. eg. it controls decisions like whether literals are sent raw or with entropy coding. This number is scaled for full size buffers (128k bytes or more). When using tiny buffers, it will choose to avoid entropy coding more often. You may wish to dial down this value to scale to your buffers. The default is 256 ; I recommend trying 128 to see what the effect is.

2/09/2017

strtod

I've been fooling around with strtod for a few days for no good reason. It's been quite interesting.

Some little notes :

strtod in many compilers & standard libraries is broken. (see the excellent pages at exploring binary for details)

(to clarify "broken" : they all round-trip doubles correctly; if you print a double and scan it using the same compiler, you will get the same double; that's quite easy. They're "broken" in the sense of not converting a full-precision string to the closest possible double, and they're "broken" in the sense of not having a single consistent rule that tells you how any given full precision numeral string will be converted)

The default rounding rule for floats is round-nearest (banker). What that means is when a value is exactly at the midpoint of either round-up or round-down (the bits below the bottom bit are 100000..) , then round up or down to make the bottom bit of the result zero. This is sometimes called "round to even" but crucially it's round to even *mantissa* not round to even *number*.

Our goal for strtod should be to take a full numeral string and convert it to the closest possible double using banker rounding.

The marginal rounding value is like this in binary :


1|101010....101010|10000000000000000

^ ^              ^ ^- bits below mantissa are exactly at rounding threshold
| |              |
| |              +- bottom bit of mantissa = "banker bit"
| |
| +- 52 bits of mantissa
|
+- implicit 1 bit at top of double

in this case the "banker bit" (bottom bit of mantissa bit range) is off, the next bit is on.

If this value was exact, you should round *down*. (if the banker bit was on, you should round up). If this value is not exact by even the tiniest bit (as in, there are more significant bits below the range we have seen, eg. it was 1000....001), it changes to rounding up.

If you think of the infinity of the real numbers being divided into buckets, each bucket corresponds to one of the doubles that covers that range, this value is the boundary of a bucket edge.

In practice the way to do strtod is to work in ints, so if you have the top 64 bits of the (possibly very long) value in an int, this is :


hi = a u64 with first 64 bits
top bit of "hi" is on

(hi>>11) is the 52 bits of mantissa + implicit top bit

hi & 0x400 is the "rounding bit"
hi & 0x800 is the "banker bit"
hi & 0x3ff are the bits under the rounding bit

hi & 0xfff == 0x400 is low boundary that needs more bits
hi & 0xfff == 0xBFF is high boundary that needs more bits

At 0x400 :
"rounding bit" is on, we're right on threshold
"banker bit" is off, so we should round down if this value is exact
but any more bits beyond the first 64 will change us to rounding up
so we need more bits
- just need to see if there are any bits at all that could be on

At 0xBFF :
"banker bit" is on, so if we were at rounding threshold, we should round up
(eg. 0xC00 rounds up, and doesn't need any more bits, we know that exactly)
the bits we have are just below rounding threshold
if the remaining bits are all on (FFFF)
OR if they generate a carry into our bottom bit
then we will round up
- need enough of the remaining value to know that it can't push us up to 0xC00

The standard approach to doing a good strtod is to start by reading the first 19 digits into a U64. You just use *= 10 integer multiplies to do the base-10 to base-2 conversion, and then you have to adjust the place-value of the right hand side of those digits using a table to find a power of 10. (the place-value is obviously adjusted based on where the decimal was and the exponent provided in the input string ; so if you are given "70.23e12" you read "7023" as an int and then adjust pow10 exponent +10).

Once you have the first 19 digits in the U64, you know that the full string must correspond to an interval :


lo = current u64 from first 19 digits

hi = lo + (all remaining digits = 99999)*(place value) + (bias for placevalue mult being too low)

final value is in [lo,hi]

so the difficult cases that need a lot of digits to discriminate are the ones where the first 19 digits put you just below the threshold of rounding, but the hi end of the interval is above it.

Older MSVC only used the first 17 digits so fails sooner. For example :


34791611969279740608512
=
111010111100000111010011011110000000011001100010011101000000000000000000000
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

this is right on a rounding threshold; because the banker bit is off it should round down

the correct closest double is 
34791611969279739000000.000000

but if you bias that up beyond the digits that MSVC reads :

34791611969279740610310
=
111010111100000111010011011110000000011001100010011101000000000011100000110
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

this should round up; the closest double is
34791611969279743000000.000000

but old MSVC gets it wrong

-----------

Another example :


2022951805990391198363682
=
110101100011000000110111111101000101100100011110011001000000000000000000000000000

is the double threshold

2022951805990391198666718
=
110101100011000000110111111101000101100100011110011001000000000100100111111011110

is definitely above the threshold, should round up

but if you only read the first 19 base-10 digits :

2022951805990391198000000
=
110101100011000000110111111101000101100100011110011000111111110000010001110000000

you see a value that is below the threshold and would round down

You have to look at the interval of uncertainty -

2022951805990391198000000
2022951805990391199000000

and see that you straddle and boundary and must get more digits.

There are two things that need refinement : getting more digits & doing your place value scaling to more bits of precision. You can interatively increase the precision of each, which refines your interval smaller and smaller, until you either know which side of the rounding barrier to take, or you have got all the bits of precision.

BTW this needing lots of precision case is exactly the same as the arithmetic coder "underflow" case , where your coding interval looks like, in binary :


hi = 101101010101001 100000000000000.110101010101
lo = 101101010101001 011111111111111.01010101010
     ^               ^               ^- active coding zone, new symbols add at this fixed point position
     |               |
     |               +-- bad underflow zone! can't yet tell if this bit should be a 1 or 0
     |
     +-- hi and lo the same in top bits, these bits have been sent & renormalized out

anyway, just a funny parallel to me.

There are obviously nasty issues to be aware of with the FPU rounding more, control word, precision, or with the optimizer doing funny things to your FPU math (and of course be aware of x86 being on the FPU and x64 being in SSE). The best solution to all that is to just not use floats, use ints and construct the output floats by hand.

(there is an optimization possible if you can trust the FPU ; numbers that fit completely in an int you can just use the FPU's int-to-float to do all the work for you, if the int-to-float rounding mode matches the rounding mode you want from your strtod)

Now, you may think "hey who cares about these boundary cases - I can round-trip doubles perfectly without them". In fact, round-tripping doubles is easy, since they don't ever have any bits on below the (52+1) in the mantissa, you never get this problem of bits beyond the last one! (you only need 17 digits to round-trip doubles).

Personally I think that these kinds of routines should have well-defined behavior for all inputs. There should be a single definite right answer, and a conforming routine should get that right answer. Then nutters can go off and optimize it and fiddle around, and you can run it through a test suite and verify it *if* you have exactly defined right answers. This is particulary true for things like the CRT. (this also comes up in how the compiler converts float and double constants). The fact that so many compilers fucked this up for so long is a bit of a head scratcher (especially since good conversion routines have been around since the Fortran days). (the unwillingess of people to just use existing good code is so bizarre to me, they go off and roll their own and invariably fuck up some of the difficult corner cases).

Anyway, rolling your own is a fun project, but fortunately there's good code to do this :

"dtoa" (which contains a strtod) by David M Gay :

dtoa.c from http://www.netlib.org/fp/

(BTW lots of people seem to have different versions of dtoa with various changes/fixes; for example gcc, mozilla, chromium; I can't find a good link or home page for the best/newest version of dtoa; help?)

dtoa is excellent code in the sense of working right, using good algorithms, and being fast. It's terrible code in the sense of being some apalling spaghetti. This is actual code from dtoa :


    if (k &= 0x1f) {
        k1 = 32 - k;
        z = 0;
        do {
            *x1++ = *x << k | z;
            z = *x++ >> k1;
            }
            while(x < xe);
        if ((*x1 = z))
            ++n1;
        }
#else
    if (k &= 0xf) {
        k1 = 16 - k;
        z = 0;
        do {
            *x1++ = *x << k  & 0xffff | z;
            z = *x++ >> k1;
            }

holy order-of-operations reliance batman! Jeebus.

To build dtoa I use :

#define IEEE_8087
#pragma warning(disable : 4244 4127 4018 4706 4701 4334 4146)
#define NO_ERRNO
Note that dtoa can optionally check the FLT_ROUNDS setting but does not round correctly unless it is 1 (nearest). Furthermore, in MSVC the FLT_ROUNDS value in the header is broken in some versions (always returns 1 even if fesetenv has changed it). So, yeah. Don't mess with the float rounding mode please.

dtoa is partly messy because it supports lots of wacky stuff from rare machines (FLT_RADIX != 2 for example). (though apparently a lot of the weird float format modes are broken).

An alternative I looked at is "floatscan" from MUSL. Floatscan produces correct results, but is very slow.

Here's my timing on random large (20-300 decimal digits) strings :


strtod dtoa.c
ticks = 353

strtod mine
ticks = 267

MSVC 2005 CRT atof
ticks = 1921

MUSL floatscan
ticks = 11445

My variant here ("mine") is a little bit faster than dtoa. That basically just comes from using the modern 64-bit CPU. For example I use mulq to do 64x64 to 128 multiply , whereas he uses only 32*32 multiplies and simulates large words by doing the carries manually. I use clz to left-justify binary, whereas he uses a bunch of if's. Stuff like that.

The MUSL floatscan slowness is mmrmm. It seems to be O(N) in the # of digits even when the first 19 digits can unambiguously resolve the correct answer. Basically it just goes straight into bigint (it makes an array of U32's which each hold 9 base10 digits) which is unnecessary most of the time.

Tips for using Oodle as Efficiently as Possible

Tips for using Oodle as efficiently and robustly as possible. (wrst your game runtime loading + decoding only).

1. Use the new compressors (Kraken/Mermaid/Selkie/Hydra). Aside from being great performance, they have the most robust and well-tested fuzz safety.

2. Pass FuzzSafe_Yes to the OodleLZ_Decompress option. The KMS decoders are always internally fuzz safe. What the FuzzSafe_Yes option is add some checks to the initial header decode which will cause the decode to fail if the header byte has been tampered with to change it to a non-KMS compressor.

3. Use Oodle Core lib only, not Ext. Core lib is very light and tight. Core lib makes no threads, requires no init, and will do no allocations to decode (if you pass in the memory).

4. Disable Oodle's callbacks :


    OodlePlugins_SetAllocators(NULL,NULL);
    OodlePlugins_SetAssertion(NULL);
    OodlePlugins_SetPrintf(NULL);

(note that disabling the allocators only works because you are doing decoding only (no encoding) and because you do #5 below - pass in scratch memory for the decoder)

5. Because you disabled Oodle's access to an allocator, instead pass in the memory needed. You can ask the data for the compressor if you don't always know it.


    compressor = OodleLZ_GetChunkCompressor(in_comp,NULL);
                                    
    decoderMemorySize = OodleLZDecoder_MemorySizeNeeded(compressor,in_raw_length);
    decoderMemory = malloc( decoderMemorySize );

6. If possible keep the Decoder memory around or pull from some shared scratch space rather than allocating & freeing it every time.

7. If you're on a platform where Oodle is in a DLL or .so , sign it with a mechanism like authentisign to ensure it isn't tampered with.

8. If possible do async IO and use double-buffering to incrementally load data and decompress, so that you are maximally using the IO bus and the CPU at the same time. If loading and decoding large buffers, consider running the decoders "ThreadPhased".

9. If possible use Thread-Phased decoding on large buffers that can't be broken into multiple decodes. The most efficient use of threading always comes from running multiple unrelated encode/decodes simultaneously, rather than trying ti multi-thread a single encode/decode.

10. Use a mix of compressors to put your decode time where it helps the most. eg. use the slower compressors only when they get big size wins, and perhaps tune the compression to the game's latency needs - random access data that needs to get loaded as fast as possible, use faster decoders, background data that streams in slowly can use slower decoders. Hydra is the best way to do this, it will automatically measure decode speed and tune to maximize the space-speed tradeoff.

11. Don't load tiny buffers. Combine them into larger paging units.

12. Consider in-memory compressed data as a paging cache.

As always, feel free to contact me and ask questions.

old rants