4/18/2018

Race in EOF on Windows Pipes

There appears to be a race in the EOF check on Windows pipes. I haven't seen this written about elsewhere, so here it is.

TLDR : don't call eof on pipes in Windows! Use read returning zero instead to detect eof.

We observed that semi-randomly pipes would report EOF too soon and we'd get a truncated stream. (this is not the issue with ctrl-Z ; of course you have to make sure your pipe is set to binary).

To reproduce the issue you may use this simple piper :


// piper :

#include <fcntl.h>
#include <io.h>

int main(int argc,char *argv[])
{
    int f_in  = _fileno(stdin);
    int f_out = _fileno(stdout);
    // f_in & out are always 0 and 1

    _setmode(f_in,  _O_BINARY);
    _setmode(f_out, _O_BINARY);

    char buf[4096];
    
    // NO : sees early eof!
    //  don't ask about eof until _read returns 0
    while ( ! _eof(f_in) )
    // YES : just for loop here
    //for(;;)
    {
        int n = _read(f_in,buf,sizeof(buf));
        if ( n > 0 )
        {
            _write(f_out,buf,n);
        }
        else if ( n == 0 )
        {
            // I think eof is always true here :
            if ( _eof(f_in) )
                break;
        }
        else
        {
            int err = errno;
            
            if ( err == EAGAIN )
                continue;
                
            // respond to other errors?
            
            if ( _eof(f_in) )
                break;
        }
    }

    return 0;
}

the behavior of piper is this :

vc80.pdb                 1,044,480

redirect input & ouput :

piper.exe < vc80.pdb > r:\out

r:\out                      1,044,480

consistently copies whole stream, no problems.

Now using a pipe :

piper.exe < vc80.pdb | piper > r:\out

r:\out                         16,384
r:\out                         28,672
r:\out                         12,288

semi-random output sizes due to hitting eof early

If the eof check marked with "NO" in the code is removed, and the for loop is used instead, piping works fine.

I can only guess at what's happening, but here's a shot :

If the pipe reader asks about EOF and there is nothing currently pending to read in the pipe, eof() returns true.

Windows anonymous pipes are unbuffered. That is, they are lock-step between the reader & writer. When the reader calls read() it blocks until the writer puts something, and vice-versa. The bytes are copied directly from writer to reader without going through an internal OS buffer.

In this context, what this means is if the reader drains out everything the writer had to put, and then races ahead and calls eof() before the writer puts anything, it sees eof true. If the writer puts something first, it seems eof false. It's just a straight up race.


time    reader                          writer
-----------------------------------------------------------
0                                       _write blocks waiting for reader
1       _eof returns false
2       _read consumes from writer
3                                       _write blocks waiting for reader
4       _eof returns false
5       _read consumes from writer
6       _eof returns true
7                                       _write blocks waiting for reader

at times 1 and 4, the eof check returns false because the writer had gotten to the pipe first. At time 6 the reader runs faster and checks eof before the writer can put anything, now it seems eof is true.

As a check of this hypothesis : if you add a Sleep(1) call immediately before the _eof check (in the loop), there is no early eof observed, because the writer always gets a chance to put data in the pipe before the eof check.

Having this behavior be a race is pretty nasty. To avoid the problem, never ask about eof on pipes in Windows, instead use the return value of read(). The difference is that read() blocks on the writer process, waiting for it to either put some bytes or terminate. I believe this is a bug. Pipes should never be returning EOF as long as the process writing to them is alive.

The Kraft Number, Binary Arithmetic, and Incremental Package Merge

The Kraft number that we compute for length limited Huffman codelen construction can be thought of as a set of bit flags that tell you which codelens to change.

Recall


K = Sum{ 2^-L_i }

K <= 1 is prefix codeable

you can think of K as the sum of effective coding probabilities (P = 2^-L), so K over 1 is a total probability over 100%.

when we initially make Huffman codelens and then apply a limit, we use too much code space. That corresponds to K > 1.

If you write K as a binary decimal, it will be something like :


K = 1.00101100

Those 1 bits that are below K = 1.0 are exactly the excess codelens that we need to correct.

That is, if you have an extra symbol of length L, that is K too high by 2^-L , that's a 1 bit in the binary at position L to the right of the decimal.


codelen set of {1,2,2,3}

a : len 1
b : len 2
c : len 2
d : len 3

K = 2^-1 + 2^-2 + 2^-2 + 2^-3

K = 0.100 +
    0.010 +
    0.010 +
    0.001

K = 1.001

take (K-1) , the part below the decimal :

K-1 = .001

we have an excess K of 2^-3 ; that's one len 3 too many.

To fix that we can change a code of len 2 to len 3

that does 

-= 0.010
+= 0.001

same as

-= 0.001

That is, when (K-1) has 2 leading zeros, you correct it by promoting a code of len 2 to len 3

so if we compute K in fixed point (integer), we can just take K - one and do a count leading zeros on it, and it tells you which code len to correct. The bits that are on in K tell us exactly what's wrong with our code.

Now, similarly, we can think of the compound operations in the same way.

Any time we need to do a correction of K by 2^-L we could do one code from (L-1) -> L , or we could do two codes from L -> (L+1), or ... that can be seen as just an expression of the mathematics of how you can add bits together to make the desired delta.

That is :


You want 0.001 (increment codelen 2 to 3)

that can be :

0.001 = 0.0001
       +0.0001

(increment two lower lens)

or :

0.001 = 0.01
       -0.001

increment a lower codelen then decrement one at your level

Now, thinking about it this way we can try to enumerate all the possible moves.

To reduce the space of all possible moves, we need a few assumptions :

1. No non-len-group changing moves are profitable. That is, the set of symbols for the current len groups are the best possible set. eg. it's not profitable to do something like { symbol a from len 2 to 3 and symbol b from len 3 to 2 } . If there are any profitable moves like that, do them separately. What this means is the counts are sorted; eg. if a symbol is at a higher codelen, its count is less equal the count of any symbol at lower codelen.

2. I only need to enumerate the moves that can be the cheapest (in terms of total code len).

In that case I think that you can enumerate all the moves thusly :


a change of 2^-L can be accomplished via

inc(L-1)   (that means increment a codelen at L-1)

inc(L) can be substitituted with { inc(L+1), inc(L+1) }

And each of those inc(L+1) can also be substituted with a pair at L+2.
You take either a codelen at L or two at the next higher len, 
whichever has a smaller effect on CL (total codelen).

a change of 2^-L can also be accomplished via :

inc(L-2) and dec(L-1)

OR

inc(L-3) and dec(L-2) and dec(L-1)

again these can be understood as binary decimals :

0.0001 = 0.0010 - 0.0001
0.0001 = 0.0100 - 0.0011
0.0001 = 0.1000 - 0.0111

and finally the decs are also a tree of pairs :

dec(L) can instead be { dec(L+1), dec(L+1) }

this seems like a lot, but compared to all possible ways to make the number X from adds & subtractions of any power of two, it's quite small. The reason we can consider such a reduced set of moves is because we only need the one best way to toggle a bit of K, not all possible ways.

Really we just do :


enumerate the position of the lowest codelen to inc
between 1 and (L-1)

decrement at all codelens below the one you incremented
down to (L-1)

this produce the desired change in K of 2^-L

each "inc" & "dec" can either be at that code len, or a pair of next codelen

(somebody smarter than me : prove that these are in fact all the necessary moves)

Let's look at how dynamic programming reduces the amount of work we have to do.


Say we need to do an inc() at L = 3

(inc means increment a codelen, that decreases K by 2^-4)

We can either increment a single symbol at L = 3
or a pair from L = 4

(this is just the same kind of operation you do in normal Huffman tree building)

The best symbol at L = 3 is just the lowest count symbol (if any exists)

Ask for the best two nodes at L = 4

Those can also be a single symbol, or a pair from L = 5

When you ask for the first node at L = 4, it gets the best two at L = 5

but then imagine the single symbol at L = 4 was lower count and is taken

Now you ask for the second node at L = 4, it again needs the best two at L = 5

we already have them, no more work is needed.

Any time you chose a symbol rather than a pair of higher len, the 2^n tree stops growing.

[3] -> { [4a], [4b] }

[4a] -> sym(4) or { [5a], [5b] }

[4a] takes sym(4)

[4b] -> sym2(4) or { [5a], [5b] }

[4b] doesn't need a new evaluation at level 5

Another issue I need to mention is that as you increment and decrement codelens, they move between the lists, so the cached dynamic programming lists cannot be retained, or can they? (for example, you want to keep the symbols at each len sorted by count)

In fact the accounting for symbols moving is simple and doesn't need to invalidate the cached lists.


When you do an inc(L) , that symbol moves to L+1 and is now available for a further inc(L+1)

(this does not occur with dec(L) since it moves in the opposite direction)

Say you wanted an inc(3) ; you consider doing a pair of { inc(4), inc(4) }

One of the inc(4)'s can be a pair of inc(5)'s , and one of those len 5 symbols can be the one you did inc 4 on.

That is, say you have 'A' at len 4 and 'B' at len 5

inc(3) <- { inc( 'A' @ 4) , { (inc 'A' @ 5) , inc( 'B' @ 5 } }

This is a legal move and something you have to consider.

But the rule for it is quite simple - if a symbol occurs earlier in the list of chosen increments, it is
available at the next level.

If you're familiar with the way Package Merge makes its lists, this is quite similar. It just means when you choose the lowest count symbol at the current level, you can also draw from the previous increments in your list if they have lower count.

These queues we are building are exactly the same thing you would need to do the full package merge algorithm. The difference is, in traditional Package Merge you would start with all the symbols at codelen = max (K is too low), and then incrementally apply the best decrements to increase K. Here we are starting with K pretty close to 1 , with K greater than 1. The result is that in many cases we can do far fewer package merge steps. I call this Incremental Package Merge. It allows you to start from a nearly-Kraft set of codelens and get to the same optimal solution as if you did full package merge.

Let's look at a concrete example or two :


codelen : symbol+count

len 3 : a+7
len 4 : b+3 , c+3
len 5 : d+1 , e+1

you need an inc(3) to get K -= 2^-4

you can :

inc(a) ; cost 7
inc(b + c) ; cost 6
inc(b + { d + e } ) ; cost 5

The best inc(4) is {b,d,e}

Another example :

len 3 : a+7
len 4 : b+2
len 5 : c+1

again say you want an inc(3)

the best is

inc( b + { b + c } ) ; cost 5

here the best option is to inc b twice

And finally let's think again about how Package Merge is related to the "numismatic" or "coin collector problem".

If you play with this you will see what we're really doing is working on a two-cost accountancy. Each symbol has a cost in K which is determined only by its current codelen (2^-L). It also has a cost in total codelen CL which is detemined only by its symbol count. We are trying to pay off a debt in K (or spend a credit in K) and maximize the value we get in CL.

4/04/2018

Engel Coding and Length Limited Huffman Heuristic Revisited

Joern Engel has written about a technique he described to me for forming prefix code lengths here on his blog.

Engel Coding is a fast/approximate way of forming length limited prefix code lengths.

I'm going to back up first and remind us what a prefix code is and the use of the Kraft inequality.

We want to entropy code some alphabet using integer code lengths. We want those codes to be decodeable without side information, eg. by only seeing the bits themselves, not transmitting the length in bits.

0  : a
10 : b
11 : c
is a prefix code. If you see the sequence "11100" it can only decode to "11=c,10=b,0=a".
0  : a
1  : b
11 : c
is not a prefix code. Any occurance of "11" can either be two b's or one c. They can't be resolved without knowing the length of the code. There is no bit sequence you can possibly assign to c that works. The fact that this is impossible is a function of the code lengths.

You can construct a prefix code from code lengths if and only if the lengths satisfy the Kraft inequality :


Sum{ 2^-L_i } <= 1

It's pretty easy to understand this intuitively if you think like an arithmetic coder. 2^-L is the effective probability for a code length, so this is just saying the probabilities must sum to <= 100%

That is, think of the binary code as dividing the range [0,1] like an arithmetic coder does. The first bit divides it in half, so a single bit code would take half the range. A two bit code takes half of that, so a quarter of the original range, eg. 2^-L.

The two numbers that we care about are the Kraft code space used by the code lengths, and the total code length of the alphabet under this encoding :


Kraft code space :

K = Sum{ 2^-L_i }

Total code length :

CL = Sum{ C_i * L_i }

L_i = code length in bits of symbol i
C_i = count of symbol i


Minimize CL subject to K <= 1 (the "Kraft inequality")

We want the minimum total code length subject to the prefix code constraint.

The well known solution to this problem is Huffman's algorithm. There are of course lots of other ways to make prefix code lengths which do not minimize CL. A famous historical one is Shannon-Fano coding, but there have been many others, particularly in the early days of data compression before Huffman's discovery.

Now for a length-limited code we add the extra constraint :


max{ L_i } <= limit

now Huffman's standard algorithm can't be used. Again the exact solution is known; to minimize CL under the two constraints of the Kraft inequality and the maximum codelength limit, the algorithm is "Package Merge".

In Oodle we (uniquely) actually use Package Merge at the higher compression levels, but it is too slow and complicated to use when you want fast encoding, so at the lower compression levels we use a heuristic.

The goal of the heuristics is to find a set of code lengths that satisfy the contraints and get CL reasonably close to the minimum (what Package Merge would find).

The Oodle heuristic works by first finding the true Huffman code lengths, then if any are over the limit, they are changed to equal the limit. This now violates the Kraft inequality (they are not prefix decodeable), so we apply corrections to get them to K = 1. ZStd uses a similar method (and I imagine lots of other people have in the past; this is pretty much how length-limited near-Huffman is done). My previous post on the heuristic length limited code is below, with some other Huffman background :

cbloom rants: 07-03-10 - Length-Limitted Huffman Codes Heuristic
cbloom rants 07-02-10 - Length-Limitted Huffman Codes
cbloom rants 05-22-09 - A little more Huffman
cbloom rants 08-12-10 - The Lost Huffman Paper
cbloom rants Huffman Performance
cbloom rants Huffman Correction

(Engel correctly points out that most of the places where I say "Huffman coding" I should really be saying "prefix coding". The decoding methods and canonical code assignment and so on can be done with any prefix code. A Huffman code is only the special case of a prefix code with optimal lengths. That is, Huffman's algorithm is only the part about code length assignment; the rest is just prefix coding.)

So Engel's idea is : if we're going to limit the code lengths and muck them up with some heuristic anyway, don't bother with first finding the optimal non-length-limited Huffman code lengths. Just start with heuristic code lengths.

His heuristic is (conceptually) :


L_i = round( log2( P_i ) )

which is intuitively a reasonable place to start. If your code lengths didn't need to be an integer number of bits, then you would want them to be as close to log(P) as possible.

Then apply the limit and fix the lengths to satisfy the Kraft inequality. Note that in this case the tweaking of lens to satisfy Kraft is not just caused by lens that exceed the limit. After the heuristic codelens are made, even if they are short, they might not be Kraft. eg. you can get code lengths like { 2,3,3,3,3,3,4,4,4 } which are not prefix (one of the 3's need to be changed to a 4). The idea is that unlike Huffman or Shannon-Fano which explicitly work by creating a prefix code by construction, Engel coding instead makes code lengths which could be non-prefix and relies on a fix up phase.

When Joern told me about this it reminded me of "Polar Coding" (Andrew Polar's, not the more common use of the term for error correction). Andrew Polar's code is similar in the sense that it tries to roughly assign log2(P) codelens to symbols, and then uses a fix-up phase to make them prefix. The details of the heuristic are not the same. (I suspect that there are lots of these heuristic entropy coders that have been invented over the years and usually not written down).

Obviously you don't actually want to do a floating log2; for the details of Engel's heuristic see his blog.

But actually the details of the initial codelen guess is not very important to Engel coding. His codelen adjustment phase is what actually determines the codelens. You can start the codelens all at len 1 and let the adjustment phase do all the work to set them, and in fact that gives the same final codelens!

I tried four methods of initial codelen assignment and they all produced the exact same final codelens. The only difference is how many steps of the iterative refinement were needed to get them to Kraft equality.


all initial codelens = 1    : num_adjustment_iterations = 2350943

codelens = floor( log2(P) ) : num_adjustment_iterations = 136925

codelens = round( log2(P) ) : num_adjustment_iterations = 25823

Engel Coding heuristic      : num_adjustment_iterations = 28419

The crucial thing is how the refinement is done.

To get to the refinement, let's go over some basics. I'll first describe the way we were actually doing the length limit heuristic in Oodle (which is not the same as what I described in the old blog post above).

In the Oodle heuristic, we start with Huffman, then clamp the lens to the limit. At this point, the Kraft K is too big. That is, we are using more code space than we are allowed to. We need to raise some codelens somewhere to free up more code space. But raising codelens increases the total codelen (CL). So the goal is to bump up some codelens to get K = 1, with a minimum increase to CL.


If you do L_i ++

K -= 2^(-L_i)
K += 2^(-(L_i+1))

for a net change of :

K -= 2^(-(L_i+1))

(shorter codelen symbols makes a bigger change to K)

and CL does :

CL -= C_i * L_i
CL += C_i * (L_i + 1)

net :

CL += C_i

(lower count symbols hurt total codelen less)

K_delta_i = 2^(-(L_i+1))
CL_delta_i = C_i

To get under the K budget, we want to find the lowest CL_delta with the maximum K_delta. That is, code space (K) is the resource you want to buy, and code len (CL) is the currency you use to pay for that code space. You want the best price :

price = CL_delta / K_delta

price = C_i * 2^(L_i+1)

What I was doing in Oodle was taking the step with the best "price" that didn't overshoot the target of K = 1.

If your symbols are sorted by count (which they usually are for Huffman codelen algorithms), then you don't need to compute "price" for all your symbols. The minimum price will always occur at the lowest count (first in the sorted list) at each codelen. So rather than making a full heap of up to 256 symbols (or whatever your alphabet size is), you only need a heap of the 16 (or whatever codelen limit is) lowest count symbols at each codelen.

The big improvement in Engel's refinement heuristic is that it allows overshooting K if the absolute value of the distance to K decreases.

Consider K in fixed point with a 12 bit codelen limit. Then "one" is at K = 4096. Say you had K = 4099. It's 3 too big. My heuristic could only consider K steps of -=2 and -=1 (only power of two steps are possible). Engel can also take a step of -= 4 , changing K to 4095. It's now 1 too small (codeable but wasteful) and rather than increasing codelens to fit in the code space, we can decrease a symbol codelen somewhere to gain some total codelen.

Engel converges to K = one by (possibly) taking successively smaller overshooting steps, so K wiggles around the target, delta going positive & negative. This does not always converge, so a bail out to a simpler approach is needed. This overshooting lets it get to K by doing a combination of positive and negative steps (eg. 3 = 4 - 1 , not just 3 = 1 + 2), which is a little bit of a step towards Package Merge (the difference being that package merge find the cheapest path to get the desired sum, while Engel's heuristic is greedy, taking the single cheapest step each time).

In practice this turns out to be much better than only taking non-overshooting steps.

Time to look at the results on some real data :


"seven" test set, cut into 64k chunks, order 0 entropy coded

comparing code len to package merge (ideal)

The length in excess (percent) is :

excess percent = (codelen(heuristic) - codelen(packagemerge))*100/codelen(packagemerge)

Huffman then limit monotonic      mean : 0.027% max : 2.512%
Huffman then limit overshooting   mean : 0.002% max : 0.229%
Engel coding                      mean : 0.016% max : 10.712%

(codelen limit of 12, 256 symbol byte alphabet)

The heuristic (overshooting) limit is really very good, extremely close to package merge and even the maximum excess len is small. Engel coding (non-Huffman initial code lengths) works fine on average but does have (very rare) bad cases. This is not surprising; there's reason we use Huffman's algorithm to get the code lengths right.

In that bad 10.71% excess case, the package merge average code len is 1.604 but Engel coding produces an average code length of 1.776

Note that many of the blocks in this test did not hit the codelen limit; in that case "Huffman then limit" produces the best possible codelens, but Engel coding might not.

For most applications, it's probably best to make the true Huffman code and then limit the lengths with a heuristic. The time saved from the approximate initial code lengths is pretty small compared to other operations needed to do entropy coding (histogramming for example is annoyingly slow). Nevertheless I found this technique to be an interesting reminder to keep an open mind about approximations and understand where our algorithms came from and why we use the ones we do.


Another thing I find interesting is how Engel Coding points back to Package Merge again.

First there's the fact that you can start Engel Coding with just all the codelens set at 1 , and let the Kraft fixup make the codelens. That's how Package Merge works. It starts all codelens at 1, and the increments the cheapest ones until it gets up to Kraft. The log2ish starting guess for Engel Coding is just a way of jump-starting the codelens closer to the final answer to avoid lots of steps.

Engel Coding's overshooting heuristic improves on the monotonic heuristic by allowing you to take some +- steps. That is, increment one codelen and decrement another. In particular it can do things like : rather than increment a len 3 codelen , instead increment a len 2 codelen and decrement a len 3 codelen. This is the kind of move that you need to make to get to real optimal code lens.

The key missing element is considering the costs of all possible steps and finding a path to the desired K. That is, Engel coding takes greedy (locally cheapest price) steps, which may not give the optimal path overall. The way to turn this greedy algorithm into an optimal one is dynamic programming. Lo and behold, that's what Package Merge is.

3/26/2018

Compression of Android Game APK Packages with Oodle

This is a look at compression of Android Game APK Packages. In this study I'm mainly looking at the issue of transmission of the APK to the client, not storage on the client and decompression at runtime. The key difference between the two is that for transmission over the network, you want to compress the package as a "tar", that is without division into files, so that the compressor can use cross-file correlation. For storage on disk and runtime loading, you might want to store files individually (or perhaps combined and/or split into paging units), and you might want some files uncompressed.

The Android APK package is just a zip (thanks to them for just using zip and not changing the header so that it can be easily manipulated with standard tools).

I chose the list of games from this article : Google's instant app tech now lets you try games before you buy which is : Clash Royale, Words With Friends 2, Bubble Witch 3 Saga, Final Fantasy XV: A New Empire, Mighty Battles and -- of course -- Solitaire

I discovered that "Mighty Battles" internally contains a large pre-compressed pak file. (it's named "content.mp3" but is not really an mp3, it's some sort of compressed archive. They use the mp3 extension to get the APK package system to store it without further zlib compression.) Because of that I exluded Might Battles from the test; it would be about the same size with every compressor, and is not reflective of how it should be compressed (their internal package should be uncompressed if we're testing how well the outer compressor does). Later I also saw that "Clash Royale" is also mostly pre-compressed content. Clash Royale has its content in ".sc" files that are opaque compressed data. I left it in the test set, but it should also have those files uncompressed for real use with an outer compressor. I wasn't sure which Solitaire to test; I chose the one by Zynga.

The "tar" is made by unpacking the APK zip and concatenating all the files together. I also applied PNGz0 to turn off zlib compression on any PNGs. I then tested various compressors on the game tars.

original tar zlib Leviathan
BubbleWitch3 78,032,875 304,736,621 67,311,666 54,443,823
ClashRoyale 101,702,690 124,031,098 98,386,824 93,026,161
FinalFantasyXV 58,933,554 144,668,500 57,104,802 41,093,459
Solitaire 14,814,888 139,177,140 14,071,999 8,337,863
WordsWithFriends2 78,992,339 570,621,614 78,784,623 53,413,494
total 332,476,346 1,283,234,973 315,659,914 250,314,800
original  = size of the source APK (per-file zip with some files stored uncompressed)
tar       = unzipped files, with PNGz0, concatenated together
zlib      = zip -9 applied to the tar ; slightly smaller than original
Leviathan = Oodle Leviathan level 8 (Optimal4) applied to the tar
You can see that Clash Royale doesn't change much because it contains large amounts of pre-compressed data internally. The other games all get much smaller with Leviathan on a tar (relative to the original APK, or zlib on the tar). eg. BubbleWitch3 was 78 MB, Leviathan can send it in 54.4 MB ; Solitaire can be sent in almost half the size.

Leviathan is very fast to decode on ARM. Despite getting much more compression than zlib, it is faster to decode. More modern competitors (ZStd, brotli, LZMA) are also slower to decode than Leviathan on ARM, and get less compression.

For reference, here is the performance on this test set of a few compressors (speeds on Windows x64 Core i7-3770) :

Note that some of the wins here are not accessible to game developers. When a mobile game developer uses Oodle on Android, they can apply Oodle to their own content and get the size and time savings there. But they can't apply Oodle to their executable or Java files. The smaller they reduce their content, the larger the proportion of their APK becomes that is made up of files they can't compress. To compress all the content in the APK (both system and client files, as well as cross-file tar compression) requires support from the OS or transport layer.

I'll also take this chance to remind clients that when using Oodle, you should always try to turn off any previous compression on your data. For example, here we didn't just try the compressors directly on the APK files (which are zip archives and have previous zlib compression), we first unpacked them. We then further took the zlib compression off the PNG's so that the outer compressors in the test could have a chance to compress that data better. The internal compressors used on Clash Royale and Mighty Battles should also have been ideally turned off to maximize compression. On the other hand, turning off previous compression does not apply to data-specific lossy compressors such as audio, image, and video codecs. That type of data should be passed through with no further compression.

3/25/2018

Improving the compression of block-compressed textures Revisited

The Oodle LZ compressors are especially good on binary data, such as the block compressed textures used in games.

(by "block compressed textures" I mean BCn, ETC1, ETC2, etc. textures in fixed size blocks for use with GPU's. I do *not* mean already compressed textures such as JPEG, PNG, or BCn that has already been compressed with crunch. You should not be applying Oodle or any other generic compressor on top of already compressed textures of that type. If you have a lot of PNG data consider PNG without ZLib or look for the upcoming Oodle Lossless Image codec.)

See the Appendix at the bottom for a comparison of modern LZ compressors on BCn data. Oodle LZ gets more compression and/or much faster decode speeds on BCn data.

So you can certainly just create your texture as usual (at maximum quality) and compress it with Oodle. That's fine and gives you the best visual quality.

If you need your texture data to be smaller for some reason, you can use a data-specific lossy compressor like crunch (or Basis), or you could use RDO texture creation followed by Oodle LZ compression.

(I've written about this before, here : Improving the compression of block-compressed textures , but I'm trying to do a rather cleaner more thorough job this time).

RDO texture creation is a modification of the step that creates the block compressed texture (BCn or whatever) from the original (RGBA32 or whatever). Instead of simply choosing the compressed texture blocks that minimize error, blocks are chosen to minimize rate + distortion. That is, sometimes larger error is intentionally chosen when it improves rate. In this case, we want to minimize the rate *after* a following LZ compressor. The block compressed textures always have the same size, but some choices are more compressible than others. The basic idea is to choose blocks that have some relation to preceding blocks, thereby making them more compressible. Common examples are trying to reuse selector bits, or to choose endpoints that match neighbors.

RDO encoding of block compressed textures should always be done from the original non-compressed version of the texture, *not* from a previous block compressed encoding. eg. don't take something already in BC1 and try to run RDO to shrink it further. Doing that would cause the errors to add up, a bit like taking a JPEG and lowering it's "quality" setting to make it smaller - that should always be done from original data.

Now, block compressed textures are already lossy. BC1 is quite bad; BC7 and ASTC less so. So adding more error may not be acceptable at all. If large amounts of error are acceptable in your texture, you may not ever be seeing the largest mip levels. Sending mip levels that are too large and never visible is a *far* larger waste of size than anything we do here, so it's important to have a process in your game to find those textures and shrink them.

The best tool I know of at the moment to do RDO texture creation is crunch by Rich Geldreich / Binomial. I'm told that their newer Basis product has an improved RDO-for-LZ but I don't have a copy to test. What I actually run is Unity's improvement to crunch. The way you use it is something like :

crunch_x64_unity.exe -DXT1 -fileformat dds -file input.png -maxmips 1 -quality 200 -out output.dds
That is, tell it to make fileformat DDS, it will do normal block texture compression, but with rate-favoring decisions.

NOTE : we're talking about lossy compression here, which is always a little subtle to investigate because there are two axes of performance : both size and quality. Furthermore "quality" is hard to measure well, and there is no substitute for human eyes examining the images to decide what level of loss is acceptable. Here I am reporting "imdiff" scores with my "combo" metric. The "imdiff" scores are not like an RMSE; they're roughly on a scale of 0-100 where 0 = no difference and 100 = complete garbage, like a percent difference (though not really).

Some results :


act3cracked_colour
1024x1024

non-RDO fast BC1 : 524,416 bytes
then Leviathan   : -> 416,981
imdiff     : 33.26

crunch RDO quality 200 , then Leviathan : -> 354,203
imdiff     : 36.15

file size 85% of non-RDO
error 109% of non-RDO

adventurer_colour 
1024x1024

non-RDO fast BC1 : 524,416 bytes
then Leviathan   : -> 409,874
imdiff     : 32.96

crunch RDO quality 200 , then Leviathan : -> 334,342
imdiff     : 33.48

file size 81% of non-RDO
error 102% of non-RDO

Personally I like crunch's RDO DDS at these high quality levels, 200 or above. It introduces relatively little error and the file size savings are still significant.

At lower quality levels use of crunch can be problematic in practice. Unfortunately it's hard to control how much error it introduces. You either have to manually inspect textures for damage, or run an external process to measure quality and feed that back into the crunch settings. Another problem is that crunch's quality setting doesn't scale with texture size; smaller size textures get less error and larger size textures get more error at the same "quality" setting, which means you need to choose a quality setting per texture size. (I think the issue is that crunch's codebook size doesn't scale with texture size, which makes it particularly bad for textures at 2048x2048 or above, or for large texture atlases).

Your other option besides doing RDO texture creation followed by LZ is to just use crunch's native "crn" format for textures.

Let's compare RDO+LZ vs crn for size. I will do this by dialing the quality setting until they get the same imdiff "combo" score, so we are comparing a line of equal distortion (under one metric).


act3cracked_colour
1024x1024

crunch crn 255 : -> 211,465
imdiff     : 42.33

crunch rdo dds 95 : -> 264,206
imdiff     : 42.36


adventurer_colour
1024x1024

crunch crn 255 : -> 197,644
imdiff     : 38.48

crunch rdo dds 101 : -> 244,402
imdiff     : 38.67

The native "crn" format is about 20% smaller than RDO + LZ on both of these textures. It is to be expected that custom compressors, well designed for one type of data, should beat general purpose compressors. Note that comparing "crn" sizes to just doing BCn + LZ (without RDO) is not a valid comparison, since they are at different error levels.

If you look at the quality settings, the "crn" mode at maximum quality is still introducing a lot of error. That "quality" setting is not on the same scale for crn mode and dds mode. Maximum quality (255) in crn mode is roughly equivalent to quality = 100 in dds mode. Unfortunately there seems to be no way to get higher quality in the crn mode.

This has been an attempt to provide some facts to help you make a good choice. There are three options : BCn (without RDO) + LZ , RDO BCn + LZ, or a custom compresssed texture format like CRN. They have different pros and cons and the right choice depends on your app and pipeline.

Now we haven't looked at decode speed in this comparison. I've never measured crunch's decode speed (of CRN format), but I suspect that Oodle's LZ decoders are significantly faster. Another possible speed advantage for Oodle LZ is that you can store your BCn data pre-swizzled for the hardware, which may let you avoid more CPU work. I should also note that you should never LZ decompress directly into uncached graphics memory. You either need to copy it over after decoding (which is very fast and recommended) or start the memory as cached for LZ decoding and then change it to uncached GPU memory after the decode is done.


Appendix : Performance of LZ compressors on BCn data

Comparison of Oodle to some other compressors on samples of game texture data.

Repeating the "Game BCn" test from Oodle 2.6.0 : Leviathan detailed performance report : A mix of BCn textures from a game (mostly BC1, BC4, and BC7) :

"Game BCn" :

lzma 16.04 -9           : 3.692 to 1 :   64.85 MB/s
brotli24 2017-12-12 -11 : 3.380 to 1 :  237.78 MB/s
zlib 1.2.11 -9          : 2.720 to 1 :  282.78 MB/s
zstd 1.3.3 -22          : 3.170 to 1 :  485.97 MB/s

Kraken8                 : 3.673 to 1 :  880.99 MB/s
Leviathan8              : 3.844 to 1 :  661.93 MB/s

A different set : "test_data\image\dds" is mostly BC1 with some BC3 and some RGBA32

test_data\image\dds :

lzma 16.04 -9           : 2.354 to 1 :   39.53 MB/s
brotli24 2017-12-12 -11 : 2.161 to 1 :  161.40 MB/s
zlib 1.2.11 -9          : 1.894 to 1 :  222.70 MB/s
zstd 1.3.3 -22          : 2.066 to 1 :  443.96 MB/s

Kraken8                 : 2.320 to 1 :  779.84 MB/s
Leviathan8              : 2.386 to 1 :  540.90 MB/s
(note this is lzma with default settings; lzma with settings tweaked for BCn can sometimes get more compression than Leviathan)

While brotli and ZStd are competitive with Kraken's compression ratio on text (and text-like) files, they lag behind on many types of binary data, such as these block compressed textures.

3/19/2018

Oodle Data Compression Integration for Unreal Engine 4

We have created an integration of Oodle Data Compression for Unreal Engine 4. You can now use Oodle's amazing Kraken, Leviathan, and other codecs to compress your Unreal game data smaller and load it faster.

Once installed in your source tree, the Oodle Data Compression integration is transparent. Simply create compressed pak files the way you usually do, and instead of compressing with zlib they will be compressed with Oodle. At runtime, the engine automatically decodes with Oodle or zlib as specified in the pak.

Oodle can compress game data much smaller than zlib. Oodle also decodes faster than zlib. With less data to load from disk and faster decompression, you speed up data loading in two ways.

For example, on the ShooterGame sample game, the pak file sizes are :


ShooterGame-WindowsNoEditor.pak

uncompressed :       1,131,939,544 bytes

Unreal default zlib :  417,951,648   2.708 : 1

Oodle Kraken 4 :       372,845,225   3.036 : 1

Oodle Leviathan 8 :    330,963,205   3.420 : 1

Oodle Leviathan makes the ShooterGame pak file 87 MB smaller than the Unreal default zlib compression!

NOTE this is new and separate from the Oodle Network integration which has been in Unreal for some time. Oodle Network provides compression of network packets to reduce bandwidth in networked games.

The Oodle Data Compression integration is available for Unreal 4.18 and 4.19.

Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, visit the RAD Game Tools web site. This is my personal blog where I post supplemental material about Oodle.

3/06/2018

Oodle 2.6.0 : Some more perf comparisons

Here are some more perf comparisons with Oodle 2.6.0, this time with more non-Oodle compressors for reference.

As before, this is Windows x64 on a Core i7-3770, and we're looking at compression ratio vs. decode speed, not considering encode speed, and running all compressors in their highest compression level.

On the "seven" testset, compressing each file independently, then summing decode time and size for each file :

The loglog chart shows log compression ratio on the Y axis and log decode speed on the X axis (the numeric labels show the pre-log values). The upper right is the Pareto frontier.

The raw numbers are :

total      : lzma 16.04 -9           : 3.186 to 1 :   52.84 MB/s
total      : lzham 1.0 -d26 -4       : 2.932 to 1 :  149.57 MB/s
total      : brotli24 2017-12-12 -11 : 2.958 to 1 :  203.91 MB/s
total      : zlib 1.2.11 -9          : 2.336 to 1 :  271.61 MB/s
total      : zstd 1.3.3 -22          : 2.750 to 1 :  474.47 MB/s
total      : lz4hc 1.8.0 -12         : 2.006 to 1 : 2786.78 MB/s
total      : Leviathan8              : 3.251 to 1 :  675.92 MB/s
total      : Kraken8                 : 3.097 to 1 :  983.74 MB/s
total      : Mermaid8                : 2.846 to 1 : 1713.53 MB/s
total      : Selkie8                 : 2.193 to 1 : 3682.88 MB/s


Isolating just Kraken, Leviathan, and ZStd on the same test (ZStd is the closest non-Oodle codec), we can look at file-by-file performance :

The loglog shows each file with a different symbol, colored by the compressor.

The speed advantage of Oodle is pretty uniform, but the compression advantage varies by file type. Some files simply have more air that Oodle can squeeze out beyond what other LZ's find. For example if you only looked at enwik7 (xml/text), then all of the modern LZ's considered here (ZStd,Oodle,Brotli,LZHAM) would get almost exactly the same compression; there's just not a lot of room for them to differentiate themselves on text.


Runs on a couple other standard files :

On the Silesia/Mozilla file :

mozilla    : lzma 16.04 -9           : 3.832 to 1 :   63.79 MB/s
mozilla    : lzham 1.0 -d26 -4       : 3.570 to 1 :  168.96 MB/s
mozilla    : brotli24 2017-12-12 -11 : 3.601 to 1 :  246.68 MB/s
mozilla    : zlib 1.2.11 -9          : 2.690 to 1 :  275.40 MB/s
mozilla    : zstd 1.3.3 -22          : 3.365 to 1 :  503.44 MB/s
mozilla    : lz4hc 1.8.0 -12         : 2.327 to 1 : 2509.82 MB/s
mozilla    : Leviathan8              : 3.831 to 1 :  691.37 MB/s
mozilla    : Kraken8                 : 3.740 to 1 :  985.35 MB/s
mozilla    : Mermaid8                : 3.335 to 1 : 1834.49 MB/s
mozilla    : Selkie8                 : 2.793 to 1 : 3145.63 MB/s

On win81 :

win81      : lzma 16.04 -9           : 2.922 to 1 :   51.87 MB/s
win81      : lzham 1.0 -d26 -4       : 2.774 to 1 :  156.44 MB/s
win81      : brotli24 2017-12-12 -11 : 2.815 to 1 :  214.02 MB/s
win81      : zlib 1.2.11 -9          : 2.207 to 1 :  253.68 MB/s
win81      : zstd 1.3.3 -22          : 2.702 to 1 :  472.39 MB/s
win81      : lz4hc 1.8.0 -12         : 1.923 to 1 : 2408.91 MB/s
win81      : Leviathan8              : 2.959 to 1 :  757.36 MB/s
win81      : Kraken8                 : 2.860 to 1 :  949.07 MB/s
win81      : Mermaid8                : 2.618 to 1 : 1847.77 MB/s
win81      : Selkie8                 : 2.142 to 1 : 3467.36 MB/s


And again on the "seven" testset, but this time with the files cut into 32 kB chunks :

total      : lzma 16.04 -9           : 2.656 to 1 :   43.25 MB/s
total      : lzham 1.0 -d26 -4       : 2.435 to 1 :   76.36 MB/s
total      : brotli24 2017-12-12 -11 : 2.581 to 1 :  178.25 MB/s
total      : zlib 1.2.11 -9          : 2.259 to 1 :  255.18 MB/s
total      : zstd 1.3.3 -22          : 2.363 to 1 :  442.42 MB/s
total      : lz4hc 1.8.0 -12         : 1.849 to 1 : 2717.30 MB/s
total      : Leviathan8              : 2.731 to 1 :  650.23 MB/s
total      : Kraken8                 : 2.615 to 1 :  975.69 MB/s
total      : Mermaid8                : 2.455 to 1 : 1625.69 MB/s
total      : Selkie8                 : 1.910 to 1 : 4097.27 MB/s

By cutting into 32 kB chunks, we remove the window size disadvantage suffered by zlib and LZ4. Now all the codecs have the same match window, and the compression difference only comes from what additional features they provide. The small chunk also stresses startup overhead time and adaptation speed.

The Oodle codecs generally do even better (vs the competition) on small chunks than they do on large files. For example LZMA and LZHAM both have large models that really need a lot of data to get up to speed. All the non-Oodle codecs slow down more on small chunks than the Oodle codecs do.


Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, visit the RAD Game Tools web site. This is my personal blog where I post supplemental material about Oodle.

Read more about Leviathan and Oodle 2.6.0 in these other posts on my blog :

Leviathan Rising
Everything new and tasty in Oodle 2.6.0
Leviathan performance on PS4, Xbox One, and Switch
Leviathan detailed performance report
Oodle Hydra and space-speed flexibility

or visit RAD to read for more information about the Oodle SDK

Oodle 2.6.0 : Hydra and space-speed flexibility

Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, visit the RAD Game Tools web site. This is my personal blog where I post supplemental material about Oodle.


One of the unique things about Oodle is the fact that its compressors are optimizing for a space-speed goal (not just for size), and the user has control over how that combined score is priced.

This is a crucial aspect of Leviathan. Oodle Leviathan considers many options for the encoded data, it rates those options for space-speed and chooses the bit stream that optimizes that goal. This means that Leviathan can consider slower-to-decode bit stream options, and only use them when they provide enough of a benefit that they are worth it.

That is, Leviathan decompression speed varies a bit depending on the file, as all codecs do. However, other codecs will sometimes get slower for no particular benefit. They may choose a slower mode, or simply take a lot more slow encoding options (short matches, or frequent literal-match switches), even if isn't a big benefit to file size. Leviathan only chooses slower encoding modes when they provide a benefit to file size that meets the goal the user has set for space-speed tradeoff.

Each of the new Oodle codecs (Leviathan, Kraken, Mermaid & Selkie) has a different default space-speed goal, which we set to be near their "natural lambda", the place that their fundamental structure works best. Clients can dial this up or down to bias their decisions more for size or speed.

The flexibility of these Oodle codecs allows them to cover a nearly continuous range of compression ratio vs decode speed points.

Here's an example of the Oodle codecs targeting a range of space-speed goals on the file "TC_Foreground_Arc.clb" (Windows x64):

Now you may notice that at the highest compression setting of Kraken (-zs64) it is strictly worse than the fastest setting of Leviathan (-zs1024). If you want that speed-compression tradeoff point, then using Kraken is strictly wrong - you should switch to Leviathan there.

That's what Oodle Hydra does for you. Hydra (the many headed beast) is a meta compressor which chooses between the other Oodle compressors to make the best space-speed encoding. Hydra does not just choose per file, but per block, which means it can do finer grain switching to find superior encodings.

Oodle Hydra on the file "TC_Foreground_Arc.clb" (Windows x64):

When using Hydra you don't explicitly choose the encoder at all. You set your space-speed goal and you trust in Oodle to make the choice for you. It may use Leviathan, Kraken, or Mermaid, so you may get faster or slower decoding on any given chunk, but you do know that when it chooses a slower decoder it was worth it. Hydra also sometimes gets free wins; if you wanted high compression so you would've gone with Leviathan, there are cases where Kraken compresses nearly the same (or even does better), and switching down to Kraken is just a decode speed win for free (no compression ratio sacrificed).

Of course the disadvantage of Hydra is slow encoding, because it has to consider even more encoding options than Oodle already does. It is ideal for distribution scenarios where you encode once and decode many times.


Another way we can demonstrate Oodle's space-speed encoding is to disable it.

We can run Oodle in "max compression only" mode by setting the price of time to zero. That is, when it scores a decision for space-speed, we consider only speed. (I never actually set the price of time to exactly zero; it's better to just make it very small so that ties in size are still broken in favor of speed; specifically we will set spaceSpeedTradeoffBytes = 1).

Here's Oodle Leviathan on the Silesia standard test corpus :

Leviathan default space-speed score (spaceSpeedTradeoffBytes = 256) :

total   : 211,938,580 ->48,735,197 =  1.840 bpb =  4.349 to 1
decode  : 264.501 millis, 4.25 c/B, rate= 801.28 MB/s

Leviathan max compress (spaceSpeedTradeoffBytes = 1) :

total   : 211,938,580 ->48,592,540 =  1.834 bpb =  4.362 to 1
decode  : 295.671 millis, 4.75 c/B, rate= 716.81 MB/s
By ignoring speed in decisions, we've gained 140 kB , but have lost 30 milliseconds in decode time.

Perhaps a better way to look at it is the other way around - by making good space-speed decisions, the default Leviathan setting saves 0.50 cycles per byte in decode time, at a cost of only 0.006 bits per byte of compressed size.


Read more about Leviathan and Oodle 2.6.0 in these other posts on my blog :

Leviathan Rising
Everything new and tasty in Oodle 2.6.0
Leviathan performance on PS4, Xbox One, and Switch
Leviathan detailed performance report
Oodle Hydra and space-speed flexibility

or visit RAD to read for more information about the Oodle SDK

2/27/2018

Oodle 2.6.0 : Leviathan Rising

We're pleased to announce the release of Oodle 2.6.0 featuring the new Leviathan codec. Leviathan achieves high compression ratios (comparable to 7z/LZMA) with unparalleled fast decompression.

Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, visit the RAD Game Tools web site. This is my personal blog where I post supplemental material about Oodle.


Almost two years ago, we released Oodle Kraken. Kraken roared onto the scene with high compression ratios and crazy fast decompression (over 1 GB/s on a Core i7 3770). The performance of Kraken immediately made lots of codecs obsolete. We could also see that something was wrong in the higher compression domain.

Kraken gets high compression (much more that zlib; usually more than things like RAR, ZStd and Brotli), but it gets a little less than 7z/LZMA. But to get that small step up in compression ratio, you had to accept a 20X decode speed penalty.

For example, on the "seven" testset (a private test set with a variety of data types) :

LZMA-9     :  3.18:1 ,    3.0 enc MB/s ,   53.8 dec MB/s
Kraken-7   :  3.09:1 ,    1.3 enc MB/s ,  948.5 dec MB/s
zlib-9     :  2.33:1 ,    7.7 enc MB/s ,  309.0 dec MB/s
LZMA gets 3% more compression than Kraken, but decodes almost 20X slower. That's not the right price to pay for that compression gain. There had to be a better way to take Kraken's great space-speed performance and extend it into higher compression without giving up so much speed.

It's easy to see that there was a big gap in a plot of decode speed vs compression ratio :

(this is a log-log plot on the seven test set, on a Core i7 3770. We're not looking at encode speed here at all, we're running the compressors in their max compression mode and we care about the tradeoff of size vs decode speed.)

We've spent the last year searching in that mystery zone and we have found the great beast that lives there and fills the gap : Leviathan.

Leviathan gets to high compression ratios with the correct amount of speed decrease from Kraken (about 33%). This means that even with better than LZMA ratios on the seven test set, it is still over 2X faster to decode than zlib :

Leviathan-7:  3.23:1 ,    0.9 enc MB/s ,  642.4 dec MB/s

LZMA-9     :  3.18:1 ,    3.0 enc MB/s ,   53.8 dec MB/s
Kraken-7   :  3.09:1 ,    1.3 enc MB/s ,  948.5 dec MB/s
zlib-9     :  2.33:1 ,    7.7 enc MB/s ,  309.0 dec MB/s

Leviathan doesn't always beat LZMA's compression ratio (they have slightly different strengths) but they are comparable. Leviathan is 7-20X faster to decompress than LZMA (usually around 10X).

Leviathan is ideal for distribution use cases, in which you will compress once and decompress many times. Leviathan allows you to serve highly compressed data to clients without slow decompression. We do try to keep Leviathan encode's time reasonable, but it is not a realtime encoder and not the right answer where fast encoding is needed.

Leviathan is a game changer. It makes high ratio decompression possible where it wasn't before. It can be used on video game consoles and mobile devices where other decoders took far too long. It can be used for in-game loading where CPU use needs to be minimized. It can be used to keep data compressed on disk with no install, because its decompression is fast enough to do on every load.

Leviathan now makes up a part of the Oodle Kraken-Mermaid-Selkie family. These codecs now provide excellent solutions over a wider range of compression needs.


Read more about Leviathan and Oodle 2.6.0 in these other posts on my blog :

Leviathan Rising
Everything new and tasty in Oodle 2.6.0
Leviathan performance on PS4, Xbox One, and Switch
Leviathan detailed performance report
Oodle Hydra and space-speed flexibility

or visit RAD to read for more information about the Oodle SDK

Oodle 2.6.0 : Everything new and tasty

Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, visit the RAD Game Tools web site. This is my personal blog where I post supplemental material about Oodle.


A quick run down of all the exciting new stuff in Oodle 2.6.0 :

1. Leviathan!
2. The Kraken, Mermaid & Selkie fast-level encoders are now much faster.
3. Kraken & Mermaid's optimal level encoders now get more compression.
4. Kraken & Mermaid have new bit stream options which allow them to reach even higher compression.
5. Kraken and Mermaid are now more tuneable to different compression ratios and decode speeds.


1. Leviathan!

Read the Leviathan detailed performance report.


2. The Kraken, Mermaid & Selkie fast-level encoders are now much faster.

The non-optimal-parsing encoder levels that are intended for realtime use in Oodle are levels 1-4 aka SuperFast, VeryFast, Fast & Normal.

Their decode speed was always best in class, but previously their encode speed was slightly off the Pareto frontier (the best possible trade off of encode speed vs compression ratio). No longer.

"win81" test file (Core i7-3770 Windows x64)

Oodle 255 :

Kraken1   :  2.33:1 ,   83.3 enc MB/s ,  911.8 dec MB/s
Kraken2   :  2.39:1 ,   68.5 enc MB/s ,  938.0 dec MB/s
Kraken3   :  2.51:1 ,   24.4 enc MB/s , 1005.5 dec MB/s
Kraken4   :  2.57:1 ,   17.3 enc MB/s ,  997.3 dec MB/s

Oodle 260 :

Kraken1   :  2.33:1 ,  135.1 enc MB/s ,  928.1 dec MB/s
Kraken2   :  2.41:1 ,   94.0 enc MB/s ,  937.4 dec MB/s
Kraken3   :  2.52:1 ,   38.3 enc MB/s , 1020.6 dec MB/s
Kraken4   :  2.60:1 ,   23.0 enc MB/s , 1022.9 dec MB/s

Oodle 255 :

Mermaid1  :  2.10:1 ,  106.4 enc MB/s , 2079.4 dec MB/s
Mermaid2  :  2.15:1 ,   78.9 enc MB/s , 2161.1 dec MB/s
Mermaid3  :  2.24:1 ,   26.4 enc MB/s , 2294.0 dec MB/s
Mermaid4  :  2.29:1 ,   26.6 enc MB/s , 2341.1 dec MB/s

Oodle 260 :

Mermaid1  :  2.14:1 ,  161.1 enc MB/s , 2012.1 dec MB/s
Mermaid2  :  2.22:1 ,  104.1 enc MB/s , 2007.7 dec MB/s
Mermaid3  :  2.29:1 ,   39.0 enc MB/s , 2181.7 dec MB/s
Mermaid4  :  2.32:1 ,   32.1 enc MB/s , 2294.0 dec MB/s

Oodle 255 :

Selkie1   :  1.76:1 ,  146.8 enc MB/s , 3645.0 dec MB/s
Selkie2   :  1.85:1 ,  100.5 enc MB/s , 3565.0 dec MB/s
Selkie3   :  1.98:1 ,   28.2 enc MB/s , 3533.2 dec MB/s
Selkie4   :  2.04:1 ,   28.9 enc MB/s , 3675.9 dec MB/s

Oodle 260 :

Selkie1   :  1.78:1 ,  181.7 enc MB/s , 3716.0 dec MB/s
Selkie2   :  1.87:1 ,  114.7 enc MB/s , 3595.7 dec MB/s
Selkie3   :  1.98:1 ,   42.1 enc MB/s , 3653.1 dec MB/s
Selkie4   :  2.02:1 ,   34.9 enc MB/s , 3818.3 dec MB/s
The speed of the fastest encoder (level 1 = "SuperFast") is up by about 60% in Kraken & Mermaid.

Kraken's encode speed vs ratio is now competitive with ZStd, which has long been the best codec for encode speed tradeoff. For example, matching Kraken1 to the closest comparable ZStd levels on the same machine :

at similar encode speed, you can compare the compression ratios :

Kraken1       :  2.33:1 ,  135.1 enc MB/s ,  928.1 dec MB/s
zstd 1.3.3 -4 :  2.22:1 ,  136   enc MB/s ,  595   dec MB/s

or you can look at equal file sizes and compare the encode speed :

Kraken1       :  2.33:1 ,  135.1 enc MB/s ,  928.1 dec MB/s
zstd 1.3.3 -6 :  2:33:1 ,   62   enc MB/s ,  595   dec MB/s
Of course ZStd does have faster encode levels (1-3); Oodle does not provide anything in that domain.


3. Kraken & Mermaid's optimal level encoders now get more compression. (even with 2.5 compatible bit streams)

We improved the ability of the optimal parse encoders to make good decisions and find smaller encodings. This is at level 8 (Optimal4) our maximum compression level with slow encoding.

PD3D : (public domain 3D game data test set)

Kraken8 255    :  3.67:1 ,    2.8 enc MB/s , 1091.5 dec MB/s
Kraken8 260 -v5:  3.72:1 ,    1.2 enc MB/s , 1079.9 dec MB/s

GTS : (private game data test set)

Kraken8 255    :  2.60:1 ,    2.5 enc MB/s , 1335.8 dec MB/s
Kraken8 260 -v5:  2.63:1 ,    1.2 enc MB/s , 1343.8 dec MB/s

Silesia : (standard Silesia compression corpus)

Kraken8 255    :  4.12:1 ,    1.4 enc MB/s ,  982.0 dec MB/s
Kraken8 260 -v5:  4.18:1 ,    0.6 enc MB/s , 1018.7 dec MB/s

(speeds on Core i7-3770 Windows x64)
(-v5 means encode in v5 (2.5.x) backwards compatibility mode)
Compression ratio improvements around 1% might not sound like much, but when you're already on the Pareto frontier, finding another 1% without sacrificing any decode speed or changing the bit stream is quite significant.


4. Kraken & Mermaid have new bit stream options which allow them to reach even higher compression.

PD3D :

Kraken8 255    :  3.67:1 ,    2.8 enc MB/s , 1091.5 dec MB/s
Kraken8 260 -v5:  3.72:1 ,    1.2 enc MB/s , 1079.9 dec MB/s
Kraken8 260    :  4.00:1 ,    1.0 enc MB/s , 1034.7 dec MB/s

GTS :

Kraken8 255    :  2.60:1 ,    2.5 enc MB/s , 1335.8 dec MB/s
Kraken8 260 -v5:  2.63:1 ,    1.2 enc MB/s , 1343.8 dec MB/s
Kraken8 260    :  2.67:1 ,    1.0 enc MB/s , 1282.3 dec MB/s

Silesia :

Kraken8 255    :  4.12:1 ,    1.4 enc MB/s ,  982.0 dec MB/s
Kraken8 260 -v5:  4.18:1 ,    0.6 enc MB/s , 1018.7 dec MB/s
Kraken8 260    :  4.24:1 ,    0.6 enc MB/s ,  985.4 dec MB/s
Kraken in Oodle 2.6.0 now gets Silesia to 50,006,565 bytes at the default space-speed tradeoff target. Kraken in max-compression space-speed setting gets Silesia to 49,571,429 bytes (and is still far faster to decode than anything close).

If we look back at where Kraken started in April of 2016 , it was getting 4.05 to 1 on Silesia , now 4.24 to 1.

Kraken now usually gets more compression than anything remotely close to its decode speed.

Looking back at the old Performance of Oodle Kraken , Kraken only got 2.70:1 on win81. On some files, Kraken has always out-compressed the competition, but win81 was one where it lagged. It does better now :

"win81" test file (Core i7-3770 Windows x64)
in order of decreasing decode speed :

Kraken8       :  2.86:1 ,    0.6 enc MB/s ,  950.9 dec MB/s
old Kraken 215:  2.70:1 ,    1.0 enc mb/s ,  877.0 dec mb/s
Leviathan8    :  2.96:1 ,    0.4 enc MB/s ,  758.1 dec MB/s

zstd 1.3.3 -22           3.35 enc MB/s   473 dec MB/s    38804423  37.01 win81 = 2.702:1
zlib 1.2.11 -9           8.59 enc MB/s   254 dec MB/s    47516720  45.32 win81 = 2.206:1
brotli24 2017-12-12 -11  0.39 enc MB/s   214 dec MB/s    37246857  35.52 win81 = 2.815:1
lzham 1.0 -d26 -4        1.50 enc MB/s   158 dec MB/s    37794766  36.04 win81 = 2.775:1
lzma 16.04 -9            2.75 enc MB/s    51 dec MB/s    35890919  34.23 win81 = 2.921:1
At the time of Kraken's release, it was a huge decode speed win vs comparable compressors, but it sometimes lagged a bit in compression ratio. No longer.

NOTE : Oodle 2.6.0 by default makes bit streams that are decodable by version >= 2.6.0 only. If you need bit streams that can be read by earlier versions, you must set the backward compatible version number that you need. See the Oodle FAQ on backward compatibility.


5. Kraken and Mermaid are now more tuneable to different compression ratios and decode speeds.

The new v6 bit stream has more options, which allows them to smoothly trade off compression ratio for decode speed. The user can set this goal with a space-speed tradeoff parameter.

All the Oodle codecs have a compression level setting (similar to the familiar zip 1-9 level) that trades encode time for decode speed. Unlike many other codecs, Oodle's compressors do not lose *decode* speed at higher encode effort levels. We are not finding more compact encodings by making the decoder slower. Instead you can dial decode speed vs ratio with a separate parameter that changes how the encoder scores decisions.

More about this in : Oodle Hydra and space-speed flexibility.


Read more about Leviathan and Oodle 2.6.0 in these other posts on my blog :

Leviathan Rising
Everything new and tasty in Oodle 2.6.0
Leviathan performance on PS4, Xbox One, and Switch
Leviathan detailed performance report
Oodle Hydra and space-speed flexibility

or visit RAD to read for more information about the Oodle SDK

Oodle 2.6.0 : Leviathan performance on PS4, Xbox One and Switch

Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, visit the RAD Game Tools web site. This is my personal blog where I post supplemental material about Oodle.


Oodle's speed on the Sony PS4 (and Microsoft Xbox One) and Nintendo Switch is superb. With the slower processors in these consoles (compared to a modern PC), the speed advantage of Oodle makes a big difference in total load time or CPU use.

These are run on the private test file "lzt99". I'm mainly looking at the speed numbers here, not the compression ratio (compression wise, we do so well on lzt99 that it's a bit silly, and also not entirely fair to the competition).

On the Nintendo Switch (clang ARM-A57 AArch64 1.02 GHz) :


Oodle 2.6.0 -z8 :

Leviathan   : 2.780 to 1 : 205.50 MB/s
Kraken      : 2.655 to 1 : 263.54 MB/s
Mermaid     : 2.437 to 1 : 499.72 MB/s
Selkie      : 1.904 to 1 : 957.60 MB/s

zlib from nn_deflate

zlib        : 1.883 to 1 :  74.75 MB/s

And on the Sony PS4 (clang x64 AMD Jaguar 1.6 GHz) :

Oodle 2.6.0 -z8 :

Leviathan   : 2.780 to 1 : 271.53 MB/s
Kraken      : 2.655 to 1 : 342.49 MB/s
Mermaid     : 2.437 to 1 : 669.34 MB/s
Selkie      : 1.904 to 1 :1229.26 MB/s

non-Oodle reference (2016) :

brotli-11   : 2.512 to 1 :  77.84 MB/s
miniz       : 1.883 to 1 :  85.65 MB/s
brotli-9    : 2.358 to 1 :  95.36 MB/s
zlib-ng     : 1.877 to 1 : 109.30 MB/s
zstd        : 2.374 to 1 : 133.50 MB/s
lz4hc-safe  : 1.669 to 1 : 673.62 MB/s
LZSSE8      : 1.626 to 1 : 767.11 MB/s

The Microsoft XBox One has similar performance to the PS4. Mermaid & Selkie can decode faster than the hardware DMA compression engine in the PS4 and Xbox One, and usually compress more if they aren't limited to small chunks like the hardware DMA engine needs.

Note that the PS4 non-Oodle reference data is from my earlier runs back in 2016 : Oodle Mermaid and Selkie on PS4 and PS4 Battle : MiniZ vs Zlib-NG vs ZStd vs Brotli vs Oodle . They should be considered only rough reference points; I imagine some of those codecs are slightly different now, but does even a 10 or 20 or 50% improvement really make much difference? (also note that there's no true zlib reference in that PS4 set; miniz is close but a little different, and zlib-ng is faster than standard zlib).

Leviathan is in a different compression class than any of the other options, and is still 2-3X faster than zlib.


Something I spotted while gathering the old numbers that I think is worth talking about:

If you look at the old Kraken PS4 numbers from Oodle 2.3.0 : Kraken Improvement you would see :


PS4 lzt99

old :

Oodle 2.3.0 -z6 : 2.477 to 1 : 389.28 MB/s
Oodle 2.3.0 -z7 : 2.537 to 1 : 363.70 MB/s

vs new :

Oodle 2.6.0 -z8 : 2.655 to 1 : 342.49 MB/s

(-z8 encode level didn't exist back then)

Oh no! Oodle's gotten slower to decode!

Well no, it hasn't. But this is a good example of how looking at just space or speed on their own can be misleading.

Oodle's encoders are always optimizing for a space-speed goal. There are a range of solutions to that problem which have nearly the same space-speed score, but have different sizes or speeds.

So part of what's happened here is that Oodle 2.6.0 is just hitting a slightly different spot in the space-speed solution space than Oodle 2.3.0 is. It's finding a bit stream that is smaller, and trades off some decode speed for that. With its space-speed cost model, it measures that tradeoff as being a good value. (the user can set the relative value of time & bytes that Oodle uses in its scoring via the spaceSpeedTradeoffBytes parameter).

But something else has also happened - Oodle 2.6.0 has just gotten much better. It hasn't just stepped along the Pareto curve to a different but equally good solution - it has stepped perpendicularly to the old Pareto curve and is finding better solutions.

At RAD we measure that using the "correct" Weissman score which provides a way of combining a space-speed point into a single number that can be used to tell whether you have made a real Pareto improvement or just a tangential step.

The easiest way to see that you have definitely made an improvement is to run Oodle 2.6.0 with a different spaceSpeedTradeoffBytes price so that it provides a simpler relationship :


PS4 lzt99

new, with spaceSpeedTradeoffBytes = 1024

Oodle 2.6.0 -z8 : 2.495 to 1 : 445.56 MB/s

vs old :

Oodle 2.3.0 -z6 : 2.477 to 1 : 389.28 MB/s

Now we have higher compression and higher speed, so there's no question of whether we lost anything.

In general the Oodle 2.6.0 Kraken & Mermaid encoders are making decisions that slightly bias for higher compression (and slower decode; though often the decode speed is very close) than before 2.6.0. If you find you've lost a little decode speed and want it back, increase spaceSpeedTradeoffBytes (try 400).


Read more about Leviathan and Oodle 2.6.0 in these other posts on my blog :

Leviathan Rising
Everything new and tasty in Oodle 2.6.0
Leviathan performance on PS4, Xbox One, and Switch
Leviathan detailed performance report
Oodle Hydra and space-speed flexibility

or visit RAD to read for more information about the Oodle SDK

Oodle 2.6.0 : Leviathan detailed performance report

Oodle is an SDK for high performance lossless data compression. For more about Oodle, or licensing inquiries, visit the RAD Game Tools web site. This is my personal blog where I post supplemental material about Oodle.


Let's have a look at a few concrete tests of Leviathan's performance.

All of the tests in this post are run on Windows, x64, on a Core i7-3770. For performance on some other game platforms, see : Leviathan performance on PS4, Xbox One, and Switch .

All the compressors in this test are run on in their slowest encoding level. We're looking for high compression and fast decompression, we're not looking for fast compression here. See @@ for a look at encode times.

Tests are on groups of files. The results show the total compressed size & total decode time over the group of files, but each file is encoded independently. The maximum window size is used for all compressors.

The compressors tested here are :

Kraken8          : Oodle Kraken at level 8 (Optimal4)
Leviathan8       : Oodle Leviathan at level 8 (Optimal4)

lzma_def9        : LZMA -mx9 with default settings (lc3lp0pb2fb32) +d29
lzmalp3pb3fb1289 : LZMA -mx9 with settings that are better on binary and game data (lc0lp3pb3fb128) +d29
zlib9            : zlib 1.2.8 at level 9
LZMA and zlib here were built with MSVC; I also checked there speed in a GCC build and confirmed it is nearly identical.

Two settings for LZMA are tested to try to give it the best chance of competing with Leviathan. LZMA's default settings tend to be good on text but not great on binary (often even Kraken can beat "lzma_def" on binary structured data). I've also increased LZMA's fast bytes to 128 in the non-default options, the default value of 32 is a bit of a detriment to compression ratio, and most of the competition (ZStd, Brotli, LZHAM) use a value more like 128. I want to give LZMA a great chance to compete; we don't need to play any games with selective testing to make Leviathan look good.

Let the tests begin!

PD3D :

Public Domain 3d game data test set :

Kraken8         :  4.02:1 ,    1.0 enc MB/s , 1013.3 dec MB/s
Leviathan8      :  4.21:1 ,    0.5 enc MB/s ,  732.8 dec MB/s
zlib9           :  2.89:1 ,    5.2 enc MB/s ,  374.5 dec MB/s
lzmalp3pb3fb1289:  4.08:1 ,    3.4 enc MB/s ,   65.4 dec MB/s
lzma_def9       :  3.97:1 ,    4.3 enc MB/s ,   65.4 dec MB/s

Game BCn :

A mix of BCn textures from a game (mostly BC1, BC4, and BC7) :

Kraken8         :  3.67:1 ,    0.7 enc MB/s ,  881.0 dec MB/s
Leviathan8      :  3.84:1 ,    0.4 enc MB/s ,  661.9 dec MB/s
zlib9           :  2.72:1 ,    6.2 enc MB/s ,  321.5 dec MB/s
lzmalp3pb3fb1289:  3.67:1 ,    2.0 enc MB/s ,   66.9 dec MB/s
lzma_def9       :  3.68:1 ,    2.3 enc MB/s ,   67.0 dec MB/s

ASTC testset :

A private corpus of game ASTC texture data :

Kraken8         :  1.15:1 ,    1.7 enc MB/s , 1289.6 dec MB/s
Leviathan8      :  1.17:1 ,    1.0 enc MB/s ,  675.8 dec MB/s
zlib9           :  1.10:1 ,   18.9 enc MB/s ,  233.8 dec MB/s
lzma_def9       :  1.17:1 ,    6.2 enc MB/s ,   21.4 dec MB/s
lzmalp3pb3fb1289:  1.19:1 ,    6.1 enc MB/s ,   21.4 dec MB/s

gametestset :

A private corpus of game test data :

Kraken8         :  2.68:1 ,    1.0 enc MB/s , 1256.2 dec MB/s
Leviathan8      :  2.83:1 ,    0.6 enc MB/s ,  915.0 dec MB/s
zlib9           :  1.99:1 ,    8.3 enc MB/s ,  332.4 dec MB/s
lzmalp3pb3fb1289:  2.76:1 ,    3.4 enc MB/s ,   46.5 dec MB/s
lzma_def9       :  2.72:1 ,    4.2 enc MB/s ,   46.3 dec MB/s

Silesia :

Silesia standard public compression test corpus :

Kraken8         :  4.25:1 ,    0.6 enc MB/s ,  996.0 dec MB/s
Leviathan8      :  4.35:1 ,    0.4 enc MB/s ,  804.5 dec MB/s
zlib9           :  3.13:1 ,    8.4 enc MB/s ,  351.3 dec MB/s
lzmalp3pb3fb1289:  4.37:1 ,    1.9 enc MB/s ,   80.2 dec MB/s
lzma_def9       :  4.27:1 ,    2.6 enc MB/s ,   79.0 dec MB/s

brotli24 -11    :  4.21:1 ,                    310.9 dec MB/s
zstd 1.3.3 -22  :  4.01:1 ,                    598.2 dec MB/s
(zstd and brotli run in lzbench)  (brotli is 2017-12-12)


Read more about Leviathan and Oodle 2.6.0 in these other posts on my blog :

Leviathan Rising
Everything new and tasty in Oodle 2.6.0
Leviathan performance on PS4, Xbox One, and Switch
Leviathan detailed performance report
Oodle Hydra and space-speed flexibility

or visit RAD to read for more information about the Oodle SDK

old rants