7/25/2016

Oodle Selkie

Selkie is the faster cousin of Mermaid, distant relative of Kraken.

Selkie is all about decode speed, it aims to be the fastest mainstream decompressor in the world, and still gets more compression than anything in the high-speed domain.

Selkie does not currently have a super fast encoder. It's got good optimal parse encoders that produce carefully tuned encoded file which offer excellent space-speed tradeoff.

The closest compressors to Selkie are the fast byte-wise small-window coders like LZ4 and LZSSE (and Oodle's LZB16). These are all just obsolete now (in terms of ratio vs decode speed), Selkie gets a lot more compression (sometimes close to Zlib compression levels!) and is also much faster.

Selkie will not compress tiny buffers, or files that only compress a little bit. For example if you give Selkie something like an mp3, it might be able to compress it to 95% of its original size, saving a few bytes. Selkie will refuse to do that and just give you the original uncompressed file. If you wanted that compression, that means you wanted to save only a few bytes at a large time cost, which means you don't actually want a fast compressor like Selkie. You in fact wanted a compressor that was more willing to trade time for bytes, such as Mermaid or Kraken. Selkie will not abide logical inconsistency.

Selkie generally beats LZ4 compression even on small files (under 64k) but really gets ahead on files larger than 64k where the unbounded match distances can find big wins.

As usual, I'm not picking on LZ4 here because it's bad; I'm comparing to it because it's the best of the rest, and it's widely known. Both decompressors are run fuzz-safe.


Tests on Win64 (Core i7-3770 3.4 GHz) : 
(for reference, this machine runs memcpy at roughly 8 GB/s)
(total of time & size on each test set)

gametestset : ooSelkie    : 143,579,361 ->70,716,380 =  3.940 bpb =  2.030 to 1 
gametestset : decode      : 29.239 millis, 0.69 c/b, rate= 4910.61 mb/s

gametestset : lz4hc       : 143,579,361 ->80,835,018 =  4.504 bpb =  1.776 to 1 
gametestset : decode      : 44.495 millis, 1.05 c/b, rate= 3226.89 mb/s

pd3d : ooSelkie    : 31,941,800 ->13,428,298 =  3.363 bpb =  2.379 to 1 
pd3d : decode      : 8.381 millis, 0.89 c/b, rate= 3811.29 mb/s

pd3d : lz4hc       : 31,941,800 ->14,273,195 =  3.575 bpb =  2.238 to 1 
pd3d : decode      : 13.479 millis, 1.44 c/b, rate= 2369.67 mb/s

seven : ooSelkie    : 80,000,000 ->36,460,084 =  3.646 bpb =  2.194 to 1 
seven : decode      : 21.458 millis, 0.91 c/b, rate= 3728.26 mb/s

seven : lz4hc       : 80,000,000 ->39,990,656 =  3.999 bpb =  2.000 to 1 
seven : decode      : 31.730 millis, 1.35 c/b, rate= 2521.30 mb/s

silesia : ooSelkie    : 211,938,580 ->69,430,966 =  2.621 bpb =  3.053 to 1 
silesia : decode      : 72.340 millis, 1.16 c/b, rate= 2929.77 mb/s

silesia : lz4hc       : 211,938,580 ->77,841,566 =  2.938 bpb =  2.723 to 1 
silesia : decode      : 93.488 millis, 1.50 c/b, rate= 2267.02 mb/s

The edge that Selkie has over LZ4 is even greater on more difficult platforms like the PS4.

To get a better idea of the magic of Selkie it's useful to look at the other Oodle compressors that are similar to Selkie.

LZB16 is Oodle's LZ4 variant; it gets slightly more compression and slightly more decode speed, but they're roughly equal. It's included here for comparison to LZBLW.

Oodle's LZBLW is perhaps the most similar compressor to Selkie. It's like LZB16 (LZ4) but adds large-window matches. That ability to do long-distance matches hurts speed a tiny bit (2873 mb/s -> 2596 mb/s), but helps compression a lot.

Oodle's LZNIB is nibble-wise, with unbounded offsets and a rep match. It gets good compression, generally better than Zlib, with speed much higher than any LZ-Huff. LZNIB is in a pretty unique space speed tradeoff zone without much competition outside of Oodle.


lz4hc     : 24,700,820 ->14,801,510 =  4.794 bpb =  1.669 to 1 
decode    : 9.481 millis, 1.31 c/b, rate= 2605.37 mb/s

ooLZB16   : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1
decode    : 8.597 millis, 1.18 c/b, rate= 2873.17 mb/s

ooLZNIB   : 24,700,820 ->12,014,368 =  3.891 bpb =  2.056 to 1
decode    : 17.420 millis, 2.40 c/b, rate= 1417.93 mb/s

ooLZBLW   : 24,700,820 ->13,349,800 =  4.324 bpb =  1.850 to 1
decode    : 9.512 millis, 1.31 c/b, rate= 2596.80 mb/s

ooSelkie  : 24,700,820 ->12,752,506 =  4.130 bpb =  1.937 to 1 
decode    : 6.410 millis, 0.88 c/b, rate= 3853.57 mb/s

LZNIB and LZBLW were both pretty cool before Selkie, but now they're just obsolete.

LZBLW gets a nice compression gain over LZB16, but Selkie gets even more, and is way faster!

LZNIB beats Selkie compression, but is way slower, around 3X slower, in fact it's slower than Mermaid (2283.28 mb/s and compresses to 10,838,455 = 3.510 bpb = 2.279 to 1).

You can see from the curves that Selkie just completely covers the curves of LZB16,LZBLW, and LZ4. When a curve is completely covered like that, it means that it was beaten for both space and speed, so there is no domain where that compressor is ever better. LZNIB just peeks out of the Selkie curve because it gets higher compression (albeit at lower speed), so there is a domain where it is the better choice - but in that domain Mermaid just completely dominates LZNIB, so it too is obsolete.

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

Oodle 2.3.0 ARM Report

I prepared a detailed report of Oodle's performance on ARM mobile devices (Android and iOS).

The full report is here :

oodle_arm_report on cbloom.com

It's a thorough test on many devices and several corpora. See the full details there.

Cliff notes is : Oodle's great on ARM.

For example on the iPadAir2 64-bit , on Silesia :

We found that the iOS devices are generally very good and easy to program for. They're more like desktop Intel chips; they don't have any terrible performance cliffs. The Android ARM devices we tested on were rather more difficult. For one thing they have horrible thermal saturation problems that makes testing on them very difficult. They also have some odd performance quirks.

I'm sure we could get a lot more speed on ARM, but it's rather nasty to optimize for. For one thing the thermal problems mean that iterating and getting good numbers is a huge pain. It's hard to tell if a change helped or not. For another, there's a wide variety of devices and it's hard to tell which to optimize for, and they have different performance shortfalls. So there's definitely a lot left on the table here.

Mermaid & Selkie are quite special on ARM. Many of these devices have small caches (as small as 512k L2) and very slow main memory (slow wrst latency; they often have huge bandwidth, but latency is what I need). Mermaid & Selkie are able to use unbounded windows for LZ without suffering a huge speed hit, due to the unique way they are structured. Kraken doesn't have the same magic trick so it benefits from a limited window, as demonstrated in the report.

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

Oodle 2.3.0 All Test Sets

Putting the total performance on various testsets together in one place. Tests on Win64 Core i7-3770 3.4 GHz as usual.

Showing speed & ratio here, higher is better.

As usual the total on a test set is total size of all individually compressed files, and total time.

I think the scatter plot most clearly shows the way Kraken, Mermaid & Selkie are just on a whole new Pareto Frontier than the older compressors. You can connect the dots of K-M-S performance for each test set and they form a very consistent space-speed tradeoff curve that's way above the previous best.

The raw numbers :

gametestset          : Kraken     : 2.566 to 1 : 1363.283 MB/s
gametestset          : Mermaid    : 2.284 to 1 : 2711.458 MB/s
gametestset          : Selkie     : 2.030 to 1 : 4870.413 MB/s
gametestset          : lz4hc      : 1.776 to 1 : 3223.279 MB/s
gametestset          : zlib9      : 1.992 to 1 : 338.063 MB/s
gametestset          : lzma       : 2.756 to 1 : 43.782 MB/s

pd3d                 : Kraken     : 3.647 to 1 : 1072.833 MB/s
pd3d                 : Mermaid    : 2.875 to 1 : 2299.860 MB/s
pd3d                 : Selkie     : 2.379 to 1 : 3784.850 MB/s
pd3d                 : lz4hc      : 2.238 to 1 : 2370.193 MB/s
pd3d                 : zlib9      : 2.886 to 1 : 382.226 MB/s
pd3d                 : lzma       : 4.044 to 1 : 63.878 MB/s

seven                : Kraken     : 2.914 to 1 : 1053.961 MB/s
seven                : Mermaid    : 2.462 to 1 : 2374.796 MB/s
seven                : Selkie     : 2.194 to 1 : 3717.074 MB/s
seven                : lz4hc      : 2.000 to 1 : 2522.824 MB/s
seven                : zlib9      : 2.329 to 1 : 315.344 MB/s
seven                : lzma       : 3.186 to 1 : 52.660 MB/s

silesia              : Kraken     : 4.082 to 1 : 1004.014 MB/s
silesia              : Mermaid    : 3.571 to 1 : 2002.079 MB/s
silesia              : Selkie     : 3.053 to 1 : 2889.536 MB/s
silesia              : lz4hc      : 2.723 to 1 : 2269.788 MB/s
silesia              : zlib9      : 3.128 to 1 : 358.593 MB/s
silesia              : lzma       : 4.369 to 1 : 78.655 MB/s

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

7/24/2016

Oodle 2.3.0 : Kraken Improvement

Oodle 2.3.0 includes some pretty solid improvements to Kraken. The result is around a 10% gain in decode speed.

There were two major factors in the gains. One was just some more time optimizing some inner loops (including some new super-tight pathways from Fabian).

The other was more rigorous analysis of the space-speed tradeoff decisions inside Kraken. One of the fundamental things that makes Kraken work is the fact that it consider space-speed when making its internal decisions, but before 230 those decisions were made in a rather ad-hoc way. Making those decisions better means that even with the same decoder, the new encoder is able to create files that are the same size but decode faster.

The tradeoff point (technically, the lagrange lambda, or the exchange rate from time to bytes) that's used by Oodle to make space-speed decisions is exposed to the client in the OodleLZ_CompressOptions so you can adjust it to bias for compression or decode speed. Each compressor sets what I believe to be a reasonable default for its usage domain, so adjustments to this value should typically be small (you can't massively change behavior with it; Kraken won't start arithmetic coding things if you set the tradeoff really small, for example, there's a small window where the compressor works well and you can just bias sightly within that window).

Some dry numbers for reference :

On PS4 :


Oodle 230 Kraken -zl4 : 24,700,820 ->10,377,556 =  3.361 bpb =  2.380 to 1
decode only      : 65.547 millis, 4.23 c/b, rate= 376.84 mb/s

Oodle 230 Kraken -zl6 : 24,700,820 -> 9,970,882 =  3.229 bpb =  2.477 to 1
decode           : 63.453 millis, 4.09 c/b, rate= 389.28 mb/s

Oodle 230 Kraken -zl7 : 24,700,820 -> 9,734,771 =  3.153 bpb =  2.537 to 1
decode           : 67.915 millis, 4.38 c/b, rate= 363.70 mb/s

Oodle 220 Kraken -zl4 : 24,700,820 ->10,326,584 =  3.345 bpb =  2.392 to 1
decode only      : 0.073 seconds, 211.30 b/kc, rate= 336.76 mb/s

Oodle 220 Kraken -zl6 : 24,700,820 ->10,011,486 =  3.242 bpb =  2.467 to 1
decode           : 0.074 seconds, 208.83 b/kc, rate= 332.82 mb/s

Oodle 220 Kraken -zl7 : 24,700,820 -> 9,773,112 =  3.165 bpb =  2.527 to 1
decode           : 0.079 seconds, 196.70 b/kc, rate= 313.49 mb/s

On Win64 (Core i7-3770 3.4 GHz) :


Oodle 2.3.0 :

Silesia Kraken -z6

total   : 211,938,580 ->51,918,269 =  1.960 bpb =  4.082 to 1
decode           : 210.685 millis, 3.38 c/b, rate= 1005.95 mb/s
Weissman 1-256 : [8.575]

mozilla : 51,220,480 ->14,410,181 =  2.251 bpb =  3.554 to 1
decode only      : 51.280 millis, 3.41 c/b, rate= 998.83 mb/s

lzt99 : 24,700,820 -> 9,970,882 =  3.229 bpb =  2.477 to 1
decode only      : 20.943 millis, 2.89 c/b, rate= 1179.44 mb/s

win81 : 104,857,600 ->38,222,311 =  2.916 bpb =  2.743 to 1
decode only      : 108.344 millis, 3.52 c/b, rate= 967.82 mb/s

Oodle 2.2.0 :

Silesia Kraken -z6

total : 211,938,580 ->51,857,427 =  1.957 bpb =  4.087 to 1
decode   : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s
Weissman 1-256 : [8.431]

"silesia_mozilla"

Kraken 230  :  3.55:1 ,   998.8 dec mb/s
Kraken 220  :  3.60:1 ,   896.5 dec mb/s
Kraken 215  :  3.51:1 ,   928.0 dec mb/s

"lzt99"

Kraken 230  :  2.48:1 ,   998.8 dec mb/s
Kraken 220  :  2.53:1 ,   912.0 dec mb/s
Kraken 215  :  2.46:1 ,   957.1 dec mb/s

"win81"

Kraken 230  :  2.74:1 ,   967.8 dec mb/s
Kraken 220  :  2.77:1 ,   818.0 dec mb/s
Kraken 215  :  2.70:1 ,   877.0 dec mb/s

NOTE : Oodle 2.3.0 Kraken data cannot be read by Oodle 2.2.0 or earlier. Oodle 230 can load all old Oodle data (new versions of Oodle can always load all data created by older versions). If you need to make data that be loaded with an older version using Oodle 230, then you can set the minimum decoder version to something lower (by default it's the current version). Contact Oodle support for details.

Some of the biggest gains were found on ARM, which I'll post about more in the future.

6/09/2016

Fundamentals of Modern LZ : Two-Step Parse


06-09-16 | Fundamentals of Modern LZ : Two-Step Parse

For some reason I feel like writing a note on this today.

A two-step parse is an enhancement to a forward-arrivals parse.

(background : forward-arrivals parse stores the minimum cost from head at each position, along with information on the path taken to get there. At each pos P, it takes the best incoming arrival and considers all ways to go further into the parse (literal/match/rep/etc.). At each destination point it stores arrival_cost[P] + step cost. In simple cases (no carried state, no entropy coding, like LZSS) the forward-arrivals parse is a perfect parse just like the backward dynamic-programming parse. In modern LZ with carried state such as a rep set or markov state, the forward parse is preferable.)

A two-step parse extends the standard forward-arrivals parse by being able to store an arrival from a single coding step, or from two coding steps. The standard usage (as in LZMA/7zip) is to be able to store a two-step arrival from the sequence { normal match, some literals, rep match }. This multi-step arrival is stored with the cost of the whole sequence at the end point of the sequence.

If you stored *all* arrivals (not just the cheapest), you would not need two-step parse. You could just store the first step, and then when your parse origin point advanced to the end of the first step, it would find the second step and be able to choose it as an option.

But obviously you don't store all arrivals at each position, since the number would massively explode, even with reduction by symmetries. (see, eg. previous articles on A* parse)

The problem arises when you have a cheap multi-step sequence, but the first step is expensive. Then the first step might be replaced (or never filled in the first place) and the parse will not be able to find the second step cheap option.

Let's consider a concrete example for clarity.


Parser is at pos P consider all ways to continue

At pos P there's a length 4 normal match available at offset O

It stores an arrival at [P+4] that's rather expensive

(because it has to send offset O).

At pos P+1 the parser finds a length 3 rep match

The exit from (P+1) length 3 also lands at [P+4]

This is a cheaper way to arrive at [P+4] , so the previous arrival from P via O is replaced

When the parser reaches P+4 it sees the incoming arrival as
begin a rep match match from P+1


But we missed something !

At pos P+5 (one step after the arrival) there are 2 bytes that match at offset O

if we had chosen the normal match to arrive at P+4 , we could now code a rep match

but we lost it, so we don't see the rep as an option.


Two-step to the rescue!

Back at pos P , we consider the one-step arrival :

{match len 4, offset O} to arrive at P+4

We also look after the end of that for cheap rep matches and find one.

So we store a two-step arrival :

{match len4, offset O, 1 literal, rep len 2} to arrive at P+7

Now at pos P+1 the arrival at P+4 is stomped

but the arrival at P+7 remains!  So we are able to find that in the future.


The options look like :

   P   P+4
   V   V
1. MMMMLRR
2. LRRRLLL

Option 2 is cheaper at P+4
but Option 1 is cheaper at P+7

This is the primary application of two-step parse.

It's a (very limited) way of finding non-local minima in the parse search space.

The other option is "multi-parse" that stores multiple arrivals at each position (something like 4 is typical). Multi-parse and two-step provide diminishing returns when used together, so they usually aren't. Two-step is generally a faster way and provides more win per CPU time, multi-parse is able to find longer-range non-local-minimum moves and so provides more compression.

All good modern LZ's need some kind of non-local-minimum parse, because to get into a good state for the future (typically by getting the right offset into the rep offset cache) you may need to make a more expensive initial step.

5/27/2016

PS4 Battle : MiniZ vs Zlib-NG vs ZStd vs Brotli vs Oodle

(see charts at the bottom)

Everything run at max compression options, level 99, max dict size. All libs are the latest on github, downloaded today. Zlib-NG has the arch/x86 stuff enabled. PS4 is AMD Jaguar , x64.

I'm going to omit encode speeds on the per-file results for simplicity, these are pretty representative :


aow3_skin_giants.clb :
zlib-ng encode   : 2.699 seconds, 1.65 b/kc, rate= 2.63 mb/s
miniz encode     : 2.950 seconds, 1.51 b/kc, rate= 2.41 mb/s
zstd encode      : 5.464 seconds, 0.82 b/kc, rate= 1.30 mb/s
brotli-9  encode    : 23.110 seconds, 0.19 b/kc, rate= 307.44 kb/s
brotli-10 encode    : 68.072 seconds, 0.07 b/kc, rate= 104.38 kb/s
brotli-11 encode    : 79.844 seconds, 0.06 b/kc, rate= 88.99 kb/s

Results :

PS4 clang-3.5.0

-------------

lzt99 :

MiniZ : 24,700,820 ->13,120,668 =  4.249 bpb =  1.883 to 1
miniz_decompress_time : 0.292 seconds, 53.15 b/kc, rate= 84.71 mb/s

zlib-ng : 24,700,820 ->13,158,385 =  4.262 bpb =  1.877 to 1
miniz_decompress_time : 0.226 seconds, 68.58 b/kc, rate= 109.30 mb/s

ZStd : 24,700,820 ->10,403,228 =  3.369 bpb =  2.374 to 1
zstd_decompress_time : 0.184 seconds, 84.12 b/kc, rate= 134.07 mb/s

Brotli-9 : 24,700,820 ->10,473,560 =  3.392 bpb =  2.358 to 1
brotli_decompress_time : 0.259 seconds, 59.83 b/kc, rate= 95.36 mb/s

Brotli-10 : 24,700,820 -> 9,949,740 =  3.222 bpb =  2.483 to 1
brotli_decompress_time : 0.319 seconds, 48.54 b/kc, rate= 77.36 mb/s

Brotli-11 : 24,700,820 -> 9,833,023 =  3.185 bpb =  2.512 to 1
brotli_decompress_time : 0.317 seconds, 48.84 b/kc, rate= 77.84 mb/s

Oodle Kraken -zl4 : 24,700,820 ->10,326,584 =  3.345 bpb =  2.392 to 1
encode only      : 4.139 seconds, 3.74 b/kc, rate= 5.97 mb/s
decode only      : 0.073 seconds, 211.30 b/kc, rate= 336.76 mb/s

Oodle Kraken -zl6 : 24,700,820 ->10,011,486 =  3.242 bpb =  2.467 to 1
decode           : 0.074 seconds, 208.83 b/kc, rate= 332.82 mb/s

Oodle Kraken -zl7 : 24,700,820 -> 9,773,112 =  3.165 bpb =  2.527 to 1
decode           : 0.079 seconds, 196.70 b/kc, rate= 313.49 mb/s

Oodle LZNA : lzt99 : 24,700,820 -> 9,068,880 =  2.937 bpb =  2.724 to 1
decode           : 0.643 seconds, 24.12 b/kc, rate= 38.44 mb/s

-------------

normals.bc1 :

miniz :   524,316 ->   291,697 =  4.451 bpb =  1.797 to 1
miniz_decompress_time : 0.008 seconds, 39.86 b/kc, rate= 63.53 mb/s

zlib-ng :   524,316 ->   292,541 =  4.464 bpb =  1.792 to 1
zlib_ng_decompress_time : 0.007 seconds, 47.32 b/kc, rate= 75.41 mb/s

zstd :   524,316 ->   273,642 =  4.175 bpb =  1.916 to 1
zstd_decompress_time : 0.007 seconds, 49.64 b/kc, rate= 79.13 mb/s

brotli-9 :   524,316 ->   289,778 =  4.421 bpb =  1.809 to 1
brotli_decompress_time : 0.010 seconds, 31.70 b/kc, rate= 50.52 mb/s

brotli-10 :   524,316 ->   259,772 =  3.964 bpb =  2.018 to 1
brotli_decompress_time : 0.011 seconds, 28.65 b/kc, rate= 45.66 mb/s

brotli-11 :   524,316 ->   253,625 =  3.870 bpb =  2.067 to 1
brotli_decompress_time : 0.011 seconds, 29.74 b/kc, rate= 47.41 mb/s

Oodle Kraken -zl6 :    524,316 ->   247,217 =  3.772 bpb =  2.121 to 1
decode           : 0.002 seconds, 135.52 b/kc, rate= 215.95 mb/s

Oodle Kraken -zl7 :    524,316 ->   238,844 =  3.644 bpb =  2.195 to 1
decode           : 0.003 seconds, 123.96 b/kc, rate= 197.56 mb/s

Oodle BitKnit :    524,316 ->   225,884 =  3.447 bpb =  2.321 to 1
decode only      : 0.010 seconds, 31.67 b/kc, rate= 50.47 mb/s

-------------

lightmap.bc3 :

miniz :  4,194,332 ->   590,448 =  1.126 bpb =  7.104 to 1 
miniz_decompress_time : 0.025 seconds, 105.14 b/kc, rate= 167.57 mb/s

zlib-ng : 4,194,332 ->   584,107 =  1.114 bpb =  7.181 to 1
zlib_ng_decompress_time : 0.019 seconds, 137.77 b/kc, rate= 219.56 mb/s

zstd :  4,194,332 ->   417,672 =  0.797 bpb = 10.042 to 1 
zstd_decompress_time : 0.014 seconds, 182.53 b/kc, rate= 290.91 mb/s

brotli-9 : 4,194,332 ->   499,120 =  0.952 bpb =  8.403 to 1 
brotli_decompress_time : 0.022 seconds, 118.64 b/kc, rate= 189.09 mb/s

brotli-10 : 4,194,332 ->   409,907 =  0.782 bpb = 10.232 to 1 
brotli_decompress_time : 0.021 seconds, 125.20 b/kc, rate= 199.54 mb/s

brotli-11 : 4,194,332 ->   391,576 =  0.747 bpb = 10.711 to 1 
brotli_decompress_time : 0.021 seconds, 127.12 b/kc, rate= 202.61 mb/s

Oodle Kraken -zl6 :   4,194,332 ->   428,737 =  0.818 bpb =  9.783 to 1 
decode           : 0.009 seconds, 308.45 b/kc, rate= 491.60 mb/s

Oodle BitKnit :   4,194,332 ->   416,208 =  0.794 bpb = 10.077 to 1
decode only      : 0.021 seconds, 122.59 b/kc, rate= 195.39 mb/s

Oodle LZNA :  4,194,332 ->   356,313 =  0.680 bpb = 11.771 to 1 
decode           : 0.033 seconds, 79.51 b/kc, rate= 126.71 mb/s

----------------

aow3_skin_giants.clb

Miniz : 7,105,158 -> 3,231,469 =  3.638 bpb =  2.199 to 1
miniz_decompress_time : 0.070 seconds, 63.80 b/kc, rate= 101.69 mb/s

zlib-ng : 7,105,158 -> 3,220,291 =  3.626 bpb =  2.206 to 1
zlib_ng_decompress_time : 0.056 seconds, 80.14 b/kc, rate= 127.71 mb/s

Zstd : 7,105,158 -> 2,700,034 =  3.040 bpb =  2.632 to 1
zstd_decompress_time : 0.050 seconds, 88.69 b/kc, rate= 141.35 mb/s

brotli-9 :  7,105,158 -> 2,671,237 =  3.008 bpb =  2.660 to 1
brotli_decompress_time : 0.080 seconds, 55.84 b/kc, rate= 89.00 mb/s

brotli-10 : 7,105,158 -> 2,518,315 =  2.835 bpb =  2.821 to 1
brotli_decompress_time : 0.098 seconds, 45.54 b/kc, rate= 72.58 mb/s

brotli-11 : 7,105,158 -> 2,482,511 =  2.795 bpb =  2.862 to 1
brotli_decompress_time : 0.097 seconds, 45.84 b/kc, rate= 73.05 mb/s

Oodle Kraken -zl6 : aow3_skin_giants.clb :  7,105,158 -> 2,638,490 =  2.971 bpb =  2.693 to 1
decode           : 0.023 seconds, 195.25 b/kc, rate= 311.19 mb/s

Oodle BitKnit : 7,105,158 -> 2,623,466 =  2.954 bpb =  2.708 to 1
decode only      : 0.095 seconds, 47.11 b/kc, rate= 75.08 mb/s

Oodle LZNA : aow3_skin_giants.clb :  7,105,158 -> 2,394,871 =  2.696 bpb =  2.967 to 1
decode           : 0.170 seconds, 26.26 b/kc, rate= 41.85 mb/s

--------------------

silesia_mozilla

MiniZ : 51,220,480 ->19,141,389 =  2.990 bpb =  2.676 to 1
miniz_decompress_time : 0.571 seconds, 56.24 b/kc, rate= 89.63 mb/s

zlib-ng : 51,220,480 ->19,242,520 =  3.005 bpb =  2.662 to 1
zlib_ng_decompress_time : 0.457 seconds, 70.31 b/kc, rate= 112.05 mb/s

zstd : malloc failed

brotli-9 : 51,220,480 ->15,829,463 =  2.472 bpb =  3.236 to 1
brotli_decompress_time : 0.516 seconds, 62.27 b/kc, rate= 99.24 mb/s

brotli-10 : 51,220,480 ->14,434,253 =  2.254 bpb =  3.549 to 1
brotli_decompress_time : 0.618 seconds, 52.00 b/kc, rate= 82.88 mb/s

brotli-11 : 51,220,480 ->14,225,511 =  2.222 bpb =  3.601 to 1
brotli_decompress_time : 0.610 seconds, 52.72 b/kc, rate= 84.02 mb/s

Oodle Kraken -zl6 : 51,220,480 ->14,330,298 =  2.238 bpb =  3.574 to 1
decode           : 0.200 seconds, 160.51 b/kc, rate= 255.82 mb/s

Oodle Kraken -zl7 : 51,220,480 ->14,222,802 =  2.221 bpb =  3.601 to 1
decode           : 0.201 seconds, 160.04 b/kc, rate= 255.07 mb/s

Oodle LZNA : silesia_mozilla : 51,220,480 ->13,294,622 =  2.076 bpb =  3.853 to 1
decode           : 1.022 seconds, 31.44 b/kc, rate= 50.11 mb/s

I tossed in tests of BitKnit & LZNA in some cases after I realized that the Brotli decode speeds are more comparable to BitKnit than Kraken, and even LZNA isn't that far off (usually less than a factor of 2). eg. you could do half your files in LZNA and half in Kraken and that would be about the same total time as doing them all in Brotli.


Here are charts of the above data :

(silesia_mozilla omitted due to lack of zstd results)

(I'm trying an experiment and showing inverted scales, which are more proportional to what you care about. I'm showing seconds per gigabyte, and percent out of output size, which are proportional to *time* not speed, and *size* not ratio. So, lower is better.)

log-log speed & ratio :

Time and size are just way better scales. Looking at "speed" and "ratio" can be very misleading, because big differences in speed at the high end (eg. 2000 mb/s vs 2200 mb/s) don't translate into a very big time difference, and *time* is what you care about. On the other hand, small differences in speed at the low end *are* important - (eg. 30 mb/s vs 40 mb/s) because those mean a big difference in time.

I've been doing mostly "speed" and "ratio" because it reads better to the novice (higher is better! I want the one with the biggest bar!), but it's so misleading that I think going to time & size is worth it.

5/17/2016

The Weissman Score

Wikipedia suggests the Weissman score should be

which ignoring constants is just W = r/logT

That's just wrong. You don't take a logarithm of something with units. But there are aspects of it that are correct. W should be proportional to r (compression ratio), and a logarithm of time should be involved. Just not like that.

I present a formula which I call the correct Weissman Score :


W = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )  -
    comp_ratio * log10( 1 + speed/(disk_speed_hi *comp_ratio) )

or

W = comp_ratio * log10( ( comp_ratio + speed/disk_speed_lo ) / ( comp_ratio + speed/disk_speed_hi ) )

You can have a Weissman score for encode speed or decode speed. It's a measure of space-speed tradeoff success.

I suggest the range should be 1-256. disk_speed_lo = 1 MB/s (to evaluate performance on very slow channels, favoring small files), disk_speed_hi = 256 MB/s (to evalue performance on very fast disks, favoring speed). And because 1 and 256 are amongst programmers' favorite numbers.

You could also just let the hi range go to infinity. Then you don't need a hi disk speed parameter and you get :


Weissman-infinity = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )

with disk_speed_lo = 1 MB/s ; which is neater, though this favors fast compressors more than you might like. While it's a cleaner formula, I think it's less useful for practical purposes, where the bounded hi range focuses the score more on the area that most people care about.

I came up with this formula because I started thinking about summarizing a score from the Pareto charts I've made . What if you took the speedup value at several (log-scale) disk speeds; like you could take the speedup at 1 MB/s,2 MB/s,4 MB/s, and just average them? speedup is a good way to measure a compressor even if you don't actually care about speed. Well, rather than just average a bunch of points, what if I average *all* points? eg. integrate to get the area under the curve? Note that we're integrating in log-scale of disk speed.

Turns out you can just do that integral :

    speedup = (time to load uncompressed) / (time to load compressed + decompress)
    speedup = (raw_size/disk_speed) / (comp_size/disk_speed + raw_size/ decompress_speed)
    speedup = (1/disk_speed) / (1/(disk_speed*compression_ratio) + 1 / decompress_speed)
    speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)
    speedup = 1 / (1/compression_ratio + exp( log_disk_speed ) / decompress_speed)
    speedup = compression_ratio / (1 + exp( log_disk_speed ) * compression_ratio/decompress_speed)
    speedup = compression_ratio * 1 / (1 + exp( log_disk_speed + log(compression_ratio/decompress_speed)))

speedup is a sigmoid :

    y = 1 / (1 + e^-x ) 
    
    Integral{y} = ln( 1 + e^x )

    x = - ( log_disk_speed + log(compression_ratio/decompress_speed) )

so substitute some variables and you get the above formula for the Weissman score.

In the final formula, I changed from natural log to log-base-10, which is just a constant scaling factor.

The Weissman (decode Core i7-3770 3.4 GHz; 1-256 range) scores on Silesia are :

lz4hc    : 6.243931
zstdmax  : 7.520236
lzham    : 6.924379
lzma     : 5.460073
zlib9    : 5.198510
Kraken   : 8.431461

Weissman-infinity scores are :

lz4hc    : 7.983104
zstdmax  : 8.168544
lzham    : 7.277707
lzma     : 5.589155
zlib9    : 5.630476
Kraken   : 9.551152

Goal : beat 10.0 !


ADD : this post was a not-sure-if-joking. But I actually think it's useful. I find it useful anyway.

When you're trying to tweak out some space-speed tradeoff decisions, you get different sizes and speeds, and it can be hard to tell if that tradeoff was good. You can do things like plot all your options on a space-speed graph and try to guess the pareto frontier and take those. But when iterating an optimization of a parameter you want just a simple score.

This corrected Weissman score is a nice way to do that. You have to choose what domain you're optimizing for, size-dominant slower compressors should use Weissman 1-256 , for balance of space and super speed use Weissman 1-inf (or 40-800), for the fast domain (LZ4-ish) use a range like 100-inf. Then you can just iterate to maximize that number!

For whatever space-speed tradeoff domain you're interested in, there exists a Weissman score range (lo-hi disk speed paramaters) such that maximizing the Weissman score in that range gives you the best space-speed tradeoff in the domain you wanted. The trick is choosing what that lo-hi range is (it doesn't necessarily directly correspond to actual disk or channel speeds; there are other factors to consider like latency, storage usage, etc. that might cause you to bias the lo-hi away from the actual channel speeds in some way; for example high speed decoders should always set the upper speed to infinity, which corresponds to the use case that the compressed data might be already resident in RAM so it has zero time to load).

5/15/2016

PS4 Battle : LZ4 vs LZSSE vs Oodle

PS4 is the arena. Three compressors enter. (revised 05-16)

Advantages of the PS4 : consistent well-defined hardware for reproducible testing. Slow platform that game developers care about being fast on. Builds with clang which is the target of choice for some compression libraries that don't build so easily on MSVC.

After the initial version of this post, I went and fuzz-safed LZB16, so they're all directly comparable. To compare apples, look at LZ4_decompress_safe. I also include LZ4_decompress_fast for reference.

LZ4_decompress_safe : fuzz safe (*)
LZ4_decompress_fast : NOT fuzz safe
LZSSE8_Decode : fuzz safe
Oodle LZB16 : fuzz safe
LZSSE and Oodle both use multiple copies of the core loop to minimize the pain of fuzz safing. LZ4's code is much simpler, it doesn't do a ton of macro or .inl nastiness. (* = this is the only one that I would actually trust to put in a satellite or a bank or something critical, it's just so simple, it's way easier to be sure that it's correct)

Conclusion :

Comparing the two Safe open-source options, LZ4_safe vs. LZSSE8 : LZSSE8 is pretty consistently faster than LZ4_Safe on PS4 (though the difference is small). PS4 is a better platform for LZSSE than x64 (PS4 is actually a pretty bad platform for LZ4; there're a variety of issues; see GDC "Taming the Jaguar" slides ; but particular issues for LZ4 are the front-end bottleneck and cache latency). When I tested on x64 before, it was much more mixed, sometimes LZ4 was faster.

I was surprised to find that Oodle LZB16 is quite a lot faster than LZ4 on PS4. (for example, that's not the case on Windows/x64, it's much closer there). I've never run third party codecs on PS4 before. I suppose this reflects the per-platform tweaking that we spend so much time on, and I'm sure LZ4 would catch up with some time spent fiddling with the PS4 codegen.

The compression ratios are mostly close enough to not care, though LZSSE8 does a bit worse on some of the DXTC/BCn files (lightmap.bc3 and d.dds).

On some files I include Oodle LZBLW numbers (LZB-bytewise-large-window). Sometimes Oodle LZBLW is a pretty big free compression win at about the same speed. Sometimes it gets worse ratio, sometimes much worse speed. If I was a client using this, I might try LZBLW and drop down to LZB16 any time it's not a good tradeoff.

Full data :

REMINDER : LZ4_decompress_fast is not directly comparable to the others, it's not fuzz safe, the others are!


PS4 clang-3.6.1

------------------

lzt99 :

lz4hc : 24,700,820 ->14,801,510 =  4.794 bpb =  1.669 to 1
LZ4_decompress_safe_time : 0.035 seconds, 440.51 b/kc, rate= 702.09 mb/s
LZ4_decompress_fast_time : 0.032 seconds, 483.55 b/kc, rate= 770.67 mb/s

LZSSE8  : 24,700,820 ->15,190,395 =  4.920 bpb =  1.626 to 1
LZSSE8_Decode_Time : 0.033 seconds, 467.32 b/kc, rate= 744.81 mb/s

Oodle LZB16 : lzt99 : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1 
decode           : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s

Oodle LZBLW : lzt99 : 24,700,820 ->13,349,800 =  4.324 bpb =  1.850 to 1
decode           : 0.033 seconds, 470.39 b/kc, rate= 749.74 mb/s

------------------

texture.bc1

lz4hc : 2,188,524 -> 2,068,268 =  7.560 bpb =  1.058 to 1
LZ4_decompress_safe_time : 0.004 seconds, 322.97 b/kc, rate= 514.95 mb/s
LZ4_decompress_fast_time : 0.004 seconds, 353.08 b/kc, rate= 562.89 mb/s

LZSSE8 : 2,188,524 -> 2,111,182 =  7.717 bpb =  1.037 to 1
LZSSE8_Decode_Time : 0.004 seconds, 360.21 b/kc, rate= 574.42 mb/s

Oodle LZB16 : texture.bc1 :  2,188,524 -> 2,068,823 =  7.562 bpb =  1.058 to 1 
decode           : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s

------------------

lightmap.bc3

lz4hc : 4,194,332 ->   632,974 =  1.207 bpb =  6.626 to 1
LZ4_decompress_safe_time : 0.005 seconds, 521.54 b/kc, rate= 831.38 mb/s
LZ4_decompress_fast_time : 0.005 seconds, 564.63 b/kc, rate= 900.46 mb/s

LZSSE8 encode    :  4,194,332 ->   684,062 =  1.305 bpb =  6.132 to 1
LZSSE8_Decode_Time : 0.005 seconds, 551.85 b/kc, rate= 879.87 mb/s

Oodle LZB16 : lightmap.bc3 :  4,194,332 ->   630,794 =  1.203 bpb =  6.649 to 1 
decode           : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s

------------------

silesia_mozilla

lz4hc : 51,220,480 ->22,062,995 =  3.446 bpb =  2.322 to 1
LZ4_decompress_safe_time : 0.083 seconds, 385.47 b/kc, rate= 614.35 mb/s
LZ4_decompress_fast_time : 0.075 seconds, 427.14 b/kc, rate= 680.75 mb/s

LZSSE8 : 51,220,480 ->22,148,366 =  3.459 bpb =  2.313 to 1
LZSSE8_Decode_Time : 0.070 seconds, 461.53 b/kc, rate= 735.59 mb/s

Oodle LZB16 : silesia_mozilla : 51,220,480 ->22,022,002 =  3.440 bpb =  2.326 to 1 
decode           : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s

Oodle LZBLW : silesia_mozilla : 51,220,480 ->20,881,772 =  3.261 bpb =  2.453 to 1
decode           : 0.112 seconds, 285.68 b/kc, rate= 455.30 mb/s

------------------

breton.dds

lz4hc :   589,952 ->   116,447 =  1.579 bpb =  5.066 to 1
LZ4_decompress_safe_time : 0.001 seconds, 568.65 b/kc, rate= 906.22 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 624.81 b/kc, rate= 996.54 mb/s

LZSSE8 encode    :  589,952 ->   119,659 =  1.623 bpb =  4.930 to 1
LZSSE8_Decode_Time : 0.001 seconds, 604.14 b/kc, rate= 962.40 mb/s

Oodle LZB16 : breton.dds :    589,952 ->   113,578 =  1.540 bpb =  5.194 to 1 
decode           : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s

Oodle LZBLW : breton.dds :    589,952 ->   132,934 =  1.803 bpb =  4.438 to 1
decode           : 0.001 seconds, 396.04 b/kc, rate= 630.96 mb/s

------------------

d.dds

lz4hc encode     :  1,048,704 ->   656,706 =  5.010 bpb =  1.597 to 1
LZ4_decompress_safe_time : 0.001 seconds, 554.69 b/kc, rate= 884.24 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 587.20 b/kc, rate= 936.34 mb/s

LZSSE8 encode    :  1,048,704 ->   695,583 =  5.306 bpb =  1.508 to 1
LZSSE8_Decode_Time : 0.001 seconds, 551.13 b/kc, rate= 879.05 mb/s

Oodle LZB16 : d.dds :  d.dds :  1,048,704 ->   654,014 =  4.989 bpb =  1.603 to 1 
decode           : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s

------------------

all_dds

lz4hc : 79,993,099 ->47,848,680 =  4.785 bpb =  1.672 to 1
LZ4_decompress_safe_time : 0.158 seconds, 316.67 b/kc, rate= 504.70 mb/s
LZ4_decompress_fast_time : 0.143 seconds, 350.66 b/kc, rate= 558.87 mb/s

LZSSE8 : 79,993,099 ->47,807,041 =  4.781 bpb =  1.673 to 1
LZSSE8_Decode_Time : 0.140 seconds, 358.61 b/kc, rate= 571.54 mb/s

Oodle LZB16 : all_dds : 79,993,099 ->47,683,003 =  4.769 bpb =  1.678 to 1 
decode           : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s

----------

baby_robot_shell.gr2

lz4hc : 58,788,904 ->32,998,567 =  4.490 bpb =  1.782 to 1
LZ4_decompress_safe_time : 0.090 seconds, 412.04 b/kc, rate= 656.71 mb/s
LZ4_decompress_fast_time : 0.080 seconds, 460.55 b/kc, rate= 734.01 mb/s

LZSSE8 : 58,788,904 ->33,201,406 =  4.518 bpb =  1.771 to 1
LZSSE8_Decode_Time : 0.076 seconds, 485.14 b/kc, rate= 773.20 mb/s

Oodle LZB16 : baby_robot_shell.gr2 : 58,788,904 ->32,862,033 =  4.472 bpb =  1.789 to 1
decode           : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s

Oodle LZBLW : baby_robot_shell.gr2 : 58,788,904 ->30,207,635 =  4.111 bpb =  1.946 to 1
decode           : 0.090 seconds, 409.88 b/kc, rate= 653.26 mb/s


After posting the original version with non-fuzz-safe LZB16, I decided to just go and do the fuzz-safing for LZB16.


LZB16, PS4 clang-3.6.1

post-fuzz-safing :

    lzt99 : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1 
    decode           : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s
    
    texture.bc1 :  2,188,524 -> 2,068,823 =  7.562 bpb =  1.058 to 1 
    decode           : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s
    
    lightmap.bc3 :  4,194,332 ->   630,794 =  1.203 bpb =  6.649 to 1 
    decode           : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s
    
    silesia_mozilla : 51,220,480 ->22,022,002 =  3.440 bpb =  2.326 to 1 
    decode           : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s
    
    breton.dds :    589,952 ->   113,578 =  1.540 bpb =  5.194 to 1 
    decode           : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s
    
    d.dds :  1,048,704 ->   654,014 =  4.989 bpb =  1.603 to 1 
    decode           : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s
    
    all_dds : 79,993,099 ->47,683,003 =  4.769 bpb =  1.678 to 1 
    decode           : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s
    
    baby_robot_shell.gr2 : 58,788,904 ->32,862,033 =  4.472 bpb =  1.789 to 1
    decode           : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s

pre-fuzz-safing reference :

    lzt99 912.92 mb/s
    texture.bc1 598.61 mb/s
    lightmap.bc3 876.19 mb/s
    silezia_mozilla 794.72 mb/s
    breton.dds 1078.52 mb/s
    d.dds 888.73 mb/s
    all_dds 701.45 mb/s
    baby_robot_shell.gr2 877.81 mb/s

Mild speed penalty on most files.

5/12/2016

Oodle Kraken Thread-Phased Decoding

Oodle Kraken is already by far the fastest-to-decode high-compression compressor in the world (that's a mouthful!). But it's got a magic trick that makes it even faster :

Oodle Kraken can decode its normal compressed data on multiple threads.

This is different than what a lot of compressors do (and what Oodle has done in the past), which is to split the data into independent chunks so that each chunk can be decompressed on its own thread. All compressors can do that in theory, Oodle makes it easy in practice with the "seek chunk" decodes. But that requires special encoding that does the chunking, and it hurts compression ratio by breaking up where matches can be found.

The Oodle Kraken threaded decode is different. To distinguish it I call it "Thread-Phased" decode. It runs on normal compressed data - no special encoding flags are needed. It has no compressed size penalty because it's the same normal single-thread compressed data.

The Oodle Kraken Thread-Phased decode gets most of its benefit with just 2 threads (if you like, the calling thread + 1 more). The exact speedup varies by file, usually in the 1.4X - 1.9X range. The results here are all for 2-thread decode.

For example on win81, 2-thread Oodle Kraken is 1.7X faster than 1-thread : (with some other compressors for reference)


win81 :

Kraken 2-thread  : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.075 seconds, 410.98 b/kc, rate= 1398.55 M/s

Kraken           : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.127 seconds, 243.06 b/kc, rate= 827.13 M/s

zstdmax          : 104,857,600 ->39,768,086 =  3.034 bpb =  2.637 to 1 
decode           : 0.251 seconds, 122.80 b/kc, rate= 417.88 M/s

lzham            : 104,857,600 ->37,856,839 =  2.888 bpb =  2.770 to 1 
decode           : 0.595 seconds, 51.80 b/kc, rate= 176.27 M/s

lzma             : 104,857,600 ->35,556,039 =  2.713 bpb =  2.949 to 1 
decode           : 2.026 seconds, 15.21 b/kc, rate= 51.76 M/s

Charts on a few files :

Oodle 2.2.0 includes helper functions that will just run a Thread-Phased decode for you on Oodle's own thread system, as well as example code that runs the entire Thread-Phased decode client-side so you can do it on your own threads however you like.

Performance on the Silesia set for reference :


Silesia total :

Oodle Kraken -z6 : 211,938,580 ->51,857,427 =  1.957 bpb =  4.087 to 1

single threaded decode   : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s

two threaded decode      : 0.158 seconds, 394.55 b/kc, rate= 1342.64 M/s

Note that because the Kraken Thread-Phased decode is a true threaded decode of individual compressed buffers that means it is a *latency* reduction for decoding individual blocks, not just a *throughput* reduction. For example, if you were really decoding the whole Silesia set, you might just run the decompression of each file on its own thread. That is a good thing to do, and it would give you a near 2X speedup (with two threads). But that's a different kind of threading - that gives you a throughput improvement of 2X but the latency to decode any individual file is not improved at all. Kraken Thread-Phased decode reduces the latency of each independent decode, and of course it can also be used with chunking or multiple-file decoding to get further speedups.

Oodle 2.2.0 Kraken Optimal Parse improvements

Oodle 2.2.0 is about to ship, with some improvements to the Kraken optimal parse compression ratios. Compressed size is improved by around 1%. Speed is approximately the same at -z6 (previous max level for Kraken) but there's a new -z7 mode that's slightly slower and even higher compression.

I think we'll continue to find improvements in the optimal parsers over the coming months (optimal parsing is hard!) which should lead to some more tiny gains in the compression ratio in the slow encoder modes.


Silesia , sum of all files

uncompressed : 211,938,580

Kraken 2.1.5 -z6 : 52,366,897
Kraken 2.2.0 -z6 : 51,857,427
Kraken 2.2.0 -z7 : 51,625,488

Oodle Kraken 2.1.5 topped out at -z6 (Optimal2). There's a new -z7 (Optimal3) mode which gets a bit more compression at the cost of a bit of speed, which is why it's on a separate option instead of just part of -z6.

Results on some individual files (Kraken 220 is -z7) :

-------------------------------------------------------
"silesia_mozilla"

by ratio:
lzma        :  3.88:1 ,    2.0 enc mb/s ,   63.7 dec mb/s
Kraken 220  :  3.60:1 ,    1.1 enc mb/s ,  896.5 dec mb/s
lzham       :  3.56:1 ,    1.5 enc mb/s ,  186.4 dec mb/s
Kraken 215  :  3.51:1 ,    1.2 enc mb/s ,  928.0 dec mb/s
zstdmax     :  3.24:1 ,    2.8 enc mb/s ,  401.0 dec mb/s
zlib9       :  2.51:1 ,   12.4 enc mb/s ,  291.5 dec mb/s
lz4hc       :  2.32:1 ,   36.4 enc mb/s , 2351.6 dec mb/s

-------------------------------------------------------
"lzt99"

by ratio:
lzma        :  2.65:1 ,    3.1 enc mb/s ,   42.3 dec mb/s
Kraken 220  :  2.53:1 ,    2.0 enc mb/s ,  912.0 dec mb/s
Kraken 215  :  2.46:1 ,    2.3 enc mb/s ,  957.1 dec mb/s
lzham       :  2.44:1 ,    1.9 enc mb/s ,  166.0 dec mb/s
zstdmax     :  2.27:1 ,    3.8 enc mb/s ,  482.3 dec mb/s
zlib9       :  1.77:1 ,   13.3 enc mb/s ,  286.2 dec mb/s
lz4hc       :  1.67:1 ,   30.3 enc mb/s , 2737.4 dec mb/s

-------------------------------------------------------
"all_dds"

by ratio:
lzma        :  2.37:1 ,    2.1 enc mb/s ,   40.8 dec mb/s
Kraken 220  :  2.23:1 ,    1.0 enc mb/s ,  650.6 dec mb/s
Kraken 215  :  2.18:1 ,    1.0 enc mb/s ,  684.6 dec mb/s
lzham       :  2.17:1 ,    1.3 enc mb/s ,  127.7 dec mb/s
zstdmax     :  2.02:1 ,    3.3 enc mb/s ,  289.4 dec mb/s
zlib9       :  1.83:1 ,   13.3 enc mb/s ,  242.9 dec mb/s
lz4hc       :  1.67:1 ,   20.4 enc mb/s , 2226.9 dec mb/s

-------------------------------------------------------
"baby_robot_shell.gr2"

by ratio:
lzma        :  4.35:1 ,    3.1 enc mb/s ,   59.3 dec mb/s
Kraken 220  :  3.82:1 ,    1.4 enc mb/s ,  837.2 dec mb/s
Kraken 215  :  3.77:1 ,    1.5 enc mb/s ,  878.3 dec mb/s
lzham       :  3.77:1 ,    1.6 enc mb/s ,  162.5 dec mb/s
zstdmax     :  2.77:1 ,    5.7 enc mb/s ,  405.7 dec mb/s
zlib9       :  2.19:1 ,   13.9 enc mb/s ,  332.9 dec mb/s
lz4hc       :  1.78:1 ,   40.1 enc mb/s , 2364.4 dec mb

-------------------------------------------------------
"win81"

by ratio:
lzma        :  2.95:1 ,    2.5 enc mb/s ,   51.9 dec mb/s
lzham       :  2.77:1 ,    1.6 enc mb/s ,  177.6 dec mb/s
Kraken 220  :  2.77:1 ,    1.0 enc mb/s ,  818.0 dec mb/s
Kraken 215  :  2.70:1 ,    1.0 enc mb/s ,  877.0 dec mb/s
zstdmax     :  2.64:1 ,    3.5 enc mb/s ,  417.8 dec mb/s
zlib9       :  2.07:1 ,   16.8 enc mb/s ,  269.6 dec mb/s
lz4hc       :  1.91:1 ,   28.8 enc mb/s , 2297.6 dec mb/s

-------------------------------------------------------
"enwik7"

by ratio:
lzma        :  3.64:1 ,    1.8 enc mb/s ,   79.5 dec mb/s
lzham       :  3.60:1 ,    1.4 enc mb/s ,  196.5 dec mb/s
zstdmax     :  3.56:1 ,    2.2 enc mb/s ,  394.6 dec mb/s
Kraken 220  :  3.51:1 ,    1.4 enc mb/s ,  702.8 dec mb/s
Kraken 215  :  3.49:1 ,    1.5 enc mb/s ,  789.7 dec mb/s
zlib9       :  2.38:1 ,   22.2 enc mb/s ,  234.3 dec mb/s
lz4hc       :  2.35:1 ,   27.5 enc mb/s , 2059.6 dec mb/s

-------------------------------------------------------
You can see that encode & decode speed is slightly worse at level -z7 , and compression ratio si improved. (most of the other compression levels have roughly the same decode speed; -z7 enables some special options that can hurt decode speed a bit). Of course even at -z7 Kraken is way faster than anything else comparable!

old rants