8/20/2016

PNG without ZLib

If you need to send PNG images in a compressed archive, here's a tip.

PNG's are internally compressed with Zlib. When you run another compressor (such as Oodle) on an already-compressed file like PNG, it won't be able to do much with it. It might get a few bytes out of the headers, but typically the space-speed tradeoff decision in Oodle will not think that gain is worth bothering with, so the PNG will just be sent uncompressed.

There are a few reasons why you might want to use an Oodle compressor rather than the Zlib inside PNG. One is to reduce size; some of the Oodle compressors can make the files smaller than Zlib can. Another is for speed, if you use Kraken or Mermaid the decoder is much faster than the Zlib decompression in PNG.

Now obviously if you want the smallest possible lossless image, you should use an image-specific codec like webp-ll , but we will assume here that that isn't an option.

You could of course just decode the PNG to BMP or TGA or some kind of simple sample format, but that is not desirable. For one thing it changes the format, and your end usage loader might be expecting PNG. Your PNG's might be using PNG-specific features like borders or transparency or whatever that is hard to translate to other formats.

But another is that we want the PNG to keep doing its filtering. Filtered image samples from PNG will usually be more compressible by the back-end compressor than the raw samples in a BMP.

The easy way to do this all is just to take an existing PNG and set its ZLib compression level to 0 (just store). You keep all the PNG headers, and you still get the pixel filtering. But the samples are now uncompressed, so the back-end compressor (Oodle or whatever) gets to work on them instead of passing through already-ZLibbed data.


pngcp

pngcp is a utility from the official libpng distribution. It reads & writes a png and can change some options.

Usage for what we want is :


pngcp --level=0 --text-level=0 from.png to.png

I have made a Win32 build with static libs of pngcp for your convenience :

pngcp.zip

I also added a --help option ; run "pngcp --help". The official pngcp seems to have no help or readme at all that explains usage.

I *think* that pngcp preserves headers & options & pixel formats BUT I'M NOT SURE, it's not my code, YMMV, don't go fuck up your pngs without testing it. If it doesn't work - hey you can get pngcp from the official distro and fix it.

I used libpng 1624. The vc7.1 project in libpng worked fine for me. pngcp needed a little bit of de-unixification to build in VC but it was straightforward. You need zlib ; I used 1.2.8 and it worked fine; you need to make a dir named "zlib" at the same level as libpng. I did "mklink /j zlib zlib-1.2.8".

* CAVEAT : this isn't really the way I'd like to do this. pngcp loads the PNG and then saves it out again, which introduces the possibility of losing metadata that was stuffed in the file or just screwing it up somehow. I'd much rather do this conversion without ever actually loading it as an image. That is, take the PNG file as just a binary blob, find the zlib streams and unpack them, store them with a level 0 header, and pass through the PNG headers totally untouched. That would be a much more robust way to ensure you don't lose anything.


cbpngz0

cbpngz0 usage :


cbpngz0 from to

cbpngz0 uses the cblib loaders, so it can load bmp,tga,png,jpeg and so on. It writes a PNG at zlib level 0. Unlike pngcp, cbpngz0 does NOT support lots of weird formats; it only writes 8-bit gray, 24-bit RGB, and 32-bit RGBA. This is not a general purpose PNG zlib level changer!! Nevertheless I find it useful because of the wider range of formats it can load.

cbpngz0.zip

cbpngz0 is an x64 exe and uses the DLLs included.


Some sample results.

I take an original PNG, then try compressing it with Oodle two ways. First, convert it to a BMP and compress the BMP. Second, convert to a Zlib level 0 PNG (the "_z0.png") and then compress with Oodle. The differene between the two is that the _z0.png gets the PNG filters, and of course stays a PNG if that's what your loader expects. If you give the original PNG to Oodle, it passes it through uncompressed.


porsche640.png             529,821

porsche640.bmp             921,654

porsche640.bmp.ooz         711,273

porsche640_z0.png.ooz      508,091

-------------

blinds.png                 328,754

blinds.bmp               1,028,826

blinds.bmp.ooz             193,130

blinds_z0.png.ooz          195,558

-------------

xxx.png                    420,149

xxx.bmp                    915,054

xxx.bmp.ooz                521,861

xxx_z0.png.ooz             409,311

The ooz files are made with Oodle LZNA -z6 (level Optimal2).

You can see there are some big gains possible with replacing Zlib (on "blinds"). On normal photographic continuous tone images Zlib does okay so the gains are small. On those images, compressing the BMP without filters is very bad.


Another small note : if your end usage PNG loader supports the optional MNG format LOCO color transform, that usually helps compression.

ADD : Chris Maiwald points out that he gets better PNG filter choice by using "Z_FIXED" (which is the zlib option for fixed huffman tables instead of per-file huffman). A bit weird, but perhaps it biases the filter choice to be more consistent?

I wonder if choosing a single PNG filter for the whole image would be better than letting PNG do its per-row thing? (to try to make the post-filter residuals more consistent for the back end modeling stage). For max compression you would use something like a png optimizer that tried various filter strategies, but instead of rating them using zlib, rate with the back-end of your choice.

7/29/2016

Scatter Plots

I tweaked my scatter plot generation per Fabian. They're still log-log but now labeled with the non-log axis values so it's easier to read off the actual ratio and speed without doing a pow2 in your head.

Silesia :

Game Test Set :

Seven, total :

Seven, all files :

Slow slow compressors

lzma is really too slow to decode.

total                : Kraken     : 2.914 to 1 : 1053.961 MB/s
total                : lzma       : 3.186 to 1 : 52.660 MB/s

(Win64 Core i7-3770 3.4 GHz)

Kraken is around 20X faster than lzma, but lzma compresses better (about 9%). That's already a tip that something is horribly wrong; you have a 2000% speed difference and a 9% size difference.

If we look at the total time to load compressed from disk + decompress, we can make these speedup factor curves :

At very low disk speeds, the higher compression of lzma provides a speedup over Kraken. But how slow does the disk have to be? You can see the intersection of the curves is between 0 and 1 on the log scale, that's 1-2 MB/s !!

For any disk faster than 2 MB/s , load+decomp is *way* faster with Kraken. At a disk speed of 16 MB/s or so (log scale 4) the full load+decomp for Kraken is around 2X faster than with lzma. And that's still a very slow disk (around optical speed).

Now, this is a speedup factor for load *then* decomp. If you are fully overlapping overlapping IO with decompression, then some of the decode time is hidden.

*But* that also assumes that you have a whole core to give to decompression. And it assumes you have no other CPU to work to do after loading.

The idea that you can hide decompressor time in IO time only works if you have enough independent loads so that there's lots to overlap (because if you don't, then the first IO and last decompress will never overlap anything), and it assumes you have no other CPU work to do.

In theory I absolutely love the idea that you just load pre-baked data which is all ready to go, and you just point at it, so there's no CPU work in loading other than decompression, but in practice that is almost never the case. eg. for loading compressed web pages, there's tons of CPU work that needs to be ton to parse the HTML or JS or whatever, so the idea that you can hide the decompressor time in the network latency is a lie - the decompressor time adds on to the later processing time and adds directly onto total load latency.

The other factor that people often ignore is the fact that loading these days is heterogeneous.

What you actually encounter is something like this :


Download from internet ~ 1 MB/s
Load from optimal disc ~ 20 MB/s
Load from slow HDD ~ 80 MB/s
Fast SSD ~ 500 MB/s
NVMe drive on PCIe ~ 1-2 GB/s
Load from cache in RAM ~ 8 GB/s

We have very heterogeneous loading - even for a single file loaded by the same application.

The first time you load it, maybe you download from the internet, and in that case a slow decompressor like lzma might be best. But the next time you load it's from the cache in RAM. And the time after that it's from HDD. In those cases, using lzma is a disaster (in the sense that the loading is now nearly instant, but you spend seconds decoding; or in the sense that just loading uncompressed would have been way faster).

One issue that I think is not considered is that making the right choice in the slow-disk zone is not that big of a deal. On a 1 MB/s disk, the difference in "speedup" between lzma and Kraken is maybe 2% in favor of lzma. But on a 100 MB/s it's something like 400% in favor of Kraken.

Now in theory maybe it would be nice to have different compressors for download & disk storage; like you use something like lzma for downloadable, and then decode and re-encode to ZStd for HDD loading. In practice nobody does that and the advantage over just using ZStd all the time is very marginal.

Also in theory it would be nice if the OS cache would cache the decompressed data rather than caching the compressed data.


TODO : time lzma on PS4. Usually PS4 is 2-4X slower than my PC, so that puts lzma somewhere in the 10-25 mb/s range, which is very borderline for keeping up with the optical disc.


DVD 16x is ~ 20 MB/s (max)
PS4 Blu-Ray is 6x ~ 27 MB/s (max)

PS4 transparently caches Blu-Ray to HDD

Of course because of the transparent caching to HDD, if you actually keep files in lzma on the disc, and they are cached to HDD, loading them from HDD is a huge mismatch and makes lzma the bottleneck.

That is, in practice on the PS4 when you load files from disc, they are sometimes actually coming from the HDD transparent cache, so you sometimes get 20 MB/s speeds, and sometimes 100 MB/s.


Now of course we'd love to have a higher-ratio compressor than Kraken which isn't so slow. Right now, we just don't have it. We have Kraken at 1000 MB/s , LZNA at 120 MB/s , lzma at 50 MB/s - it's too big of a step down in speed, even for LZNA.

In order for the size gain of lzma/LZNA to be worth it, it needs to run a *lot* faster, more like 400 mb/s. There needs to be a new algorithmic step in the high compress domain to get there.

At the moment the only reason to use the slower decoders than Kraken is if you simply must get smaller files and damn the speed; like if you have a downloadable app size hard limit and just have to fit in 100 MB, or if you are running out of room on an optical disc or whatever.

7/27/2016

Improving the compression of block-compressed textures

I'm often asked how to get more compression on block-compressed textures. (and was asked again today, hence writing this)

If you're working from existing BCn data (not RGB originals), there's not a lot you can do. One small thing you can do is to de-interleave the end points and indexes.

In BC1/DXT1 for example each block is 4 bytes of end points then 4 bytes of index data. You can take each alternating 4 bytes out and instead put all the end points together, then all the index data together. Sometimes this improves compression, sometimes it doesn't, it depends on the compressor and the data. When it does help it's in the 5-10% range.

If you're using mips, then you can convert the BCn endpoint colors into deltas from the parent mip. (that will only be a good idea if your BCn encoder is one that is *not* aggressive about finding endpoints outside of the original image colors)

If you have original RGB data, you can make new BCn data that's designed to be more compressible. This opens up a whole new world of powerful possibilities : R/D decisions in the BCn encoding.

There are some obvious basic ideas, like - instead of aggressive end point searches, only choose end points that occur or are near the colors in the block; try to reuse previous index dwords rather than make new ones; try to use completely flat color blocks with the whole index dword = 0, etc.

See for example Jon Olick's articles on doing this .

But unless you really need to do it manually for some reason you should probably just use Rich Geldreich's cool tool crunch .

Crunch can make its own "crn" compressed texture format, which you would need to load with the crn-lib. crn-lib would decode the crn file back to BC1 at load time. That may be an awesome thing to do, I really can't comment because I haven't looked into it in detail.

Let's assume for the moment that you don't want to use the "crn" custom compressed format. You just want DDS or raw BCn data that you will run through your normal compression pipeline (Oodle or whatever). Crunch can also make BCn data that's more compressible by reducing its complexity, choosing encodings that are lower quality but less complex.

You tell it to make BCn and output DDS and you can specify a "quality". Then when you run it through your back-end compressor you get smaller files :


lena 512x512 RGB (absolutely terrible as a representative game texture)

DXT1 DDS quality 255

 rmse : 7.6352 psnr : 30.5086

Oodle LZNA   :   131,200 ->   102,888 =  6.274 bpb =  1.275 to 1
Oodle Kraken :   131,200 ->   107,960 =  6.583 bpb =  1.215 to 1

DXT1 DDS quality 200

 rmse : 8.2322 psnr : 29.8547

Oodle LZNA   :   131,200 ->    80,264 =  4.894 bpb =  1.635 to 1
Oodle Kraken :   131,200 ->    85,268 =  5.199 bpb =  1.539 to 1

CRN quality 255

 rmse : 8.2699 psnr : 29.8150
   crunched.crn                74,294

DXT1 DDS quality 128

 rmse : 9.0698 psnr : 29.0131

Oodle LZNA   :   131,200 ->    62,960 =  3.839 bpb =  2.084 to 1
Oodle Kraken :   131,200 ->    66,628 =  4.063 bpb =  1.969 to 1

CRN quality 160

rmse : 9.0277 psnr : 29.0535

crunched.crn                53,574

DXT1 DDS quality 80

 rmse : 10.2521 psnr : 27.9488

Oodle LZNA   :   131,200 ->    50,983 =  3.109 bpb =  2.573 to 1
Oodle Kraken :   131,200 ->    54,096 =  3.299 bpb =  2.425 to 1

CRN quality 100

 rmse : 10.1770 psnr : 28.0126

 crunched.crn               41,533

So going from rmse 7.64 to 10.26 we reduced the Oodle-compressed DDS files to about half their size! Pretty cool.

The CRN format files are even smaller at equal error. (unfortunately the -quality setting is not a direct comparison, you have to hunt around to find qualities that give equal rmse to do a comparison).

For my reference :


@echo test.bat [file] [quality]
@echo quality in 1-255 optional 128 default
set file=%1
if "%file%"=="" end.bat 
set q=%2
if "%q%"=="" set q=128
crunch_x64.exe -DXT1 -fileformat dds -file %file% -maxmips 1 -quality %q% -out crunched.dds
@REM -forceprimaryencoding ??

@REM verified same :
@REM radbitmaptest64 copy crunched.dds crunched_dds_un.bmp
crunch_x64.exe -R8G8B8 -file crunched.dds -out crunched_dds_un.bmp -fileformat bmp

crunch_x64.exe -DXT1 -file %file% -maxmips 1 -quality %q% -out crunched.crn

crunch_x64.exe -R8G8B8 -file crunched.crn -out crunched_crn_un.bmp -fileformat bmp

call bmp mse %file% crunched_dds_un.bmp
@echo --------------------------

call ooz crunched.dds -zc7 -zl8
call ooz crunched.dds -zc8 -zl8

call bmp mse %file% crunched_crn_un.bmp
@echo --------------------------

call d crunched.crn

7/26/2016

Oodle 2.3.0 Thread-Phased Performance

Thread-Phased (two threads) decoding can be used with Mermaid just as it was with Kraken.

It's not as big a benefit on Mermaid as it is for Kraken. (Kraken gets around 1.5X, Mermaid much less).

There is a little neat thing about thread-phased Mermaid though, and that's the ability to run Mermaid+ at almost the exact same speed as Mermaid.

Mermaid+ is a hybrid option that's hidden in Oodle 2.3.0 and will be exposed in the next release. It gets compression between Mermaid & Kraken, with a small speed hit vs Mermaid.


Seven :

ooSelkie    :  2.19:1 ,    3.0 enc MB/s , 3668.0 dec MB/s
ootp2Mermaid:  2.46:1 ,    2.3 enc MB/s , 3046.1 dec MB/s
lz4hc       :  2.00:1 ,   12.8 enc MB/s , 2532.4 dec MB/s
ooMermaid   :  2.46:1 ,    2.3 enc MB/s , 2364.4 dec MB/s
ootp2Kraken :  2.91:1 ,    2.6 enc MB/s , 1660.9 dec MB/s
ooKraken    :  2.91:1 ,    2.6 enc MB/s , 1049.6 dec MB/s
zlib9       :  2.33:1 ,    7.9 enc MB/s ,  315.1 dec MB/s

ootp2Merm+  :  2.64:1 ,    2.3 enc MB/s , 3042.4 dec MB/s
ooMermaid+  :  2.64:1 ,    2.3 enc MB/s , 2044.5 dec MB/s

Silesia :

ooSelkie    :  3.05:1 ,    1.3 enc MB/s , 2878.4 dec MB/s
ootp2Mermaid:  3.57:1 ,    1.1 enc MB/s , 2600.8 dec MB/s
lz4hc       :  2.72:1 ,   13.6 enc MB/s , 2273.5 dec MB/s
ooMermaid   :  3.57:1 ,    1.1 enc MB/s , 1994.2 dec MB/s
ootp2Kraken :  4.08:1 ,    1.2 enc MB/s , 1434.4 dec MB/s
ooKraken    :  4.08:1 ,    1.2 enc MB/s , 1000.5 dec MB/s
zlib9       :  3.13:1 ,    8.3 enc MB/s ,  358.4 dec MB/s

ootp2Merm+  :  3.58:1 ,    1.1 enc MB/s , 2583.9 dec MB/s
ooMermaid+  :  3.58:1 ,    1.1 enc MB/s , 1986.8 dec MB/s

Game Test Set :

ooSelkie    :  2.03:1 ,    3.3 enc MB/s , 4548.0 dec MB/s
lz4hc       :  1.78:1 ,   14.0 enc MB/s , 3171.1 dec MB/s
ootp2Mermaid:  2.28:1 ,    2.7 enc MB/s , 3099.4 dec MB/s
ooMermaid   :  2.28:1 ,    2.7 enc MB/s , 2622.8 dec MB/s
ootp2Kraken :  2.57:1 ,    2.9 enc MB/s , 1812.7 dec MB/s
ooKraken    :  2.57:1 ,    2.9 enc MB/s , 1335.9 dec MB/s
zlib9       :  1.99:1 ,    8.3 enc MB/s ,  337.2 dec MB/s

ootp2Merm+  :  2.35:1 ,    2.7 enc MB/s , 3034.9 dec MB/s
ooMermaid+  :  2.35:1 ,    2.7 enc MB/s , 2409.0 dec MB/s

Pulling out just the relevant numbers on Seven, you can see Mermaid+ is between Mermaid and Kraken, but thread-phased it runs at full Mermaid speed :

Seven :

ooMermaid   :  2.46:1 ,    2.3 enc MB/s , 2364.4 dec MB/s
ooMermaid+  :  2.64:1 ,    2.3 enc MB/s , 2044.5 dec MB/s
ooKraken    :  2.91:1 ,    2.6 enc MB/s , 1049.6 dec MB/s

ootp2Mermaid:  2.46:1 ,    2.3 enc MB/s , 3046.1 dec MB/s
ootp2Merm+  :  2.64:1 ,    2.3 enc MB/s , 3042.4 dec MB/s
ootp2Kraken :  2.91:1 ,    2.6 enc MB/s , 1660.9 dec MB/s

7/25/2016

Introducing Oodle Mermaid and Selkie

I'm pleased to announce the release of two new compressors in Oodle 2.3.0 : Mermaid and Selkie.

Mermaid and Selkie are the super-fast-to-decode distant relatives of Kraken. They use some of the same ideas and technology as Kraken, but are independent compressors targetted at even higher speed and lower compression. Mermaid & Selkie make huge strides in what's possible in compression in the high-speed domain, the same way that Kraken did in the high-compression domain.

Mermaid is about twice as fast as Kraken, but with compression around Zlib levels.

Selkie is one of the fastest decompressors in the world, and also gets much more compression than other very-high-speed compressors.

( Oodle is my data compression library that we sell at RAD Game Tools , read more about it there )

Kraken, Mermaid, and Selkie all use an architecture that makes space-speed decisions in the encoder to give the best tradeoff of compressed size vs decoding speed. The three compressors have different performance targets and make decisions suited for each one's usage domain (Kraken favors more compression and will give up some speed, Selkie strongly favors speed, Mermaid is in between).

For detailed information about the new Mermaid and Selkie I've written a series of posts :

cbloom rants Introducing Oodle Mermaid and Selkie
cbloom rants Oodle 2.3.0 All Test Sets
cbloom rants Oodle 2.3.0 ARM Report
cbloom rants Oodle Mermaid and Selkie on PS4
cbloom rants Oodle Mermaid
cbloom rants Oodle Selkie
RAD Game Tools - Oodle Network and Data Compression

Here are some representative numbers on the Silesia test set : (sum of time and size on individual files)


Oodle 2.3.0 Silesia -z6

Kraken     : 4.082 to 1 : 999.389 MB/s
Mermaid    : 3.571 to 1 : 2022.038 MB/s
Selkie     : 3.053 to 1 : 2929.770 MB/s

zstdmax    : 4.013 to 1 : 468.497 MB/s
zlib9      : 3.128 to 1 : 358.681 MB/s
lz4hc      : 2.723 to 1 : 2267.021 MB/s

on Win64 (Core i7-3770 3.4 GHz)

On Silesia, Mermaid is 5.65X faster to decode than zlib, and gets 14% more compression. Selkie is 1.3X faster to decode than LZ4 and gets 12% more compression.

Charts on Silesia total : (charts show time and size - lower is better!)

And the speedup chart on Silesia, which demonstrates the space-speed efficiency of a compressor in different usage domains.

Kraken was a huge step in the Pareto frontier that pushed the achievable speedup factor way up beyond what other compressers were doing. There's a pre-Kraken curve where we thought the best possible tradeoff existed, that most other compressors in the world roughly lie on (or under). Kraken set a new frontier way up on its own with nobody to join it; Mermaid & Selkie are the partners on that new curve that have their peaks at higher speeds than Kraken.

You can also see this big jump of the new family very easily in scatter plots, which we'll see in later posts .

Oodle Mermaid and Selkie on PS4

The PS4 is a lovely platform to benchmark on because it's standard. It's also very easy to run tests on. Performance of these compressors on the Xbox One is extremely similar, small variatation due to clock rate (1.75 vs 1.6 GHz) and compiler (MSVC vs clang/llvm).

Everything is slow on the PS4 in absolute terms (it's a slow chip and difficult to optimize for). The Oodle compressors do very well, even better in relative terms on PS4 than on typical PC's.

Kraken is usually around ~2X faster than ZStd on PC's, but is 3X faster on PS4. Mermaid is usually just slightly slower than LZ4 on PC's, but is solidly faster than LZ4 on PS4.


lzt99 :

Kraken     : 2.477 to 1 : 390.582 MB/s
Mermaid    : 2.279 to 1 : 749.896 MB/s
Selkie     : 1.937 to 1 : 1159.064 MB/s

zstd       : 2.374 to 1 : 133.498 MB/s
miniz      : 1.883 to 1 : 85.654 MB/s
lz4hc-safe : 1.669 to 1 : 673.616 MB/s
LZSSE8     : 1.626 to 1 : 767.106 MB/s

Mermaid is faster than LZ4 on PS4 !! Wow! And the compression level is in a totally different domain than other super-fast decompressors like LZ4 or LZSSE.

lzt99 is a good case for Selkie & Mermaid. Selkie beats zlib compression ratio while being 75% faster than LZ4.

All compressors here are fuzz-safe, and run in safe mode if they have optional safe/unsafe modes.

Charts : (showing time and size - lower is better!)

lzt99 :

the raw data :


PS4 : Oodle 230 : (-z6)

inName : lzt:/lzt99

reference :

miniz      : 24,700,820 ->13,120,668 =  4.249 bpb =  1.883 to 1
miniz_decompress_time : 288.379 millis, 18.61 c/b, rate= 85.65 mb/s

zstd       : 24,700,820 ->10,403,228 =  3.369 bpb =  2.374 to 1
zstd_decompress_time : 185.028 millis, 11.94 c/b, rate= 133.50 mb/s

lz4hc      : 24,700,820 ->14,801,510 =  4.794 bpb =  1.669 to 1
LZ4_decompress_safe_time : 36.669 millis, 2.37 c/b, rate= 673.62 mb/s

LZSSE8     : 24,700,820 ->15,190,395 =  4.920 bpb =  1.626 to 1
decode_time      : 32.200 millis, 2.08 c/b, rate= 767.11 mb/s

Oodle :

Kraken     : 24,700,820 -> 9,970,882 =  3.229 bpb =  2.477 to 1
decode           : 63.241 millis, 4.08 c/b, rate= 390.58 mb/s

Mermaid    : 24,700,820 ->10,838,455 =  3.510 bpb =  2.279 to 1
decode           : 32.939 millis, 2.13 c/b, rate= 749.90 mb/s

Selkie : 24,700,820 ->12,752,506 =  4.130 bpb =  1.937 to 1
decode           : 21.311 millis, 1.38 c/b, rate= 1159.06 mb/s

BTW for reference, the previous best compressor in Mermaid's domain was LZNIB. Before these new compressors, LZNIB was quite unique in that it got good decode speeds, much faster than the LZ-Huffs of the time (eg. 3X faster than ZStd) but with compression usually better than ZLib. Well, LZNIB is still quite good compared to other competition, but it's just clobbered by the new Oceanic Bestiary compressors. The new compressor in this domain is Mermaid and it creams LZNIB for both size and speed :


LZNIB -z6  : 24,700,820 ->12,015,591 =  3.892 bpb =  2.056 to 1 
decode           : 58.710 millis, 3.79 c/b, rate= 420.73 mb/s

Mermaid    : 24,700,820 ->10,838,455 =  3.510 bpb =  2.279 to 1
decode           : 32.939 millis, 2.13 c/b, rate= 749.90 mb/s

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

Oodle Mermaid

Mermaid is a new compressor with a unique balance of space and speed. Mermaid is very close to LZ4 decode speeds, while usually beating Zlib compression ratios.

There's really nothing even close. It's way beyond what was previously thought possible.

Mermaid supports unbounded distance match references. This is part of how it gets such high compression. It does so in a new way which reduces the speed penalty normally incurred by large-window LZ's.

Mermaid almost always compresses better than ZLib. The only exception is on small files, less than 32k or so. The whole Oceanic Bestiary family is best suited to files over 64k. They work fine on smaller files, but they lose their huge advantage. It's always best to combine small files into larger units for compression, particularly so with these compressors.

There's not really any single compressor to compare Mermaid to. What we can do is compare vs. Zlib's compression ratio and LZ4's speed. A kind of mythological hybrid like a Chimera, the head of a Zlib and the legs of an LZ4.

Tests on Win64 (Core i7-3770 3.4 GHz) :

Silesia :

On Silesia, Mermaid is just slightly slower than LZ4 but compresses much more than Zlib !!

PD3D :

On PD3D, Mermaid gets almost exactly the compression level of ZLib but the decode speed of LZ4. Magic! It turns out you *can* have your cake and eat it too.

Game Test Set :

lzt99 :

Mermaid really compresses well on lzt99 ; not only does it kill Zlib, it gets close to high compression LZ-Huffs like RAR. (RAR gets 10826108 , Mermaid 10838455 bytes).

Seven :

Because of the space-speed optimizing nature of Mermaid, it will make decisions to be slower than LZ4 when it can find big compression gains. For example if you look at the individual files of the "Seven" test set below - Mermaid is typically right around the same speed as LZ4 or even faster (baby7,dds7,exe7,game7,wad7 - all same speed or faster than LZ4). On a few files it decides to take an encoding slower to decode than LZ4 - model7,enwik7, and records7. The biggest differences are enwik7 and records7, but if you look at the compression ratios - those are all the files where it found huge size differences over LZ4. It has an internal exchange rate for time vs. bytes that it must meet in order to take that encoding, trying to optimize for its space-speed target usage.

Seven files :


Silesia              : Mermaid    : 3.571 to 1 : 2022.038 MB/s
Silesia              : lz4hc      : 2.723 to 1 : 2267.021 MB/s
Silesia              : zlib9      : 3.128 to 1 : 358.681 MB/s

GameTestSet          : Mermaid    : 2.284 to 1 : 2718.095 MB/s
GameTestSet          : lz4hc      : 1.776 to 1 : 3226.887 MB/s
GameTestSet          : zlib9      : 1.992 to 1 : 337.986 MB/s

lzt99                : Mermaid    : 2.279 to 1 : 2283.278 MB/s
lzt99                : lz4hc      : 1.669 to 1 : 2605.366 MB/s
lzt99                : zlib9      : 1.883 to 1 : 309.304 MB/s

PD3D                 : Mermaid    : 2.875 to 1 : 2308.830 MB/s
PD3D                 : lz4hc      : 2.238 to 1 : 2369.666 MB/s
PD3D                 : zlib9      : 2.886 to 1 : 382.349 MB/s

Seven                : Mermaid    : 2.462 to 1 : 2374.212 MB/s
Seven                : lz4hc      : 2.000 to 1 : 2521.296 MB/s
Seven                : zlib9      : 2.329 to 1 : 315.370 MB/s

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

Oodle Selkie

Selkie is the faster cousin of Mermaid, distant relative of Kraken.

Selkie is all about decode speed, it aims to be the fastest mainstream decompressor in the world, and still gets more compression than anything in the high-speed domain.

Selkie does not currently have a super fast encoder. It's got good optimal parse encoders that produce carefully tuned encoded file which offer excellent space-speed tradeoff.

The closest compressors to Selkie are the fast byte-wise small-window coders like LZ4 and LZSSE (and Oodle's LZB16). These are all just obsolete now (in terms of ratio vs decode speed), Selkie gets a lot more compression (sometimes close to Zlib compression levels!) and is also much faster.

Selkie will not compress tiny buffers, or files that only compress a little bit. For example if you give Selkie something like an mp3, it might be able to compress it to 95% of its original size, saving a few bytes. Selkie will refuse to do that and just give you the original uncompressed file. If you wanted that compression, that means you wanted to save only a few bytes at a large time cost, which means you don't actually want a fast compressor like Selkie. You in fact wanted a compressor that was more willing to trade time for bytes, such as Mermaid or Kraken. Selkie will not abide logical inconsistency.

Selkie generally beats LZ4 compression even on small files (under 64k) but really gets ahead on files larger than 64k where the unbounded match distances can find big wins.

As usual, I'm not picking on LZ4 here because it's bad; I'm comparing to it because it's the best of the rest, and it's widely known. Both decompressors are run fuzz-safe.


Tests on Win64 (Core i7-3770 3.4 GHz) : 
(for reference, this machine runs memcpy at roughly 8 GB/s)
(total of time & size on each test set)

gametestset : ooSelkie    : 143,579,361 ->70,716,380 =  3.940 bpb =  2.030 to 1 
gametestset : decode      : 29.239 millis, 0.69 c/b, rate= 4910.61 mb/s

gametestset : lz4hc       : 143,579,361 ->80,835,018 =  4.504 bpb =  1.776 to 1 
gametestset : decode      : 44.495 millis, 1.05 c/b, rate= 3226.89 mb/s

pd3d : ooSelkie    : 31,941,800 ->13,428,298 =  3.363 bpb =  2.379 to 1 
pd3d : decode      : 8.381 millis, 0.89 c/b, rate= 3811.29 mb/s

pd3d : lz4hc       : 31,941,800 ->14,273,195 =  3.575 bpb =  2.238 to 1 
pd3d : decode      : 13.479 millis, 1.44 c/b, rate= 2369.67 mb/s

seven : ooSelkie    : 80,000,000 ->36,460,084 =  3.646 bpb =  2.194 to 1 
seven : decode      : 21.458 millis, 0.91 c/b, rate= 3728.26 mb/s

seven : lz4hc       : 80,000,000 ->39,990,656 =  3.999 bpb =  2.000 to 1 
seven : decode      : 31.730 millis, 1.35 c/b, rate= 2521.30 mb/s

silesia : ooSelkie    : 211,938,580 ->69,430,966 =  2.621 bpb =  3.053 to 1 
silesia : decode      : 72.340 millis, 1.16 c/b, rate= 2929.77 mb/s

silesia : lz4hc       : 211,938,580 ->77,841,566 =  2.938 bpb =  2.723 to 1 
silesia : decode      : 93.488 millis, 1.50 c/b, rate= 2267.02 mb/s

The edge that Selkie has over LZ4 is even greater on more difficult platforms like the PS4.

To get a better idea of the magic of Selkie it's useful to look at the other Oodle compressors that are similar to Selkie.

LZB16 is Oodle's LZ4 variant; it gets slightly more compression and slightly more decode speed, but they're roughly equal. It's included here for comparison to LZBLW.

Oodle's LZBLW is perhaps the most similar compressor to Selkie. It's like LZB16 (LZ4) but adds large-window matches. That ability to do long-distance matches hurts speed a tiny bit (2873 mb/s -> 2596 mb/s), but helps compression a lot.

Oodle's LZNIB is nibble-wise, with unbounded offsets and a rep match. It gets good compression, generally better than Zlib, with speed much higher than any LZ-Huff. LZNIB is in a pretty unique space speed tradeoff zone without much competition outside of Oodle.


lz4hc     : 24,700,820 ->14,801,510 =  4.794 bpb =  1.669 to 1 
decode    : 9.481 millis, 1.31 c/b, rate= 2605.37 mb/s

ooLZB16   : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1
decode    : 8.597 millis, 1.18 c/b, rate= 2873.17 mb/s

ooLZNIB   : 24,700,820 ->12,014,368 =  3.891 bpb =  2.056 to 1
decode    : 17.420 millis, 2.40 c/b, rate= 1417.93 mb/s

ooLZBLW   : 24,700,820 ->13,349,800 =  4.324 bpb =  1.850 to 1
decode    : 9.512 millis, 1.31 c/b, rate= 2596.80 mb/s

ooSelkie  : 24,700,820 ->12,752,506 =  4.130 bpb =  1.937 to 1 
decode    : 6.410 millis, 0.88 c/b, rate= 3853.57 mb/s

LZNIB and LZBLW were both pretty cool before Selkie, but now they're just obsolete.

LZBLW gets a nice compression gain over LZB16, but Selkie gets even more, and is way faster!

LZNIB beats Selkie compression, but is way slower, around 3X slower, in fact it's slower than Mermaid (2283.28 mb/s and compresses to 10,838,455 = 3.510 bpb = 2.279 to 1).

You can see from the curves that Selkie just completely covers the curves of LZB16,LZBLW, and LZ4. When a curve is completely covered like that, it means that it was beaten for both space and speed, so there is no domain where that compressor is ever better. LZNIB just peeks out of the Selkie curve because it gets higher compression (albeit at lower speed), so there is a domain where it is the better choice - but in that domain Mermaid just completely dominates LZNIB, so it too is obsolete.

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

Oodle 2.3.0 ARM Report

I prepared a detailed report of Oodle's performance on ARM mobile devices (Android and iOS).

The full report is here :

oodle_arm_report on cbloom.com

It's a thorough test on many devices and several corpora. See the full details there.

Cliff notes is : Oodle's great on ARM.

For example on the iPadAir2 64-bit , on Silesia :

We found that the iOS devices are generally very good and easy to program for. They're more like desktop Intel chips; they don't have any terrible performance cliffs. The Android ARM devices we tested on were rather more difficult. For one thing they have horrible thermal saturation problems that makes testing on them very difficult. They also have some odd performance quirks.

I'm sure we could get a lot more speed on ARM, but it's rather nasty to optimize for. For one thing the thermal problems mean that iterating and getting good numbers is a huge pain. It's hard to tell if a change helped or not. For another, there's a wide variety of devices and it's hard to tell which to optimize for, and they have different performance shortfalls. So there's definitely a lot left on the table here.

Mermaid & Selkie are quite special on ARM. Many of these devices have small caches (as small as 512k L2) and very slow main memory (slow wrst latency; they often have huge bandwidth, but latency is what I need). Mermaid & Selkie are able to use unbounded windows for LZ without suffering a huge speed hit, due to the unique way they are structured. Kraken doesn't have the same magic trick so it benefits from a limited window, as demonstrated in the report.

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

Oodle 2.3.0 All Test Sets

Putting the total performance on various testsets together in one place. Tests on Win64 Core i7-3770 3.4 GHz as usual.

Showing speed & ratio here, higher is better.

As usual the total on a test set is total size of all individually compressed files, and total time.

I think the scatter plot most clearly shows the way Kraken, Mermaid & Selkie are just on a whole new Pareto Frontier than the older compressors. You can connect the dots of K-M-S performance for each test set and they form a very consistent space-speed tradeoff curve that's way above the previous best.

The raw numbers :

gametestset          : Kraken     : 2.566 to 1 : 1363.283 MB/s
gametestset          : Mermaid    : 2.284 to 1 : 2711.458 MB/s
gametestset          : Selkie     : 2.030 to 1 : 4870.413 MB/s
gametestset          : lz4hc      : 1.776 to 1 : 3223.279 MB/s
gametestset          : zlib9      : 1.992 to 1 : 338.063 MB/s
gametestset          : lzma       : 2.756 to 1 : 43.782 MB/s

pd3d                 : Kraken     : 3.647 to 1 : 1072.833 MB/s
pd3d                 : Mermaid    : 2.875 to 1 : 2299.860 MB/s
pd3d                 : Selkie     : 2.379 to 1 : 3784.850 MB/s
pd3d                 : lz4hc      : 2.238 to 1 : 2370.193 MB/s
pd3d                 : zlib9      : 2.886 to 1 : 382.226 MB/s
pd3d                 : lzma       : 4.044 to 1 : 63.878 MB/s

seven                : Kraken     : 2.914 to 1 : 1053.961 MB/s
seven                : Mermaid    : 2.462 to 1 : 2374.796 MB/s
seven                : Selkie     : 2.194 to 1 : 3717.074 MB/s
seven                : lz4hc      : 2.000 to 1 : 2522.824 MB/s
seven                : zlib9      : 2.329 to 1 : 315.344 MB/s
seven                : lzma       : 3.186 to 1 : 52.660 MB/s

silesia              : Kraken     : 4.082 to 1 : 1004.014 MB/s
silesia              : Mermaid    : 3.571 to 1 : 2002.079 MB/s
silesia              : Selkie     : 3.053 to 1 : 2889.536 MB/s
silesia              : lz4hc      : 2.723 to 1 : 2269.788 MB/s
silesia              : zlib9      : 3.128 to 1 : 358.593 MB/s
silesia              : lzma       : 4.369 to 1 : 78.655 MB/s

See the index of this series of posts for more information : Introducing Oodle Mermaid and Selkie .
For more about Oodle visit RAD Game Tools

7/24/2016

Oodle 2.3.0 : Kraken Improvement

Oodle 2.3.0 includes some pretty solid improvements to Kraken. The result is around a 10% gain in decode speed.

There were two major factors in the gains. One was just some more time optimizing some inner loops (including some new super-tight pathways from Fabian).

The other was more rigorous analysis of the space-speed tradeoff decisions inside Kraken. One of the fundamental things that makes Kraken work is the fact that it consider space-speed when making its internal decisions, but before 230 those decisions were made in a rather ad-hoc way. Making those decisions better means that even with the same decoder, the new encoder is able to create files that are the same size but decode faster.

The tradeoff point (technically, the lagrange lambda, or the exchange rate from time to bytes) that's used by Oodle to make space-speed decisions is exposed to the client in the OodleLZ_CompressOptions so you can adjust it to bias for compression or decode speed. Each compressor sets what I believe to be a reasonable default for its usage domain, so adjustments to this value should typically be small (you can't massively change behavior with it; Kraken won't start arithmetic coding things if you set the tradeoff really small, for example, there's a small window where the compressor works well and you can just bias sightly within that window).

Some dry numbers for reference :

On PS4 :


Oodle 230 Kraken -zl4 : 24,700,820 ->10,377,556 =  3.361 bpb =  2.380 to 1
decode only      : 65.547 millis, 4.23 c/b, rate= 376.84 mb/s

Oodle 230 Kraken -zl6 : 24,700,820 -> 9,970,882 =  3.229 bpb =  2.477 to 1
decode           : 63.453 millis, 4.09 c/b, rate= 389.28 mb/s

Oodle 230 Kraken -zl7 : 24,700,820 -> 9,734,771 =  3.153 bpb =  2.537 to 1
decode           : 67.915 millis, 4.38 c/b, rate= 363.70 mb/s

Oodle 220 Kraken -zl4 : 24,700,820 ->10,326,584 =  3.345 bpb =  2.392 to 1
decode only      : 0.073 seconds, 211.30 b/kc, rate= 336.76 mb/s

Oodle 220 Kraken -zl6 : 24,700,820 ->10,011,486 =  3.242 bpb =  2.467 to 1
decode           : 0.074 seconds, 208.83 b/kc, rate= 332.82 mb/s

Oodle 220 Kraken -zl7 : 24,700,820 -> 9,773,112 =  3.165 bpb =  2.527 to 1
decode           : 0.079 seconds, 196.70 b/kc, rate= 313.49 mb/s

On Win64 (Core i7-3770 3.4 GHz) :


Oodle 2.3.0 :

Silesia Kraken -z6

total   : 211,938,580 ->51,918,269 =  1.960 bpb =  4.082 to 1
decode           : 210.685 millis, 3.38 c/b, rate= 1005.95 mb/s
Weissman 1-256 : [8.575]

mozilla : 51,220,480 ->14,410,181 =  2.251 bpb =  3.554 to 1
decode only      : 51.280 millis, 3.41 c/b, rate= 998.83 mb/s

lzt99 : 24,700,820 -> 9,970,882 =  3.229 bpb =  2.477 to 1
decode only      : 20.943 millis, 2.89 c/b, rate= 1179.44 mb/s

win81 : 104,857,600 ->38,222,311 =  2.916 bpb =  2.743 to 1
decode only      : 108.344 millis, 3.52 c/b, rate= 967.82 mb/s

Oodle 2.2.0 :

Silesia Kraken -z6

total : 211,938,580 ->51,857,427 =  1.957 bpb =  4.087 to 1
decode   : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s
Weissman 1-256 : [8.431]

"silesia_mozilla"

Kraken 230  :  3.55:1 ,   998.8 dec mb/s
Kraken 220  :  3.60:1 ,   896.5 dec mb/s
Kraken 215  :  3.51:1 ,   928.0 dec mb/s

"lzt99"

Kraken 230  :  2.48:1 ,   998.8 dec mb/s
Kraken 220  :  2.53:1 ,   912.0 dec mb/s
Kraken 215  :  2.46:1 ,   957.1 dec mb/s

"win81"

Kraken 230  :  2.74:1 ,   967.8 dec mb/s
Kraken 220  :  2.77:1 ,   818.0 dec mb/s
Kraken 215  :  2.70:1 ,   877.0 dec mb/s

NOTE : Oodle 2.3.0 Kraken data cannot be read by Oodle 2.2.0 or earlier. Oodle 230 can load all old Oodle data (new versions of Oodle can always load all data created by older versions). If you need to make data that be loaded with an older version using Oodle 230, then you can set the minimum decoder version to something lower (by default it's the current version). Contact Oodle support for details.

Some of the biggest gains were found on ARM, which I'll post about more in the future.

6/09/2016

Fundamentals of Modern LZ : Two-Step Parse


06-09-16 | Fundamentals of Modern LZ : Two-Step Parse

For some reason I feel like writing a note on this today.

A two-step parse is an enhancement to a forward-arrivals parse.

(background : forward-arrivals parse stores the minimum cost from head at each position, along with information on the path taken to get there. At each pos P, it takes the best incoming arrival and considers all ways to go further into the parse (literal/match/rep/etc.). At each destination point it stores arrival_cost[P] + step cost. In simple cases (no carried state, no entropy coding, like LZSS) the forward-arrivals parse is a perfect parse just like the backward dynamic-programming parse. In modern LZ with carried state such as a rep set or markov state, the forward parse is preferable.)

A two-step parse extends the standard forward-arrivals parse by being able to store an arrival from a single coding step, or from two coding steps. The standard usage (as in LZMA/7zip) is to be able to store a two-step arrival from the sequence { normal match, some literals, rep match }. This multi-step arrival is stored with the cost of the whole sequence at the end point of the sequence.

If you stored *all* arrivals (not just the cheapest), you would not need two-step parse. You could just store the first step, and then when your parse origin point advanced to the end of the first step, it would find the second step and be able to choose it as an option.

But obviously you don't store all arrivals at each position, since the number would massively explode, even with reduction by symmetries. (see, eg. previous articles on A* parse)

The problem arises when you have a cheap multi-step sequence, but the first step is expensive. Then the first step might be replaced (or never filled in the first place) and the parse will not be able to find the second step cheap option.

Let's consider a concrete example for clarity.


Parser is at pos P consider all ways to continue

At pos P there's a length 4 normal match available at offset O

It stores an arrival at [P+4] that's rather expensive

(because it has to send offset O).

At pos P+1 the parser finds a length 3 rep match

The exit from (P+1) length 3 also lands at [P+4]

This is a cheaper way to arrive at [P+4] , so the previous arrival from P via O is replaced

When the parser reaches P+4 it sees the incoming arrival as
begin a rep match match from P+1


But we missed something !

At pos P+5 (one step after the arrival) there are 2 bytes that match at offset O

if we had chosen the normal match to arrive at P+4 , we could now code a rep match

but we lost it, so we don't see the rep as an option.


Two-step to the rescue!

Back at pos P , we consider the one-step arrival :

{match len 4, offset O} to arrive at P+4

We also look after the end of that for cheap rep matches and find one.

So we store a two-step arrival :

{match len4, offset O, 1 literal, rep len 2} to arrive at P+7

Now at pos P+1 the arrival at P+4 is stomped

but the arrival at P+7 remains!  So we are able to find that in the future.


The options look like :

   P   P+4
   V   V
1. MMMMLRR
2. LRRRLLL

Option 2 is cheaper at P+4
but Option 1 is cheaper at P+7

This is the primary application of two-step parse.

It's a (very limited) way of finding non-local minima in the parse search space.

The other option is "multi-parse" that stores multiple arrivals at each position (something like 4 is typical). Multi-parse and two-step provide diminishing returns when used together, so they usually aren't. Two-step is generally a faster way and provides more win per CPU time, multi-parse is able to find longer-range non-local-minimum moves and so provides more compression.

All good modern LZ's need some kind of non-local-minimum parse, because to get into a good state for the future (typically by getting the right offset into the rep offset cache) you may need to make a more expensive initial step.

5/27/2016

PS4 Battle : MiniZ vs Zlib-NG vs ZStd vs Brotli vs Oodle

(see charts at the bottom)

Everything run at max compression options, level 99, max dict size. All libs are the latest on github, downloaded today. Zlib-NG has the arch/x86 stuff enabled. PS4 is AMD Jaguar , x64.

I'm going to omit encode speeds on the per-file results for simplicity, these are pretty representative :


aow3_skin_giants.clb :
zlib-ng encode   : 2.699 seconds, 1.65 b/kc, rate= 2.63 mb/s
miniz encode     : 2.950 seconds, 1.51 b/kc, rate= 2.41 mb/s
zstd encode      : 5.464 seconds, 0.82 b/kc, rate= 1.30 mb/s
brotli-9  encode    : 23.110 seconds, 0.19 b/kc, rate= 307.44 kb/s
brotli-10 encode    : 68.072 seconds, 0.07 b/kc, rate= 104.38 kb/s
brotli-11 encode    : 79.844 seconds, 0.06 b/kc, rate= 88.99 kb/s

Results :

PS4 clang-3.5.0

-------------

lzt99 :

MiniZ : 24,700,820 ->13,120,668 =  4.249 bpb =  1.883 to 1
miniz_decompress_time : 0.292 seconds, 53.15 b/kc, rate= 84.71 mb/s

zlib-ng : 24,700,820 ->13,158,385 =  4.262 bpb =  1.877 to 1
miniz_decompress_time : 0.226 seconds, 68.58 b/kc, rate= 109.30 mb/s

ZStd : 24,700,820 ->10,403,228 =  3.369 bpb =  2.374 to 1
zstd_decompress_time : 0.184 seconds, 84.12 b/kc, rate= 134.07 mb/s

Brotli-9 : 24,700,820 ->10,473,560 =  3.392 bpb =  2.358 to 1
brotli_decompress_time : 0.259 seconds, 59.83 b/kc, rate= 95.36 mb/s

Brotli-10 : 24,700,820 -> 9,949,740 =  3.222 bpb =  2.483 to 1
brotli_decompress_time : 0.319 seconds, 48.54 b/kc, rate= 77.36 mb/s

Brotli-11 : 24,700,820 -> 9,833,023 =  3.185 bpb =  2.512 to 1
brotli_decompress_time : 0.317 seconds, 48.84 b/kc, rate= 77.84 mb/s

Oodle Kraken -zl4 : 24,700,820 ->10,326,584 =  3.345 bpb =  2.392 to 1
encode only      : 4.139 seconds, 3.74 b/kc, rate= 5.97 mb/s
decode only      : 0.073 seconds, 211.30 b/kc, rate= 336.76 mb/s

Oodle Kraken -zl6 : 24,700,820 ->10,011,486 =  3.242 bpb =  2.467 to 1
decode           : 0.074 seconds, 208.83 b/kc, rate= 332.82 mb/s

Oodle Kraken -zl7 : 24,700,820 -> 9,773,112 =  3.165 bpb =  2.527 to 1
decode           : 0.079 seconds, 196.70 b/kc, rate= 313.49 mb/s

Oodle LZNA : lzt99 : 24,700,820 -> 9,068,880 =  2.937 bpb =  2.724 to 1
decode           : 0.643 seconds, 24.12 b/kc, rate= 38.44 mb/s

-------------

normals.bc1 :

miniz :   524,316 ->   291,697 =  4.451 bpb =  1.797 to 1
miniz_decompress_time : 0.008 seconds, 39.86 b/kc, rate= 63.53 mb/s

zlib-ng :   524,316 ->   292,541 =  4.464 bpb =  1.792 to 1
zlib_ng_decompress_time : 0.007 seconds, 47.32 b/kc, rate= 75.41 mb/s

zstd :   524,316 ->   273,642 =  4.175 bpb =  1.916 to 1
zstd_decompress_time : 0.007 seconds, 49.64 b/kc, rate= 79.13 mb/s

brotli-9 :   524,316 ->   289,778 =  4.421 bpb =  1.809 to 1
brotli_decompress_time : 0.010 seconds, 31.70 b/kc, rate= 50.52 mb/s

brotli-10 :   524,316 ->   259,772 =  3.964 bpb =  2.018 to 1
brotli_decompress_time : 0.011 seconds, 28.65 b/kc, rate= 45.66 mb/s

brotli-11 :   524,316 ->   253,625 =  3.870 bpb =  2.067 to 1
brotli_decompress_time : 0.011 seconds, 29.74 b/kc, rate= 47.41 mb/s

Oodle Kraken -zl6 :    524,316 ->   247,217 =  3.772 bpb =  2.121 to 1
decode           : 0.002 seconds, 135.52 b/kc, rate= 215.95 mb/s

Oodle Kraken -zl7 :    524,316 ->   238,844 =  3.644 bpb =  2.195 to 1
decode           : 0.003 seconds, 123.96 b/kc, rate= 197.56 mb/s

Oodle BitKnit :    524,316 ->   225,884 =  3.447 bpb =  2.321 to 1
decode only      : 0.010 seconds, 31.67 b/kc, rate= 50.47 mb/s

-------------

lightmap.bc3 :

miniz :  4,194,332 ->   590,448 =  1.126 bpb =  7.104 to 1 
miniz_decompress_time : 0.025 seconds, 105.14 b/kc, rate= 167.57 mb/s

zlib-ng : 4,194,332 ->   584,107 =  1.114 bpb =  7.181 to 1
zlib_ng_decompress_time : 0.019 seconds, 137.77 b/kc, rate= 219.56 mb/s

zstd :  4,194,332 ->   417,672 =  0.797 bpb = 10.042 to 1 
zstd_decompress_time : 0.014 seconds, 182.53 b/kc, rate= 290.91 mb/s

brotli-9 : 4,194,332 ->   499,120 =  0.952 bpb =  8.403 to 1 
brotli_decompress_time : 0.022 seconds, 118.64 b/kc, rate= 189.09 mb/s

brotli-10 : 4,194,332 ->   409,907 =  0.782 bpb = 10.232 to 1 
brotli_decompress_time : 0.021 seconds, 125.20 b/kc, rate= 199.54 mb/s

brotli-11 : 4,194,332 ->   391,576 =  0.747 bpb = 10.711 to 1 
brotli_decompress_time : 0.021 seconds, 127.12 b/kc, rate= 202.61 mb/s

Oodle Kraken -zl6 :   4,194,332 ->   428,737 =  0.818 bpb =  9.783 to 1 
decode           : 0.009 seconds, 308.45 b/kc, rate= 491.60 mb/s

Oodle BitKnit :   4,194,332 ->   416,208 =  0.794 bpb = 10.077 to 1
decode only      : 0.021 seconds, 122.59 b/kc, rate= 195.39 mb/s

Oodle LZNA :  4,194,332 ->   356,313 =  0.680 bpb = 11.771 to 1 
decode           : 0.033 seconds, 79.51 b/kc, rate= 126.71 mb/s

----------------

aow3_skin_giants.clb

Miniz : 7,105,158 -> 3,231,469 =  3.638 bpb =  2.199 to 1
miniz_decompress_time : 0.070 seconds, 63.80 b/kc, rate= 101.69 mb/s

zlib-ng : 7,105,158 -> 3,220,291 =  3.626 bpb =  2.206 to 1
zlib_ng_decompress_time : 0.056 seconds, 80.14 b/kc, rate= 127.71 mb/s

Zstd : 7,105,158 -> 2,700,034 =  3.040 bpb =  2.632 to 1
zstd_decompress_time : 0.050 seconds, 88.69 b/kc, rate= 141.35 mb/s

brotli-9 :  7,105,158 -> 2,671,237 =  3.008 bpb =  2.660 to 1
brotli_decompress_time : 0.080 seconds, 55.84 b/kc, rate= 89.00 mb/s

brotli-10 : 7,105,158 -> 2,518,315 =  2.835 bpb =  2.821 to 1
brotli_decompress_time : 0.098 seconds, 45.54 b/kc, rate= 72.58 mb/s

brotli-11 : 7,105,158 -> 2,482,511 =  2.795 bpb =  2.862 to 1
brotli_decompress_time : 0.097 seconds, 45.84 b/kc, rate= 73.05 mb/s

Oodle Kraken -zl6 : aow3_skin_giants.clb :  7,105,158 -> 2,638,490 =  2.971 bpb =  2.693 to 1
decode           : 0.023 seconds, 195.25 b/kc, rate= 311.19 mb/s

Oodle BitKnit : 7,105,158 -> 2,623,466 =  2.954 bpb =  2.708 to 1
decode only      : 0.095 seconds, 47.11 b/kc, rate= 75.08 mb/s

Oodle LZNA : aow3_skin_giants.clb :  7,105,158 -> 2,394,871 =  2.696 bpb =  2.967 to 1
decode           : 0.170 seconds, 26.26 b/kc, rate= 41.85 mb/s

--------------------

silesia_mozilla

MiniZ : 51,220,480 ->19,141,389 =  2.990 bpb =  2.676 to 1
miniz_decompress_time : 0.571 seconds, 56.24 b/kc, rate= 89.63 mb/s

zlib-ng : 51,220,480 ->19,242,520 =  3.005 bpb =  2.662 to 1
zlib_ng_decompress_time : 0.457 seconds, 70.31 b/kc, rate= 112.05 mb/s

zstd : malloc failed

brotli-9 : 51,220,480 ->15,829,463 =  2.472 bpb =  3.236 to 1
brotli_decompress_time : 0.516 seconds, 62.27 b/kc, rate= 99.24 mb/s

brotli-10 : 51,220,480 ->14,434,253 =  2.254 bpb =  3.549 to 1
brotli_decompress_time : 0.618 seconds, 52.00 b/kc, rate= 82.88 mb/s

brotli-11 : 51,220,480 ->14,225,511 =  2.222 bpb =  3.601 to 1
brotli_decompress_time : 0.610 seconds, 52.72 b/kc, rate= 84.02 mb/s

Oodle Kraken -zl6 : 51,220,480 ->14,330,298 =  2.238 bpb =  3.574 to 1
decode           : 0.200 seconds, 160.51 b/kc, rate= 255.82 mb/s

Oodle Kraken -zl7 : 51,220,480 ->14,222,802 =  2.221 bpb =  3.601 to 1
decode           : 0.201 seconds, 160.04 b/kc, rate= 255.07 mb/s

Oodle LZNA : silesia_mozilla : 51,220,480 ->13,294,622 =  2.076 bpb =  3.853 to 1
decode           : 1.022 seconds, 31.44 b/kc, rate= 50.11 mb/s

I tossed in tests of BitKnit & LZNA in some cases after I realized that the Brotli decode speeds are more comparable to BitKnit than Kraken, and even LZNA isn't that far off (usually less than a factor of 2). eg. you could do half your files in LZNA and half in Kraken and that would be about the same total time as doing them all in Brotli.


Here are charts of the above data :

(silesia_mozilla omitted due to lack of zstd results)

(I'm trying an experiment and showing inverted scales, which are more proportional to what you care about. I'm showing seconds per gigabyte, and percent out of output size, which are proportional to *time* not speed, and *size* not ratio. So, lower is better.)

log-log speed & ratio :

Time and size are just way better scales. Looking at "speed" and "ratio" can be very misleading, because big differences in speed at the high end (eg. 2000 mb/s vs 2200 mb/s) don't translate into a very big time difference, and *time* is what you care about. On the other hand, small differences in speed at the low end *are* important - (eg. 30 mb/s vs 40 mb/s) because those mean a big difference in time.

I've been doing mostly "speed" and "ratio" because it reads better to the novice (higher is better! I want the one with the biggest bar!), but it's so misleading that I think going to time & size is worth it.

5/17/2016

The Weissman Score

Wikipedia suggests the Weissman score should be

which ignoring constants is just W = r/logT

That's just wrong. You don't take a logarithm of something with units. But there are aspects of it that are correct. W should be proportional to r (compression ratio), and a logarithm of time should be involved. Just not like that.

I present a formula which I call the correct Weissman Score :


W = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )  -
    comp_ratio * log10( 1 + speed/(disk_speed_hi *comp_ratio) )

or

W = comp_ratio * log10( ( comp_ratio + speed/disk_speed_lo ) / ( comp_ratio + speed/disk_speed_hi ) )

You can have a Weissman score for encode speed or decode speed. It's a measure of space-speed tradeoff success.

I suggest the range should be 1-256. disk_speed_lo = 1 MB/s (to evaluate performance on very slow channels, favoring small files), disk_speed_hi = 256 MB/s (to evalue performance on very fast disks, favoring speed). And because 1 and 256 are amongst programmers' favorite numbers.

You could also just let the hi range go to infinity. Then you don't need a hi disk speed parameter and you get :


Weissman-infinity = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )

with disk_speed_lo = 1 MB/s ; which is neater, though this favors fast compressors more than you might like. While it's a cleaner formula, I think it's less useful for practical purposes, where the bounded hi range focuses the score more on the area that most people care about.

I came up with this formula because I started thinking about summarizing a score from the Pareto charts I've made . What if you took the speedup value at several (log-scale) disk speeds; like you could take the speedup at 1 MB/s,2 MB/s,4 MB/s, and just average them? speedup is a good way to measure a compressor even if you don't actually care about speed. Well, rather than just average a bunch of points, what if I average *all* points? eg. integrate to get the area under the curve? Note that we're integrating in log-scale of disk speed.

Turns out you can just do that integral :

    speedup = (time to load uncompressed) / (time to load compressed + decompress)
    speedup = (raw_size/disk_speed) / (comp_size/disk_speed + raw_size/ decompress_speed)
    speedup = (1/disk_speed) / (1/(disk_speed*compression_ratio) + 1 / decompress_speed)
    speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)
    speedup = 1 / (1/compression_ratio + exp( log_disk_speed ) / decompress_speed)
    speedup = compression_ratio / (1 + exp( log_disk_speed ) * compression_ratio/decompress_speed)
    speedup = compression_ratio * 1 / (1 + exp( log_disk_speed + log(compression_ratio/decompress_speed)))

speedup is a sigmoid :

    y = 1 / (1 + e^-x ) 
    
    Integral{y} = ln( 1 + e^x )

    x = - ( log_disk_speed + log(compression_ratio/decompress_speed) )

so substitute some variables and you get the above formula for the Weissman score.

In the final formula, I changed from natural log to log-base-10, which is just a constant scaling factor.

The Weissman (decode Core i7-3770 3.4 GHz; 1-256 range) scores on Silesia are :

lz4hc    : 6.243931
zstdmax  : 7.520236
lzham    : 6.924379
lzma     : 5.460073
zlib9    : 5.198510
Kraken   : 8.431461

Weissman-infinity scores are :

lz4hc    : 7.983104
zstdmax  : 8.168544
lzham    : 7.277707
lzma     : 5.589155
zlib9    : 5.630476
Kraken   : 9.551152

Goal : beat 10.0 !


ADD : this post was a not-sure-if-joking. But I actually think it's useful. I find it useful anyway.

When you're trying to tweak out some space-speed tradeoff decisions, you get different sizes and speeds, and it can be hard to tell if that tradeoff was good. You can do things like plot all your options on a space-speed graph and try to guess the pareto frontier and take those. But when iterating an optimization of a parameter you want just a simple score.

This corrected Weissman score is a nice way to do that. You have to choose what domain you're optimizing for, size-dominant slower compressors should use Weissman 1-256 , for balance of space and super speed use Weissman 1-inf (or 40-800), for the fast domain (LZ4-ish) use a range like 100-inf. Then you can just iterate to maximize that number!

For whatever space-speed tradeoff domain you're interested in, there exists a Weissman score range (lo-hi disk speed paramaters) such that maximizing the Weissman score in that range gives you the best space-speed tradeoff in the domain you wanted. The trick is choosing what that lo-hi range is (it doesn't necessarily directly correspond to actual disk or channel speeds; there are other factors to consider like latency, storage usage, etc. that might cause you to bias the lo-hi away from the actual channel speeds in some way; for example high speed decoders should always set the upper speed to infinity, which corresponds to the use case that the compressed data might be already resident in RAM so it has zero time to load).

5/15/2016

PS4 Battle : LZ4 vs LZSSE vs Oodle

PS4 is the arena. Three compressors enter. (revised 05-16)

Advantages of the PS4 : consistent well-defined hardware for reproducible testing. Slow platform that game developers care about being fast on. Builds with clang which is the target of choice for some compression libraries that don't build so easily on MSVC.

After the initial version of this post, I went and fuzz-safed LZB16, so they're all directly comparable. To compare apples, look at LZ4_decompress_safe. I also include LZ4_decompress_fast for reference.

LZ4_decompress_safe : fuzz safe (*)
LZ4_decompress_fast : NOT fuzz safe
LZSSE8_Decode : fuzz safe
Oodle LZB16 : fuzz safe
LZSSE and Oodle both use multiple copies of the core loop to minimize the pain of fuzz safing. LZ4's code is much simpler, it doesn't do a ton of macro or .inl nastiness. (* = this is the only one that I would actually trust to put in a satellite or a bank or something critical, it's just so simple, it's way easier to be sure that it's correct)

Conclusion :

Comparing the two Safe open-source options, LZ4_safe vs. LZSSE8 : LZSSE8 is pretty consistently faster than LZ4_Safe on PS4 (though the difference is small). PS4 is a better platform for LZSSE than x64 (PS4 is actually a pretty bad platform for LZ4; there're a variety of issues; see GDC "Taming the Jaguar" slides ; but particular issues for LZ4 are the front-end bottleneck and cache latency). When I tested on x64 before, it was much more mixed, sometimes LZ4 was faster.

I was surprised to find that Oodle LZB16 is quite a lot faster than LZ4 on PS4. (for example, that's not the case on Windows/x64, it's much closer there). I've never run third party codecs on PS4 before. I suppose this reflects the per-platform tweaking that we spend so much time on, and I'm sure LZ4 would catch up with some time spent fiddling with the PS4 codegen.

The compression ratios are mostly close enough to not care, though LZSSE8 does a bit worse on some of the DXTC/BCn files (lightmap.bc3 and d.dds).

On some files I include Oodle LZBLW numbers (LZB-bytewise-large-window). Sometimes Oodle LZBLW is a pretty big free compression win at about the same speed. Sometimes it gets worse ratio, sometimes much worse speed. If I was a client using this, I might try LZBLW and drop down to LZB16 any time it's not a good tradeoff.

Full data :

REMINDER : LZ4_decompress_fast is not directly comparable to the others, it's not fuzz safe, the others are!


PS4 clang-3.6.1

------------------

lzt99 :

lz4hc : 24,700,820 ->14,801,510 =  4.794 bpb =  1.669 to 1
LZ4_decompress_safe_time : 0.035 seconds, 440.51 b/kc, rate= 702.09 mb/s
LZ4_decompress_fast_time : 0.032 seconds, 483.55 b/kc, rate= 770.67 mb/s

LZSSE8  : 24,700,820 ->15,190,395 =  4.920 bpb =  1.626 to 1
LZSSE8_Decode_Time : 0.033 seconds, 467.32 b/kc, rate= 744.81 mb/s

Oodle LZB16 : lzt99 : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1 
decode           : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s

Oodle LZBLW : lzt99 : 24,700,820 ->13,349,800 =  4.324 bpb =  1.850 to 1
decode           : 0.033 seconds, 470.39 b/kc, rate= 749.74 mb/s

------------------

texture.bc1

lz4hc : 2,188,524 -> 2,068,268 =  7.560 bpb =  1.058 to 1
LZ4_decompress_safe_time : 0.004 seconds, 322.97 b/kc, rate= 514.95 mb/s
LZ4_decompress_fast_time : 0.004 seconds, 353.08 b/kc, rate= 562.89 mb/s

LZSSE8 : 2,188,524 -> 2,111,182 =  7.717 bpb =  1.037 to 1
LZSSE8_Decode_Time : 0.004 seconds, 360.21 b/kc, rate= 574.42 mb/s

Oodle LZB16 : texture.bc1 :  2,188,524 -> 2,068,823 =  7.562 bpb =  1.058 to 1 
decode           : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s

------------------

lightmap.bc3

lz4hc : 4,194,332 ->   632,974 =  1.207 bpb =  6.626 to 1
LZ4_decompress_safe_time : 0.005 seconds, 521.54 b/kc, rate= 831.38 mb/s
LZ4_decompress_fast_time : 0.005 seconds, 564.63 b/kc, rate= 900.46 mb/s

LZSSE8 encode    :  4,194,332 ->   684,062 =  1.305 bpb =  6.132 to 1
LZSSE8_Decode_Time : 0.005 seconds, 551.85 b/kc, rate= 879.87 mb/s

Oodle LZB16 : lightmap.bc3 :  4,194,332 ->   630,794 =  1.203 bpb =  6.649 to 1 
decode           : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s

------------------

silesia_mozilla

lz4hc : 51,220,480 ->22,062,995 =  3.446 bpb =  2.322 to 1
LZ4_decompress_safe_time : 0.083 seconds, 385.47 b/kc, rate= 614.35 mb/s
LZ4_decompress_fast_time : 0.075 seconds, 427.14 b/kc, rate= 680.75 mb/s

LZSSE8 : 51,220,480 ->22,148,366 =  3.459 bpb =  2.313 to 1
LZSSE8_Decode_Time : 0.070 seconds, 461.53 b/kc, rate= 735.59 mb/s

Oodle LZB16 : silesia_mozilla : 51,220,480 ->22,022,002 =  3.440 bpb =  2.326 to 1 
decode           : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s

Oodle LZBLW : silesia_mozilla : 51,220,480 ->20,881,772 =  3.261 bpb =  2.453 to 1
decode           : 0.112 seconds, 285.68 b/kc, rate= 455.30 mb/s

------------------

breton.dds

lz4hc :   589,952 ->   116,447 =  1.579 bpb =  5.066 to 1
LZ4_decompress_safe_time : 0.001 seconds, 568.65 b/kc, rate= 906.22 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 624.81 b/kc, rate= 996.54 mb/s

LZSSE8 encode    :  589,952 ->   119,659 =  1.623 bpb =  4.930 to 1
LZSSE8_Decode_Time : 0.001 seconds, 604.14 b/kc, rate= 962.40 mb/s

Oodle LZB16 : breton.dds :    589,952 ->   113,578 =  1.540 bpb =  5.194 to 1 
decode           : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s

Oodle LZBLW : breton.dds :    589,952 ->   132,934 =  1.803 bpb =  4.438 to 1
decode           : 0.001 seconds, 396.04 b/kc, rate= 630.96 mb/s

------------------

d.dds

lz4hc encode     :  1,048,704 ->   656,706 =  5.010 bpb =  1.597 to 1
LZ4_decompress_safe_time : 0.001 seconds, 554.69 b/kc, rate= 884.24 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 587.20 b/kc, rate= 936.34 mb/s

LZSSE8 encode    :  1,048,704 ->   695,583 =  5.306 bpb =  1.508 to 1
LZSSE8_Decode_Time : 0.001 seconds, 551.13 b/kc, rate= 879.05 mb/s

Oodle LZB16 : d.dds :  d.dds :  1,048,704 ->   654,014 =  4.989 bpb =  1.603 to 1 
decode           : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s

------------------

all_dds

lz4hc : 79,993,099 ->47,848,680 =  4.785 bpb =  1.672 to 1
LZ4_decompress_safe_time : 0.158 seconds, 316.67 b/kc, rate= 504.70 mb/s
LZ4_decompress_fast_time : 0.143 seconds, 350.66 b/kc, rate= 558.87 mb/s

LZSSE8 : 79,993,099 ->47,807,041 =  4.781 bpb =  1.673 to 1
LZSSE8_Decode_Time : 0.140 seconds, 358.61 b/kc, rate= 571.54 mb/s

Oodle LZB16 : all_dds : 79,993,099 ->47,683,003 =  4.769 bpb =  1.678 to 1 
decode           : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s

----------

baby_robot_shell.gr2

lz4hc : 58,788,904 ->32,998,567 =  4.490 bpb =  1.782 to 1
LZ4_decompress_safe_time : 0.090 seconds, 412.04 b/kc, rate= 656.71 mb/s
LZ4_decompress_fast_time : 0.080 seconds, 460.55 b/kc, rate= 734.01 mb/s

LZSSE8 : 58,788,904 ->33,201,406 =  4.518 bpb =  1.771 to 1
LZSSE8_Decode_Time : 0.076 seconds, 485.14 b/kc, rate= 773.20 mb/s

Oodle LZB16 : baby_robot_shell.gr2 : 58,788,904 ->32,862,033 =  4.472 bpb =  1.789 to 1
decode           : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s

Oodle LZBLW : baby_robot_shell.gr2 : 58,788,904 ->30,207,635 =  4.111 bpb =  1.946 to 1
decode           : 0.090 seconds, 409.88 b/kc, rate= 653.26 mb/s


After posting the original version with non-fuzz-safe LZB16, I decided to just go and do the fuzz-safing for LZB16.


LZB16, PS4 clang-3.6.1

post-fuzz-safing :

    lzt99 : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1 
    decode           : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s
    
    texture.bc1 :  2,188,524 -> 2,068,823 =  7.562 bpb =  1.058 to 1 
    decode           : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s
    
    lightmap.bc3 :  4,194,332 ->   630,794 =  1.203 bpb =  6.649 to 1 
    decode           : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s
    
    silesia_mozilla : 51,220,480 ->22,022,002 =  3.440 bpb =  2.326 to 1 
    decode           : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s
    
    breton.dds :    589,952 ->   113,578 =  1.540 bpb =  5.194 to 1 
    decode           : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s
    
    d.dds :  1,048,704 ->   654,014 =  4.989 bpb =  1.603 to 1 
    decode           : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s
    
    all_dds : 79,993,099 ->47,683,003 =  4.769 bpb =  1.678 to 1 
    decode           : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s
    
    baby_robot_shell.gr2 : 58,788,904 ->32,862,033 =  4.472 bpb =  1.789 to 1
    decode           : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s

pre-fuzz-safing reference :

    lzt99 912.92 mb/s
    texture.bc1 598.61 mb/s
    lightmap.bc3 876.19 mb/s
    silezia_mozilla 794.72 mb/s
    breton.dds 1078.52 mb/s
    d.dds 888.73 mb/s
    all_dds 701.45 mb/s
    baby_robot_shell.gr2 877.81 mb/s

Mild speed penalty on most files.

5/12/2016

Oodle Kraken Thread-Phased Decoding

Oodle Kraken is already by far the fastest-to-decode high-compression compressor in the world (that's a mouthful!). But it's got a magic trick that makes it even faster :

Oodle Kraken can decode its normal compressed data on multiple threads.

This is different than what a lot of compressors do (and what Oodle has done in the past), which is to split the data into independent chunks so that each chunk can be decompressed on its own thread. All compressors can do that in theory, Oodle makes it easy in practice with the "seek chunk" decodes. But that requires special encoding that does the chunking, and it hurts compression ratio by breaking up where matches can be found.

The Oodle Kraken threaded decode is different. To distinguish it I call it "Thread-Phased" decode. It runs on normal compressed data - no special encoding flags are needed. It has no compressed size penalty because it's the same normal single-thread compressed data.

The Oodle Kraken Thread-Phased decode gets most of its benefit with just 2 threads (if you like, the calling thread + 1 more). The exact speedup varies by file, usually in the 1.4X - 1.9X range. The results here are all for 2-thread decode.

For example on win81, 2-thread Oodle Kraken is 1.7X faster than 1-thread : (with some other compressors for reference)


win81 :

Kraken 2-thread  : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.075 seconds, 410.98 b/kc, rate= 1398.55 M/s

Kraken           : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.127 seconds, 243.06 b/kc, rate= 827.13 M/s

zstdmax          : 104,857,600 ->39,768,086 =  3.034 bpb =  2.637 to 1 
decode           : 0.251 seconds, 122.80 b/kc, rate= 417.88 M/s

lzham            : 104,857,600 ->37,856,839 =  2.888 bpb =  2.770 to 1 
decode           : 0.595 seconds, 51.80 b/kc, rate= 176.27 M/s

lzma             : 104,857,600 ->35,556,039 =  2.713 bpb =  2.949 to 1 
decode           : 2.026 seconds, 15.21 b/kc, rate= 51.76 M/s

Charts on a few files :

Oodle 2.2.0 includes helper functions that will just run a Thread-Phased decode for you on Oodle's own thread system, as well as example code that runs the entire Thread-Phased decode client-side so you can do it on your own threads however you like.

Performance on the Silesia set for reference :


Silesia total :

Oodle Kraken -z6 : 211,938,580 ->51,857,427 =  1.957 bpb =  4.087 to 1

single threaded decode   : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s

two threaded decode      : 0.158 seconds, 394.55 b/kc, rate= 1342.64 M/s

Note that because the Kraken Thread-Phased decode is a true threaded decode of individual compressed buffers that means it is a *latency* reduction for decoding individual blocks, not just a *throughput* reduction. For example, if you were really decoding the whole Silesia set, you might just run the decompression of each file on its own thread. That is a good thing to do, and it would give you a near 2X speedup (with two threads). But that's a different kind of threading - that gives you a throughput improvement of 2X but the latency to decode any individual file is not improved at all. Kraken Thread-Phased decode reduces the latency of each independent decode, and of course it can also be used with chunking or multiple-file decoding to get further speedups.

Oodle 2.2.0 Kraken Optimal Parse improvements

Oodle 2.2.0 is about to ship, with some improvements to the Kraken optimal parse compression ratios. Compressed size is improved by around 1%. Speed is approximately the same at -z6 (previous max level for Kraken) but there's a new -z7 mode that's slightly slower and even higher compression.

I think we'll continue to find improvements in the optimal parsers over the coming months (optimal parsing is hard!) which should lead to some more tiny gains in the compression ratio in the slow encoder modes.


Silesia , sum of all files

uncompressed : 211,938,580

Kraken 2.1.5 -z6 : 52,366,897
Kraken 2.2.0 -z6 : 51,857,427
Kraken 2.2.0 -z7 : 51,625,488

Oodle Kraken 2.1.5 topped out at -z6 (Optimal2). There's a new -z7 (Optimal3) mode which gets a bit more compression at the cost of a bit of speed, which is why it's on a separate option instead of just part of -z6.

Results on some individual files (Kraken 220 is -z7) :

-------------------------------------------------------
"silesia_mozilla"

by ratio:
lzma        :  3.88:1 ,    2.0 enc mb/s ,   63.7 dec mb/s
Kraken 220  :  3.60:1 ,    1.1 enc mb/s ,  896.5 dec mb/s
lzham       :  3.56:1 ,    1.5 enc mb/s ,  186.4 dec mb/s
Kraken 215  :  3.51:1 ,    1.2 enc mb/s ,  928.0 dec mb/s
zstdmax     :  3.24:1 ,    2.8 enc mb/s ,  401.0 dec mb/s
zlib9       :  2.51:1 ,   12.4 enc mb/s ,  291.5 dec mb/s
lz4hc       :  2.32:1 ,   36.4 enc mb/s , 2351.6 dec mb/s

-------------------------------------------------------
"lzt99"

by ratio:
lzma        :  2.65:1 ,    3.1 enc mb/s ,   42.3 dec mb/s
Kraken 220  :  2.53:1 ,    2.0 enc mb/s ,  912.0 dec mb/s
Kraken 215  :  2.46:1 ,    2.3 enc mb/s ,  957.1 dec mb/s
lzham       :  2.44:1 ,    1.9 enc mb/s ,  166.0 dec mb/s
zstdmax     :  2.27:1 ,    3.8 enc mb/s ,  482.3 dec mb/s
zlib9       :  1.77:1 ,   13.3 enc mb/s ,  286.2 dec mb/s
lz4hc       :  1.67:1 ,   30.3 enc mb/s , 2737.4 dec mb/s

-------------------------------------------------------
"all_dds"

by ratio:
lzma        :  2.37:1 ,    2.1 enc mb/s ,   40.8 dec mb/s
Kraken 220  :  2.23:1 ,    1.0 enc mb/s ,  650.6 dec mb/s
Kraken 215  :  2.18:1 ,    1.0 enc mb/s ,  684.6 dec mb/s
lzham       :  2.17:1 ,    1.3 enc mb/s ,  127.7 dec mb/s
zstdmax     :  2.02:1 ,    3.3 enc mb/s ,  289.4 dec mb/s
zlib9       :  1.83:1 ,   13.3 enc mb/s ,  242.9 dec mb/s
lz4hc       :  1.67:1 ,   20.4 enc mb/s , 2226.9 dec mb/s

-------------------------------------------------------
"baby_robot_shell.gr2"

by ratio:
lzma        :  4.35:1 ,    3.1 enc mb/s ,   59.3 dec mb/s
Kraken 220  :  3.82:1 ,    1.4 enc mb/s ,  837.2 dec mb/s
Kraken 215  :  3.77:1 ,    1.5 enc mb/s ,  878.3 dec mb/s
lzham       :  3.77:1 ,    1.6 enc mb/s ,  162.5 dec mb/s
zstdmax     :  2.77:1 ,    5.7 enc mb/s ,  405.7 dec mb/s
zlib9       :  2.19:1 ,   13.9 enc mb/s ,  332.9 dec mb/s
lz4hc       :  1.78:1 ,   40.1 enc mb/s , 2364.4 dec mb

-------------------------------------------------------
"win81"

by ratio:
lzma        :  2.95:1 ,    2.5 enc mb/s ,   51.9 dec mb/s
lzham       :  2.77:1 ,    1.6 enc mb/s ,  177.6 dec mb/s
Kraken 220  :  2.77:1 ,    1.0 enc mb/s ,  818.0 dec mb/s
Kraken 215  :  2.70:1 ,    1.0 enc mb/s ,  877.0 dec mb/s
zstdmax     :  2.64:1 ,    3.5 enc mb/s ,  417.8 dec mb/s
zlib9       :  2.07:1 ,   16.8 enc mb/s ,  269.6 dec mb/s
lz4hc       :  1.91:1 ,   28.8 enc mb/s , 2297.6 dec mb/s

-------------------------------------------------------
"enwik7"

by ratio:
lzma        :  3.64:1 ,    1.8 enc mb/s ,   79.5 dec mb/s
lzham       :  3.60:1 ,    1.4 enc mb/s ,  196.5 dec mb/s
zstdmax     :  3.56:1 ,    2.2 enc mb/s ,  394.6 dec mb/s
Kraken 220  :  3.51:1 ,    1.4 enc mb/s ,  702.8 dec mb/s
Kraken 215  :  3.49:1 ,    1.5 enc mb/s ,  789.7 dec mb/s
zlib9       :  2.38:1 ,   22.2 enc mb/s ,  234.3 dec mb/s
lz4hc       :  2.35:1 ,   27.5 enc mb/s , 2059.6 dec mb/s

-------------------------------------------------------
You can see that encode & decode speed is slightly worse at level -z7 , and compression ratio si improved. (most of the other compression levels have roughly the same decode speed; -z7 enables some special options that can hurt decode speed a bit). Of course even at -z7 Kraken is way faster than anything else comparable!

old rants