4/09/2019

Oodle 2.8.0 Release

Oodle 2.8.0 is now out. This release continues to improve the Kraken, Mermaid, Selkie, Leviathan family of compressors. There are no new compressors or major changes in this release. We continue to move towards retiring the pre-Kraken codecs, but they are still available in Oodle 2.8

The major new features in Oodle 2.8 are :

  • "Job" threading of the encoders in Oodle Core. This allows you to get multi-threaded encoding of medium size buffers using Oodle Core on your own thread system (or optionally use the one provided in OodleX). See a whole section on this below.

  • Faster encoders, particularly the optimals, and particularly Optimal1. They're 1 - 2X faster even without Jobify threading.

  • Better limits on memory use, particularly for the encoders. You can now query the memory sizes needed and allocate all the memory yourself before calling Oodle, and Oodle will then do no allocations. see OodleLZ_GetCompressScratchMemBound, example_lz_noallocs, and the FAQ.

An example of the encoder speed improvement on the "seven" test set, measured with ect on a Core i7 3770, Kraken levels 5 (Optimal1) and 7 (Optimal3) :


Oodle 2.7.5 :
ooKraken5       :  3.02:1 ,    3.3 enc MB/s , 1089.2 dec MB/s
ooKraken7       :  3.09:1 ,    1.5 enc MB/s , 1038.1 dec MB/s

Oodle 2.8.0 : (without Jobify)
ooKraken5       :  3.01:1 ,    4.6 enc MB/s , 1093.6 dec MB/s
ooKraken7       :  3.08:1 ,    2.3 enc MB/s , 1027.6 dec MB/s

Oodle 2.8.0 : (with Jobify)
ooKraken5       :  3.01:1 ,    7.2 enc MB/s , 1088.3 dec MB/s
ooKraken7       :  3.08:1 ,    2.9 enc MB/s , 1024.6 dec MB/s

See the full change log for more.


About the new Jobify threading of Oodle Core :

Oodle Core is a pure code lib (as much as possible) that just does memory to memory compression and decompression. It does not have IO, threading, or other system dependencies. (that's provided by Oodle Ext). The system functions that Oodle Core needs are accessed through function pointers that the user can provide, such as for allocations and logging. We have extended this so you can now plug in a Job threading system which Oodle Core can optionally use to multi-thread operations.

Currently the only thing we will multi-thread is OodleLZ_Compress encoding of the new LZ compressors (Kraken, Mermaid, Selkie, Leviathan) at the Optimal levels, on buffers larger than one BLOCK (256 KB). In the future we may multi-thread other things.

Previously if you wanted multi-threaded encoding you had to split your buffers into chunks and multi-thread at the chunk level (with or without overlap), or by encoding multiple files simultaneously. You still can and should do that. Oodle Ext for example provides functions to multi-thread at this granularity. Oodle Core does not do this for you. I refer to this as "macro" parallelism.

The Oodle Core provides more "micro" parallelism that uses multiple cores even on individual buffers. It parallelizes at the BLOCK level, hence it will not get any parallelism on buffers <= one BLOCK (256 KB).

Threading of OodleLZ_Compress is controlled by the OodleLZ_CompressOptions:jobify setting. If you don't touch it, the default value (Jobify_Default) is to use threads if a thread system is plugged in to Oodle Core, and to not use threads if no thread system is plugged in. You may change that option to explicitly control which calls try to use threads and which don't.

OodleX_Init plugs the Oodle Ext thread system in to Oodle Core. So if you use OodleX and don't touch anything, you will now have Jobify threading of OodleLZ_Compress automatically enabled. You can specify Threads_No in OodleX_Init if you don't want the OodleX thread system. If you use OodleX you should NOT plug in your own thread system or allocators into Oodle Core - you must let OodleX provide all the plugins. The Oodle Core plugins allow people who are not using OodleX to provide the systems from their own engine.

WHO WILL SEE AN ENCODE PERF WIN :

If you are encoding buffers larger than 1 BLOCK (256 KB).

If you are encoding at level Optimal1 (5) or higher.

If you use the new LZ codecs (Kraken, Mermaid, Selkie, Leviathan)

If you plug in a job system, either with OodleX or your own.

CAVEAT :

If you are already heavily macro-threading, eg. encoding lots of files multi-threaded, using all your cores, then Jobify probably won't help much. It also won't hurt, and might help ensure full utilization of all your cores. YMMV.

If you are encoding small chunks (say 64 KB or 256 KB), then you should be macro-threading, encoding those chunks simultaneously on many threads and Jobify does not apply to you. Note when encoding lots of small chunks you should be passing pre-allocated memory to Oodle and reusing that memory for all your compress calls (but not sharing it across threads - one scratch memory buffer per thread!). Allocation time overhead can be very significant on small chunks.

If you are encoding huge files, you should be macro-threading at the chunk level, possibly with dictionary backup for overlap. Contact RAD support for the "oozi" example that demonstrates multi-threaded encoding of huge files with async IO.

NOTE : All the perf numbers we post about and shows graphs for are for single threaded speed. I will try to continue to stick to that.


A few APIs have changed & the CompressOptions struct has changed.

This is why the middle version number (8) was incremented. When the middle ("major") version of Oodle is the same, the Oodle lib is binary link compatible. That means you can just drop in a new DLL without recompiling. When the major version changes you must recompile.

A few APIs have small signature changes :

 OodleLZ_GetDecodeBufferSize, OodleLZ_GetCompressedBufferSizeNeeded and OodleLZ_GetInPlaceDecodeBufferSize :
    take compressor argument to return smaller padding for the new codecs.
 OodleLZ_GetChunkCompressor API : take compressed size argument to ensure it doesn't read past end
these should give compile errors and be easy to fix.

The CompressOptions struct has new fields. Those fields may be zero initialized to get default values. So if you were initializing the struct thusly :

struct OodleLZ_CompressOptions my_options = { 1, 2, 3 };
the new fields on the end will be implicitly zeroed by C, and that is fine.

NOTE : I do NOT recommend that style of initializing CompressOptions. The recommended pattern is to GetDefaults and then modify the fields you want to change :

struct OodleLZ_CompressOptions my_options = OodleLZ_CompressOptions_GetDefault();
my_options.seekChunkReset = true;
my_options.dictionarySize = 256*1024;
then after you set up options you should Validate :
OodleLZ_CompressOptions_Validate(&my_options);
Validate will clamp values into valid ranges and make sure that any constraints are met. Note that if Validate changes your options you should really look at why, you shouldn't be shipping code where you rely on Validate to clamp your options.


WARNINGS :

example_lz before 2.8.0 had a bug that caused it to stomp the user-provided input file, if one was provided.

YES IT WOULD STOMP YOUR FILE!

That bug is not in the Oodle library, it's in the example code, so we did not issue a critical bug fix for it, but please beware running the old example_lz with a file argument. If you update to the 280 SDK please make sure you update the *whole* SDK including the examples, not just the lib!

On Windows it is very important to not link both Oodle Core and Ext. The Oodle Ext DLL includes a whole copy of Oodle Core - if you use OodleX you should ONLY link to the Ext DLL, *not* both.

Unfortunately because of the C linkage model, if you link to both Core and Ext, the Oodle Core symbols will be multiply defined and just silently link without a warning or anything. That is not benign. (It's almost never benign and C needs to get its act together and fix linkage in general). It's specifically not benign here, because Oodle Ext will be calling its own copy of Core, but you might be calling to the other copy of Core, so the static variables will not be shared.

Benchmarking Oodle with ozip -b

The ozip utility is designed to act mostly like gzip. A compiled executable of ozip is provided with Oodle for easy testing, or you may download ozip source on github.

ozip now has a benchmarking option (ozip -b) which is an easy way to test Oodle.

ozip -b runs encode & decode many times to provide accurate timing. It does not include IO. It was designed to be similar to zstd -b so that they are directly comparable.

ozip -b can take a file or a dir (plus wildcard), in which case it will test all the files in the dir. You can set up the specific compressor and options you want to test to see how they affect performance and compression ratio.

So for example you can test the effect of spaceSpeedTradeoffBytes on Kraken level Optimal1 :

r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla -os512
K 5 mozilla          :  51220480 ->  14288373 (3.585),   3.5 MB/s, 1080.4 MB/s

r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla
K 5 mozilla          :  51220480 ->  14216948 (3.603),   3.5 MB/s, 1048.6 MB/s

r:\>ozip -b -c8 -z5 r:\testsets\silesia\mozilla -os128
K 5 mozilla          :  51220480 ->  14164777 (3.616),   3.5 MB/s, 1004.6 MB/s
Or to test Kraken HyperFast3 on all the files in Silesia :
r:\>ozip -b -c8 -ol-3 r:\testsets\silesia\*
K-3 12 files         : 211938580 ->  81913142 (2.587), 339.0 MB/s, 1087.6 MB/s


Another option for easy testing with Oodle is example_lz_chart, which is also provided as a pre-compiled exe and also as source code.

example_lz_chart runs on a single file you provide and prints a report of the compression ratio and speed of various Oodle compressors and encode levels.

This gives you an overview of the different performance points you can hit with Oodle.


WARNING :

Do not try to profile Oodle by process timing ozip.

The normal ozip (not -b mode) uses stdio and is not designed to be as efficient as possible. It's designed for simplicity and to replicated gzip behavior when used for streaming pipes on UNIX.

In general it is not recommended to benchmark by timing with things like IO included because it's very difficult to do that right and can give misleading results.

See also :

Tips for benchmarking a compressor
The Perils of Holistic Profiling
Tips for using Oodle as Efficiently as Possible


NOTE :

ozip does not have any threading. ozip -b is benchmarking single threaded performance.

This is true even for the new Jobify threading because ozip initializes OodleX without threads :

    OodleX_Init_Default(OODLE_HEADER_VERSION,OodleX_Init_GetDefaults_DebugSystems_No,OodleX_Init_GetDefaults_Threads_No);

I believe that zstd -b is also single threaded so they are apples to apples. However some compressors uses threads by default (LZMA, LZHAM, etc.) so if they are being compared they should be set to not use threads OR you should use Oodle with threads. Measuring multi-threaded performance is context dependent (for example are you encoding many small chunks simultaneously?) and I generally don't recommend it, it's much easier to compare fairly with single threaded performance.

For high performance on large files, ask for the "oozi" example.