Oodle Perf with Chunking and Dictionary Size

I get a lot of customers that want to cut their data into small blocks for paging, who ask "what's the benefit of using larger blocks" ?

The larger the block = more compression, and can help throughput (decode speed).

Obviously larger block = longer latency (to load & decode one whole block).

(though you can get data out incrementally, you don't have to wait for the whole decode to get the first byte out; but if you only needed the last byte of the block, it's strictly longer latency).

If you need fine grain paging, you have to trade off the desire to get precise control of your loading with small blocks & the benefits of larger blocks.

(obviously always follow general good paging practice, like amortize disk seeks, combine small resources into paging units, don't load a 256k chunk and just keep 1k of it and throw the rest away, etc.)

As a reference point, here's Kraken on Silesia with various chunk sizes :

Silesia : (Kraken Normal -z4)

 16k : ooKraken    : 211,938,580 ->75,624,641 =  2.855 bpb =  2.803 to 1 
 16k : decode           : 264.190 millis, 4.24 c/b, rate= 802.22 mb/s

 32k : ooKraken    : 211,938,580 ->70,906,686 =  2.676 bpb =  2.989 to 1 
 32k : decode           : 217.339 millis, 3.49 c/b, rate= 975.15 mb/s

 64k : ooKraken    : 211,938,580 ->67,562,203 =  2.550 bpb =  3.137 to 1 
 64k : decode           : 195.793 millis, 3.14 c/b, rate= 1082.46 mb/s

128k : ooKraken    : 211,938,580 ->65,274,250 =  2.464 bpb =  3.247 to 1 
128k : decode           : 183.232 millis, 2.94 c/b, rate= 1156.67 mb/s

256k : ooKraken    : 211,938,580 ->63,548,390 =  2.399 bpb =  3.335 to 1 
256k : decode           : 182.080 millis, 2.92 c/b, rate= 1163.99 mb/s

512k : ooKraken    : 211,938,580 ->61,875,640 =  2.336 bpb =  3.425 to 1 
512k : decode           : 182.018 millis, 2.92 c/b, rate= 1164.38 mb/s

1024k: ooKraken    : 211,938,580 ->60,602,177 =  2.288 bpb =  3.497 to 1 
1024k: decode           : 181.486 millis, 2.91 c/b, rate= 1167.80 mb/s

files: ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files: decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s

16k   :  2.80:1 ,   15.7 enc mbps ,  802.2 dec mbps
32k   :  2.99:1 ,   19.7 enc mbps ,  975.2 dec mbps
64k   :  3.14:1 ,   22.8 enc mbps , 1082.5 dec mbps
128k  :  3.25:1 ,   24.6 enc mbps , 1156.7 dec mbps
256k  :  3.34:1 ,   25.5 enc mbps , 1164.0 dec mbps
512k  :  3.43:1 ,   25.4 enc mbps , 1164.4 dec mbps
1024k :  3.50:1 ,   24.6 enc mbps , 1167.8 dec mbps
files :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

(note these are *chunks* not a window size; no carry-over of compressor state or dictionary is allowed across chunks. "files" means compress the individual files of silesia as whole units, but reset compressor between files.)

You may have noticed that the chunked files (once you get past the very small 16k,32k) are somewhat faster to decode. This is due to keeping match references in the CPU cache in the decoder.

Limitting the match window (OodleLZ_CompressOptions::dictionarySize) gives the same speed benefit for staying in cache, but with a smaller compression win.

window 128k : ooKraken    : 211,938,580 ->61,939,885 =  2.338 bpb =  3.422 to 1 
window 128k : decode           : 181.967 millis, 2.92 c/b, rate= 1164.71 mb/s

window 256k : ooKraken    : 211,938,580 ->60,688,467 =  2.291 bpb =  3.492 to 1 
window 256k : decode           : 182.316 millis, 2.93 c/b, rate= 1162.48 mb/s

window 512k : ooKraken    : 211,938,580 ->59,658,759 =  2.252 bpb =  3.553 to 1 
window 512k : decode           : 184.702 millis, 2.97 c/b, rate= 1147.46 mb/s

window 1M : ooKraken    : 211,938,580 ->58,878,065 =  2.222 bpb =  3.600 to 1 
window 1M : decode           : 184.912 millis, 2.97 c/b, rate= 1146.16 mb/s

window 2M :  ooKraken    : 211,938,580 ->58,396,432 =  2.204 bpb =  3.629 to 1 
window 2M :  decode           : 182.231 millis, 2.93 c/b, rate= 1163.02 mb/s

window 4M :  ooKraken    : 211,938,580 ->58,018,936 =  2.190 bpb =  3.653 to 1 
window 4M : decode           : 182.950 millis, 2.94 c/b, rate= 1158.45 mb/s

window 8M : ooKraken    : 211,938,580 ->57,657,484 =  2.176 bpb =  3.676 to 1 
window 8M : decode           : 189.241 millis, 3.04 c/b, rate= 1119.94 mb/s

window 16M: ooKraken    : 211,938,580 ->57,525,174 =  2.171 bpb =  3.684 to 1 
window 16M: decode           : 202.384 millis, 3.25 c/b, rate= 1047.21 mb/s

files     : ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files     : decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s

window 128k:  3.42:1 ,   20.1 enc mbps , 1164.7 dec mbps
window 256k:  3.49:1 ,   20.1 enc mbps , 1162.5 dec mbps
window 512k:  3.55:1 ,   20.1 enc mbps , 1147.5 dec mbps
window 1M  :  3.60:1 ,   20.0 enc mbps , 1146.2 dec mbps
window 2M  :  3.63:1 ,   19.7 enc mbps , 1163.0 dec mbps
window 4M  :  3.65:1 ,   19.3 enc mbps , 1158.5 dec mbps
window 8M  :  3.68:1 ,   18.9 enc mbps , 1119.9 dec mbps
window 16M :  3.68:1 ,   18.8 enc mbps , 1047.2 dec mbps
files      :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

WARNING : tuning perf to cache size is obviously very machine dependent; I don't really recommend fiddling with it unless you know the exact hardware you will be decoding on. The test machine here has a 4 MB L3, so speed falls off slightly as window size approaches 4 MB.

If you do need to use tiny chunks with Oodle ("tiny" being 32k or smaller; 128k or above is in the normal intended operating range) here are a few tips to consider :

1. Consider pre-allocating the Decoder object and passing in the memory to the OodleLZ_Decompress calls. This avoids doing a malloc per call, which may or may not be significant overhead.

2. Consider changing OodleConfigValues::m_OodleLZ_Small_Buffer_LZ_Fallback_Size . The default is 2k bytes. Buffers smaller than that will use LZB16 instead of the requested compressor, because many of the new ones don't do well on tiny buffers. If you want to have control of this yourself, you can set this to 0.

3. Consider changing OodleLZ_CompressOptions::spaceSpeedTradeoffBytes . This is the number of bytes that must be saved from the compressed output size before the encoder will choose a slower decode mode. eg. it controls decisions like whether literals are sent raw or with entropy coding. This number is scaled for full size buffers (128k bytes or more). When using tiny buffers, it will choose to avoid entropy coding more often. You may wish to dial down this value to scale to your buffers. The default is 256 ; I recommend trying 128 to see what the effect is.



I've been fooling around with strtod for a few days for no good reason. It's been quite interesting.

Some little notes :

strtod in many compilers & standard libraries is broken. (see the excellent pages at exploring binary for details)

(to clarify "broken" : they all round-trip doubles correctly; if you print a double and scan it using the same compiler, you will get the same double; that's quite easy. They're "broken" in the sense of not converting a full-precision string to the closest possible double, and they're "broken" in the sense of not having a single consistent rule that tells you how any given full precision numeral string will be converted)

The default rounding rule for floats is round-nearest (banker). What that means is when a value is exactly at the midpoint of either round-up or round-down (the bits below the bottom bit are 100000..) , then round up or down to make the bottom bit of the result zero. This is sometimes called "round to even" but crucially it's round to even *mantissa* not round to even *number*.

Our goal for strtod should be to take a full numeral string and convert it to the closest possible double using banker rounding.

The marginal rounding value is like this in binary :


^ ^              ^ ^- bits below mantissa are exactly at rounding threshold
| |              |
| |              +- bottom bit of mantissa = "banker bit"
| |
| +- 52 bits of mantissa
+- implicit 1 bit at top of double

in this case the "banker bit" (bottom bit of mantissa bit range) is off, the next bit is on.

If this value was exact, you should round *down*. (if the banker bit was on, you should round up). If this value is not exact by even the tiniest bit (as in, there are more significant bits below the range we have seen, eg. it was 1000....001), it changes to rounding up.

If you think of the infinity of the real numbers being divided into buckets, each bucket corresponds to one of the doubles that covers that range, this value is the boundary of a bucket edge.

In practice the way to do strtod is to work in ints, so if you have the top 64 bits of the (possibly very long) value in an int, this is :

hi = a u64 with first 64 bits
top bit of "hi" is on

(hi>>11) is the 52 bits of mantissa + implicit top bit

hi & 0x400 is the "rounding bit"
hi & 0x800 is the "banker bit"
hi & 0x3ff are the bits under the rounding bit

hi & 0xfff == 0x400 is low boundary that needs more bits
hi & 0xfff == 0xBFF is high boundary that needs more bits

At 0x400 :
"rounding bit" is on, we're right on threshold
"banker bit" is off, so we should round down if this value is exact
but any more bits beyond the first 64 will change us to rounding up
so we need more bits
- just need to see if there are any bits at all that could be on

At 0xBFF :
"banker bit" is on, so if we were at rounding threshold, we should round up
(eg. 0xC00 rounds up, and doesn't need any more bits, we know that exactly)
the bits we have are just below rounding threshold
if the remaining bits are all on (FFFF)
OR if they generate a carry into our bottom bit
then we will round up
- need enough of the remaining value to know that it can't push us up to 0xC00

The standard approach to doing a good strtod is to start by reading the first 19 digits into a U64. You just use *= 10 integer multiplies to do the base-10 to base-2 conversion, and then you have to adjust the place-value of the right hand side of those digits using a table to find a power of 10. (the place-value is obviously adjusted based on where the decimal was and the exponent provided in the input string ; so if you are given "70.23e12" you read "7023" as an int and then adjust pow10 exponent +10).

Once you have the first 19 digits in the U64, you know that the full string must correspond to an interval :

lo = current u64 from first 19 digits

hi = lo + (all remaining digits = 99999)*(place value) + (bias for placevalue mult being too low)

final value is in [lo,hi]

so the difficult cases that need a lot of digits to discriminate are the ones where the first 19 digits put you just below the threshold of rounding, but the hi end of the interval is above it.

Older MSVC only used the first 17 digits so fails sooner. For example :


this is right on a rounding threshold; because the banker bit is off it should round down

the correct closest double is 

but if you bias that up beyond the digits that MSVC reads :


this should round up; the closest double is

but old MSVC gets it wrong


Another example :


is the double threshold


is definitely above the threshold, should round up

but if you only read the first 19 base-10 digits :


you see a value that is below the threshold and would round down

You have to look at the interval of uncertainty -


and see that you straddle and boundary and must get more digits.

There are two things that need refinement : getting more digits & doing your place value scaling to more bits of precision. You can interatively increase the precision of each, which refines your interval smaller and smaller, until you either know which side of the rounding barrier to take, or you have got all the bits of precision.

BTW this needing lots of precision case is exactly the same as the arithmetic coder "underflow" case , where your coding interval looks like, in binary :

hi = 101101010101001 100000000000000.110101010101
lo = 101101010101001 011111111111111.01010101010
     ^               ^               ^- active coding zone, new symbols add at this fixed point position
     |               |
     |               +-- bad underflow zone! can't yet tell if this bit should be a 1 or 0
     +-- hi and lo the same in top bits, these bits have been sent & renormalized out

anyway, just a funny parallel to me.

There are obviously nasty issues to be aware of with the FPU rounding more, control word, precision, or with the optimizer doing funny things to your FPU math (and of course be aware of x86 being on the FPU and x64 being in SSE). The best solution to all that is to just not use floats, use ints and construct the output floats by hand.

(there is an optimization possible if you can trust the FPU ; numbers that fit completely in an int you can just use the FPU's int-to-float to do all the work for you, if the int-to-float rounding mode matches the rounding mode you want from your strtod)

Now, you may think "hey who cares about these boundary cases - I can round-trip doubles perfectly without them". In fact, round-tripping doubles is easy, since they don't ever have any bits on below the (52+1) in the mantissa, you never get this problem of bits beyond the last one! (you only need 17 digits to round-trip doubles).

Personally I think that these kinds of routines should have well-defined behavior for all inputs. There should be a single definite right answer, and a conforming routine should get that right answer. Then nutters can go off and optimize it and fiddle around, and you can run it through a test suite and verify it *if* you have exactly defined right answers. This is particulary true for things like the CRT. (this also comes up in how the compiler converts float and double constants). The fact that so many compilers fucked this up for so long is a bit of a head scratcher (especially since good conversion routines have been around since the Fortran days). (the unwillingess of people to just use existing good code is so bizarre to me, they go off and roll their own and invariably fuck up some of the difficult corner cases).

Anyway, rolling your own is a fun project, but fortunately there's good code to do this :

"dtoa" (which contains a strtod) by David M Gay :

dtoa.c from http://www.netlib.org/fp/

(BTW lots of people seem to have different versions of dtoa with various changes/fixes; for example gcc, mozilla, chromium; I can't find a good link or home page for the best/newest version of dtoa; help?)

dtoa is excellent code in the sense of working right, using good algorithms, and being fast. It's terrible code in the sense of being some apalling spaghetti. This is actual code from dtoa :

    if (k &= 0x1f) {
        k1 = 32 - k;
        z = 0;
        do {
            *x1++ = *x << k | z;
            z = *x++ >> k1;
            while(x < xe);
        if ((*x1 = z))
    if (k &= 0xf) {
        k1 = 16 - k;
        z = 0;
        do {
            *x1++ = *x << k  & 0xffff | z;
            z = *x++ >> k1;

holy order-of-operations reliance batman! Jeebus.

To build dtoa I use :

#define IEEE_8087
#pragma warning(disable : 4244 4127 4018 4706 4701 4334 4146)
#define NO_ERRNO
Note that dtoa can optionally check the FLT_ROUNDS setting but does not round correctly unless it is 1 (nearest). Furthermore, in MSVC the FLT_ROUNDS value in the header is broken in some versions (always returns 1 even if fesetenv has changed it). So, yeah. Don't mess with the float rounding mode please.

dtoa is partly messy because it supports lots of wacky stuff from rare machines (FLT_RADIX != 2 for example). (though apparently a lot of the weird float format modes are broken).

An alternative I looked at is "floatscan" from MUSL. Floatscan produces correct results, but is very slow.

Here's my timing on random large (20-300 decimal digits) strings :

strtod dtoa.c
ticks = 353

strtod mine
ticks = 267

MSVC 2005 CRT atof
ticks = 1921

MUSL floatscan
ticks = 11445

My variant here ("mine") is a little bit faster than dtoa. That basically just comes from using the modern 64-bit CPU. For example I use mulq to do 64x64 to 128 multiply , whereas he uses only 32*32 multiplies and simulates large words by doing the carries manually. I use clz to left-justify binary, whereas he uses a bunch of if's. Stuff like that.

The MUSL floatscan slowness is mmrmm. It seems to be O(N) in the # of digits even when the first 19 digits can unambiguously resolve the correct answer. Basically it just goes straight into bigint (it makes an array of U32's which each hold 9 base10 digits) which is unnecessary most of the time.

Tips for using Oodle as Efficiently as Possible

Tips for using Oodle as efficiently and robustly as possible. (wrst your game runtime loading + decoding only).

1. Use the new compressors (Kraken/Mermaid/Selkie/Hydra). Aside from being great performance, they have the most robust and well-tested fuzz safety.

2. Pass FuzzSafe_Yes to the OodleLZ_Decompress option. The KMS decoders are always internally fuzz safe. What the FuzzSafe_Yes option is add some checks to the initial header decode which will cause the decode to fail if the header byte has been tampered with to change it to a non-KMS compressor.

3. Use Oodle Core lib only, not Ext. Core lib is very light and tight. Core lib makes no threads, requires no init, and will do no allocations to decode (if you pass in the memory).

4. Disable Oodle's callbacks :


(note that disabling the allocators only works because you are doing decoding only (no encoding) and because you do #5 below - pass in scratch memory for the decoder)

5. Because you disabled Oodle's access to an allocator, instead pass in the memory needed. You can ask the data for the compressor if you don't always know it.

    compressor = OodleLZ_GetChunkCompressor(in_comp,NULL);
    decoderMemorySize = OodleLZDecoder_MemorySizeNeeded(compressor,in_raw_length);
    decoderMemory = malloc( decoderMemorySize );

6. If possible keep the Decoder memory around or pull from some shared scratch space rather than allocating & freeing it every time.

7. If you're on a platform where Oodle is in a DLL or .so , sign it with a mechanism like authentisign to ensure it isn't tampered with.

8. If possible do async IO and use double-buffering to incrementally load data and decompress, so that you are maximally using the IO bus and the CPU at the same time. If loading and decoding large buffers, consider running the decoders "ThreadPhased".

9. If possible use Thread-Phased decoding on large buffers that can't be broken into multiple decodes. The most efficient use of threading always comes from running multiple unrelated encode/decodes simultaneously, rather than trying ti multi-thread a single encode/decode.

10. Use a mix of compressors to put your decode time where it helps the most. eg. use the slower compressors only when they get big size wins, and perhaps tune the compression to the game's latency needs - random access data that needs to get loaded as fast as possible, use faster decoders, background data that streams in slowly can use slower decoders. Hydra is the best way to do this, it will automatically measure decode speed and tune to maximize the space-speed tradeoff.

11. Don't load tiny buffers. Combine them into larger paging units.

12. Consider in-memory compressed data as a paging cache.

As always, feel free to contact me and ask questions.


Oodle Hydra

02-01-17 | Oodle Hydra

Oodle Hydra - the many headed beast.

Hydra is a meta-compressor which selects Kraken, Mermaid, or Selkie per block. It uses the speed fit model of each compressor to do a lagrangian space-speed optimization decision about which compressor is maximizing the desired lagrange cost (size + lambda*time).

It turns out to be quite interesting.

(this is of course in addition to each of those compressors internally making space-speed decisions; each of them can enable or disable internal processing modes using the same lagrange optimization model. (eg. they can turn on and off entropy coding for various streams). And there are additional per-block implicit decisions such as choosing uncompressed blocks and huff-only blocks.)

Hydra is a single entry point to all the Oodle compressors. You simply choose how much you care about size vs. decode speed, that corresponds to a certain lagrange lambda. In Oodle this is called "spaceSpeedTradeoffBytes". It's the # of bytes that compression must save in order to take up N cycles more of decode time. You then no longer think about "do I want Kraken or Mermaid" , Oodle makes the right decision for you that optimizes the goal.

Hydra can interpolate the performance of Kraken & Mermaid to create a meta-compressor that targets the points in between. That in itself is a somewhat surprising result. Say Kraken is at 1000 mb/s , Mermaid is at 2000 mb/s decode speed, but you really want a compressor that's around 1500 mb/s with compression between the two. We don't know of a Pareto-optimal compressor that is between Kraken and Mermaid, so you're sunk, right? No, you can use Hydra.

I should note that Hydra is very much about *whole corpus* performance. That is, if your target is 1500 mb/s, you may not hit that on any one file - that file could go either all-Kraken or all-Mermaid. The target is hit overall. This is intentional and good, but if for whatever reason you are trying to hit a specific speed for an individual file then Hydra is not the way to do that.

It leads to an idea that I've tried to advocate for before : corpus lagrange optimization for bit rate allocation. If you are dealing with a limited resource that you want to allocate well, such as disk size or download size or time to load - you want to allocate that resource to the data that can make the best use of it. eg. spend your decode time where it makes the biggest size difference. (I encourage this for lossy bit rate allocation as well). So with Hydra some files decode slower and some decode faster, but when they are slower it's because the time was worth it.

And now some reports. We're going to look at 3 copora. On Silesia and gametestset, Hydra interpolates as expected. But then on PD3D, something magic happens ...

(Oodle 2.4.2 , level 7, Core i7-3770 x64)

Silesia :

total                : Kraken     : 4.106 to 1 : 994.036 MB/s
total                : Mermaid    : 3.581 to 1 : 1995.919 MB/s
total                : Hydra200   : 4.096 to 1 : 1007.692 MB/s
total                : Hydra288   : 4.040 to 1 : 1082.211 MB/s
total                : Hydra416   : 3.827 to 1 : 1474.452 MB/s
total                : Hydra601   : 3.685 to 1 : 1780.476 MB/s
total                : Hydra866   : 3.631 to 1 : 1906.823 MB/s
total                : Hydra1250  : 3.572 to 1 : 2002.683 MB/s

gametestset :

total                : Kraken     : 2.593 to 1 : 1309.865 MB/s
total                : Mermaid    : 2.347 to 1 : 2459.442 MB/s
total                : Hydra200   : 2.593 to 1 : 1338.429 MB/s
total                : Hydra288   : 2.581 to 1 : 1397.465 MB/s
total                : Hydra416   : 2.542 to 1 : 1581.959 MB/s
total                : Hydra601   : 2.484 to 1 : 1836.988 MB/s
total                : Hydra866   : 2.431 to 1 : 2078.516 MB/s
total                : Hydra1250  : 2.366 to 1 : 2376.828 MB/s

PD3D :

total                : Kraken     : 3.678 to 1 : 1054.380 MB/s
total                : Mermaid    : 3.403 to 1 : 1814.660 MB/s
total                : Hydra200   : 3.755 to 1 : 1218.745 MB/s
total                : Hydra288   : 3.738 to 1 : 1249.838 MB/s
total                : Hydra416   : 3.649 to 1 : 1381.570 MB/s
total                : Hydra601   : 3.574 to 1 : 1518.151 MB/s
total                : Hydra866   : 3.487 to 1 : 1666.634 MB/s
total                : Hydra1250  : 3.279 to 1 : 1965.039 MB/s

Whoah! Magic!

On PD3D, Hydra finds big free wins - it not only compresses more than Kraken, it decodes significantly faster, repeating the above to point it out :

total                : Kraken     : 3.678 to 1 : 1054.380 MB/s

total                : Hydra288   : 3.738 to 1 : 1249.838 MB/s
 Kraken compression ratio is in between here, around 1300 MB/s
total                : Hydra416   : 3.649 to 1 : 1381.570 MB/s

You can see it visually in the loglog plot; if you draw a line between Kraken & Mermaid, the Hydra data points are above that line (more compression) and to the right (faster).

What's happening is that once in a while there's a block where Mermaid gets the same or more compression than Kraken. While that's rare, when it does happen you just get a big free win from switching to Mermaid on that block. More often, Mermaid only gets a little bit less compression than Kraken but a lot less decode time, so switching is advantageous in the space-speed lagrange cost.

Crucial to Hydra is having a decoder speed fit for every compressor that can simulate decoding a block and count cycles needed to decode on an imaginary machine. You need a model because you don't want to actually measure the time by running the decoder on the current machine - it would take lots of runs to get reliable timing, and it would mean that you are optimizing for the exact machine that you are encoding on. I currently use a single virtual machine that is a blend of various real platforms; in the future I might expose the ability to use virtual machines that simulate specific target machines (because Hydra might make decisions differently if it knows it is targeting PC-x64 vs. Jaguar-x64 vs. Aarch64-on-A57 , etc.).

Hydra is exciting to me as a general framework for the future of Oodle. It provides a way to add in new compression modes and be sure that they are never worse. That is, you always can start with Kraken per block, and then new modes could be picked block by block only when they are known to beat Kraken (in a space-speed sense). It lets you mix in compressors that you specifically don't expect to be good in general on all data, but that might be amazing once in a while on certain data.

(Hydra requires compressors that carry no state across blocks, so you can't naively mix in something like PPM or CM/PAQ. To optimize a switching choice with compressors that carry state requires a trellis-quantization like lattice dynamic programming optimization and is rather more complex to do quickly)

old rants