3/08/2017

Kraken Perf with Simultaneous Threaded Decodes

I had a report from a customer of poor Kraken decode performance on PS4 when using 10 simultaneous threads for decoding, and it occurred to me I had never tested that thoroughly. (Their issue turned out to be something else; see end of post).

There is reason to be concerned about running a lot of Kraken (or Mermaid/Selkie) decodes simultaneously. On most modern systems, like the PS4, the many cores share caches, perhaps share memory busses or TLBs. That means while you have N* the compute performance, you may have cache conflicts, and you could wind up bottlenecking on some of the memory subsystem. (generally we don't run into bandwidth bottlenecks, but there are lots of other limitted resources, like queue sizes, etc.)

Anyhoo, onto the testing -

I ran N threaded decodes of the same file. The buffers are copied for each thread so they can't share any cache for input or output buffers. Wiped caches before runs. I then wait on all N decodes being done and time that.

The graphs show total time for all N decodes, and time per decode (total/N).

If you had infinite compute resources, then "total time" (orange) would be a flat line. Any number of threads would take the same total time, it would not change.

Once you hit the limits of the system, the "time per" (blue) should be constant, and total then should go up linearly. (actually not quite, because when you are off the core # modulo, the threads don't all complete at the same time so you get wasted idle time; see the jump on lappy from 4-6 cores then how flat it is from 6-8, same on PS4 from 6-8 cores then flat from 9-12). If you have the threads to spare, then you can maximize throughput by minimizing "time per".


Conclusion :

No problem with lots of simultaneous Kraken decodes. Even when heavily over-subscribed, there's no major perf inversion due to overloading cache or memory subsystems.

Kraken on PS4 has near perfect threading up to 6 threads (total time goes from 0.0099 - 0.0111) ; on lappy it's not as good but still provides benefit to the time per decode up to 4 threads (total time from 0.0060 - 0.0095).

It's a surprise to me that the PS4 scales so well despite sharing cache & memory bus for the first 4 cores. It's also a surprise that lappy scales less well, I thought it would be near perfect on the first 4 cores, but maybe that's just Windows not giving me the whole machine? That was backward from my expectation.


Charts :

Kraken on PS4 (6 cores; 4 cores per 2MB L2) :

lzt24 :

lzt99:

Almost perfect threading from 1-6 cores (total time constant) even with large binary file.

webster:

webster is a large text file that uses a lot of long distance matches (offset > 1M). Text files have very different character than binary files like the lzt's. We can see that the large hot memory region used by webster does put some stress on the shared L2, there's falloff in perf from 1-4 cores.

webster Selkie :

Selkie is much faster than Kraken (2.75X faster on webster PS4) so all else being equal it should be affected a lot more by thread contention hurting memory latency. But, Selkie has some unique cleverness that makes it immune to this drawback. Threading even on webster from 1-6 cores is near perfect.


Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) no turbo :

lzt24 :

lzt99 :

webster :

Similar to PS4, lappy has almost perfect threading on binary files from 1-4 cores. On webster there is falloff in perf due to the


Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) WITH TURBO :

lzt24 :

lzt99 :

I initially mistakenly posted lappy timings with turbo enabled. I usually turn it off for perf testing on my laptop so that timings are more reliable. I think it's interesting actually to look at how the perf falloff is different with turbo.

Without turbo, total time is constant on lzt24 and lzt99 from 1-4 cores, but with turbo it steadily falls off, as adding more cores causes the laptop to reduce its clock rate. Despite that there's still a solid gain to throughput (the blue "time per" is going down despite the clock rate also going down).


raw data : (lzt24)


lappy : no turbo : (*1000)
1,   9.1360,   9.1360
2,   9.5523,   4.7761
3,   9.7850,   3.2617
4,  10.1901,   2.5475
5,  14.6867,   2.9373
6,  16.6759,   2.7793
7,  19.1105,   2.7301
8,  20.1687,   2.5211
9,  23.6391,   2.6266
10,  25.9279,   2.5928
11,  27.7395,   2.5218
12,  27.6459,   2.3038
13,  30.7935,   2.3687
14,  31.8541,   2.2753
15,  33.7883,   2.2526
16,  34.8252,   2.1766

lappy : with turbo :
1,   0.0060,   0.0060
2,   0.0070,   0.0035
3,   0.0087,   0.0029
4,   0.0095,   0.0024 <- 4
5,   0.0133,   0.0027
6,   0.0170,   0.0028
7,   0.0175,   0.0025
8,   0.0193,   0.0024 <- 8
9,   0.0228,   0.0025
10,   0.0252,   0.0025
11,   0.0262,   0.0024
12,   0.0278,   0.0023 <- 12
13,   0.0318,   0.0024
14,   0.0310,   0.0022
15,   0.0325,   0.0022
16,   0.0346,   0.0022 <- 16

PS4 :
1,   0.0099,   0.0099
2,   0.0102,   0.0051
3,   0.0104,   0.0035
4,   0.0106,   0.0027
5,   0.0110,   0.0022
6,   0.0111,   0.0018 <- min
7,   0.0147,   0.0021
8,   0.0180,   0.0022
9,   0.0204,   0.0023
10,   0.0214,   0.0021
11,   0.0217,   0.0020
12,   0.0220,   0.0018 <- same min again
13,   0.0257,   0.0020
14,   0.0297,   0.0021
15,   0.0310,   0.0021
16,   0.0319,   0.0020



comparing just lappy turbo to no-turbo :

lappy : no turbo :
1,   9.1360,   9.1360
2,   9.5523,   4.7761
3,   9.7850,   3.2617
4,  10.1901,   2.5475

lappy : with turbo :
1,   6.0,   6.0
2,   7.0,   3.5
3,   8.7,   2.9
4,   9.5,   2.4

You can see with only 1 core, turbo is 1.5X faster (9.13/6.0) than no turbo
With 4 cores they are getting close to the same speed, (10.2 vs 9.5), the turbo
has almost completely clocked down


The customer's actual issue was decoding into write-combined graphics memory. This is an absolute killer for decoder perf because Kraken (like any LZ decoder) needs to read back the buffers it writes.

On the PS4 I think the best way to decode to graphics memory (garlic) is to allocate the memory as writeback onion, do the decompress, then change it to wb_garlic with sceKernelBatchMap (which will cause a CPU cache flush; several of these changes could be combined together, eg. for level loading you only need to do it once at the end of all the resource decoding, don't do it per resource).

2/24/2017

Oodle Perf with Chunking and Dictionary Size

I get a lot of customers that want to cut their data into small blocks for paging, who ask "what's the benefit of using larger blocks" ?

The larger the block = more compression, and can help throughput (decode speed).

Obviously larger block = longer latency (to load & decode one whole block).

(though you can get data out incrementally, you don't have to wait for the whole decode to get the first byte out; but if you only needed the last byte of the block, it's strictly longer latency).

If you need fine grain paging, you have to trade off the desire to get precise control of your loading with small blocks & the benefits of larger blocks.

(obviously always follow general good paging practice, like amortize disk seeks, combine small resources into paging units, don't load a 256k chunk and just keep 1k of it and throw the rest away, etc.)

As a reference point, here's Kraken on Silesia with various chunk sizes :


Silesia : (Kraken Normal -z4)

 16k : ooKraken    : 211,938,580 ->75,624,641 =  2.855 bpb =  2.803 to 1 
 16k : decode           : 264.190 millis, 4.24 c/b, rate= 802.22 mb/s

 32k : ooKraken    : 211,938,580 ->70,906,686 =  2.676 bpb =  2.989 to 1 
 32k : decode           : 217.339 millis, 3.49 c/b, rate= 975.15 mb/s

 64k : ooKraken    : 211,938,580 ->67,562,203 =  2.550 bpb =  3.137 to 1 
 64k : decode           : 195.793 millis, 3.14 c/b, rate= 1082.46 mb/s

128k : ooKraken    : 211,938,580 ->65,274,250 =  2.464 bpb =  3.247 to 1 
128k : decode           : 183.232 millis, 2.94 c/b, rate= 1156.67 mb/s

256k : ooKraken    : 211,938,580 ->63,548,390 =  2.399 bpb =  3.335 to 1 
256k : decode           : 182.080 millis, 2.92 c/b, rate= 1163.99 mb/s

512k : ooKraken    : 211,938,580 ->61,875,640 =  2.336 bpb =  3.425 to 1 
512k : decode           : 182.018 millis, 2.92 c/b, rate= 1164.38 mb/s

1024k: ooKraken    : 211,938,580 ->60,602,177 =  2.288 bpb =  3.497 to 1 
1024k: decode           : 181.486 millis, 2.91 c/b, rate= 1167.80 mb/s

files: ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files: decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s


16k   :  2.80:1 ,   15.7 enc mbps ,  802.2 dec mbps
32k   :  2.99:1 ,   19.7 enc mbps ,  975.2 dec mbps
64k   :  3.14:1 ,   22.8 enc mbps , 1082.5 dec mbps
128k  :  3.25:1 ,   24.6 enc mbps , 1156.7 dec mbps
256k  :  3.34:1 ,   25.5 enc mbps , 1164.0 dec mbps
512k  :  3.43:1 ,   25.4 enc mbps , 1164.4 dec mbps
1024k :  3.50:1 ,   24.6 enc mbps , 1167.8 dec mbps
files :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

(note these are *chunks* not a window size; no carry-over of compressor state or dictionary is allowed across chunks. "files" means compress the individual files of silesia as whole units, but reset compressor between files.)

You may have noticed that the chunked files (once you get past the very small 16k,32k) are somewhat faster to decode. This is due to keeping match references in the CPU cache in the decoder.

Limitting the match window (OodleLZ_CompressOptions::dictionarySize) gives the same speed benefit for staying in cache, but with a smaller compression win.


window 128k : ooKraken    : 211,938,580 ->61,939,885 =  2.338 bpb =  3.422 to 1 
window 128k : decode           : 181.967 millis, 2.92 c/b, rate= 1164.71 mb/s

window 256k : ooKraken    : 211,938,580 ->60,688,467 =  2.291 bpb =  3.492 to 1 
window 256k : decode           : 182.316 millis, 2.93 c/b, rate= 1162.48 mb/s

window 512k : ooKraken    : 211,938,580 ->59,658,759 =  2.252 bpb =  3.553 to 1 
window 512k : decode           : 184.702 millis, 2.97 c/b, rate= 1147.46 mb/s

window 1M : ooKraken    : 211,938,580 ->58,878,065 =  2.222 bpb =  3.600 to 1 
window 1M : decode           : 184.912 millis, 2.97 c/b, rate= 1146.16 mb/s

window 2M :  ooKraken    : 211,938,580 ->58,396,432 =  2.204 bpb =  3.629 to 1 
window 2M :  decode           : 182.231 millis, 2.93 c/b, rate= 1163.02 mb/s

window 4M :  ooKraken    : 211,938,580 ->58,018,936 =  2.190 bpb =  3.653 to 1 
window 4M : decode           : 182.950 millis, 2.94 c/b, rate= 1158.45 mb/s

window 8M : ooKraken    : 211,938,580 ->57,657,484 =  2.176 bpb =  3.676 to 1 
window 8M : decode           : 189.241 millis, 3.04 c/b, rate= 1119.94 mb/s

window 16M: ooKraken    : 211,938,580 ->57,525,174 =  2.171 bpb =  3.684 to 1 
window 16M: decode           : 202.384 millis, 3.25 c/b, rate= 1047.21 mb/s

files     : ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files     : decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s

window 128k:  3.42:1 ,   20.1 enc mbps , 1164.7 dec mbps
window 256k:  3.49:1 ,   20.1 enc mbps , 1162.5 dec mbps
window 512k:  3.55:1 ,   20.1 enc mbps , 1147.5 dec mbps
window 1M  :  3.60:1 ,   20.0 enc mbps , 1146.2 dec mbps
window 2M  :  3.63:1 ,   19.7 enc mbps , 1163.0 dec mbps
window 4M  :  3.65:1 ,   19.3 enc mbps , 1158.5 dec mbps
window 8M  :  3.68:1 ,   18.9 enc mbps , 1119.9 dec mbps
window 16M :  3.68:1 ,   18.8 enc mbps , 1047.2 dec mbps
files      :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

WARNING : tuning perf to cache size is obviously very machine dependent; I don't really recommend fiddling with it unless you know the exact hardware you will be decoding on. The test machine here has a 4 MB L3, so speed falls off slightly as window size approaches 4 MB.


If you do need to use tiny chunks with Oodle ("tiny" being 32k or smaller; 128k or above is in the normal intended operating range) here are a few tips to consider :

1. Consider pre-allocating the Decoder object and passing in the memory to the OodleLZ_Decompress calls. This avoids doing a malloc per call, which may or may not be significant overhead.

2. Consider changing OodleConfigValues::m_OodleLZ_Small_Buffer_LZ_Fallback_Size . The default is 2k bytes. Buffers smaller than that will use LZB16 instead of the requested compressor, because many of the new ones don't do well on tiny buffers. If you want to have control of this yourself, you can set this to 0.

3. Consider changing OodleLZ_CompressOptions::spaceSpeedTradeoffBytes . This is the number of bytes that must be saved from the compressed output size before the encoder will choose a slower decode mode. eg. it controls decisions like whether literals are sent raw or with entropy coding. This number is scaled for full size buffers (128k bytes or more). When using tiny buffers, it will choose to avoid entropy coding more often. You may wish to dial down this value to scale to your buffers. The default is 256 ; I recommend trying 128 to see what the effect is.

2/09/2017

strtod

I've been fooling around with strtod for a few days for no good reason. It's been quite interesting.

Some little notes :

strtod in many compilers & standard libraries is broken. (see the excellent pages at exploring binary for details)

(to clarify "broken" : they all round-trip doubles correctly; if you print a double and scan it using the same compiler, you will get the same double; that's quite easy. They're "broken" in the sense of not converting a full-precision string to the closest possible double, and they're "broken" in the sense of not having a single consistent rule that tells you how any given full precision numeral string will be converted)

The default rounding rule for floats is round-nearest (banker). What that means is when a value is exactly at the midpoint of either round-up or round-down (the bits below the bottom bit are 100000..) , then round up or down to make the bottom bit of the result zero. This is sometimes called "round to even" but crucially it's round to even *mantissa* not round to even *number*.

Our goal for strtod should be to take a full numeral string and convert it to the closest possible double using banker rounding.

The marginal rounding value is like this in binary :


1|101010....101010|10000000000000000

^ ^              ^ ^- bits below mantissa are exactly at rounding threshold
| |              |
| |              +- bottom bit of mantissa = "banker bit"
| |
| +- 52 bits of mantissa
|
+- implicit 1 bit at top of double

in this case the "banker bit" (bottom bit of mantissa bit range) is off, the next bit is on.

If this value was exact, you should round *down*. (if the banker bit was on, you should round up). If this value is not exact by even the tiniest bit (as in, there are more significant bits below the range we have seen, eg. it was 1000....001), it changes to rounding up.

If you think of the infinity of the real numbers being divided into buckets, each bucket corresponds to one of the doubles that covers that range, this value is the boundary of a bucket edge.

In practice the way to do strtod is to work in ints, so if you have the top 64 bits of the (possibly very long) value in an int, this is :


hi = a u64 with first 64 bits
top bit of "hi" is on

(hi>>11) is the 52 bits of mantissa + implicit top bit

hi & 0x400 is the "rounding bit"
hi & 0x800 is the "banker bit"
hi & 0x3ff are the bits under the rounding bit

hi & 0xfff == 0x400 is low boundary that needs more bits
hi & 0xfff == 0xBFF is high boundary that needs more bits

At 0x400 :
"rounding bit" is on, we're right on threshold
"banker bit" is off, so we should round down if this value is exact
but any more bits beyond the first 64 will change us to rounding up
so we need more bits
- just need to see if there are any bits at all that could be on

At 0xBFF :
"banker bit" is on, so if we were at rounding threshold, we should round up
(eg. 0xC00 rounds up, and doesn't need any more bits, we know that exactly)
the bits we have are just below rounding threshold
if the remaining bits are all on (FFFF)
OR if they generate a carry into our bottom bit
then we will round up
- need enough of the remaining value to know that it can't push us up to 0xC00

The standard approach to doing a good strtod is to start by reading the first 19 digits into a U64. You just use *= 10 integer multiplies to do the base-10 to base-2 conversion, and then you have to adjust the place-value of the right hand side of those digits using a table to find a power of 10. (the place-value is obviously adjusted based on where the decimal was and the exponent provided in the input string ; so if you are given "70.23e12" you read "7023" as an int and then adjust pow10 exponent +10).

Once you have the first 19 digits in the U64, you know that the full string must correspond to an interval :


lo = current u64 from first 19 digits

hi = lo + (all remaining digits = 99999)*(place value) + (bias for placevalue mult being too low)

final value is in [lo,hi]

so the difficult cases that need a lot of digits to discriminate are the ones where the first 19 digits put you just below the threshold of rounding, but the hi end of the interval is above it.

Older MSVC only used the first 17 digits so fails sooner. For example :


34791611969279740608512
=
111010111100000111010011011110000000011001100010011101000000000000000000000
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

this is right on a rounding threshold; because the banker bit is off it should round down

the correct closest double is 
34791611969279739000000.000000

but if you bias that up beyond the digits that MSVC reads :

34791611969279740610310
=
111010111100000111010011011110000000011001100010011101000000000011100000110
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

this should round up; the closest double is
34791611969279743000000.000000

but old MSVC gets it wrong

-----------

Another example :


2022951805990391198363682
=
110101100011000000110111111101000101100100011110011001000000000000000000000000000

is the double threshold

2022951805990391198666718
=
110101100011000000110111111101000101100100011110011001000000000100100111111011110

is definitely above the threshold, should round up

but if you only read the first 19 base-10 digits :

2022951805990391198000000
=
110101100011000000110111111101000101100100011110011000111111110000010001110000000

you see a value that is below the threshold and would round down

You have to look at the interval of uncertainty -

2022951805990391198000000
2022951805990391199000000

and see that you straddle and boundary and must get more digits.

There are two things that need refinement : getting more digits & doing your place value scaling to more bits of precision. You can interatively increase the precision of each, which refines your interval smaller and smaller, until you either know which side of the rounding barrier to take, or you have got all the bits of precision.

BTW this needing lots of precision case is exactly the same as the arithmetic coder "underflow" case , where your coding interval looks like, in binary :


hi = 101101010101001 100000000000000.110101010101
lo = 101101010101001 011111111111111.01010101010
     ^               ^               ^- active coding zone, new symbols add at this fixed point position
     |               |
     |               +-- bad underflow zone! can't yet tell if this bit should be a 1 or 0
     |
     +-- hi and lo the same in top bits, these bits have been sent & renormalized out

anyway, just a funny parallel to me.

There are obviously nasty issues to be aware of with the FPU rounding more, control word, precision, or with the optimizer doing funny things to your FPU math (and of course be aware of x86 being on the FPU and x64 being in SSE). The best solution to all that is to just not use floats, use ints and construct the output floats by hand.

(there is an optimization possible if you can trust the FPU ; numbers that fit completely in an int you can just use the FPU's int-to-float to do all the work for you, if the int-to-float rounding mode matches the rounding mode you want from your strtod)

Now, you may think "hey who cares about these boundary cases - I can round-trip doubles perfectly without them". In fact, round-tripping doubles is easy, since they don't ever have any bits on below the (52+1) in the mantissa, you never get this problem of bits beyond the last one! (you only need 17 digits to round-trip doubles).

Personally I think that these kinds of routines should have well-defined behavior for all inputs. There should be a single definite right answer, and a conforming routine should get that right answer. Then nutters can go off and optimize it and fiddle around, and you can run it through a test suite and verify it *if* you have exactly defined right answers. This is particulary true for things like the CRT. (this also comes up in how the compiler converts float and double constants). The fact that so many compilers fucked this up for so long is a bit of a head scratcher (especially since good conversion routines have been around since the Fortran days). (the unwillingess of people to just use existing good code is so bizarre to me, they go off and roll their own and invariably fuck up some of the difficult corner cases).

Anyway, rolling your own is a fun project, but fortunately there's good code to do this :

"dtoa" (which contains a strtod) by David M Gay :

dtoa.c from http://www.netlib.org/fp/

(BTW lots of people seem to have different versions of dtoa with various changes/fixes; for example gcc, mozilla, chromium; I can't find a good link or home page for the best/newest version of dtoa; help?)

dtoa is excellent code in the sense of working right, using good algorithms, and being fast. It's terrible code in the sense of being some apalling spaghetti. This is actual code from dtoa :


    if (k &= 0x1f) {
        k1 = 32 - k;
        z = 0;
        do {
            *x1++ = *x << k | z;
            z = *x++ >> k1;
            }
            while(x < xe);
        if ((*x1 = z))
            ++n1;
        }
#else
    if (k &= 0xf) {
        k1 = 16 - k;
        z = 0;
        do {
            *x1++ = *x << k  & 0xffff | z;
            z = *x++ >> k1;
            }

holy order-of-operations reliance batman! Jeebus.

To build dtoa I use :

#define IEEE_8087
#pragma warning(disable : 4244 4127 4018 4706 4701 4334 4146)
#define NO_ERRNO
Note that dtoa can optionally check the FLT_ROUNDS setting but does not round correctly unless it is 1 (nearest). Furthermore, in MSVC the FLT_ROUNDS value in the header is broken in some versions (always returns 1 even if fesetenv has changed it). So, yeah. Don't mess with the float rounding mode please.

dtoa is partly messy because it supports lots of wacky stuff from rare machines (FLT_RADIX != 2 for example). (though apparently a lot of the weird float format modes are broken).

An alternative I looked at is "floatscan" from MUSL. Floatscan produces correct results, but is very slow.

Here's my timing on random large (20-300 decimal digits) strings :


strtod dtoa.c
ticks = 353

strtod mine
ticks = 267

MSVC 2005 CRT atof
ticks = 1921

MUSL floatscan
ticks = 11445

My variant here ("mine") is a little bit faster than dtoa. That basically just comes from using the modern 64-bit CPU. For example I use mulq to do 64x64 to 128 multiply , whereas he uses only 32*32 multiplies and simulates large words by doing the carries manually. I use clz to left-justify binary, whereas he uses a bunch of if's. Stuff like that.

The MUSL floatscan slowness is mmrmm. It seems to be O(N) in the # of digits even when the first 19 digits can unambiguously resolve the correct answer. Basically it just goes straight into bigint (it makes an array of U32's which each hold 9 base10 digits) which is unnecessary most of the time.

Tips for using Oodle as Efficiently as Possible

Tips for using Oodle as efficiently and robustly as possible. (wrst your game runtime loading + decoding only).

1. Use the new compressors (Kraken/Mermaid/Selkie/Hydra). Aside from being great performance, they have the most robust and well-tested fuzz safety.

2. Pass FuzzSafe_Yes to the OodleLZ_Decompress option. The KMS decoders are always internally fuzz safe. What the FuzzSafe_Yes option is add some checks to the initial header decode which will cause the decode to fail if the header byte has been tampered with to change it to a non-KMS compressor.

3. Use Oodle Core lib only, not Ext. Core lib is very light and tight. Core lib makes no threads, requires no init, and will do no allocations to decode (if you pass in the memory).

4. Disable Oodle's callbacks :


    OodlePlugins_SetAllocators(NULL,NULL);
    OodlePlugins_SetAssertion(NULL);
    OodlePlugins_SetPrintf(NULL);

(note that disabling the allocators only works because you are doing decoding only (no encoding) and because you do #5 below - pass in scratch memory for the decoder)

5. Because you disabled Oodle's access to an allocator, instead pass in the memory needed. You can ask the data for the compressor if you don't always know it.


    compressor = OodleLZ_GetChunkCompressor(in_comp,NULL);
                                    
    decoderMemorySize = OodleLZDecoder_MemorySizeNeeded(compressor,in_raw_length);
    decoderMemory = malloc( decoderMemorySize );

6. If possible keep the Decoder memory around or pull from some shared scratch space rather than allocating & freeing it every time.

7. If you're on a platform where Oodle is in a DLL or .so , sign it with a mechanism like authentisign to ensure it isn't tampered with.

8. If possible do async IO and use double-buffering to incrementally load data and decompress, so that you are maximally using the IO bus and the CPU at the same time. If loading and decoding large buffers, consider running the decoders "ThreadPhased".

9. If possible use Thread-Phased decoding on large buffers that can't be broken into multiple decodes. The most efficient use of threading always comes from running multiple unrelated encode/decodes simultaneously, rather than trying ti multi-thread a single encode/decode.

10. Use a mix of compressors to put your decode time where it helps the most. eg. use the slower compressors only when they get big size wins, and perhaps tune the compression to the game's latency needs - random access data that needs to get loaded as fast as possible, use faster decoders, background data that streams in slowly can use slower decoders. Hydra is the best way to do this, it will automatically measure decode speed and tune to maximize the space-speed tradeoff.

11. Don't load tiny buffers. Combine them into larger paging units.

12. Consider in-memory compressed data as a paging cache.

As always, feel free to contact me and ask questions.

2/01/2017

Oodle Hydra


02-01-17 | Oodle Hydra

Oodle Hydra - the many headed beast.

Hydra is a meta-compressor which selects Kraken, Mermaid, or Selkie per block. It uses the speed fit model of each compressor to do a lagrangian space-speed optimization decision about which compressor is maximizing the desired lagrange cost (size + lambda*time).

It turns out to be quite interesting.

(this is of course in addition to each of those compressors internally making space-speed decisions; each of them can enable or disable internal processing modes using the same lagrange optimization model. (eg. they can turn on and off entropy coding for various streams). And there are additional per-block implicit decisions such as choosing uncompressed blocks and huff-only blocks.)

Hydra is a single entry point to all the Oodle compressors. You simply choose how much you care about size vs. decode speed, that corresponds to a certain lagrange lambda. In Oodle this is called "spaceSpeedTradeoffBytes". It's the # of bytes that compression must save in order to take up N cycles more of decode time. You then no longer think about "do I want Kraken or Mermaid" , Oodle makes the right decision for you that optimizes the goal.

Hydra can interpolate the performance of Kraken & Mermaid to create a meta-compressor that targets the points in between. That in itself is a somewhat surprising result. Say Kraken is at 1000 mb/s , Mermaid is at 2000 mb/s decode speed, but you really want a compressor that's around 1500 mb/s with compression between the two. We don't know of a Pareto-optimal compressor that is between Kraken and Mermaid, so you're sunk, right? No, you can use Hydra.

I should note that Hydra is very much about *whole corpus* performance. That is, if your target is 1500 mb/s, you may not hit that on any one file - that file could go either all-Kraken or all-Mermaid. The target is hit overall. This is intentional and good, but if for whatever reason you are trying to hit a specific speed for an individual file then Hydra is not the way to do that.

It leads to an idea that I've tried to advocate for before : corpus lagrange optimization for bit rate allocation. If you are dealing with a limited resource that you want to allocate well, such as disk size or download size or time to load - you want to allocate that resource to the data that can make the best use of it. eg. spend your decode time where it makes the biggest size difference. (I encourage this for lossy bit rate allocation as well). So with Hydra some files decode slower and some decode faster, but when they are slower it's because the time was worth it.

And now some reports. We're going to look at 3 copora. On Silesia and gametestset, Hydra interpolates as expected. But then on PD3D, something magic happens ...

(Oodle 2.4.2 , level 7, Core i7-3770 x64)

Silesia :

total                : Kraken     : 4.106 to 1 : 994.036 MB/s
total                : Mermaid    : 3.581 to 1 : 1995.919 MB/s
total                : Hydra200   : 4.096 to 1 : 1007.692 MB/s
total                : Hydra288   : 4.040 to 1 : 1082.211 MB/s
total                : Hydra416   : 3.827 to 1 : 1474.452 MB/s
total                : Hydra601   : 3.685 to 1 : 1780.476 MB/s
total                : Hydra866   : 3.631 to 1 : 1906.823 MB/s
total                : Hydra1250  : 3.572 to 1 : 2002.683 MB/s

gametestset :

total                : Kraken     : 2.593 to 1 : 1309.865 MB/s
total                : Mermaid    : 2.347 to 1 : 2459.442 MB/s
total                : Hydra200   : 2.593 to 1 : 1338.429 MB/s
total                : Hydra288   : 2.581 to 1 : 1397.465 MB/s
total                : Hydra416   : 2.542 to 1 : 1581.959 MB/s
total                : Hydra601   : 2.484 to 1 : 1836.988 MB/s
total                : Hydra866   : 2.431 to 1 : 2078.516 MB/s
total                : Hydra1250  : 2.366 to 1 : 2376.828 MB/s

PD3D :

total                : Kraken     : 3.678 to 1 : 1054.380 MB/s
total                : Mermaid    : 3.403 to 1 : 1814.660 MB/s
total                : Hydra200   : 3.755 to 1 : 1218.745 MB/s
total                : Hydra288   : 3.738 to 1 : 1249.838 MB/s
total                : Hydra416   : 3.649 to 1 : 1381.570 MB/s
total                : Hydra601   : 3.574 to 1 : 1518.151 MB/s
total                : Hydra866   : 3.487 to 1 : 1666.634 MB/s
total                : Hydra1250  : 3.279 to 1 : 1965.039 MB/s

Whoah! Magic!

On PD3D, Hydra finds big free wins - it not only compresses more than Kraken, it decodes significantly faster, repeating the above to point it out :


total                : Kraken     : 3.678 to 1 : 1054.380 MB/s

total                : Hydra288   : 3.738 to 1 : 1249.838 MB/s
 Kraken compression ratio is in between here, around 1300 MB/s
total                : Hydra416   : 3.649 to 1 : 1381.570 MB/s

You can see it visually in the loglog plot; if you draw a line between Kraken & Mermaid, the Hydra data points are above that line (more compression) and to the right (faster).

What's happening is that once in a while there's a block where Mermaid gets the same or more compression than Kraken. While that's rare, when it does happen you just get a big free win from switching to Mermaid on that block. More often, Mermaid only gets a little bit less compression than Kraken but a lot less decode time, so switching is advantageous in the space-speed lagrange cost.

Crucial to Hydra is having a decoder speed fit for every compressor that can simulate decoding a block and count cycles needed to decode on an imaginary machine. You need a model because you don't want to actually measure the time by running the decoder on the current machine - it would take lots of runs to get reliable timing, and it would mean that you are optimizing for the exact machine that you are encoding on. I currently use a single virtual machine that is a blend of various real platforms; in the future I might expose the ability to use virtual machines that simulate specific target machines (because Hydra might make decisions differently if it knows it is targeting PC-x64 vs. Jaguar-x64 vs. Aarch64-on-A57 , etc.).

Hydra is exciting to me as a general framework for the future of Oodle. It provides a way to add in new compression modes and be sure that they are never worse. That is, you always can start with Kraken per block, and then new modes could be picked block by block only when they are known to beat Kraken (in a space-speed sense). It lets you mix in compressors that you specifically don't expect to be good in general on all data, but that might be amazing once in a while on certain data.

(Hydra requires compressors that carry no state across blocks, so you can't naively mix in something like PPM or CM/PAQ. To optimize a switching choice with compressors that carry state requires a trellis-quantization like lattice dynamic programming optimization and is rather more complex to do quickly)

1/24/2017

Order 0 estimators for data compression

I thought this might be an interesting simple survey topic to illustrate an introduction to data compression.

Assume that we are writing a compressor with only order-0 modeling, and that we are working on a binary alphabet, so we are just modeling the count of 0's and 1's. Maybe we have some binary data that we believe only has order-0 correlation in it, or maybe this is the back-end of some other stage of a compressor.

If the data is in fact stationary (the probabilites don't change over time) and truly order-0, then the best we can do is to count the # of 0's and 1's in the whole sequence to get the best possible estimate of the true probability of 0's and 1's in the source.

The first option is a static coder : (using the nomenclature of "static huffman" vs "adaptive huffman" ; eg. static means non-streaming, probabilities or counts transmitted at the start of the buffer)


Encoder counts n0 and n1 in the whole sequence
Encoder transmits n0 and n1 exactly

Encoder & Decoder both make
p0 = n0 / (n0+n1)
p1 = n1 / (n0+n1)

Lots of little notes here already. We didn't have to do any +1's to ensure we had non-zero probabilities as you often see, because we have the exact count of the whole stream. eg. if n0 is 0, that's okay because we won't have any 0's to code so it's okay that they're impossible to code.

Now, how do you do your entropy coding? You could feed p0&p1 to arithmetic coding, ANS, to an enumerative coder (since we know n0 and n1, we are just selecting one of the arrangements of those bits, of which there are (n0+n1)!/n0!n1! , and those are all equally likely, so just send an integer that selects one of those), you could group up bits and use huffman. For now we don't care how the back end works, we're just trying to model the probability to feed to the back end.

If n0 and n1 are large, they are probably specifying more precision than the coder can use, which is wasting bits. So maybe you want to send an approximation of just p0 in 14 bits or whatever your back-end can use.

If you do send n0 and n1 exactly, then obviously you don't need to send the file length (it's n0+n1), and furthermore you can gain some efficiency by decrementing n0 or n1 as you go, so that the last symbol you see is known exactly.

Okay, so moving on to adaptive estimators. Instead of transmitting p0 up front, we will start with no a-priori knowledge of the stream (hence p0 = 50%), and as we encounter symbols, we will update p0 to make it the best estimate based on what we've seen so far. The standard solution is :


Encoder & Decoder start with n0 and n1 = 0

Encoder & Decoder form a probability from the n0 and n1 seen so far

p0 = (n0 + B)/(n0+n1 + 2B)

symbols are coded with the current estimate of p0
after which n0 or n1 is incremented and a new p0 is formed

B is a constant bias factor
if B = 1/2 this is a KT estimator (optimal in a specific synthetic case, irrelevant)
if B = 1 this is a laplace estimator

Note that the bias B must be > 0 so that we can encode a novel symbol, eg. coding the first 0 bit when n0 = 0.

There's stuff in the literature about "optimal estimators" but it's all a bit silly, because the optimal estimator depends on the source and what the distribution of possible sources is.

That is, say you actually are getting bits from a stationary source that has a true (unknown) probability of 0 bits, T0. You could see a wide variety of sources with different values of T0, which occur with probability P(T0). After you see some bits, n0 and n1, you wish to compute a p0 which minimizes the expected codelen of the next symbol you see. To do that, you can compute the relative probability of seeing n0 and n1 events from a source of probability T0. But to form the correct final estimate you must have the information about the a-priori likelihood of each source P(T0) which in practice you never have.


So we have these estimators for stationary sources, but in the real world you almost never have a stationary source. So let's start looking at estimators we might actually want to use in the real world.

(it may actually be a pretty stationary source, but it could be stationary only under a more complex model, and any time you are not fully modeling the data, stationary sources appear to be dynamic. This is like flatlanders in 2d watching a 3d object move through their plane - it may actually be a rigid body in a higher dimension, but it looks dynamic when you have an incomplete view of it. For example data that has Order-1 correlation (probability depends on the previous symbol) will appear to have dynamic statitics under only an order-0 model (the probabilities will seem to change after each symbol is coded))

Let's start with the "static" case, transmitting p0 or n0/n1. We can improve it by just breaking the source into chunks and transmitting a model on each chunk, rather than a single count for the whole buffer. These chunks could be fixed size, but there are sometimes large wins by finding the ideal places to put the chunk boundaries. This is an unsolved problem in general, I don't know of any algorithm to do it optimally (other than brute force, which is O(N!) or something evil like that), we use hacky heuristics. Obviously chunks have a cost in that you must spend bits to indicate where the chunk boundaries are, and what the probabilities are in each chunk, so you must count the cost to send the chunk information vs. the bits saved by coding with different probabilities.

(the most extreme case is a buffer that has n0=n1, which would take n0+n1 bits to send as a single chunk, but if in fact all the 0's are at the start, and all the 1's are the end, then you can cut it into two chunks, in the first chunk p0=100% so the bits are sent in zero bits, in the second chunk p0=0% , so the total size is only the overhead of specifying the chunks and probabilities)

A slightly more sophisticated version of this scheme is to have several probability groups and to be able to switch between them from chunk to chunk, that is :


send the # of models, M
send the models
  in the binary case, p0 or n0/n1 for each model

send the # of chunks
for each chunk :
  send its length
  send a model selection m in [0,M)
  send the data in that chunk using model m

In a binary coder this is a bit silly, but in a general alphabet coder, the model might be very large (100 bytes or so), so sending the model selection m is much cheaper than sending the whole model. This method allows you to switch rapidly between models at a lower cost. eg. if your data is like 000000111111111100000001111111000000 - the runs of different-character data are best coded by switching between models. (we're still assuming we can only use order-0 coding). (this is what Brotli does)

Now moving on to adaptive estimators.

The basic approach is that instead of forming an estimate of future probabilities by counting all n0 and n1 events we have seen in the past, we will count based on what we've seen in the recent past, or weight more recent events higher than old ones.

This is rarely done in practice, but you can simply count the # of each symbol in a finite window and update it incrementally :


at position p
code bit[p]

p0 = (n0 + B)/(n0+n1 + 2B)

after coding, increment n0 or n1

if (n0+n1) = T , desired maximum total
  remove the bit b[p - T]
  by subtracting one from n0 or n1

this has the advantage of keeping the sum constant (once the sum reaches T), which you could use to make the sum power of 2. But it requires you actually have the previous T bits, which you usually don't if you are using the adaptive coder as part of a larger model.

This does illustrate a problem we will face with many of these adaptive estimators. There's an initial run-up phase. They start empty with no symbols seen, then count normally up to T, at which point they reach steady state.

A common old-fashioned approach is to renormalize the total to T/2 once it reaches T. This was originally done as a way of limitting the sum T inside the range allowed by the entropy coder (eg. it must fit in 14 bits in old arithmetic coders so that the multiplies fit in 32 bits). It was found that applying limits like this didn't hurt compression, they in fact help in practice, because they make the statistics more adaptive to local changes.


after coding increment n0 or n1

if (n0+n1) = T
    n0 /= 2 , n1 /= 2;

This is actually the same as a piecewise-linear approximation of geometric falloff of counts. A true geometric update is like this :

once steady state is reached :
n0+n1 == T always

after coding
n0 or n1 += 1 
n0+n1 == T+1 now

n0 *= T/(T+1)
n1 *= T/(T+1)

now n0+n1 == T again

this is equivalent to doing :

n0 or n1 += inc
inc *= (T+1)/T

let G = (T+1)/T is the geometric growth factor

events contribute with weights :

1,G,G^2,G^3,etc..

now, nobody does a geometric update quite like this because it requires high precision counts (though you can do piecewise linear approximations of this and fixed point versions, which can be interesting). There is a way to do a geometric update in finite precision that's extremely common :

p0 probability is fixed point (12-14 bits is common)

at steady state

after coding a 1 : p0 -= p0 >> updshift
after coding a 0 : p0 += (one - p0) >> updshift

this is equivalent to the "renorm every step to keep n0+n1 = T" with T = 1<<updshift

This gives an efficient way to do a very recency-biased (geometric) estimator. For most of the estimators I'm talking about, the non-binary alphabet extension is obvious, and I'm just doing binary here for simplicitly, but in this case the non-binary alphabet version is non trivial. Fabian works it out here : Mixing discrete probability distributions , and Models for adaptive arithmetic coding .

For people familiar with filtering, it should be obvious that what we're really doing here is running filters over the previous events. The "window" estimator is a simple FIR filter with a box response. The geometric estimator is the simplest IIR filter.


In all our (adaptive) estimators, we have ensured that P0 and P1 are never zero - we need to be able to code either bit even if we've never seen one before.

To do this, we often add on a count to n0 and n1 (+B above), or ensure it's non-zero.

In the binary updshift case, the minimum of p0 is where (p0 >> updshift) is zero, that's


p0min = (1 << updshift) - 1

which in practice is actually quite a large minimum probability of the novel symbol. That turns out to be desirable in very local fast-adaptive estimators. What you want is if the last 4 events were all 1 bits, you want the probability P1 to go very high very fast - but you don't want to be over-confident about that local model matching future bits, so you want P0 to stay at some floor.

Essentially what we are doing here is blending in the unknown or "flat" model (50/50 probability of 0 or 1 bit) with some desired weight. So you might have a very jerky strongly adapting local model, but then you also blend in the flat model as a hedge.


The geometric update be extended to "two speed" :


track two running estimators, p0_a and p0_b

make p0 = (p0_a + p0_b)/2
use p0 for coding

after the event is see update each with different speeds :

after coding a 1 : p0_a -= p0_a >> updshift_a
after coding a 0 : p0_a += (one - p0_a) >> updshift_a

and p0_b with updshift_b

eg. you might use
updshift_a = 4 (a very fast model)
updshift_b = 8 (a slower model)
(with one = 1<<14)

Naively this looks like an interesting blend of two models. Actually since it's all just linear, it's in fact still just an IIR filter. It's simply a slightly more general IIR filter; the previous one was a one-tap filter (previous total and new event), this one is a two-tap filter (two previous totals and new event).

But this leads us to somethat that is interesting, which is more general blending.

You could have something like 3 models : flat (all symbols likely), a very fast estimator that strongly adapts to local statistics, and a slow estimator (perhaps n0/n1 counts for the whole file) that is more accurate if the file is in fact stationary.

Then blend the 3 models based on local performance. The blend weight for a simple log-loss system is simply the multiple of probabilities of that model on the preceding symbols.


Now, a common problem with these IIR type filters is that they assume steady state. You may recall previously we talked about the renormalization-based adaptive coder that has two phases :


track n0,n1

ramp-up phase , while (n0+n1) < T
  initialize n0=n1=0
  n0 or n1 += 1

stready-state :
  when n0+n1 = T , renorm total to T/2
  n0 or n1 += 1

If you're doing whole-file entropy coding (eg. lots of events) then maybe the ramp-up phase is not important to you and you can just ignore it, but if you're doing context modeling (lots of probability estimators in each node of the tree, which might not see very many events), then the ramp-up phase is crucial and can't be ignored.

If you want something efficient (like the updshift geometric model), but that accounts for ramp-up vs steady state, the answer is table lookups. (the key difference in the ramp-up phase is that adaptation early on is much faster than once you reach steady state)

This actually goes back to the ancient days of arithmetic codec, in the work of people like Howard & Vitter, and things like the Q-coder from IBM.

The idea is that you have a compact state variable which is your table index. It starts at an index for no events (n0=0,n1=0), and counts up through the ramp-up phase. Then once you reach steady state the index ticks up and down on a line like the p0 in updshift. Each index has a state transition for "after a 0" and "after a 1" to adapt. Something like :


ramp-up :

0: {0,0} -> 1 or 2
1: {1,0} -> 3 or 4
2: {0,1} -> 3 or 5
3: {1,1} -> 6 or 7
4: {2,0} -> 
5: {0,2} -> 

etc.

then say T = 16 is steady state, you have

{0,16} {1,15} {2,14} ... {16,0}

that just transitions up and down

And obviously you don't need to actually store {n0,n1}, you just store p0 in fixed point so you can do divide-free arithmetic coding. So there's like a tree of states for the ramp-up phase, then just a line back and forth at steady state.

And those states are not actually what you want at steady state. Actually finding the ideal probabilities for steady state is complex and in the end can only be solved by iteration. I won't go into the details but just quickly touch on the issues.

You might start with a mid point at p0=0.5 , at simulated T=16 that corresponds to {8,8} , so you consider stepping to {8,9} after seeing a 1 and renormalize to T=16, that gives p0=8/17 = 0.47059 ; that corresponds to a geometric update with scaling factor G = 17/16. If you keep seeing 1's, then p0 keeps going down like that. But if you saw a 0, then p0 -> p0 + (1 - p0) * (1 - 1/G) , so 0.47059 -> 0.50173 , which is not back to where you were.

This should be intuitive because with geometric recency, if you see a 1 bit then a 0 bit, the 0 you just saw counts a bit more than the 1 before, so you don't get back to the midpoint. With geometry recency the p0 estimated for seeing bits 01 is not the same as after seeing 10 - the order matters. This is also good intuition why simple counting estimators like KT are not very useful in data compression - the probability of 0 after seeing "11110000" is most likely the not the same after seeing "00001111" . Now you might argue that we're asking our order-0 estimator to do non-order-0 things, we aren't giving it a memoryless bit, we should have used some higher order statistics or a transform or something first, but in practice that's not helpful.

The short answer is you just need lots of states on the steady-state line, and you have to numerically optimize what the probability in each state is by simulating what the desired probability is when you arive there in various ways and averaging them; a kind of k-means quantization type of thing.

Another issue is how you do the state transition graph on the steady-state line. When you are out at the ends, say very low p0 so a 1 bit is highly predicted - if you see another 1 bit, then p0 does not change very much, but if you see a 0 bit (unexpected), then p0 should change a lot. This is actually information theory in a microcosm - when you see the expected events, they don't jar your model very much, because they are what you expected, they contain little new information, when you see events that had very low probability, that's huge information and jars p0 a lot.

(there's some ancient code for a coder like this and a table in crblib ; "rungae.c" / "ladder.c")

You could store the # of steps to take up or down after seeing a 0 or 1 bit. One of them could be implicit. For example when you see a more probable symbol, always take 1 step, when you see a less probable symbol, take many steps (maybe 3). Another clever way to do it is used in the Q-coder (and QM and MQ). They have a steady state line of states, but only change state when the arithmetic coder outputs a bit. This means you have to see 1/log2(P) events before you change states, which is exactly what you want - when P is very high, log2(P) is tiny and you won't step until you see several. This method cannot be used in modern arithmetic coders that output byte by byte, it requires bitwise renormalization. It's neat because it lets you use a very tiny table (53 states) and you can put the density where you need it (mostly around p0=0.5) but still have states way out at the extreme probabilities to code them efficiently.


The next step in the evolution is secondary statistics.

If you have this {n0,n1} state transition table in the last section, that's a state index. The straightforward way to do it is that each state has a p0 precomputed that corresponds to n0,n1 and you use that for coding.

With secondary statistics, instead of using the p0 that you *expected* to observe for given past counts, you use the p0 that you actually *observed* in that same state in the past.


Say you're in a given state S after seeing bits 0100
(n0 =3, n1 =1 , but order matters too)

You could compute the p0 that should be seen after that sequence with some standard estimator
(geometric or KT or whatever)

Or, screw them.  Instead use S as a lookup to a secondary model.

SecondaryStatistics[S]

contains the n0 and n1 actually coded from the state S
previous times that you were in state S

This was the SEE idea from PPMZ (then Shkarin's PPMD (different from Teahan's PPMD) and Mahoney's PAQ (sometimes called APM there)). In the real world there are weird nonlinearities in the actual probabilities of states that can't be expressed well with simple estimators. Furthermore, those change from file to file, so you can't just tabulate them, you need to observe them.

A common hacky thing to do is to use a different estimator if n0=0 or n1=0 ; eg. if one of the possible symbols has never been seen at all, special case it and don't use something like a standard KT estimator that gives it a bias to non-zero probability. This is done because in practice it's been observed that deterministic contexts have very different statistics. Really this is just a special case version of something more general like secondary statistics.

The other big step you could take is mixing. But that's rather going beyond simple order-0 estimators so I think it's time to stop.

1/20/2017

Oodle on the Nintendo Switch

EDIT : See newer post : Oodle 2.7.3 on the Nintendo Switch

Original post follows :


Oodle is coming soon (in 2.4.2) to the Nintendo Switch (NX), an ARM A57 device.

Quick performance test vs. the software zlib (1.2.8) provided in the Nintendo SDK :

ADD : Update with new numbers from Oodle 2.6.0 pre-release (11-20-2017) :


file  : compressor  :  ratio      : decode speed

lzt99 : nn_deflate  :  1.883 to 1 : 74.750 MB/s

lzt99 : Kraken  -z8 :  2.615 to 1 : 275.75 mb/s  (threadphased 470.13 mb/s)
lzt99 : Kraken  -z6 :  2.527 to 1 : 289.06 mb/s
lzt99 : Hydra 300 z6:  2.571 to 1 : 335.68 mb/s
lzt99 : Hydra 800 z6:  2.441 to 1 : 458.66 mb/s
lzt99 : Mermaid -z6 :  2.363 to 1 : 556.85 mb/s
lzt99 : Selkie  -z6 :  1.939 to 1 : 988.04 mb/s

Kraken (z6) is 3.86X faster to decode than zlib, with way more compression (35% more).
Selkie gets a little more compression than zlib and is 13.35 X faster to decode.

All tests single threaded, 64-bit. (except "threadphased" which uses 2 threads to decode)

I've included Hydra at a space-speed tradeoff value between Kraken & Mermaid (sstb=300). It's a bit subtle, perhaps you can see it best in the loglog chart (below), but Hydra here is not just interpolating between Kraken & Mermaid performance, it's actually beating both of them in a Pareto frontier sense.


OLD :

This post was originally done with a pre-release version of Oodle 2.4.2 when we had just gotten Oodle running on the NX. There was still a lot of work to be done to get it running really properly.

lzt99                : nn_deflate : 1.883 to 1 : 74.750 MB/s
lzt99                : LZNA       : 2.723 to 1 : 24.886 MB/s
lzt99                : Kraken     : 2.549 to 1 : 238.881 MB/s
lzt99                : Hydra 300  : 2.519 to 1 : 274.433 MB/s
lzt99                : Mermaid    : 2.393 to 1 : 328.930 MB/s
lzt99                : Selkie     : 1.992 to 1 : 660.859 MB/s

8/20/2016

PNG without ZLib

If you need to send PNG images in a compressed archive, here's a tip.

PNG's are internally compressed with Zlib. When you run another compressor (such as Oodle) on an already-compressed file like PNG, it won't be able to do much with it. It might get a few bytes out of the headers, but typically the space-speed tradeoff decision in Oodle will not think that gain is worth bothering with, so the PNG will just be sent uncompressed.

There are a few reasons why you might want to use an Oodle compressor rather than the Zlib inside PNG. One is to reduce size; some of the Oodle compressors can make the files smaller than Zlib can. Another is for speed, if you use Kraken or Mermaid the decoder is much faster than the Zlib decompression in PNG.

Now obviously if you want the smallest possible lossless image, you should use an image-specific codec like webp-ll , but we will assume here that that isn't an option.

You could of course just decode the PNG to BMP or TGA or some kind of simple sample format, but that is not desirable. For one thing it changes the format, and your end usage loader might be expecting PNG. Your PNG's might be using PNG-specific features like borders or transparency or whatever that is hard to translate to other formats.

But another is that we want the PNG to keep doing its filtering. Filtered image samples from PNG will usually be more compressible by the back-end compressor than the raw samples in a BMP.

The easy way to do this all is just to take an existing PNG and set its ZLib compression level to 0 (just store). You keep all the PNG headers, and you still get the pixel filtering. But the samples are now uncompressed, so the back-end compressor (Oodle or whatever) gets to work on them instead of passing through already-ZLibbed data.


pngcp

pngcp is a utility from the official libpng distribution. It reads & writes a png and can change some options.

Usage for what we want is :


pngcp --level=0 --text-level=0 from.png to.png

I have made a Win32 build with static libs of pngcp for your convenience :

pngcp.zip

I also added a --help option ; run "pngcp --help". The official pngcp seems to have no help or readme at all that explains usage.

I *think* that pngcp preserves headers & options & pixel formats BUT I'M NOT SURE, it's not my code, YMMV, don't go fuck up your pngs without testing it. If it doesn't work - hey you can get pngcp from the official distro and fix it.

I used libpng 1624. The vc7.1 project in libpng worked fine for me. pngcp needed a little bit of de-unixification to build in VC but it was straightforward. You need zlib ; I used 1.2.8 and it worked fine; you need to make a dir named "zlib" at the same level as libpng. I did "mklink /j zlib zlib-1.2.8".

* CAVEAT : this isn't really the way I'd like to do this. pngcp loads the PNG and then saves it out again, which introduces the possibility of losing metadata that was stuffed in the file or just screwing it up somehow. I'd much rather do this conversion without ever actually loading it as an image. That is, take the PNG file as just a binary blob, find the zlib streams and unpack them, store them with a level 0 header, and pass through the PNG headers totally untouched. That would be a much more robust way to ensure you don't lose anything.


cbpngz0

cbpngz0 usage :


cbpngz0 from to

cbpngz0 uses the cblib loaders, so it can load bmp,tga,png,jpeg and so on. It writes a PNG at zlib level 0. Unlike pngcp, cbpngz0 does NOT support lots of weird formats; it only writes 8-bit gray, 24-bit RGB, and 32-bit RGBA. This is not a general purpose PNG zlib level changer!! Nevertheless I find it useful because of the wider range of formats it can load.

cbpngz0.zip

cbpngz0 is an x64 exe and uses the DLLs included.


Some sample results.

I take an original PNG, then try compressing it with Oodle two ways. First, convert it to a BMP and compress the BMP. Second, convert to a Zlib level 0 PNG (the "_z0.png") and then compress with Oodle. The differene between the two is that the _z0.png gets the PNG filters, and of course stays a PNG if that's what your loader expects. If you give the original PNG to Oodle, it passes it through uncompressed.


porsche640.png             529,821

porsche640.bmp             921,654

porsche640.bmp.ooz         711,273

porsche640_z0.png.ooz      508,091

-------------

blinds.png                 328,754

blinds.bmp               1,028,826

blinds.bmp.ooz             193,130

blinds_z0.png.ooz          195,558

-------------

xxx.png                    420,149

xxx.bmp                    915,054

xxx.bmp.ooz                521,861

xxx_z0.png.ooz             409,311

The ooz files are made with Oodle LZNA -z6 (level Optimal2).

You can see there are some big gains possible with replacing Zlib (on "blinds"). On normal photographic continuous tone images Zlib does okay so the gains are small. On those images, compressing the BMP without filters is very bad.


Another small note : if your end usage PNG loader supports the optional MNG format LOCO color transform, that usually helps compression.

ADD : Chris Maiwald points out that he gets better PNG filter choice by using "Z_FIXED" (which is the zlib option for fixed huffman tables instead of per-file huffman). A bit weird, but perhaps it biases the filter choice to be more consistent?

I wonder if choosing a single PNG filter for the whole image would be better than letting PNG do its per-row thing? (to try to make the post-filter residuals more consistent for the back end modeling stage). For max compression you would use something like a png optimizer that tried various filter strategies, but instead of rating them using zlib, rate with the back-end of your choice.

7/29/2016

Scatter Plots

I tweaked my scatter plot generation per Fabian. They're still log-log but now labeled with the non-log axis values so it's easier to read off the actual ratio and speed without doing a pow2 in your head.

Silesia :

Game Test Set :

Seven, total :

Seven, all files :

Slow slow compressors

lzma is really too slow to decode.

total                : Kraken     : 2.914 to 1 : 1053.961 MB/s
total                : lzma       : 3.186 to 1 : 52.660 MB/s

(Win64 Core i7-3770 3.4 GHz)

Kraken is around 20X faster than lzma, but lzma compresses better (about 9%). That's already a tip that something is horribly wrong; you have a 2000% speed difference and a 9% size difference.

If we look at the total time to load compressed from disk + decompress, we can make these speedup factor curves :

At very low disk speeds, the higher compression of lzma provides a speedup over Kraken. But how slow does the disk have to be? You can see the intersection of the curves is between 0 and 1 on the log scale, that's 1-2 MB/s !!

For any disk faster than 2 MB/s , load+decomp is *way* faster with Kraken. At a disk speed of 16 MB/s or so (log scale 4) the full load+decomp for Kraken is around 2X faster than with lzma. And that's still a very slow disk (around optical speed).

Now, this is a speedup factor for load *then* decomp. If you are fully overlapping overlapping IO with decompression, then some of the decode time is hidden.

*But* that also assumes that you have a whole core to give to decompression. And it assumes you have no other CPU to work to do after loading.

The idea that you can hide decompressor time in IO time only works if you have enough independent loads so that there's lots to overlap (because if you don't, then the first IO and last decompress will never overlap anything), and it assumes you have no other CPU work to do.

In theory I absolutely love the idea that you just load pre-baked data which is all ready to go, and you just point at it, so there's no CPU work in loading other than decompression, but in practice that is almost never the case. eg. for loading compressed web pages, there's tons of CPU work that needs to be ton to parse the HTML or JS or whatever, so the idea that you can hide the decompressor time in the network latency is a lie - the decompressor time adds on to the later processing time and adds directly onto total load latency.

The other factor that people often ignore is the fact that loading these days is heterogeneous.

What you actually encounter is something like this :


Download from internet ~ 1 MB/s
Load from optimal disc ~ 20 MB/s
Load from slow HDD ~ 80 MB/s
Fast SSD ~ 500 MB/s
NVMe drive on PCIe ~ 1-2 GB/s
Load from cache in RAM ~ 8 GB/s

We have very heterogeneous loading - even for a single file loaded by the same application.

The first time you load it, maybe you download from the internet, and in that case a slow decompressor like lzma might be best. But the next time you load it's from the cache in RAM. And the time after that it's from HDD. In those cases, using lzma is a disaster (in the sense that the loading is now nearly instant, but you spend seconds decoding; or in the sense that just loading uncompressed would have been way faster).

One issue that I think is not considered is that making the right choice in the slow-disk zone is not that big of a deal. On a 1 MB/s disk, the difference in "speedup" between lzma and Kraken is maybe 2% in favor of lzma. But on a 100 MB/s it's something like 400% in favor of Kraken.

Now in theory maybe it would be nice to have different compressors for download & disk storage; like you use something like lzma for downloadable, and then decode and re-encode to ZStd for HDD loading. In practice nobody does that and the advantage over just using ZStd all the time is very marginal.

Also in theory it would be nice if the OS cache would cache the decompressed data rather than caching the compressed data.


TODO : time lzma on PS4. Usually PS4 is 2-4X slower than my PC, so that puts lzma somewhere in the 10-25 mb/s range, which is very borderline for keeping up with the optical disc.


DVD 16x is ~ 20 MB/s (max)
PS4 Blu-Ray is 6x ~ 27 MB/s (max)

PS4 transparently caches Blu-Ray to HDD

Of course because of the transparent caching to HDD, if you actually keep files in lzma on the disc, and they are cached to HDD, loading them from HDD is a huge mismatch and makes lzma the bottleneck.

That is, in practice on the PS4 when you load files from disc, they are sometimes actually coming from the HDD transparent cache, so you sometimes get 20 MB/s speeds, and sometimes 100 MB/s.


Now of course we'd love to have a higher-ratio compressor than Kraken which isn't so slow. Right now, we just don't have it. We have Kraken at 1000 MB/s , LZNA at 120 MB/s , lzma at 50 MB/s - it's too big of a step down in speed, even for LZNA.

In order for the size gain of lzma/LZNA to be worth it, it needs to run a *lot* faster, more like 400 mb/s. There needs to be a new algorithmic step in the high compress domain to get there.

At the moment the only reason to use the slower decoders than Kraken is if you simply must get smaller files and damn the speed; like if you have a downloadable app size hard limit and just have to fit in 100 MB, or if you are running out of room on an optical disc or whatever.

old rants