Oodle Kraken Thread-Phased Decoding

Oodle Kraken is already by far the fastest-to-decode high-compression compressor in the world (that's a mouthful!). But it's got a magic trick that makes it even faster :

Oodle Kraken can decode its normal compressed data on multiple threads.

This is different than what a lot of compressors do (and what Oodle has done in the past), which is to split the data into independent chunks so that each chunk can be decompressed on its own thread. All compressors can do that in theory, Oodle makes it easy in practice with the "seek chunk" decodes. But that requires special encoding that does the chunking, and it hurts compression ratio by breaking up where matches can be found.

The Oodle Kraken threaded decode is different. To distinguish it I call it "Thread-Phased" decode. It runs on normal compressed data - no special encoding flags are needed. It has no compressed size penalty because it's the same normal single-thread compressed data.

The Oodle Kraken Thread-Phased decode gets most of its benefit with just 2 threads (if you like, the calling thread + 1 more). The exact speedup varies by file, usually in the 1.4X - 1.9X range. The results here are all for 2-thread decode.

For example on win81, 2-thread Oodle Kraken is 1.7X faster than 1-thread : (with some other compressors for reference)

win81 :

Kraken 2-thread  : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.075 seconds, 410.98 b/kc, rate= 1398.55 M/s

Kraken           : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.127 seconds, 243.06 b/kc, rate= 827.13 M/s

zstdmax          : 104,857,600 ->39,768,086 =  3.034 bpb =  2.637 to 1 
decode           : 0.251 seconds, 122.80 b/kc, rate= 417.88 M/s

lzham            : 104,857,600 ->37,856,839 =  2.888 bpb =  2.770 to 1 
decode           : 0.595 seconds, 51.80 b/kc, rate= 176.27 M/s

lzma             : 104,857,600 ->35,556,039 =  2.713 bpb =  2.949 to 1 
decode           : 2.026 seconds, 15.21 b/kc, rate= 51.76 M/s

Charts on a few files :

Oodle 2.2.0 includes helper functions that will just run a Thread-Phased decode for you on Oodle's own thread system, as well as example code that runs the entire Thread-Phased decode client-side so you can do it on your own threads however you like.

Performance on the Silesia set for reference :

Silesia total :

Oodle Kraken -z6 : 211,938,580 ->51,857,427 =  1.957 bpb =  4.087 to 1

single threaded decode   : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s

two threaded decode      : 0.158 seconds, 394.55 b/kc, rate= 1342.64 M/s

Note that because the Kraken Thread-Phased decode is a true threaded decode of individual compressed buffers that means it is a *latency* reduction for decoding individual blocks, not just a *throughput* reduction. For example, if you were really decoding the whole Silesia set, you might just run the decompression of each file on its own thread. That is a good thing to do, and it would give you a near 2X speedup (with two threads). But that's a different kind of threading - that gives you a throughput improvement of 2X but the latency to decode any individual file is not improved at all. Kraken Thread-Phased decode reduces the latency of each independent decode, and of course it can also be used with chunking or multiple-file decoding to get further speedups.

No comments:

old rants