5/17/2016

The Weissman Score

Wikipedia suggests the Weissman score should be

which ignoring constants is just W = r/logT

That's just wrong. You don't take a logarithm of something with units. But there are aspects of it that are correct. W should be proportional to r (compression ratio), and a logarithm of time should be involved. Just not like that.

I present a formula which I call the correct Weissman Score :


W = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )  -
    comp_ratio * log10( 1 + speed/(disk_speed_hi *comp_ratio) )

or

W = comp_ratio * log10( ( comp_ratio + speed/disk_speed_lo ) / ( comp_ratio + speed/disk_speed_hi ) )

You can have a Weissman score for encode speed or decode speed. It's a measure of space-speed tradeoff success.

I suggest the range should be 1-256. disk_speed_lo = 1 MB/s (to evaluate performance on very slow channels, favoring small files), disk_speed_hi = 256 MB/s (to evalue performance on very fast disks, favoring speed). And because 1 and 256 are amongst programmers' favorite numbers.

You could also just let the hi range go to infinity. Then you don't need a hi disk speed parameter and you get :


Weissman-infinity = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )

with disk_speed_lo = 1 MB/s ; which is neater, though this favors fast compressors more than you might like. While it's a cleaner formula, I think it's less useful for practical purposes, where the bounded hi range focuses the score more on the area that most people care about.

I came up with this formula because I started thinking about summarizing a score from the Pareto charts I've made . What if you took the speedup value at several (log-scale) disk speeds; like you could take the speedup at 1 MB/s,2 MB/s,4 MB/s, and just average them? speedup is a good way to measure a compressor even if you don't actually care about speed. Well, rather than just average a bunch of points, what if I average *all* points? eg. integrate to get the area under the curve? Note that we're integrating in log-scale of disk speed.

Turns out you can just do that integral :

    speedup = (time to load uncompressed) / (time to load compressed + decompress)
    speedup = (raw_size/disk_speed) / (comp_size/disk_speed + raw_size/ decompress_speed)
    speedup = (1/disk_speed) / (1/(disk_speed*compression_ratio) + 1 / decompress_speed)
    speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)
    speedup = 1 / (1/compression_ratio + exp( log_disk_speed ) / decompress_speed)
    speedup = compression_ratio / (1 + exp( log_disk_speed ) * compression_ratio/decompress_speed)
    speedup = compression_ratio * 1 / (1 + exp( log_disk_speed + log(compression_ratio/decompress_speed)))

speedup is a sigmoid :

    y = 1 / (1 + e^-x ) 
    
    Integral{y} = ln( 1 + e^x )

    x = - ( log_disk_speed + log(compression_ratio/decompress_speed) )

so substitute some variables and you get the above formula for the Weissman score.

In the final formula, I changed from natural log to log-base-10, which is just a constant scaling factor.

The Weissman (decode Core i7-3770 3.4 GHz; 1-256 range) scores on Silesia are :

lz4hc    : 6.243931
zstdmax  : 7.520236
lzham    : 6.924379
lzma     : 5.460073
zlib9    : 5.198510
Kraken   : 8.431461

Weissman-infinity scores are :

lz4hc    : 7.983104
zstdmax  : 8.168544
lzham    : 7.277707
lzma     : 5.589155
zlib9    : 5.630476
Kraken   : 9.551152

Goal : beat 10.0 !


ADD : this post was a not-sure-if-joking. But I actually think it's useful. I find it useful anyway.

When you're trying to tweak out some space-speed tradeoff decisions, you get different sizes and speeds, and it can be hard to tell if that tradeoff was good. You can do things like plot all your options on a space-speed graph and try to guess the pareto frontier and take those. But when iterating an optimization of a parameter you want just a simple score.

This corrected Weissman score is a nice way to do that. You have to choose what domain you're optimizing for, size-dominant slower compressors should use Weissman 1-256 , for balance of space and super speed use Weissman 1-inf (or 40-800), for the fast domain (LZ4-ish) use a range like 100-inf. Then you can just iterate to maximize that number!

For whatever space-speed tradeoff domain you're interested in, there exists a Weissman score range (lo-hi disk speed paramaters) such that maximizing the Weissman score in that range gives you the best space-speed tradeoff in the domain you wanted. The trick is choosing what that lo-hi range is (it doesn't necessarily directly correspond to actual disk or channel speeds; there are other factors to consider like latency, storage usage, etc. that might cause you to bias the lo-hi away from the actual channel speeds in some way; for example high speed decoders should always set the upper speed to infinity, which corresponds to the use case that the compressed data might be already resident in RAM so it has zero time to load).

5/15/2016

PS4 Battle : LZ4 vs LZSSE vs Oodle

PS4 is the arena. Three compressors enter. (revised 05-16)

Advantages of the PS4 : consistent well-defined hardware for reproducible testing. Slow platform that game developers care about being fast on. Builds with clang which is the target of choice for some compression libraries that don't build so easily on MSVC.

After the initial version of this post, I went and fuzz-safed LZB16, so they're all directly comparable. To compare apples, look at LZ4_decompress_safe. I also include LZ4_decompress_fast for reference.

LZ4_decompress_safe : fuzz safe (*)
LZ4_decompress_fast : NOT fuzz safe
LZSSE8_Decode : fuzz safe
Oodle LZB16 : fuzz safe
LZSSE and Oodle both use multiple copies of the core loop to minimize the pain of fuzz safing. LZ4's code is much simpler, it doesn't do a ton of macro or .inl nastiness. (* = this is the only one that I would actually trust to put in a satellite or a bank or something critical, it's just so simple, it's way easier to be sure that it's correct)

Conclusion :

Comparing the two Safe open-source options, LZ4_safe vs. LZSSE8 : LZSSE8 is pretty consistently faster than LZ4_Safe on PS4 (though the difference is small). PS4 is a better platform for LZSSE than x64 (PS4 is actually a pretty bad platform for LZ4; there're a variety of issues; see GDC "Taming the Jaguar" slides ; but particular issues for LZ4 are the front-end bottleneck and cache latency). When I tested on x64 before, it was much more mixed, sometimes LZ4 was faster.

I was surprised to find that Oodle LZB16 is quite a lot faster than LZ4 on PS4. (for example, that's not the case on Windows/x64, it's much closer there). I've never run third party codecs on PS4 before. I suppose this reflects the per-platform tweaking that we spend so much time on, and I'm sure LZ4 would catch up with some time spent fiddling with the PS4 codegen.

The compression ratios are mostly close enough to not care, though LZSSE8 does a bit worse on some of the DXTC/BCn files (lightmap.bc3 and d.dds).

On some files I include Oodle LZBLW numbers (LZB-bytewise-large-window). Sometimes Oodle LZBLW is a pretty big free compression win at about the same speed. Sometimes it gets worse ratio, sometimes much worse speed. If I was a client using this, I might try LZBLW and drop down to LZB16 any time it's not a good tradeoff.

Full data :

REMINDER : LZ4_decompress_fast is not directly comparable to the others, it's not fuzz safe, the others are!


PS4 clang-3.6.1

------------------

lzt99 :

lz4hc : 24,700,820 ->14,801,510 =  4.794 bpb =  1.669 to 1
LZ4_decompress_safe_time : 0.035 seconds, 440.51 b/kc, rate= 702.09 mb/s
LZ4_decompress_fast_time : 0.032 seconds, 483.55 b/kc, rate= 770.67 mb/s

LZSSE8  : 24,700,820 ->15,190,395 =  4.920 bpb =  1.626 to 1
LZSSE8_Decode_Time : 0.033 seconds, 467.32 b/kc, rate= 744.81 mb/s

Oodle LZB16 : lzt99 : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1 
decode           : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s

Oodle LZBLW : lzt99 : 24,700,820 ->13,349,800 =  4.324 bpb =  1.850 to 1
decode           : 0.033 seconds, 470.39 b/kc, rate= 749.74 mb/s

------------------

texture.bc1

lz4hc : 2,188,524 -> 2,068,268 =  7.560 bpb =  1.058 to 1
LZ4_decompress_safe_time : 0.004 seconds, 322.97 b/kc, rate= 514.95 mb/s
LZ4_decompress_fast_time : 0.004 seconds, 353.08 b/kc, rate= 562.89 mb/s

LZSSE8 : 2,188,524 -> 2,111,182 =  7.717 bpb =  1.037 to 1
LZSSE8_Decode_Time : 0.004 seconds, 360.21 b/kc, rate= 574.42 mb/s

Oodle LZB16 : texture.bc1 :  2,188,524 -> 2,068,823 =  7.562 bpb =  1.058 to 1 
decode           : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s

------------------

lightmap.bc3

lz4hc : 4,194,332 ->   632,974 =  1.207 bpb =  6.626 to 1
LZ4_decompress_safe_time : 0.005 seconds, 521.54 b/kc, rate= 831.38 mb/s
LZ4_decompress_fast_time : 0.005 seconds, 564.63 b/kc, rate= 900.46 mb/s

LZSSE8 encode    :  4,194,332 ->   684,062 =  1.305 bpb =  6.132 to 1
LZSSE8_Decode_Time : 0.005 seconds, 551.85 b/kc, rate= 879.87 mb/s

Oodle LZB16 : lightmap.bc3 :  4,194,332 ->   630,794 =  1.203 bpb =  6.649 to 1 
decode           : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s

------------------

silesia_mozilla

lz4hc : 51,220,480 ->22,062,995 =  3.446 bpb =  2.322 to 1
LZ4_decompress_safe_time : 0.083 seconds, 385.47 b/kc, rate= 614.35 mb/s
LZ4_decompress_fast_time : 0.075 seconds, 427.14 b/kc, rate= 680.75 mb/s

LZSSE8 : 51,220,480 ->22,148,366 =  3.459 bpb =  2.313 to 1
LZSSE8_Decode_Time : 0.070 seconds, 461.53 b/kc, rate= 735.59 mb/s

Oodle LZB16 : silesia_mozilla : 51,220,480 ->22,022,002 =  3.440 bpb =  2.326 to 1 
decode           : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s

Oodle LZBLW : silesia_mozilla : 51,220,480 ->20,881,772 =  3.261 bpb =  2.453 to 1
decode           : 0.112 seconds, 285.68 b/kc, rate= 455.30 mb/s

------------------

breton.dds

lz4hc :   589,952 ->   116,447 =  1.579 bpb =  5.066 to 1
LZ4_decompress_safe_time : 0.001 seconds, 568.65 b/kc, rate= 906.22 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 624.81 b/kc, rate= 996.54 mb/s

LZSSE8 encode    :  589,952 ->   119,659 =  1.623 bpb =  4.930 to 1
LZSSE8_Decode_Time : 0.001 seconds, 604.14 b/kc, rate= 962.40 mb/s

Oodle LZB16 : breton.dds :    589,952 ->   113,578 =  1.540 bpb =  5.194 to 1 
decode           : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s

Oodle LZBLW : breton.dds :    589,952 ->   132,934 =  1.803 bpb =  4.438 to 1
decode           : 0.001 seconds, 396.04 b/kc, rate= 630.96 mb/s

------------------

d.dds

lz4hc encode     :  1,048,704 ->   656,706 =  5.010 bpb =  1.597 to 1
LZ4_decompress_safe_time : 0.001 seconds, 554.69 b/kc, rate= 884.24 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 587.20 b/kc, rate= 936.34 mb/s

LZSSE8 encode    :  1,048,704 ->   695,583 =  5.306 bpb =  1.508 to 1
LZSSE8_Decode_Time : 0.001 seconds, 551.13 b/kc, rate= 879.05 mb/s

Oodle LZB16 : d.dds :  d.dds :  1,048,704 ->   654,014 =  4.989 bpb =  1.603 to 1 
decode           : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s

------------------

all_dds

lz4hc : 79,993,099 ->47,848,680 =  4.785 bpb =  1.672 to 1
LZ4_decompress_safe_time : 0.158 seconds, 316.67 b/kc, rate= 504.70 mb/s
LZ4_decompress_fast_time : 0.143 seconds, 350.66 b/kc, rate= 558.87 mb/s

LZSSE8 : 79,993,099 ->47,807,041 =  4.781 bpb =  1.673 to 1
LZSSE8_Decode_Time : 0.140 seconds, 358.61 b/kc, rate= 571.54 mb/s

Oodle LZB16 : all_dds : 79,993,099 ->47,683,003 =  4.769 bpb =  1.678 to 1 
decode           : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s

----------

baby_robot_shell.gr2

lz4hc : 58,788,904 ->32,998,567 =  4.490 bpb =  1.782 to 1
LZ4_decompress_safe_time : 0.090 seconds, 412.04 b/kc, rate= 656.71 mb/s
LZ4_decompress_fast_time : 0.080 seconds, 460.55 b/kc, rate= 734.01 mb/s

LZSSE8 : 58,788,904 ->33,201,406 =  4.518 bpb =  1.771 to 1
LZSSE8_Decode_Time : 0.076 seconds, 485.14 b/kc, rate= 773.20 mb/s

Oodle LZB16 : baby_robot_shell.gr2 : 58,788,904 ->32,862,033 =  4.472 bpb =  1.789 to 1
decode           : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s

Oodle LZBLW : baby_robot_shell.gr2 : 58,788,904 ->30,207,635 =  4.111 bpb =  1.946 to 1
decode           : 0.090 seconds, 409.88 b/kc, rate= 653.26 mb/s


After posting the original version with non-fuzz-safe LZB16, I decided to just go and do the fuzz-safing for LZB16.


LZB16, PS4 clang-3.6.1

post-fuzz-safing :

    lzt99 : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1 
    decode           : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s
    
    texture.bc1 :  2,188,524 -> 2,068,823 =  7.562 bpb =  1.058 to 1 
    decode           : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s
    
    lightmap.bc3 :  4,194,332 ->   630,794 =  1.203 bpb =  6.649 to 1 
    decode           : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s
    
    silesia_mozilla : 51,220,480 ->22,022,002 =  3.440 bpb =  2.326 to 1 
    decode           : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s
    
    breton.dds :    589,952 ->   113,578 =  1.540 bpb =  5.194 to 1 
    decode           : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s
    
    d.dds :  1,048,704 ->   654,014 =  4.989 bpb =  1.603 to 1 
    decode           : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s
    
    all_dds : 79,993,099 ->47,683,003 =  4.769 bpb =  1.678 to 1 
    decode           : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s
    
    baby_robot_shell.gr2 : 58,788,904 ->32,862,033 =  4.472 bpb =  1.789 to 1
    decode           : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s

pre-fuzz-safing reference :

    lzt99 912.92 mb/s
    texture.bc1 598.61 mb/s
    lightmap.bc3 876.19 mb/s
    silezia_mozilla 794.72 mb/s
    breton.dds 1078.52 mb/s
    d.dds 888.73 mb/s
    all_dds 701.45 mb/s
    baby_robot_shell.gr2 877.81 mb/s

Mild speed penalty on most files.

5/12/2016

Oodle Kraken Thread-Phased Decoding

Oodle Kraken is already by far the fastest-to-decode high-compression compressor in the world (that's a mouthful!). But it's got a magic trick that makes it even faster :

Oodle Kraken can decode its normal compressed data on multiple threads.

This is different than what a lot of compressors do (and what Oodle has done in the past), which is to split the data into independent chunks so that each chunk can be decompressed on its own thread. All compressors can do that in theory, Oodle makes it easy in practice with the "seek chunk" decodes. But that requires special encoding that does the chunking, and it hurts compression ratio by breaking up where matches can be found.

The Oodle Kraken threaded decode is different. To distinguish it I call it "Thread-Phased" decode. It runs on normal compressed data - no special encoding flags are needed. It has no compressed size penalty because it's the same normal single-thread compressed data.

The Oodle Kraken Thread-Phased decode gets most of its benefit with just 2 threads (if you like, the calling thread + 1 more). The exact speedup varies by file, usually in the 1.4X - 1.9X range. The results here are all for 2-thread decode.

For example on win81, 2-thread Oodle Kraken is 1.7X faster than 1-thread : (with some other compressors for reference)


win81 :

Kraken 2-thread  : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.075 seconds, 410.98 b/kc, rate= 1398.55 M/s

Kraken           : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.127 seconds, 243.06 b/kc, rate= 827.13 M/s

zstdmax          : 104,857,600 ->39,768,086 =  3.034 bpb =  2.637 to 1 
decode           : 0.251 seconds, 122.80 b/kc, rate= 417.88 M/s

lzham            : 104,857,600 ->37,856,839 =  2.888 bpb =  2.770 to 1 
decode           : 0.595 seconds, 51.80 b/kc, rate= 176.27 M/s

lzma             : 104,857,600 ->35,556,039 =  2.713 bpb =  2.949 to 1 
decode           : 2.026 seconds, 15.21 b/kc, rate= 51.76 M/s

Charts on a few files :

Oodle 2.2.0 includes helper functions that will just run a Thread-Phased decode for you on Oodle's own thread system, as well as example code that runs the entire Thread-Phased decode client-side so you can do it on your own threads however you like.

Performance on the Silesia set for reference :


Silesia total :

Oodle Kraken -z6 : 211,938,580 ->51,857,427 =  1.957 bpb =  4.087 to 1

single threaded decode   : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s

two threaded decode      : 0.158 seconds, 394.55 b/kc, rate= 1342.64 M/s

Note that because the Kraken Thread-Phased decode is a true threaded decode of individual compressed buffers that means it is a *latency* reduction for decoding individual blocks, not just a *throughput* reduction. For example, if you were really decoding the whole Silesia set, you might just run the decompression of each file on its own thread. That is a good thing to do, and it would give you a near 2X speedup (with two threads). But that's a different kind of threading - that gives you a throughput improvement of 2X but the latency to decode any individual file is not improved at all. Kraken Thread-Phased decode reduces the latency of each independent decode, and of course it can also be used with chunking or multiple-file decoding to get further speedups.

Oodle 2.2.0 Kraken Optimal Parse improvements

Oodle 2.2.0 is about to ship, with some improvements to the Kraken optimal parse compression ratios. Compressed size is improved by around 1%. Speed is approximately the same at -z6 (previous max level for Kraken) but there's a new -z7 mode that's slightly slower and even higher compression.

I think we'll continue to find improvements in the optimal parsers over the coming months (optimal parsing is hard!) which should lead to some more tiny gains in the compression ratio in the slow encoder modes.


Silesia , sum of all files

uncompressed : 211,938,580

Kraken 2.1.5 -z6 : 52,366,897
Kraken 2.2.0 -z6 : 51,857,427
Kraken 2.2.0 -z7 : 51,625,488

Oodle Kraken 2.1.5 topped out at -z6 (Optimal2). There's a new -z7 (Optimal3) mode which gets a bit more compression at the cost of a bit of speed, which is why it's on a separate option instead of just part of -z6.

Results on some individual files (Kraken 220 is -z7) :

-------------------------------------------------------
"silesia_mozilla"

by ratio:
lzma        :  3.88:1 ,    2.0 enc mb/s ,   63.7 dec mb/s
Kraken 220  :  3.60:1 ,    1.1 enc mb/s ,  896.5 dec mb/s
lzham       :  3.56:1 ,    1.5 enc mb/s ,  186.4 dec mb/s
Kraken 215  :  3.51:1 ,    1.2 enc mb/s ,  928.0 dec mb/s
zstdmax     :  3.24:1 ,    2.8 enc mb/s ,  401.0 dec mb/s
zlib9       :  2.51:1 ,   12.4 enc mb/s ,  291.5 dec mb/s
lz4hc       :  2.32:1 ,   36.4 enc mb/s , 2351.6 dec mb/s

-------------------------------------------------------
"lzt99"

by ratio:
lzma        :  2.65:1 ,    3.1 enc mb/s ,   42.3 dec mb/s
Kraken 220  :  2.53:1 ,    2.0 enc mb/s ,  912.0 dec mb/s
Kraken 215  :  2.46:1 ,    2.3 enc mb/s ,  957.1 dec mb/s
lzham       :  2.44:1 ,    1.9 enc mb/s ,  166.0 dec mb/s
zstdmax     :  2.27:1 ,    3.8 enc mb/s ,  482.3 dec mb/s
zlib9       :  1.77:1 ,   13.3 enc mb/s ,  286.2 dec mb/s
lz4hc       :  1.67:1 ,   30.3 enc mb/s , 2737.4 dec mb/s

-------------------------------------------------------
"all_dds"

by ratio:
lzma        :  2.37:1 ,    2.1 enc mb/s ,   40.8 dec mb/s
Kraken 220  :  2.23:1 ,    1.0 enc mb/s ,  650.6 dec mb/s
Kraken 215  :  2.18:1 ,    1.0 enc mb/s ,  684.6 dec mb/s
lzham       :  2.17:1 ,    1.3 enc mb/s ,  127.7 dec mb/s
zstdmax     :  2.02:1 ,    3.3 enc mb/s ,  289.4 dec mb/s
zlib9       :  1.83:1 ,   13.3 enc mb/s ,  242.9 dec mb/s
lz4hc       :  1.67:1 ,   20.4 enc mb/s , 2226.9 dec mb/s

-------------------------------------------------------
"baby_robot_shell.gr2"

by ratio:
lzma        :  4.35:1 ,    3.1 enc mb/s ,   59.3 dec mb/s
Kraken 220  :  3.82:1 ,    1.4 enc mb/s ,  837.2 dec mb/s
Kraken 215  :  3.77:1 ,    1.5 enc mb/s ,  878.3 dec mb/s
lzham       :  3.77:1 ,    1.6 enc mb/s ,  162.5 dec mb/s
zstdmax     :  2.77:1 ,    5.7 enc mb/s ,  405.7 dec mb/s
zlib9       :  2.19:1 ,   13.9 enc mb/s ,  332.9 dec mb/s
lz4hc       :  1.78:1 ,   40.1 enc mb/s , 2364.4 dec mb

-------------------------------------------------------
"win81"

by ratio:
lzma        :  2.95:1 ,    2.5 enc mb/s ,   51.9 dec mb/s
lzham       :  2.77:1 ,    1.6 enc mb/s ,  177.6 dec mb/s
Kraken 220  :  2.77:1 ,    1.0 enc mb/s ,  818.0 dec mb/s
Kraken 215  :  2.70:1 ,    1.0 enc mb/s ,  877.0 dec mb/s
zstdmax     :  2.64:1 ,    3.5 enc mb/s ,  417.8 dec mb/s
zlib9       :  2.07:1 ,   16.8 enc mb/s ,  269.6 dec mb/s
lz4hc       :  1.91:1 ,   28.8 enc mb/s , 2297.6 dec mb/s

-------------------------------------------------------
"enwik7"

by ratio:
lzma        :  3.64:1 ,    1.8 enc mb/s ,   79.5 dec mb/s
lzham       :  3.60:1 ,    1.4 enc mb/s ,  196.5 dec mb/s
zstdmax     :  3.56:1 ,    2.2 enc mb/s ,  394.6 dec mb/s
Kraken 220  :  3.51:1 ,    1.4 enc mb/s ,  702.8 dec mb/s
Kraken 215  :  3.49:1 ,    1.5 enc mb/s ,  789.7 dec mb/s
zlib9       :  2.38:1 ,   22.2 enc mb/s ,  234.3 dec mb/s
lz4hc       :  2.35:1 ,   27.5 enc mb/s , 2059.6 dec mb/s

-------------------------------------------------------
You can see that encode & decode speed is slightly worse at level -z7 , and compression ratio si improved. (most of the other compression levels have roughly the same decode speed; -z7 enables some special options that can hurt decode speed a bit). Of course even at -z7 Kraken is way faster than anything else comparable!

Tips for benchmarking a compressor

You're about to evaluate Oodle (thanks for having a look!) or some other compressor. Before you start, consider these tips :

1. Time only the compressor.

Place your time measurements only around the compressor. Not IO, not your parsing, not mallocs, just the compress or decompress calls. I understand that in the end what you care about is total time to load, but there can be a lot of issues there that need fixing, and they can cloud the comparison of just the compression part. eg. if your parsing is really slow, that will dominate the CPU time and hide the differences between the compressors.

2. Time what you actually care about.

If you care about decode time, time the decompression. If you care about encode time, time compression. If you care about round-trip time, add the two times. Compressors are not just "fast" or "slow" at both ends, you can't time encoding and decide that it's a fast or slow compressor if what you care about is decoding.

3. Choose the right options.

Most compressors have the ability to target slightly different use cases. The most common option is the ability to trade off encode time vs. compression ratio. So, if what you care about is smallest size, then run the compressor at its highest encode effort level. It can be tricky to get the options right in most compression libraries; we are woefully non-standardized and not well documented. Aside from the simple "level" parameter, there may be other options that are relevant to your goals, perhaps trading off decompressor memory usage, or decompression speed. With Oodle the best option is always to email us and ask what options will best suit your goals.

4. Run apples-to-apples (threads-to-threads) comparisons.

It can be tricky to compare compressors fairly. As much as possible they should be run in the same way, and they should be run in the way that you will actually use them in your final application. Don't profile them with threads if you will not use them threaded in your shipping application.

Threads are a common problem. Compressors should either be tested all threaded (if you will use threads in your final application), or all non-threaded. Unfortunately the defaults are not the same. "lzma" (7z) and LZHAM create threads by default. You have to change their options to tell them to *not* create threads. The normal Oodle_Compress calls will not use threads by default, you have to specifically call one of the _Async threaded routines. (my personal preference is to benchmark everything without threads to compare single-threaded performance, and you can always add threads for production use)

5. Take the MIN of N run times.

To get reliable timing, you need to run the loop many times, and take the MIN of all times. The min will give you the time it takes when the OS isn't interrupting you with task switches, the CPU isn't clocking-down for speedstep, etc. I usually do 30 *per core* but you can probably get a way with a bit less.

6. Wipe the cache.

Assuming you are now doing N loops, you need to invalidate the cache between iterations. If you don't, you will be running the compressor in a "hot cache" scenario, with some buffers already in cache.

7. Don't pack a bunch of files together in a tar if that's not how you load.

It may seem like a good way to test to grab your bunch of test files and pack them together in a tar (or zip -0 or similar package) and run the compression tests on that tar. That's a fine option if that's really how you load data in your final application - as one big contiguous chunk that must be loaded in one big blob. But most people don't. You need to test the compressors in the same way they will be used in the final application. If you load whole file at a time, test the compressors on whole file units. Many people do loading on some kind of paging unit, like perhaps 1 MB chunks. If you do that, then test the compressor on the same thing.

8. Choose your test set.

If you could test on the entire set of buffers that your final application will load, that would be an accurate test. (though actually, even that is a bit subtle, since some buffers are more latency sensitive than others, so for example you might care more about the first few things you load to get into a running application as quickly as possible). That's probably not practical, so you want to choose a set that is representative of what you will actually load. Don't exclude things like already-compressed files (JPEGs and so on) *if* you will be running them through the compressor. (though consider *not* running them your compressed-file loading path, in which case you should exclude them from testing). It's pretty hard to get an accurate representative sample, so it's generally best to just get a variety of files and look at individual per-file results.

9. Look at the spectrum of results, not the sum.

After you run on your test set, don't just add up the compressed sizes and times to make a "total" result. Sums can be misleading. One issue is there are some large incompressible files, they can hide the differences on the more compressible files. But a bigger and more subtle trap is the way that sums weight the combination of results. A sum is a weighting by the size of each file in the test set. That's fine if your test set is all of your data, or is a perfectly proportionally representative sampling of all of your data (a subset which acts like the whole). But most likely it's not. It's best to keep the results per file separate and just have a look at individual cases to see what's going on, how the results differ, and try not to simplify to just looking at the sum.

10. If you do sum, sum *time* not speed, sum *size* not ratio.

Speed (like mb/s) and ratio (raw size/comp size) are inverted measures and shouldn't be summed. What you actually care about is total compressed size, and total time to decode. So if you run over a set of files, don't look at "average speed" or "average ratio" , because those are inverted meaures that will oddly weight the accumulation. Instead accumulate total time to decode, total raw size, and total compressed size, and then if you like you can make "overall speed" and "overall ratio" from those total.

11. Try not to malloc in the timing loop.

Your malloc might be fast, it might be slow, it's best to not have that as a variable in the timing. In general try to allocate the memory for the compressor or decompressor outside of the timing loop. (In Oodle this is done by passing in your own pointer for the "decoderMemory" argument of OodleLZ_Decompress). That would be an unfair test if you didn't also do that in the final application - so do it in the final application too! (similarly, make sure there's no logging inside the timing loop).

12. Consider excluding almost-incompressible files.

This is something you should consider for final shipping application, and if you do it in your shipping application, then you should do it for the benchmark too. The most common case is already-compressed files like JPEG images and MP3 audio. These files can usually be compressed slightly, maybe saving 1% of their size, but the time to decode them is not worth it overall - you can get more total size savings by running a more powerful compressor on other files. So it's most efficient to just send them uncompressed.

13. Tiny files should either be excluded or packed together.

There's almost never a use case where you really want to compress tiny files (< 16k bytes or so) as independent units. There's too much per-unit overhead in the compressor, and more importantly there's too much per-unit overhead in IO - you don't want to eat a disk seek to just to get one tiny file. So in a real application tiny files should always be grouped into paging units that are 256k or more, a size where loading them won't just be a total waste of disk seek time. So, when benchmarking compressors you also shouldn't run them on tiny independent files, because you will never do that in a shipping application.

5/08/2016

Order-1 Huffman

This is a simple idea that's rarely written down, so I thought I'd do a quick summary.

To my knowledge I was the first person to write about it (in "New Techniques in Context Modeling and Arithmetic Encoding" (PDF) ) but it's one of those simple ideas that probably a lot of people had and didn't write about (like Deferred Summation). It's also one of those ideas that keeps being forgotten and rediscovered over the years.

(I don't know much about the details of how Brotli does this, it may differ. I'll be talking about how I did it).

(also by "Huffman" I pretty much always mean "static Huffman" where you measure the histogram of a block and transmit the code lengths, not "adaptive Huffman" (modifying codelens per symbol (bleck)) or "deferred summation Huffman" (codelens computed from histogram of previous data with no explicit codelen transmission))

Let's start with just the case of order-1 8 bit literals. So you're coding a current 8-bit symbol with an 8-bit previous symbol as context. You can do this naively by just have 256 arrays, one for each 8-bit context. The decoder looks like this :


256 times :
read codelens from file
build huffman decode table

per symbol :

o1 = ptr[-1];
ptr[0] = huff_decode( bitstream , huff_table[o1] );

and on a very large file (*) that might be okay.

(* actually it's only okay on a very large file with completely stable statistics, which never happens in practice. In the real world "very large files" don't usually exist; instead they act like a sequence of small/medium files tacked together. That is, you want a decoder that works well on small files, and you want to be able to reset it periodically (re-transmit huffman codelens in this case) so that it can adapt to local statistics).

On small files, it's disastrous. You're sending 256 sets of codelens, which can be a lot of wasted data. Worst of all it's a huge decode time overhead to parse out the codelens and build the decode tables if you're only going to get a few symbols in that context.

So you want to reduce the count of huffman tables. A rough guideline is to make the number of tables proportional to the number of bytes. Maybe 1 table per 1024 bytes is tolerable to you, it depends.

(another goal for reduction might be to get all the huff tables to fit in L2 cache)

So we want to merge the Huffman tables. You want to find pairs of contexts that have the most similar statistics and merge those. If you don't mind the poor encoder-time speed, a good solution is a best-first merge :


for each pair {i,j} (i<j)
merge_cost(i,j) = Huffman_Cost( symbols_i + symbols_j ) - Huffman_Cost( symbols_i ) - Huffman_Cost( symbols_j )

Huffman_Cost( symbols ) = bits to send codelens + bits to encode symbols using those codelens


while # of contexts > target , and/or merge cost < target
pop lowest merge_cost
merge context j onto i
delete all merge costs involving j
recompute all merge costs involving i

if the cost was just entropy (H) instead of Huffman_Cost , then a merge_cost would always be strictly >= 0 (separate statistics are always cheaper than combining). But since the Huffman codelen transmission is not free, the first merges will actually reduce encoded size. So you should always do merges that are free or beneficial, even if huffman table count is low enough.

So contexts with similar statistics will get merged together, since coding them with a combined set of codelens either doesn't hurt or hurts only a little (or helps, with the cost of codelen transmission counted). In this way contexts where it wasn't really helping to differentiate them will get reduced.

Once this is done, the decoder becomes :


get n = number of huffman tables

n times :
read codelens from file
build huffman decode table

256 times :
read tableindex from file
merged_huff_table_ptr[i] = huff_table[ tableindex ]

per symbol :

o1 = ptr[-1];
ptr[0] = huff_decode( bitstream , merged_huff_table_ptr[o1] );

So merged_huff_table_ptr[] is still a [256] array, but it points at only [n] unique Huffman tables.

That's order-1 Huffman!

In the modern world, we know that o1 = the previous literal is not usually the best use of an 8-bit context. You might do something like top 3 bits of ptr[-1], top 2 bits of [ptr-2], 2 bits of position, to make a 7-bit context.

One of the cool things order-1 Huffman can do is to adaptively figure out the context for you.

For example with LZMA you have the option of the # of literal context bits (lc) and literal pos bits (lp). You want them to be as low as possible for better statistics, and there's no good way to choose them per file. (usually lc=2 or lp=2 , usually just one or the other, not both)

With order-1 Huffman, you just make a context with 3 bits of lc and 3 bits of lp, so you have a [64] 6-bit context. Then you let the merger throw away states that don't help. For example if it's a file where pos-bits are irrelevant (like text), they will just get merged out, all the lc contexts that have different lp values will merge together.


05-03-16 | Brotli signed int mode

Brotli signed int context mode. Looks like a good idea. My guess is this is what's helping Brotli10 on the binary files I wrote about in the previous post (horse.vipm and so on)

Signed int takes the previous two bytes and forms a 6-bit context from them thusly :


  Context = (Lut2[b2]<<3) | Lut2[b1];

      Lut2 :=
         0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7

So, it's roughly categorizing the two values into ranges, which means it can act as a kind of linear predictor (if that fits the data), eg. two preceding values in group 4 = my value probably is too, or if b2 is a "3" and b1 is a "4" then I'm likely a "5". Or not, if the data isn't linear like that. Or maybe there's only correlation to b1 and b2 gets ignored, which the order-1-huff can also model.

One thing I like is holding out values 00 and FF and special cases that get a unique bucket. This lets you detect the special cases of last two bytes = 0000,FFFF,FF00,00FF , which can be pretty important on binary.

I think that for the type of data we get in games that often has floats, it might be worth it to single out 7F and 80 as well, something like :

0,1111....
1..
22...
2........2,3
4,5555...
5...
66.....
6........6,7
but who knows, would have to test to see.

5/03/2016

Underappreciated Compressors

1. LZX. LZX is crazy good. It was such a huge step at the time, and nobody really recognized it. (I sure didn't) (I guess someone at MS did because they bought it).

It's a shame it never got a good mainstream implementation. It could/should have been the LZ we all used for the past 10 years.

One of the little mistakes in LZX was the 21 bit offset limit. This must have seemed enormous back on the Amiga in 1995, but very quickly became a major handicap against LZs with unlimited windows.

LZX with unlimited window (eg. on files less than 2 MB) is competitive with any modern LZ, especially on binary structured data where it really shines. In hindsight, LZX is the clear ancestor to LZMA and it was way ahead of its time. We're only clearly beating it in the past year or two (!!).

2. RAR. The primary LZ in RAR is a pretty straightforward LZ-Huff (I believe). It's fine, it's nothing bad or special.

What makes RAR special is the filters. It still has the best filters of any compressor I know.

RAR+filters often *beats* LZMA and other very slow high ratio compressors.

The special thing about the RAR filters is that they aren't like most of the "precomp" solutions that just try to recognize WAV headers and things like that - RAR may do some of that (I have no idea) - but it also definitely finds filters that work on headerless data. Like, you can take a BMP or WAV and strip off the header and RAR will still figure out that there's data to filter in there; it must have some analysis heuristics, and they're better than anything else I've seen.

As an example of when RAR filters do magic, here's a 24-bit RGB BMP with the first 100k stripped, so it's headerless and not easily recognized by file-type-detection filters :

PDI_1200_bmp_no_header.zl8.LZNA,1241369
PDI_1200_bmp_no_header.nz,1264747
PDI_1200_bmp_no_header.BitKnit,1268120  // <- wow BitKnit !
PDI_1200_bmp_no_header.LZNA,1306670
PDI_1200_bmp_no_header.rar,1312621  // <- RAR filters!
PDI_1200_bmp_no_header.7z,1377603
PDI_1200_bmp_no_header.brotli10,1425996
PDI_1200_bmp_no_header_lp2.7z,1469079
PDI_1200_bmp_no_header.Kraken,1506915
PDI_1200_bmp_no_header.lzx21,1580593
PDI_1200_bmp_no_header.zstd060,1619920
PDI_1200_bmp_no_header.mc-.rar,1631419 // <- RAR unfiltered
PDI_1200_bmp_no_header.brotli9,1635105
PDI_1200_bmp_no_header.z9.zip,1708589
PDI_1200_bmp_no_header.lz4xc4,1835955
PDI_1200_bmp_no_header.raw,2500000

That said, it does make mistakes. Sometimes filters can make things way worse if they make a wrong decision. They don't have a "filters must help" safety check. This is easy to prevent, you just run also with no filter and make sure it helped, but they seem to not do that (presumably to save the encode time) and the results can be disastrous :

lightmap.bc3.LZNA,361185
lightmap.bc3.7z,373909
lightmap.bc3_lp2.7z,387590
lightmap.bc3.brotli10,391602
lightmap.bc3.BitKnit,416208
lightmap.bc3.zstd060,417956
lightmap.bc3.Kraken,431476
lightmap.bc3.lzx21,441669
lightmap.bc3.mc-.rar,457893  // <- RAR with disabled filters
lightmap.bc3.brotli9,498802
lightmap.bc3.z9.zip,583178
lightmap.bc3.rar,778363  // <- RAR with filters huge fuckup !!
lightmap.bc3.raw,4194332
RAR filters fucking up on DXTC (BCn) is pretty consistent :
c.dds.nz,371610
c.dds.7z,371749
c.dds.zl8.LZNA,371783
c.dds.LZNA,373245
c.dds_lp2.7z,375384
c.dds.lzx21,395866
c.dds.brotli10,399674
c.dds.BitKnit,400528
c.dds.Kraken,405563
c.dds.mc-.rar,408363 // <- unfiltered is okay
c.dds.zstd060,411515
c.dds.brotli9,426948
c.dds.z9.zip,430952
c.dds.rar,438070     // <- oops!
c.dds.raw,524416
Sometimes it does magic :
horse.vipm_lp2.7z,925996
horse.vipm.LZNA,942950
horse.vipm.7z,945707
horse.vipm.brotli10,955363 // <- brotli10 big step
horse.vipm.rar,971716 // <- RAR with filters does magic
horse.vipm.BitKnit,1017740
horse.vipm.lzx21,1029541
horse.vipm.mc-.rar,1066205 // <- RAR with disabled filters
horse.vipm.zstd060,1100219
horse.vipm.Kraken,1106081
horse.vipm.brotli9,1108858
horse.vipm.z9.zip,1155056
horse.vipm.raw,1573070
Here's an XRGB dds where the RAR filters do magic :
d.dds.zl8.LZNA,352232
d.dds.nz,356649
d.dds.BitKnit,360220  // (at zl6 BitKnit beats LZNA ! crushes 7z! wow)
d.dds.LZNA,381250
d.dds.rar,382282  // <- RAR filter crushes 7z
d.dds_lp2.7z,427395
d.dds.7z,452898
d.dds.brotli10,471413
d.dds.Kraken,480257
d.dds.lzx21,520632
d.dds.mc-.rar,534913 // <- RAR unfiltered is poor
d.dds.brotli9,542792
d.dds.zstd060,545583
d.dds.z9.zip,560708
d.dds.raw,1048704
happy.zl8.LZNA,949709
happy.LZNA,955700
happy_lp2.7z,974550
happy.BitKnit,979832
happy.7z,1004359
happy.cOO.nz,1015048
happy.co.nz,1028196
happy.Kraken,1109748
happy.brotli10,1135252
happy.lzx21,1168220
happy.mc-.rar,1177426 // <- RAR unfiltered is okay
happy.zstd060,1199064
happy.brotli9,1219174
happy.rar,1354649 // <- RAR filters fucks up
happy.z9.zip,1658789
happy.lz4xc4,2211700
happy.raw,4155083
Not about RAR, but for historical comparison, lzt24 is another mesh (the "struct72" file here )
lzt24.zl8.LZNA,1164216
lzt24.LZNA,1177160
lzt24.nz,1206662
lzt24_lp2.7z,1221821
lzt24.BitKnit,1224524
lzt24.7z,1262013
lzt24.Kraken,1307691
lzt24.brotli10,1323486
lzt24.brotli9,1359566
lzt24.lzx21,1475776
lzt24.zstd060,1498401
lzt24.mc-.rar,1612286
lzt24.rar,1612286
lzt24.z9.zip,2290695
lzt24.raw,3471552
Found another weird one where RAR filters do magic; lzt25 is super-structured 13-byte structs :
lzt25.rar,40024  // <- WOW RAR filters!
lzt25.nz,45397
lzt25.7z,51942
lzt25_lp2.7z,52579
lzt25.LZNA,58903
lzt25.zl8.LZNA,61582  // <- zl8 LZNA worse than zl6 - weird file
lzt25.lzx21,63198
lzt25.zstd060,64550  // <- ZStd does surprisingly well here, I thought you needed more reps on this file
lzt25.brotli9,67856
lzt25.Kraken,67986
lzt25.brotli10,68472 // <- brotli10 worse than brotli9 !
lzt25.BitKnit,92940  // <- BitKnit oddly struggling
lzt25.mc-.rar,106423 // <- unfiltered RAR is the worst of the LZ's
lzt25.z9.zip,209811
lzt25.lz4xc4,324125
lzt25.raw,1029744

A lot of interesting things to pick out in those reports. (just saying, I'm not gonna address them all)

One just general thing is that the performance of these LZ's is in no way consistent. You can't just say that "X LZ is 5% better than Y", there's no really consistent pattern, they have wildly variable relative performance.

There's a family of sort of normal LZ's - LZX, Brotli9, ZStd, & unfiltered RAR. Then there's the family of the high-compress LZ's, LZNA, 7z, nz. Those are pretty consistently together, and form two end-points.

But then there are the floaters. BitKnit, Kraken, Filtered RAR, and Brotli10 can jump around between the "normal LZ" and "high-compress LZ" region. BitKnit and Brotli10 are the most variable - they both can jump right up to the high-compress LZ's like 7z, but on other files they drop right down into the pack of normal LZ's (LZX, etc.).

I have a guess about what's happening with Brotli. I haven't looked at the code at all, but my guess is that between level 9 and 10 the order-1 context optimization is turned on. In particular, there's this "signed int" context mode which I believe is what does the magic for brotli on things like horse.vipm (for example it has contexts for the case of last two bytes = 0x0000 , or last two bytes = 0xFFFF , which are pretty common on horse). My guess is that this mode is just not even tried at all at level 9, and at level 10 it turns on the code to pick the best context mode, and finds the signed int mode which is great on these files. Not sure.

4/30/2016

Oodle Kraken Pareto Frontier

The whole idea of Kraken is to excel in the size-decodespeed tradeoff. From the beginning of its design, every decision was considered in terms of the size vs decodespeed.

So how does it do?

To visualize the size-decodespeed Pareto frontier, I like to use an imaginary loading problem. You want to load compressed data and then decompress it. One option is the null compressor (or "memcpy") that just loads uncompressed data and then memcpys it. Or you can use compressors that give smaller file sizes, thus quicker loads, but take more time to decompress. By dialing the virtual disk speed, you can see which compressor is best at different tradeoffs of space vs. speed.

Of course you may not actually be just trying to optimize load time, but you can still use this imaginary loading problem as a way to study the space-speed tradeoff. If you care more about size, you look at lower imaginary disk speeds, if you core more about CPU use, you look at higher imaginary disk speeds.

I like to show "speedup" which is the factor of increase in speed using a certain compressor to do load+decomp vs. the baseline (memcpy). So the left hand y-intercepts (disk speed -> 0) show the compression ratio, and the right hand y-intercepts side show the decompression speed (disk speed -> inf), and in between shows the tradeoff. (By "show" I mean "is linearly proportional to, so you can actually look at the ratio of Y-intercepts in the Pareto curve and it tells you compression ratio on the left and decompression speed on the right).

03-02-15 - Oodle LZ Pareto Frontier and 05-13-15 - Skewed Pareto Chart

The chart showing "millis" shows time, so obviously lower is better. I show a few ways to combine load & decomp. Sum is the time to load then decomp sequentially. Max is the larger of the two, which is the time if they were perfectly overlapped (not usually possible). Personally I think "l2 sum" is most useful. This is the sqrt of sum of squares, which is a kind of sum that biases towards the larger; it's kind of like a mix of "sum" and "max", it means you want the sum to be small, but you also want them to similar times.

Kraken tops the Pareto curve for decodespeed vs size for a huge range of virtual disk speed; from around 2 mb/s up to around 300 mb/s.

Of course the Pareto curve doesn't tell the whole picture. For one thing I don't have encode speed in the equation here at all (and sometimes you need to look at the memory use tradeoff too, so it's really a size-decodespeed-encodespeed-memoryuse surface). For another you sometimes have strict requirements, like I must hit a certain file size, and then you just pick the fastest decoder that meets that requirement. One thing the curves do clearly tell you is when a curve just completely covers another (that is, is greater in Y for all values of X), then that compressor is just never needed (in space & decodespeed).


Silesia :


Game Test Set :


Just for comparison to earlier posts on my Blog about the other Oodle compressors, here's a run of lzt99 with the full Oodle compressor set.

I've included ZStd for reference here just to show that Kraken jumping way up is not because the previous Oodle compressors were bad - far from it, they were already Pareto optimal. (I like to compare against ZStd not because it's bad, but because it's excellent; it's the only other compressor in the world I know of that's even close to Oodle in terms of good compression ratio and high speed (ZStd also has nice super-fast encode speeds, and it's targetted slightly differently, it's better on text and Oodle is better on binary, it's not a direct apples comparison)).

You can see that LZHLW is just completely covered by Kraken, and so is now deprecated in Oodle. Even LZNIB and BitKnit are barely peeking out of the curve, so the range where they are the right answer is greatly reduced to more specialized needs. (for example BitKnit still romps on strongly structured data, and is useful if you just need a size smaller than Kraken)

4/28/2016

Performance of Oodle Kraken

A continuing series reporting on the performance of recently released Oodle Kraken

First some general notes about Oodle before the big dump of numbers. (skip to the bottom for charts)

Oodle is not intended to solve every problem in data compression. Oodle Data Compression is mainly designed for compress-once load-many usage patterns, where the compressed size and decode speed is important, but encode speed is not as important. Things like distribution, packaging, when you bake content into a compressed archive and then serve it to many people. I do consider it a requirement to keep the encoders faster than 1 MB/s on my ancient laptop, because less than that is just too slow even for content baking.

Like most data compressors, Oodle has various compression level options that trade off encoder speed for compressed size. So there are faster and slower encoders; some of the fast modes are good tradeoffs for space-speed. Kraken turns out to be pretty good at getting high compression even in the fast modes. I'll do a post that goes into this in the future.

Oodle is mainly intended for packing binary data. Because Oodle is built for games, we focus on the types of data that games ship, such as textures, models, animations, levels, and other binary structured data that often contains various types of packed numeric data. Oodle is good on text but is not really built for text. In general if you are focusing on a specific type of data (eg. text/html/xml) you will get the best results with domain-specific preprocessors, such as xwrt or wbpe. Oodle+wbpe is an excellent text compressor.

Oodle Kraken is intended for high compression use cases, where you care about size. There's a whole family of compressors now that live in the "just below lzma (7zip)" compression ratio domain, such LZHAM, Brotli, Oodle BitKnit, and ZStd. In general the goal of these is to get as close to lzma compression ratio as possible while providing much better speed. This is where Kraken achieves far more than anyone has before, far more than we thought possible.

Today I will be focusing on high compression modes, looking at decode speed vs. compression ratio. All compressors will generally be run in their max compression mode, without going into the "ridiculously slow" below 1 MB/s range. (for Oodle this means running -zl6 Optimal2 but not -zl7 or higher).

Okay, on to lots of numbers.


Rather than post an average on a test set, which can be misleading due to the selection of the test set or the way the results are averaged (total compressed size or average ratio?), I'll post results on a selection of individual files :

-------------------------------------------------------
"silesia_mozilla"

by ratio:
lzma        :  3.88:1 ,    2.0 enc mbps ,   63.7 dec mbps
lzham       :  3.56:1 ,    1.5 enc mbps ,  186.4 dec mbps
Oodle Kraken:  3.51:1 ,    1.2 enc mbps ,  928.0 dec mbps
zstdmax     :  3.24:1 ,    2.8 enc mbps ,  401.0 dec mbps
zlib9       :  2.51:1 ,   12.4 enc mbps ,  291.5 dec mbps
lz4hc       :  2.32:1 ,   36.4 enc mbps , 2351.6 dec mbps

by encode speed:
lz4hc       :  2.32:1 ,   36.4 enc mbps , 2351.6 dec mbps
zlib9       :  2.51:1 ,   12.4 enc mbps ,  291.5 dec mbps
zstdmax     :  3.24:1 ,    2.8 enc mbps ,  401.0 dec mbps
lzma        :  3.88:1 ,    2.0 enc mbps ,   63.7 dec mbps
lzham       :  3.56:1 ,    1.5 enc mbps ,  186.4 dec mbps
Oodle Kraken:  3.51:1 ,    1.2 enc mbps ,  928.0 dec mbps

by decode speed:
lz4hc       :  2.32:1 ,   36.4 enc mbps , 2351.6 dec mbps
Oodle Kraken:  3.51:1 ,    1.2 enc mbps ,  928.0 dec mbps
zstdmax     :  3.24:1 ,    2.8 enc mbps ,  401.0 dec mbps
zlib9       :  2.51:1 ,   12.4 enc mbps ,  291.5 dec mbps
lzham       :  3.56:1 ,    1.5 enc mbps ,  186.4 dec mbps
lzma        :  3.88:1 ,    2.0 enc mbps ,   63.7 dec mbps
-------------------------------------------------------
"lzt99"

by ratio:
lzma        :  2.65:1 ,    3.1 enc mbps ,   42.3 dec mbps
Oodle Kraken:  2.46:1 ,    2.3 enc mbps ,  957.1 dec mbps
lzham       :  2.44:1 ,    1.9 enc mbps ,  166.0 dec mbps
zstdmax     :  2.27:1 ,    3.8 enc mbps ,  482.3 dec mbps
zlib9       :  1.77:1 ,   13.3 enc mbps ,  286.2 dec mbps
lz4hc       :  1.67:1 ,   30.3 enc mbps , 2737.4 dec mbps

by encode speed:
lz4hc       :  1.67:1 ,   30.3 enc mbps , 2737.4 dec mbps
zlib9       :  1.77:1 ,   13.3 enc mbps ,  286.2 dec mbps
zstdmax     :  2.27:1 ,    3.8 enc mbps ,  482.3 dec mbps
lzma        :  2.65:1 ,    3.1 enc mbps ,   42.3 dec mbps
Oodle Kraken:  2.46:1 ,    2.3 enc mbps ,  957.1 dec mbps
lzham       :  2.44:1 ,    1.9 enc mbps ,  166.0 dec mbps

by decode speed:
lz4hc       :  1.67:1 ,   30.3 enc mbps , 2737.4 dec mbps
Oodle Kraken:  2.46:1 ,    2.3 enc mbps ,  957.1 dec mbps
zstdmax     :  2.27:1 ,    3.8 enc mbps ,  482.3 dec mbps
zlib9       :  1.77:1 ,   13.3 enc mbps ,  286.2 dec mbps
lzham       :  2.44:1 ,    1.9 enc mbps ,  166.0 dec mbps
lzma        :  2.65:1 ,    3.1 enc mbps ,   42.3 dec mbps
-------------------------------------------------------
"all_dds"

by ratio:
lzma        :  2.37:1 ,    2.1 enc mbps ,   40.8 dec mbps
Oodle Kraken:  2.18:1 ,    1.0 enc mbps ,  684.6 dec mbps
lzham       :  2.17:1 ,    1.3 enc mbps ,  127.7 dec mbps
zstdmax     :  2.02:1 ,    3.3 enc mbps ,  289.4 dec mbps
zlib9       :  1.83:1 ,   13.3 enc mbps ,  242.9 dec mbps
lz4hc       :  1.67:1 ,   20.4 enc mbps , 2226.9 dec mbps

by encode speed:
lz4hc       :  1.67:1 ,   20.4 enc mbps , 2226.9 dec mbps
zlib9       :  1.83:1 ,   13.3 enc mbps ,  242.9 dec mbps
zstdmax     :  2.02:1 ,    3.3 enc mbps ,  289.4 dec mbps
lzma        :  2.37:1 ,    2.1 enc mbps ,   40.8 dec mbps
lzham       :  2.17:1 ,    1.3 enc mbps ,  127.7 dec mbps
Oodle Kraken:  2.18:1 ,    1.0 enc mbps ,  684.6 dec mbps

by decode speed:
lz4hc       :  1.67:1 ,   20.4 enc mbps , 2226.9 dec mbps
Oodle Kraken:  2.18:1 ,    1.0 enc mbps ,  684.6 dec mbps
zstdmax     :  2.02:1 ,    3.3 enc mbps ,  289.4 dec mbps
zlib9       :  1.83:1 ,   13.3 enc mbps ,  242.9 dec mbps
lzham       :  2.17:1 ,    1.3 enc mbps ,  127.7 dec mbps
lzma        :  2.37:1 ,    2.1 enc mbps ,   40.8 dec mbps
-------------------------------------------------------
"baby_robot_shell.gr2"

by ratio:
lzma        :  4.35:1 ,    3.1 enc mbps ,   59.3 dec mbps
Oodle Kraken:  3.77:1 ,    1.5 enc mbps ,  878.3 dec mbps
lzham       :  3.77:1 ,    1.6 enc mbps ,  162.5 dec mbps
zstdmax     :  2.77:1 ,    5.7 enc mbps ,  405.7 dec mbps
zlib9       :  2.19:1 ,   13.9 enc mbps ,  332.9 dec mbps
lz4hc       :  1.78:1 ,   40.1 enc mbps , 2364.4 dec mbps

by encode speed:
lz4hc       :  1.78:1 ,   40.1 enc mbps , 2364.4 dec mbps
zlib9       :  2.19:1 ,   13.9 enc mbps ,  332.9 dec mbps
zstdmax     :  2.77:1 ,    5.7 enc mbps ,  405.7 dec mbps
lzma        :  4.35:1 ,    3.1 enc mbps ,   59.3 dec mbps
lzham       :  3.77:1 ,    1.6 enc mbps ,  162.5 dec mbps
Oodle Kraken:  3.77:1 ,    1.5 enc mbps ,  878.3 dec mbps

by decode speed:
lz4hc       :  1.78:1 ,   40.1 enc mbps , 2364.4 dec mbps
Oodle Kraken:  3.77:1 ,    1.5 enc mbps ,  878.3 dec mbps
zstdmax     :  2.77:1 ,    5.7 enc mbps ,  405.7 dec mbps
zlib9       :  2.19:1 ,   13.9 enc mbps ,  332.9 dec mbps
lzham       :  3.77:1 ,    1.6 enc mbps ,  162.5 dec mbps
lzma        :  4.35:1 ,    3.1 enc mbps ,   59.3 dec mbps
-------------------------------------------------------
"win81"

by ratio:
lzma        :  2.95:1 ,    2.5 enc mbps ,   51.9 dec mbps
lzham       :  2.77:1 ,    1.6 enc mbps ,  177.6 dec mbps
Oodle Kraken:  2.70:1 ,    1.0 enc mbps ,  877.0 dec mbps
zstdmax     :  2.64:1 ,    3.5 enc mbps ,  417.8 dec mbps
zlib9       :  2.07:1 ,   16.8 enc mbps ,  269.6 dec mbps
lz4hc       :  1.91:1 ,   28.8 enc mbps , 2297.6 dec mbps

by encode speed:
lz4hc       :  1.91:1 ,   28.8 enc mbps , 2297.6 dec mbps
zlib9       :  2.07:1 ,   16.8 enc mbps ,  269.6 dec mbps
zstdmax     :  2.64:1 ,    3.5 enc mbps ,  417.8 dec mbps
lzma        :  2.95:1 ,    2.5 enc mbps ,   51.9 dec mbps
lzham       :  2.77:1 ,    1.6 enc mbps ,  177.6 dec mbps
Oodle Kraken:  2.70:1 ,    1.0 enc mbps ,  877.0 dec mbps

by decode speed:
lz4hc       :  1.91:1 ,   28.8 enc mbps , 2297.6 dec mbps
Oodle Kraken:  2.70:1 ,    1.0 enc mbps ,  877.0 dec mbps
zstdmax     :  2.64:1 ,    3.5 enc mbps ,  417.8 dec mbps
zlib9       :  2.07:1 ,   16.8 enc mbps ,  269.6 dec mbps
lzham       :  2.77:1 ,    1.6 enc mbps ,  177.6 dec mbps
lzma        :  2.95:1 ,    2.5 enc mbps ,   51.9 dec mbps
-------------------------------------------------------
"enwik7"

by ratio:
lzma        :  3.64:1 ,    1.8 enc mbps ,   79.5 dec mbps
lzham       :  3.60:1 ,    1.4 enc mbps ,  196.5 dec mbps
zstdmax     :  3.56:1 ,    2.2 enc mbps ,  394.6 dec mbps
Oodle Kraken:  3.49:1 ,    1.5 enc mbps ,  789.7 dec mbps
zlib9       :  2.38:1 ,   22.2 enc mbps ,  234.3 dec mbps
lz4hc       :  2.35:1 ,   27.5 enc mbps , 2059.6 dec mbps

by encode speed:
lz4hc       :  2.35:1 ,   27.5 enc mbps , 2059.6 dec mbps
zlib9       :  2.38:1 ,   22.2 enc mbps ,  234.3 dec mbps
zstdmax     :  3.56:1 ,    2.2 enc mbps ,  394.6 dec mbps
lzma        :  3.64:1 ,    1.8 enc mbps ,   79.5 dec mbps
Oodle Kraken:  3.49:1 ,    1.5 enc mbps ,  789.7 dec mbps
lzham       :  3.60:1 ,    1.4 enc mbps ,  196.5 dec mbps

by decode speed:
lz4hc       :  2.35:1 ,   27.5 enc mbps , 2059.6 dec mbps
Oodle Kraken:  3.49:1 ,    1.5 enc mbps ,  789.7 dec mbps
zstdmax     :  3.56:1 ,    2.2 enc mbps ,  394.6 dec mbps
zlib9       :  2.38:1 ,   22.2 enc mbps ,  234.3 dec mbps
lzham       :  3.60:1 ,    1.4 enc mbps ,  196.5 dec mbps
lzma        :  3.64:1 ,    1.8 enc mbps ,   79.5 dec mbps
-------------------------------------------------------

In chart form :

(lz4 decode speed is off the top of the chart)

I'm not including the other Oodle compressors here just to keep things as simple as possible. If you do want more compression than Kraken, and care about decode speed, then Oodle LZNA or BitKnit are much faster to decode than lzma (7zip) at comparable or better compression ratios.

Visit radgametools.com to learn more about Kraken and Oodle.

4/27/2016

Release the Kraken!

Announcing Oodle Kraken! - a new compression algorithm in the Oodle Data Compression library.

Oodle Kraken offers high compression with incredible decode speed, the likes of which has never been seen before.

Oodle Kraken decodes 3X faster than ZLib, and gets way more compression. There's really nothing even close to Kraken, its performance is revolutionary - it's never been possible to get such high compression and blazing fast decode speed before.

On the "Silesia" test set, the performance is :

Kraken :  4.05 to 1 compression ratio,   919.8 decode mb/s
zlib9  :  2.74 to 1 compression ratio,   306.9 decode mb/s
lzma   :  4.37 to 1 compression ratio,    78.8 decode mb/s

(speed measured on a Core i7-3770 3.4 Ghz , x64, single threaded)

Oodle Kraken's high compression makes it great for downloadables and distribution. Kraken's fast decode speed makes it amazing for in-game loading, paging and streaming, with less CPU load taking time away from your game. There's no need to decompress Kraken data when you install - just load compressed data directly. Kraken is so fast that loading compressed data and then decompressing is faster than just loading uncompressed data off disk!

Visit radgametools.com to learn more about Kraken and Oodle.

I'll be writing more about Oodle Kraken in the coming weeks, such as performance analysis on other data sets and vs. other compressors.

The overall performance on Silesia :

And a chart of the performance on each file in Silesia. This is a loglog scatter plot, with log2 of decode speed on the x axis and log2 of compression ratio on the y axis. (so for example from LZMA to Kraken on "sao" is 4.68 steps on the x axis, that means 25.76X faster)

Oodle Kraken is superb on the AMD Jaguar chip that's in the PS4 and XBox One. Most decompressors really struggle to run fast on those platforms, but not Kraken.

It's hard to write about Oodle Kraken without sounding ridiculous. It's really a huge step in data compression, where huge steps almost never happen.

4/25/2016

Huffman Correction

A reader pointed out an error in my blog post - 08-12-10 - The Lost Huffman Paper

I had :


if ( bits >= huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;
}

it should have been :

if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;
}

it's corrected now.

In my convention, branchCodeLeftAligned is the left-aligned bitbuff value that means you must go to a higher codelen.

I thought for clarity I'd go ahead and post the example I did with him :


You have this alphabet :

symbol_id, codeLen, code:
0 ; 2 ; 00
1 ; 3 ; 010
2 ; 3 ; 011
3 ; 3 ; 100
4 ; 4 ; 1010
5 ; 4 ; 1011
6 ; 4 ; 1100
7 ; 4 ; 1101
8 ; 4 ; 1110
9 ; 5 ; 11110
10; 5 ; 11111


baseCode[n] = first code of len n - # of codes of lower len


baseCode,
    [2]     0
    [3]     1       = 010 - 1
    [4]     6       = 1010 - 4
    [5]     21      = 11110 - 9

huff_branchCodeLeftAligned
    [2]   0x4000000000000000      010000...
    [3]   0xa000000000000000      101000...
    [4]   0xf000000000000000      111100...
    [5]   0xffffffffffffffff      111111...


My decode loop is :

for(int codeLen=1;;codeLen++) // actually unrolled, not a loop
{

if ( bitbuff < huff_branchCodeLeftAligned[codeLen] ) return symbolUnsort[ getbits(codeLen) - baseCode[codeLen] ];

}

or

int codeLen = minCodeLen;
while ( bitbuff >= huff_branchCodeLeftAligned[codeLen] ) codeLen++;
sym = symbolUnsort[ getbits(codeLen) - baseCode[codeLen] ];

so if bitbuff is

11010000...

codeLen starts at 2
we check

if ( 11010000.. < 0x4000... ) - false
if ( 11010000.. < 0xa000... ) - false
if ( 11010000.. < 0xf000... ) - true
  return ( 1101 - baseCode[4] ); = 13 - 6 = 7

And a full table-accelerated decoder for this code might be :


// 3-bit acceleration table :
#define TABLE_N_BITS 3
if ( bits < huff_branchCodeLeftAligned[TABLE_N_BITS] )
{
    U32 peek = bits >> (WORD_SIZE - TABLE_N_BITS);
    Consume( table[peek].codeLen );
    return table[peek].symbol;
}

if ( bitbuff < huff_branchCodeLeftAligned[4] ) return symbolUnsort[ getbits(4) - baseCode[4] ];

// 5 is max codelen
// this compare is not always true (because of the bitbuff=~0 problem), but we take it anyway
//if ( bitbuff < huff_branchCodeLeftAligned[5] )

return symbolUnsort[ getbits(5) - baseCode[5] ];

And there you go. MSB-first Huffman that supports long code lengths that exceed the acceleration table size.

Data Compression History : Finnish

Finnish was perhaps the fastest (non-nop) compressor in the world around 1995 (? not sure on the exact year. Definitely before P-Pro and CMOV and branch penalties and such; this is a pre-Pentium-era optimized compressor; it definitely existed before LZP1 (1996) since it was one of the things I benchmarked against). (heck it might be a 286-era compressor, seeing as it's all 16-bit!)

Finnish was by some guys that I assume were from Finland. If anybody knows the correct attribution please let me know.

I was thinking about it the other day because we talked about the old segment register trick that we used to do, and I always thought this was such a neat little bit of code. It also uses the byte-regs as part of word-reg tricks.

Finnish :


; es = CharTable
; bx = hash index
; dl = control bits
; ds[si] = input
; ds[di] = output
; ax/al/ah = input char
; bp = control ptr

ProcessByte     macro SourceReg,BitVal
                        cmp     SourceReg, es:[bx]
                        je      ProcessByte_done
                        or      dl, BitVal
                        mov     es:[bx], SourceReg
                        mov     ds:[di], SourceReg
                        inc     di
ProcessByte_done:       mov     bh, bl
                        mov     bl, SourceReg
                        endm


ProcessBlockLoop:
                        mov bp, di           ; ControlPtr = CompPtr++;
                        inc di
                        xor dl, dl           ; Control = 0;

                        lodsw                ; AX = ds[si] , si += 2
                        ProcessByte al, 80h
                        ProcessByte ah, 40h
                        lodsw
                        ProcessByte al, 20h
                        ProcessByte ah, 10h
                        lodsw
                        ProcessByte al, 08h
                        ProcessByte ah, 04h
                        lodsw
                        ProcessByte al, 02h
                        ProcessByte ah, 01h

                        mov     ds:[bp], dl  ; *ControlPtr = Control

3/14/2016

XRGB Bitmap Test

This is obvious and I think it's been done before, but hey.

I was remembering how modern LZ's like LZMA (BitKnit, etc.) that (can) do pos&3 for literals might like bitmaps in XRGB rather than 24-bit RGB.

In XRGB, each color channel gets its own entropy coding. Also offset bottom bits works if the offsets are whole pixel steps (the off&3 will be zero). In 24-bit RGB that stuff is all mod-3 which we don't do.

(in general LZMA-class compressors fall apart a bit if the structure is not the typical 4/8/pow2)

In compressors it's generally terrible to stick extra bytes in and give the compressor more work to do. In this case we're injecting a 0 in every 4th byte, and the compressor has to figure out those are all redundant just to get back to its original size.

Anyway, this is an old idea, but I don't think I ever actually tried it. So :


PDI_1200.bmp

LZNA :

24-bit RGB : LZNA : 2,760,054 -> 1,376,781
32-bit XRGB: LZNA : 3,676,818 -> 1,311,502

24-bit  RGB with DPCM filter : LZNA : 2,760,054 -> 1,022,066
32-bit XRGB with DPCM filter : LZNA : 3,676,818 -> 1,015,379  (MML8 : 1,012,988)

webpll : 961,356
paq8o8 : 1,096,342

moses.bmp

24-bit RGB : LZNA : 6,580,854 -> 3,274,757
32-bit XRGB: LZNA : 8,769,618 -> 3,022,320

24-bit  RGB with DPCM filter : LZNA : 6,580,854 -> 2,433,246
32-bit XRGB with DPCM filter : LZNA : 8,769,618 -> 2,372,921

webpll : 2,204,444
gralic111d : 1,822,108

other compressors :

32-bit XRGB with DPCM filter : LZA  : 8,769,618 -> 2,365,661 (MML8 : 2,354,434)

24-bit  RGB no filter : BitKnit : 6,580,854 -> 3,462,455
32-bit XRGB no filter : BitKnit : 8,769,618 -> 3,070,141
32-bit XRGB with DPCM filter : BitKnit : 8,769,618 -> 2,601,463

32-bit XRGB: LZNA : 8,769,618 -> 3,022,320
32-bit XRGB: LZA  : 8,769,618 -> 3,009,417

24-bit  RGB: LZMA : 6,580,854 -> 3,488,546 (LZMA lc=0,lp=2,pb=2)
32-bit XRGB: LZMA : 8,769,618 -> 3,141,455 (LZMA lc=0,lp=2,pb=2)

repro:

bmp copy moses.bmp moses.tga 32
V:\devel\projects\oodle\radbitmap\radbitmaptest
radbitmaptest64 rrz -z0 r:\moses.tga moses.tga.rrz -f8 -l1

Key observations :

1. On "moses" unfiltered : padding to XRGB does help a solid amount (3,274,757 to 3,022,320 for LZNA) , despite the source being 4/3 bigger. I think that proves the concept. (BitKnit & LZMA even bigger difference)

2. On filtered data, padding to XRGB still helps, but much (much) less. Presumably this is because post-filter data is just a bunch of low values, so the 24-bit RGB data is not so multiple-of-three structured (it's a lot of 0's, +1's, and -1's, less coherent, less difference between the color channels, etc.)

3. On un-filtered data, "sub" literals might be helping BitKnit (it beats LZMA on 32-bit unfiltered, and hangs with LZNA). On filtered data, the sub-literals don't help (might even hurt) and BK falls behind. We like the way sub literals sometimes act as an automatic structure stride and delta filter, but they can't compete with a real image-specific DPCM.


Now, XRGB padding is an ugly way to do this. You'd much rather stick with 24-bit RGB and have an LZ that works inherently on 3-byte items.

The first step is :


LZ that works on "items"

(eg. item = a pixel)

LZ matches (offsets and lens) are in whole items

(the more analogous to bottom-bits style would be to allow whole-items and "remainders";
that's /item and %item, and let the entropy coder handle it if remainder==0 always;
but probably best to just force remainders=0)

When you don't match (literal item)
each byte in the item gets it own entropy stats
(eg. color channels of pixels)

which maybe is useful on things other than just images.

The other step is something like :


Offset is an x,y delta instead of linear
(this replaces offset bottom bits)

could be generically useful in any kind of row/column structured data

Filtering for values with x-y neighbors

(do you do the LZ on un-filtered data, and only filter the literals?)
(or do you filter everything and do the LZ on filter residuals?)

and a lot of this is just webp-ll

3/11/2016

Seven Test

I made a new test set called "sevens", taking the lead from enwik7, the size of each file is 10 MB (10^7).

The goal here is not to show the total or who does best overall (that relies on how you weight each type of file and whether you think this selection is representative of the occurance ratios in your data), rather to show how each compressor does on different types of data, to highlight their different strengths.

Showing compression factor (eg. N:1 , higher is better) :

run details :


ZStd is 0.5.1 at level 21 (optimal)
LZMA is 7z -mx9 -m0=lzma:d24
Brotli is bro.exe by Sportman --quality 9 --window 24 (*)
Oodle is v2.13 at -z6 (Optimal2)

All competitors run via their provided exe

Some takeaways :

Binary structured data is really where the other compressors leave a lot of room to beat them. ("granny" and "records"). The difference in sizes on all the other files is pretty meh.

BitKnit does its special thang on granny - close to LZNA but 2X faster to decode (and ~ 6X faster than LZMA). Really super space-speed. BitKnit drops down to more like LZHLW levels on the non-record files (LZNA/LZMA has a small edge on them).

I was really surprised by ZStd vs Brotli. I actually went back and double checked by CSV to make sure I hadn't switched the columns by accident. In particular - Brotli does poorly on enwik7 (huh!?) but it does pretty well on "granny", and surprisingly ZStd does quite poorly on "granny" & "records". Not what I expected at all. Brotli is surprising poor on text/web and surprisingly good on binary record data.

LZHLW is still an excellent choice after all these years.

(* = Brotli quality 10 takes an order of magnitude longer than any of the others. I got fed up with waiting for it. Oodle also has "super" modes at -z8 that aren't used here. (**))

(for concreteness : Brotli 11 does pretty well on granny7 ; (6.148:1 vs 4.634:1 at q9) but it runs at 68 kb/s (!!) (and still not LZMA-level compression))

(** = I used to show results in benchmarks that required really slow encoders (for example the old LZNIB optimal "super parse" was hella slow); that can result in very small sizes and great decode speed, but it's a form of cheating. Encoders slower than 1 mb/s just won't be used, they're too slow, so it's reporting a result that real users won't actually see, and that's BS. I'm trying to be more legit about this now for my own stuff. Slow encoders are still interesting for research purposes because they show what should be possible, so you can try to get that result back in a faster way. (this in fact happened with LZNIB and is a Big Deal))

Seven Test Space-Speeds

Showing decompress time space-speed tradeoff on the different files of "seven test" :

records7

granny7

game7

exe7

enwik7

dds7

audio7

Note on the test :

This is running the non-Oodle compressors via my build of their lib (*). Brotli not included because it's too hard to build in MSVC (before 2010). "oohc" here is "Optimal2" level (originally posted with Optimal1 level, changed to Optimal2 for consistency with previous post).

The sorting of the labels on the right is by compressed size.

Report on total of all files :

-------------------------------------------------------
by ratio:
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s

by encode speed:
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s

by decode speed:
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s
-------------------------------------------------------

How to for my reference :


type test_slowies_seven.bat
@REM test each one individially :
spawnm -n external_compressors_test.exe -e2 -d10 -noohc -nlzma -nlzham -nzstdmax r:\testsets\seven\* -cr:\seven_csvs\@f.csv
@REM test as a set :
external_compressors_test.exe -e2 -d10 -noohc -nlzma -nlzham -nzstdmax r:\testsets\seven

dele r:\compressorspeeds.*
@REM testproj compressorspeedchart
spawnm c:\src\testproj\x64\debug\TestProj.exe r:\seven_csvs\*.csv
ed r:\compressorspeeds.*

(* = I use code or libs to test speeds, never exes; I always measure speed memory->memory, single threaded, with cold caches)

old rants