cbloom rants: 05/2016

5/27/2016

PS4 Battle : MiniZ vs Zlib-NG vs ZStd vs Brotli vs Oodle

(see charts at the bottom)

Everything run at max compression options, level 99, max dict size. All libs are the latest on github, downloaded today. Zlib-NG has the arch/x86 stuff enabled. PS4 is AMD Jaguar , x64.

I'm going to omit encode speeds on the per-file results for simplicity, these are pretty representative :


aow3_skin_giants.clb :
zlib-ng encode   : 2.699 seconds, 1.65 b/kc, rate= 2.63 mb/s
miniz encode     : 2.950 seconds, 1.51 b/kc, rate= 2.41 mb/s
zstd encode      : 5.464 seconds, 0.82 b/kc, rate= 1.30 mb/s
brotli-9  encode    : 23.110 seconds, 0.19 b/kc, rate= 307.44 kb/s
brotli-10 encode    : 68.072 seconds, 0.07 b/kc, rate= 104.38 kb/s
brotli-11 encode    : 79.844 seconds, 0.06 b/kc, rate= 88.99 kb/s

Results :


PS4 clang-3.5.0

-------------

lzt99 :

MiniZ : 24,700,820 ->13,120,668 =  4.249 bpb =  1.883 to 1
miniz_decompress_time : 0.292 seconds, 53.15 b/kc, rate= 84.71 mb/s

zlib-ng : 24,700,820 ->13,158,385 =  4.262 bpb =  1.877 to 1
miniz_decompress_time : 0.226 seconds, 68.58 b/kc, rate= 109.30 mb/s

ZStd : 24,700,820 ->10,403,228 =  3.369 bpb =  2.374 to 1
zstd_decompress_time : 0.184 seconds, 84.12 b/kc, rate= 134.07 mb/s

Brotli-9 : 24,700,820 ->10,473,560 =  3.392 bpb =  2.358 to 1
brotli_decompress_time : 0.259 seconds, 59.83 b/kc, rate= 95.36 mb/s

Brotli-10 : 24,700,820 -> 9,949,740 =  3.222 bpb =  2.483 to 1
brotli_decompress_time : 0.319 seconds, 48.54 b/kc, rate= 77.36 mb/s

Brotli-11 : 24,700,820 -> 9,833,023 =  3.185 bpb =  2.512 to 1
brotli_decompress_time : 0.317 seconds, 48.84 b/kc, rate= 77.84 mb/s

Oodle Kraken -zl4 : 24,700,820 ->10,326,584 =  3.345 bpb =  2.392 to 1
encode only      : 4.139 seconds, 3.74 b/kc, rate= 5.97 mb/s
decode only      : 0.073 seconds, 211.30 b/kc, rate= 336.76 mb/s

Oodle Kraken -zl6 : 24,700,820 ->10,011,486 =  3.242 bpb =  2.467 to 1
decode           : 0.074 seconds, 208.83 b/kc, rate= 332.82 mb/s

Oodle Kraken -zl7 : 24,700,820 -> 9,773,112 =  3.165 bpb =  2.527 to 1
decode           : 0.079 seconds, 196.70 b/kc, rate= 313.49 mb/s

Oodle LZNA : lzt99 : 24,700,820 -> 9,068,880 =  2.937 bpb =  2.724 to 1
decode           : 0.643 seconds, 24.12 b/kc, rate= 38.44 mb/s

-------------

normals.bc1 :

miniz :   524,316 ->   291,697 =  4.451 bpb =  1.797 to 1
miniz_decompress_time : 0.008 seconds, 39.86 b/kc, rate= 63.53 mb/s

zlib-ng :   524,316 ->   292,541 =  4.464 bpb =  1.792 to 1
zlib_ng_decompress_time : 0.007 seconds, 47.32 b/kc, rate= 75.41 mb/s

zstd :   524,316 ->   273,642 =  4.175 bpb =  1.916 to 1
zstd_decompress_time : 0.007 seconds, 49.64 b/kc, rate= 79.13 mb/s

brotli-9 :   524,316 ->   289,778 =  4.421 bpb =  1.809 to 1
brotli_decompress_time : 0.010 seconds, 31.70 b/kc, rate= 50.52 mb/s

brotli-10 :   524,316 ->   259,772 =  3.964 bpb =  2.018 to 1
brotli_decompress_time : 0.011 seconds, 28.65 b/kc, rate= 45.66 mb/s

brotli-11 :   524,316 ->   253,625 =  3.870 bpb =  2.067 to 1
brotli_decompress_time : 0.011 seconds, 29.74 b/kc, rate= 47.41 mb/s

Oodle Kraken -zl6 :    524,316 ->   247,217 =  3.772 bpb =  2.121 to 1
decode           : 0.002 seconds, 135.52 b/kc, rate= 215.95 mb/s

Oodle Kraken -zl7 :    524,316 ->   238,844 =  3.644 bpb =  2.195 to 1
decode           : 0.003 seconds, 123.96 b/kc, rate= 197.56 mb/s

Oodle BitKnit :    524,316 ->   225,884 =  3.447 bpb =  2.321 to 1
decode only      : 0.010 seconds, 31.67 b/kc, rate= 50.47 mb/s

-------------

lightmap.bc3 :

miniz :  4,194,332 ->   590,448 =  1.126 bpb =  7.104 to 1 
miniz_decompress_time : 0.025 seconds, 105.14 b/kc, rate= 167.57 mb/s

zlib-ng : 4,194,332 ->   584,107 =  1.114 bpb =  7.181 to 1
zlib_ng_decompress_time : 0.019 seconds, 137.77 b/kc, rate= 219.56 mb/s

zstd :  4,194,332 ->   417,672 =  0.797 bpb = 10.042 to 1 
zstd_decompress_time : 0.014 seconds, 182.53 b/kc, rate= 290.91 mb/s

brotli-9 : 4,194,332 ->   499,120 =  0.952 bpb =  8.403 to 1 
brotli_decompress_time : 0.022 seconds, 118.64 b/kc, rate= 189.09 mb/s

brotli-10 : 4,194,332 ->   409,907 =  0.782 bpb = 10.232 to 1 
brotli_decompress_time : 0.021 seconds, 125.20 b/kc, rate= 199.54 mb/s

brotli-11 : 4,194,332 ->   391,576 =  0.747 bpb = 10.711 to 1 
brotli_decompress_time : 0.021 seconds, 127.12 b/kc, rate= 202.61 mb/s

Oodle Kraken -zl6 :   4,194,332 ->   428,737 =  0.818 bpb =  9.783 to 1 
decode           : 0.009 seconds, 308.45 b/kc, rate= 491.60 mb/s

Oodle BitKnit :   4,194,332 ->   416,208 =  0.794 bpb = 10.077 to 1
decode only      : 0.021 seconds, 122.59 b/kc, rate= 195.39 mb/s

Oodle LZNA :  4,194,332 ->   356,313 =  0.680 bpb = 11.771 to 1 
decode           : 0.033 seconds, 79.51 b/kc, rate= 126.71 mb/s

----------------

aow3_skin_giants.clb

Miniz : 7,105,158 -> 3,231,469 =  3.638 bpb =  2.199 to 1
miniz_decompress_time : 0.070 seconds, 63.80 b/kc, rate= 101.69 mb/s

zlib-ng : 7,105,158 -> 3,220,291 =  3.626 bpb =  2.206 to 1
zlib_ng_decompress_time : 0.056 seconds, 80.14 b/kc, rate= 127.71 mb/s

Zstd : 7,105,158 -> 2,700,034 =  3.040 bpb =  2.632 to 1
zstd_decompress_time : 0.050 seconds, 88.69 b/kc, rate= 141.35 mb/s

brotli-9 :  7,105,158 -> 2,671,237 =  3.008 bpb =  2.660 to 1
brotli_decompress_time : 0.080 seconds, 55.84 b/kc, rate= 89.00 mb/s

brotli-10 : 7,105,158 -> 2,518,315 =  2.835 bpb =  2.821 to 1
brotli_decompress_time : 0.098 seconds, 45.54 b/kc, rate= 72.58 mb/s

brotli-11 : 7,105,158 -> 2,482,511 =  2.795 bpb =  2.862 to 1
brotli_decompress_time : 0.097 seconds, 45.84 b/kc, rate= 73.05 mb/s

Oodle Kraken -zl6 : aow3_skin_giants.clb :  7,105,158 -> 2,638,490 =  2.971 bpb =  2.693 to 1
decode           : 0.023 seconds, 195.25 b/kc, rate= 311.19 mb/s

Oodle BitKnit : 7,105,158 -> 2,623,466 =  2.954 bpb =  2.708 to 1
decode only      : 0.095 seconds, 47.11 b/kc, rate= 75.08 mb/s

Oodle LZNA : aow3_skin_giants.clb :  7,105,158 -> 2,394,871 =  2.696 bpb =  2.967 to 1
decode           : 0.170 seconds, 26.26 b/kc, rate= 41.85 mb/s

--------------------

silesia_mozilla

MiniZ : 51,220,480 ->19,141,389 =  2.990 bpb =  2.676 to 1
miniz_decompress_time : 0.571 seconds, 56.24 b/kc, rate= 89.63 mb/s

zlib-ng : 51,220,480 ->19,242,520 =  3.005 bpb =  2.662 to 1
zlib_ng_decompress_time : 0.457 seconds, 70.31 b/kc, rate= 112.05 mb/s

zstd : malloc failed

brotli-9 : 51,220,480 ->15,829,463 =  2.472 bpb =  3.236 to 1
brotli_decompress_time : 0.516 seconds, 62.27 b/kc, rate= 99.24 mb/s

brotli-10 : 51,220,480 ->14,434,253 =  2.254 bpb =  3.549 to 1
brotli_decompress_time : 0.618 seconds, 52.00 b/kc, rate= 82.88 mb/s

brotli-11 : 51,220,480 ->14,225,511 =  2.222 bpb =  3.601 to 1
brotli_decompress_time : 0.610 seconds, 52.72 b/kc, rate= 84.02 mb/s

Oodle Kraken -zl6 : 51,220,480 ->14,330,298 =  2.238 bpb =  3.574 to 1
decode           : 0.200 seconds, 160.51 b/kc, rate= 255.82 mb/s

Oodle Kraken -zl7 : 51,220,480 ->14,222,802 =  2.221 bpb =  3.601 to 1
decode           : 0.201 seconds, 160.04 b/kc, rate= 255.07 mb/s

Oodle LZNA : silesia_mozilla : 51,220,480 ->13,294,622 =  2.076 bpb =  3.853 to 1
decode           : 1.022 seconds, 31.44 b/kc, rate= 50.11 mb/s

I tossed in tests of BitKnit & LZNA in some cases after I realized that the Brotli decode speeds are more comparable to BitKnit than Kraken, and even LZNA isn't that far off (usually less than a factor of 2). eg. you could do half your files in LZNA and half in Kraken and that would be about the same total time as doing them all in Brotli.

Here are charts of the above data :

(silesia_mozilla omitted due to lack of zstd results)

(I'm trying an experiment and showing inverted scales, which are more proportional to what you care about. I'm showing seconds per gigabyte, and percent out of output size, which are proportional to *time* not speed, and *size* not ratio. So, lower is better.)

log-log speed & ratio :

Time and size are just way better scales. Looking at "speed" and "ratio" can be very misleading, because big differences in speed at the high end (eg. 2000 mb/s vs 2200 mb/s) don't translate into a very big time difference, and *time* is what you care about. On the other hand, small differences in speed at the low end *are* important - (eg. 30 mb/s vs 40 mb/s) because those mean a big difference in time.

I've been doing mostly "speed" and "ratio" because it reads better to the novice (higher is better! I want the one with the biggest bar!), but it's so misleading that I think going to time & size is worth it.

5/17/2016

The Weissman Score

Wikipedia suggests the Weissman score should be

which ignoring constants is just W = r/logT

That's just wrong. You don't take a logarithm of something with units. But there are aspects of it that are correct. W should be proportional to r (compression ratio), and a logarithm of time should be involved. Just not like that.

I present a formula which I call the correct Weissman Score :


W = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )  -
    comp_ratio * log10( 1 + speed/(disk_speed_hi *comp_ratio) )

or

W = comp_ratio * log10( ( comp_ratio + speed/disk_speed_lo ) / ( comp_ratio + speed/disk_speed_hi ) )

You can have a Weissman score for encode speed or decode speed. It's a measure of space-speed tradeoff success.

I suggest the range should be 1-256. disk_speed_lo = 1 MB/s (to evaluate performance on very slow channels, favoring small files), disk_speed_hi = 256 MB/s (to evalue performance on very fast disks, favoring speed). And because 1 and 256 are amongst programmers' favorite numbers.

You could also just let the hi range go to infinity. Then you don't need a hi disk speed parameter and you get :


Weissman-infinity = comp_ratio * log10( 1 + speed/(disk_speed_lo *comp_ratio) )

with disk_speed_lo = 1 MB/s ; which is neater, though this favors fast compressors more than you might like. While it's a cleaner formula, I think it's less useful for practical purposes, where the bounded hi range focuses the score more on the area that most people care about.

I came up with this formula because I started thinking about summarizing a score from the Pareto charts I've made . What if you took the speedup value at several (log-scale) disk speeds; like you could take the speedup at 1 MB/s,2 MB/s,4 MB/s, and just average them? speedup is a good way to measure a compressor even if you don't actually care about speed. Well, rather than just average a bunch of points, what if I average *all* points? eg. integrate to get the area under the curve? Note that we're integrating in log-scale of disk speed.

Turns out you can just do that integral :

    speedup = (time to load uncompressed) / (time to load compressed + decompress)
    speedup = (raw_size/disk_speed) / (comp_size/disk_speed + raw_size/ decompress_speed)
    speedup = (1/disk_speed) / (1/(disk_speed*compression_ratio) + 1 / decompress_speed)
    speedup = 1 / (1/compression_ratio + disk_speed / decompress_speed)
    speedup = 1 / (1/compression_ratio + exp( log_disk_speed ) / decompress_speed)
    speedup = compression_ratio / (1 + exp( log_disk_speed ) * compression_ratio/decompress_speed)
    speedup = compression_ratio * 1 / (1 + exp( log_disk_speed + log(compression_ratio/decompress_speed)))

speedup is a sigmoid :

    y = 1 / (1 + e^-x ) 
    
    Integral{y} = ln( 1 + e^x )

    x = - ( log_disk_speed + log(compression_ratio/decompress_speed) )

so substitute some variables and you get the above formula for the Weissman score.

In the final formula, I changed from natural log to log-base-10, which is just a constant scaling factor.

The Weissman (decode Core i7-3770 3.4 GHz; 1-256 range) scores on Silesia are :

lz4hc    : 6.243931
zstdmax  : 7.520236
lzham    : 6.924379
lzma     : 5.460073
zlib9    : 5.198510
Kraken   : 8.431461

Weissman-infinity scores are :

lz4hc    : 7.983104
zstdmax  : 8.168544
lzham    : 7.277707
lzma     : 5.589155
zlib9    : 5.630476
Kraken   : 9.551152

Goal : beat 10.0 !

ADD : this post was a not-sure-if-joking. But I actually think it's useful. I find it useful anyway.

When you're trying to tweak out some space-speed tradeoff decisions, you get different sizes and speeds, and it can be hard to tell if that tradeoff was good. You can do things like plot all your options on a space-speed graph and try to guess the pareto frontier and take those. But when iterating an optimization of a parameter you want just a simple score.

This corrected Weissman score is a nice way to do that. You have to choose what domain you're optimizing for, size-dominant slower compressors should use Weissman 1-256 , for balance of space and super speed use Weissman 1-inf (or 40-800), for the fast domain (LZ4-ish) use a range like 100-inf. Then you can just iterate to maximize that number!

For whatever space-speed tradeoff domain you're interested in, there exists a Weissman score range (lo-hi disk speed paramaters) such that maximizing the Weissman score in that range gives you the best space-speed tradeoff in the domain you wanted. The trick is choosing what that lo-hi range is (it doesn't necessarily directly correspond to actual disk or channel speeds; there are other factors to consider like latency, storage usage, etc. that might cause you to bias the lo-hi away from the actual channel speeds in some way; for example high speed decoders should always set the upper speed to infinity, which corresponds to the use case that the compressed data might be already resident in RAM so it has zero time to load).

5/15/2016

PS4 Battle : LZ4 vs LZSSE vs Oodle

PS4 is the arena. Three compressors enter. (revised 05-16)

Advantages of the PS4 : consistent well-defined hardware for reproducible testing. Slow platform that game developers care about being fast on. Builds with clang which is the target of choice for some compression libraries that don't build so easily on MSVC.

After the initial version of this post, I went and fuzz-safed LZB16, so they're all directly comparable. To compare apples, look at LZ4_decompress_safe. I also include LZ4_decompress_fast for reference.

LZ4_decompress_safe : fuzz safe (*)
LZ4_decompress_fast : NOT fuzz safe
LZSSE8_Decode : fuzz safe
Oodle LZB16 : fuzz safe

LZSSE and Oodle both use multiple copies of the core loop to minimize the pain of fuzz safing. LZ4's code is much simpler, it doesn't do a ton of macro or .inl nastiness. (* = this is the only one that I would actually trust to put in a satellite or a bank or something critical, it's just so simple, it's way easier to be sure that it's correct)

Conclusion :

Comparing the two Safe open-source options, LZ4_safe vs. LZSSE8 : LZSSE8 is pretty consistently faster than LZ4_Safe on PS4 (though the difference is small). PS4 is a better platform for LZSSE than x64 (PS4 is actually a pretty bad platform for LZ4; there're a variety of issues; see GDC "Taming the Jaguar" slides ; but particular issues for LZ4 are the front-end bottleneck and cache latency). When I tested on x64 before, it was much more mixed, sometimes LZ4 was faster.

I was surprised to find that Oodle LZB16 is quite a lot faster than LZ4 on PS4. (for example, that's not the case on Windows/x64, it's much closer there). I've never run third party codecs on PS4 before. I suppose this reflects the per-platform tweaking that we spend so much time on, and I'm sure LZ4 would catch up with some time spent fiddling with the PS4 codegen.

The compression ratios are mostly close enough to not care, though LZSSE8 does a bit worse on some of the DXTC/BCn files (lightmap.bc3 and d.dds).

On some files I include Oodle LZBLW numbers (LZB-bytewise-large-window). Sometimes Oodle LZBLW is a pretty big free compression win at about the same speed. Sometimes it gets worse ratio, sometimes much worse speed. If I was a client using this, I might try LZBLW and drop down to LZB16 any time it's not a good tradeoff.

Full data :

REMINDER : LZ4_decompress_fast is not directly comparable to the others, it's not fuzz safe, the others are!


PS4 clang-3.6.1

------------------

lzt99 :

lz4hc : 24,700,820 ->14,801,510 =  4.794 bpb =  1.669 to 1
LZ4_decompress_safe_time : 0.035 seconds, 440.51 b/kc, rate= 702.09 mb/s
LZ4_decompress_fast_time : 0.032 seconds, 483.55 b/kc, rate= 770.67 mb/s

LZSSE8  : 24,700,820 ->15,190,395 =  4.920 bpb =  1.626 to 1
LZSSE8_Decode_Time : 0.033 seconds, 467.32 b/kc, rate= 744.81 mb/s

Oodle LZB16 : lzt99 : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1 
decode           : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s

Oodle LZBLW : lzt99 : 24,700,820 ->13,349,800 =  4.324 bpb =  1.850 to 1
decode           : 0.033 seconds, 470.39 b/kc, rate= 749.74 mb/s

------------------

texture.bc1

lz4hc : 2,188,524 -> 2,068,268 =  7.560 bpb =  1.058 to 1
LZ4_decompress_safe_time : 0.004 seconds, 322.97 b/kc, rate= 514.95 mb/s
LZ4_decompress_fast_time : 0.004 seconds, 353.08 b/kc, rate= 562.89 mb/s

LZSSE8 : 2,188,524 -> 2,111,182 =  7.717 bpb =  1.037 to 1
LZSSE8_Decode_Time : 0.004 seconds, 360.21 b/kc, rate= 574.42 mb/s

Oodle LZB16 : texture.bc1 :  2,188,524 -> 2,068,823 =  7.562 bpb =  1.058 to 1 
decode           : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s

------------------

lightmap.bc3

lz4hc : 4,194,332 ->   632,974 =  1.207 bpb =  6.626 to 1
LZ4_decompress_safe_time : 0.005 seconds, 521.54 b/kc, rate= 831.38 mb/s
LZ4_decompress_fast_time : 0.005 seconds, 564.63 b/kc, rate= 900.46 mb/s

LZSSE8 encode    :  4,194,332 ->   684,062 =  1.305 bpb =  6.132 to 1
LZSSE8_Decode_Time : 0.005 seconds, 551.85 b/kc, rate= 879.87 mb/s

Oodle LZB16 : lightmap.bc3 :  4,194,332 ->   630,794 =  1.203 bpb =  6.649 to 1 
decode           : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s

------------------

silesia_mozilla

lz4hc : 51,220,480 ->22,062,995 =  3.446 bpb =  2.322 to 1
LZ4_decompress_safe_time : 0.083 seconds, 385.47 b/kc, rate= 614.35 mb/s
LZ4_decompress_fast_time : 0.075 seconds, 427.14 b/kc, rate= 680.75 mb/s

LZSSE8 : 51,220,480 ->22,148,366 =  3.459 bpb =  2.313 to 1
LZSSE8_Decode_Time : 0.070 seconds, 461.53 b/kc, rate= 735.59 mb/s

Oodle LZB16 : silesia_mozilla : 51,220,480 ->22,022,002 =  3.440 bpb =  2.326 to 1 
decode           : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s

Oodle LZBLW : silesia_mozilla : 51,220,480 ->20,881,772 =  3.261 bpb =  2.453 to 1
decode           : 0.112 seconds, 285.68 b/kc, rate= 455.30 mb/s

------------------

breton.dds

lz4hc :   589,952 ->   116,447 =  1.579 bpb =  5.066 to 1
LZ4_decompress_safe_time : 0.001 seconds, 568.65 b/kc, rate= 906.22 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 624.81 b/kc, rate= 996.54 mb/s

LZSSE8 encode    :  589,952 ->   119,659 =  1.623 bpb =  4.930 to 1
LZSSE8_Decode_Time : 0.001 seconds, 604.14 b/kc, rate= 962.40 mb/s

Oodle LZB16 : breton.dds :    589,952 ->   113,578 =  1.540 bpb =  5.194 to 1 
decode           : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s

Oodle LZBLW : breton.dds :    589,952 ->   132,934 =  1.803 bpb =  4.438 to 1
decode           : 0.001 seconds, 396.04 b/kc, rate= 630.96 mb/s

------------------

d.dds

lz4hc encode     :  1,048,704 ->   656,706 =  5.010 bpb =  1.597 to 1
LZ4_decompress_safe_time : 0.001 seconds, 554.69 b/kc, rate= 884.24 mb/s
LZ4_decompress_fast_time : 0.001 seconds, 587.20 b/kc, rate= 936.34 mb/s

LZSSE8 encode    :  1,048,704 ->   695,583 =  5.306 bpb =  1.508 to 1
LZSSE8_Decode_Time : 0.001 seconds, 551.13 b/kc, rate= 879.05 mb/s

Oodle LZB16 : d.dds :  d.dds :  1,048,704 ->   654,014 =  4.989 bpb =  1.603 to 1 
decode           : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s

------------------

all_dds

lz4hc : 79,993,099 ->47,848,680 =  4.785 bpb =  1.672 to 1
LZ4_decompress_safe_time : 0.158 seconds, 316.67 b/kc, rate= 504.70 mb/s
LZ4_decompress_fast_time : 0.143 seconds, 350.66 b/kc, rate= 558.87 mb/s

LZSSE8 : 79,993,099 ->47,807,041 =  4.781 bpb =  1.673 to 1
LZSSE8_Decode_Time : 0.140 seconds, 358.61 b/kc, rate= 571.54 mb/s

Oodle LZB16 : all_dds : 79,993,099 ->47,683,003 =  4.769 bpb =  1.678 to 1 
decode           : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s

----------

baby_robot_shell.gr2

lz4hc : 58,788,904 ->32,998,567 =  4.490 bpb =  1.782 to 1
LZ4_decompress_safe_time : 0.090 seconds, 412.04 b/kc, rate= 656.71 mb/s
LZ4_decompress_fast_time : 0.080 seconds, 460.55 b/kc, rate= 734.01 mb/s

LZSSE8 : 58,788,904 ->33,201,406 =  4.518 bpb =  1.771 to 1
LZSSE8_Decode_Time : 0.076 seconds, 485.14 b/kc, rate= 773.20 mb/s

Oodle LZB16 : baby_robot_shell.gr2 : 58,788,904 ->32,862,033 =  4.472 bpb =  1.789 to 1
decode           : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s

Oodle LZBLW : baby_robot_shell.gr2 : 58,788,904 ->30,207,635 =  4.111 bpb =  1.946 to 1
decode           : 0.090 seconds, 409.88 b/kc, rate= 653.26 mb/s

After posting the original version with non-fuzz-safe LZB16, I decided to just go and do the fuzz-safing for LZB16.


LZB16, PS4 clang-3.6.1

post-fuzz-safing :

    lzt99 : 24,700,820 ->14,754,643 =  4.779 bpb =  1.674 to 1 
    decode           : 0.027 seconds, 564.72 b/kc, rate= 900.08 mb/s
    
    texture.bc1 :  2,188,524 -> 2,068,823 =  7.562 bpb =  1.058 to 1 
    decode           : 0.004 seconds, 368.67 b/kc, rate= 587.84 mb/s
    
    lightmap.bc3 :  4,194,332 ->   630,794 =  1.203 bpb =  6.649 to 1 
    decode           : 0.005 seconds, 525.10 b/kc, rate= 837.19 mb/s
    
    silesia_mozilla : 51,220,480 ->22,022,002 =  3.440 bpb =  2.326 to 1 
    decode           : 0.065 seconds, 492.03 b/kc, rate= 784.19 mb/s
    
    breton.dds :    589,952 ->   113,578 =  1.540 bpb =  5.194 to 1 
    decode           : 0.001 seconds, 627.56 b/kc, rate= 1001.62 mb/s
    
    d.dds :  1,048,704 ->   654,014 =  4.989 bpb =  1.603 to 1 
    decode           : 0.001 seconds, 537.78 b/kc, rate= 857.48 mb/s
    
    all_dds : 79,993,099 ->47,683,003 =  4.769 bpb =  1.678 to 1 
    decode           : 0.113 seconds, 444.38 b/kc, rate= 708.24 mb/s
    
    baby_robot_shell.gr2 : 58,788,904 ->32,862,033 =  4.472 bpb =  1.789 to 1
    decode           : 0.070 seconds, 530.45 b/kc, rate= 845.42 mb/s

pre-fuzz-safing reference :

    lzt99 912.92 mb/s
    texture.bc1 598.61 mb/s
    lightmap.bc3 876.19 mb/s
    silezia_mozilla 794.72 mb/s
    breton.dds 1078.52 mb/s
    d.dds 888.73 mb/s
    all_dds 701.45 mb/s
    baby_robot_shell.gr2 877.81 mb/s

Mild speed penalty on most files.

5/12/2016

Oodle Kraken Thread-Phased Decoding

Oodle Kraken is already by far the fastest-to-decode high-compression compressor in the world (that's a mouthful!). But it's got a magic trick that makes it even faster :

Oodle Kraken can decode its normal compressed data on multiple threads.

This is different than what a lot of compressors do (and what Oodle has done in the past), which is to split the data into independent chunks so that each chunk can be decompressed on its own thread. All compressors can do that in theory, Oodle makes it easy in practice with the "seek chunk" decodes. But that requires special encoding that does the chunking, and it hurts compression ratio by breaking up where matches can be found.

The Oodle Kraken threaded decode is different. To distinguish it I call it "Thread-Phased" decode. It runs on normal compressed data - no special encoding flags are needed. It has no compressed size penalty because it's the same normal single-thread compressed data.

The Oodle Kraken Thread-Phased decode gets most of its benefit with just 2 threads (if you like, the calling thread + 1 more). The exact speedup varies by file, usually in the 1.4X - 1.9X range. The results here are all for 2-thread decode.

For example on win81, 2-thread Oodle Kraken is 1.7X faster than 1-thread : (with some other compressors for reference)


win81 :

Kraken 2-thread  : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.075 seconds, 410.98 b/kc, rate= 1398.55 M/s

Kraken           : 104,857,600 ->37,898,868 =  2.891 bpb =  2.767 to 1 
decode           : 0.127 seconds, 243.06 b/kc, rate= 827.13 M/s

zstdmax          : 104,857,600 ->39,768,086 =  3.034 bpb =  2.637 to 1 
decode           : 0.251 seconds, 122.80 b/kc, rate= 417.88 M/s

lzham            : 104,857,600 ->37,856,839 =  2.888 bpb =  2.770 to 1 
decode           : 0.595 seconds, 51.80 b/kc, rate= 176.27 M/s

lzma             : 104,857,600 ->35,556,039 =  2.713 bpb =  2.949 to 1 
decode           : 2.026 seconds, 15.21 b/kc, rate= 51.76 M/s

Charts on a few files :

Oodle 2.2.0 includes helper functions that will just run a Thread-Phased decode for you on Oodle's own thread system, as well as example code that runs the entire Thread-Phased decode client-side so you can do it on your own threads however you like.

Performance on the Silesia set for reference :


Silesia total :

Oodle Kraken -z6 : 211,938,580 ->51,857,427 =  1.957 bpb =  4.087 to 1

single threaded decode   : 0.232 seconds, 268.43 b/kc, rate= 913.46 M/s

two threaded decode      : 0.158 seconds, 394.55 b/kc, rate= 1342.64 M/s

Note that because the Kraken Thread-Phased decode is a true threaded decode of individual compressed buffers that means it is a *latency* reduction for decoding individual blocks, not just a *throughput* reduction. For example, if you were really decoding the whole Silesia set, you might just run the decompression of each file on its own thread. That is a good thing to do, and it would give you a near 2X speedup (with two threads). But that's a different kind of threading - that gives you a throughput improvement of 2X but the latency to decode any individual file is not improved at all. Kraken Thread-Phased decode reduces the latency of each independent decode, and of course it can also be used with chunking or multiple-file decoding to get further speedups.

Oodle 2.2.0 Kraken Optimal Parse improvements

Oodle 2.2.0 is about to ship, with some improvements to the Kraken optimal parse compression ratios. Compressed size is improved by around 1%. Speed is approximately the same at -z6 (previous max level for Kraken) but there's a new -z7 mode that's slightly slower and even higher compression.

I think we'll continue to find improvements in the optimal parsers over the coming months (optimal parsing is hard!) which should lead to some more tiny gains in the compression ratio in the slow encoder modes.


Silesia , sum of all files

uncompressed : 211,938,580

Kraken 2.1.5 -z6 : 52,366,897
Kraken 2.2.0 -z6 : 51,857,427
Kraken 2.2.0 -z7 : 51,625,488

Oodle Kraken 2.1.5 topped out at -z6 (Optimal2). There's a new -z7 (Optimal3) mode which gets a bit more compression at the cost of a bit of speed, which is why it's on a separate option instead of just part of -z6.

Results on some individual files (Kraken 220 is -z7) :

-------------------------------------------------------
"silesia_mozilla"

by ratio:
lzma        :  3.88:1 ,    2.0 enc mb/s ,   63.7 dec mb/s
Kraken 220  :  3.60:1 ,    1.1 enc mb/s ,  896.5 dec mb/s
lzham       :  3.56:1 ,    1.5 enc mb/s ,  186.4 dec mb/s
Kraken 215  :  3.51:1 ,    1.2 enc mb/s ,  928.0 dec mb/s
zstdmax     :  3.24:1 ,    2.8 enc mb/s ,  401.0 dec mb/s
zlib9       :  2.51:1 ,   12.4 enc mb/s ,  291.5 dec mb/s
lz4hc       :  2.32:1 ,   36.4 enc mb/s , 2351.6 dec mb/s

-------------------------------------------------------
"lzt99"

by ratio:
lzma        :  2.65:1 ,    3.1 enc mb/s ,   42.3 dec mb/s
Kraken 220  :  2.53:1 ,    2.0 enc mb/s ,  912.0 dec mb/s
Kraken 215  :  2.46:1 ,    2.3 enc mb/s ,  957.1 dec mb/s
lzham       :  2.44:1 ,    1.9 enc mb/s ,  166.0 dec mb/s
zstdmax     :  2.27:1 ,    3.8 enc mb/s ,  482.3 dec mb/s
zlib9       :  1.77:1 ,   13.3 enc mb/s ,  286.2 dec mb/s
lz4hc       :  1.67:1 ,   30.3 enc mb/s , 2737.4 dec mb/s

-------------------------------------------------------
"all_dds"

by ratio:
lzma        :  2.37:1 ,    2.1 enc mb/s ,   40.8 dec mb/s
Kraken 220  :  2.23:1 ,    1.0 enc mb/s ,  650.6 dec mb/s
Kraken 215  :  2.18:1 ,    1.0 enc mb/s ,  684.6 dec mb/s
lzham       :  2.17:1 ,    1.3 enc mb/s ,  127.7 dec mb/s
zstdmax     :  2.02:1 ,    3.3 enc mb/s ,  289.4 dec mb/s
zlib9       :  1.83:1 ,   13.3 enc mb/s ,  242.9 dec mb/s
lz4hc       :  1.67:1 ,   20.4 enc mb/s , 2226.9 dec mb/s

-------------------------------------------------------
"baby_robot_shell.gr2"

by ratio:
lzma        :  4.35:1 ,    3.1 enc mb/s ,   59.3 dec mb/s
Kraken 220  :  3.82:1 ,    1.4 enc mb/s ,  837.2 dec mb/s
Kraken 215  :  3.77:1 ,    1.5 enc mb/s ,  878.3 dec mb/s
lzham       :  3.77:1 ,    1.6 enc mb/s ,  162.5 dec mb/s
zstdmax     :  2.77:1 ,    5.7 enc mb/s ,  405.7 dec mb/s
zlib9       :  2.19:1 ,   13.9 enc mb/s ,  332.9 dec mb/s
lz4hc       :  1.78:1 ,   40.1 enc mb/s , 2364.4 dec mb

-------------------------------------------------------
"win81"

by ratio:
lzma        :  2.95:1 ,    2.5 enc mb/s ,   51.9 dec mb/s
lzham       :  2.77:1 ,    1.6 enc mb/s ,  177.6 dec mb/s
Kraken 220  :  2.77:1 ,    1.0 enc mb/s ,  818.0 dec mb/s
Kraken 215  :  2.70:1 ,    1.0 enc mb/s ,  877.0 dec mb/s
zstdmax     :  2.64:1 ,    3.5 enc mb/s ,  417.8 dec mb/s
zlib9       :  2.07:1 ,   16.8 enc mb/s ,  269.6 dec mb/s
lz4hc       :  1.91:1 ,   28.8 enc mb/s , 2297.6 dec mb/s

-------------------------------------------------------
"enwik7"

by ratio:
lzma        :  3.64:1 ,    1.8 enc mb/s ,   79.5 dec mb/s
lzham       :  3.60:1 ,    1.4 enc mb/s ,  196.5 dec mb/s
zstdmax     :  3.56:1 ,    2.2 enc mb/s ,  394.6 dec mb/s
Kraken 220  :  3.51:1 ,    1.4 enc mb/s ,  702.8 dec mb/s
Kraken 215  :  3.49:1 ,    1.5 enc mb/s ,  789.7 dec mb/s
zlib9       :  2.38:1 ,   22.2 enc mb/s ,  234.3 dec mb/s
lz4hc       :  2.35:1 ,   27.5 enc mb/s , 2059.6 dec mb/s

-------------------------------------------------------

You can see that encode & decode speed is slightly worse at level -z7 , and compression ratio si improved. (most of the other compression levels have roughly the same decode speed; -z7 enables some special options that can hurt decode speed a bit). Of course even at -z7 Kraken is way faster than anything else comparable!

Tips for benchmarking a compressor

You're about to evaluate Oodle (thanks for having a look!) or some other compressor. Before you start, consider these tips :

1. Time only the compressor.

Place your time measurements only around the compressor. Not IO, not your parsing, not mallocs, just the compress or decompress calls. I understand that in the end what you care about is total time to load, but there can be a lot of issues there that need fixing, and they can cloud the comparison of just the compression part. eg. if your parsing is really slow, that will dominate the CPU time and hide the differences between the compressors.

2. Time what you actually care about.

If you care about decode time, time the decompression. If you care about encode time, time compression. If you care about round-trip time, add the two times. Compressors are not just "fast" or "slow" at both ends, you can't time encoding and decide that it's a fast or slow compressor if what you care about is decoding.

3. Choose the right options.

Most compressors have the ability to target slightly different use cases. The most common option is the ability to trade off encode time vs. compression ratio. So, if what you care about is smallest size, then run the compressor at its highest encode effort level. It can be tricky to get the options right in most compression libraries; we are woefully non-standardized and not well documented. Aside from the simple "level" parameter, there may be other options that are relevant to your goals, perhaps trading off decompressor memory usage, or decompression speed. With Oodle the best option is always to email us and ask what options will best suit your goals.

4. Run apples-to-apples (threads-to-threads) comparisons.

It can be tricky to compare compressors fairly. As much as possible they should be run in the same way, and they should be run in the way that you will actually use them in your final application. Don't profile them with threads if you will not use them threaded in your shipping application.

Threads are a common problem. Compressors should either be tested all threaded (if you will use threads in your final application), or all non-threaded. Unfortunately the defaults are not the same. "lzma" (7z) and LZHAM create threads by default. You have to change their options to tell them to *not* create threads. The normal Oodle_Compress calls will not use threads by default, you have to specifically call one of the _Async threaded routines. (my personal preference is to benchmark everything without threads to compare single-threaded performance, and you can always add threads for production use)

5. Take the MIN of N run times.

To get reliable timing, you need to run the loop many times, and take the MIN of all times. The min will give you the time it takes when the OS isn't interrupting you with task switches, the CPU isn't clocking-down for speedstep, etc. I usually do 30 *per core* but you can probably get a way with a bit less.

6. Wipe the cache.

Assuming you are now doing N loops, you need to invalidate the cache between iterations. If you don't, you will be running the compressor in a "hot cache" scenario, with some buffers already in cache.

7. Don't pack a bunch of files together in a tar if that's not how you load.

It may seem like a good way to test to grab your bunch of test files and pack them together in a tar (or zip -0 or similar package) and run the compression tests on that tar. That's a fine option if that's really how you load data in your final application - as one big contiguous chunk that must be loaded in one big blob. But most people don't. You need to test the compressors in the same way they will be used in the final application. If you load whole file at a time, test the compressors on whole file units. Many people do loading on some kind of paging unit, like perhaps 1 MB chunks. If you do that, then test the compressor on the same thing.

8. Choose your test set.

If you could test on the entire set of buffers that your final application will load, that would be an accurate test. (though actually, even that is a bit subtle, since some buffers are more latency sensitive than others, so for example you might care more about the first few things you load to get into a running application as quickly as possible). That's probably not practical, so you want to choose a set that is representative of what you will actually load. Don't exclude things like already-compressed files (JPEGs and so on) *if* you will be running them through the compressor. (though consider *not* running them your compressed-file loading path, in which case you should exclude them from testing). It's pretty hard to get an accurate representative sample, so it's generally best to just get a variety of files and look at individual per-file results.

9. Look at the spectrum of results, not the sum.

After you run on your test set, don't just add up the compressed sizes and times to make a "total" result. Sums can be misleading. One issue is there are some large incompressible files, they can hide the differences on the more compressible files. But a bigger and more subtle trap is the way that sums weight the combination of results. A sum is a weighting by the size of each file in the test set. That's fine if your test set is all of your data, or is a perfectly proportionally representative sampling of all of your data (a subset which acts like the whole). But most likely it's not. It's best to keep the results per file separate and just have a look at individual cases to see what's going on, how the results differ, and try not to simplify to just looking at the sum.

10. If you do sum, sum *time* not speed, sum *size* not ratio.

Speed (like mb/s) and ratio (raw size/comp size) are inverted measures and shouldn't be summed. What you actually care about is total compressed size, and total time to decode. So if you run over a set of files, don't look at "average speed" or "average ratio" , because those are inverted meaures that will oddly weight the accumulation. Instead accumulate total time to decode, total raw size, and total compressed size, and then if you like you can make "overall speed" and "overall ratio" from those total.

11. Try not to malloc in the timing loop.

Your malloc might be fast, it might be slow, it's best to not have that as a variable in the timing. In general try to allocate the memory for the compressor or decompressor outside of the timing loop. (In Oodle this is done by passing in your own pointer for the "decoderMemory" argument of OodleLZ_Decompress). That would be an unfair test if you didn't also do that in the final application - so do it in the final application too! (similarly, make sure there's no logging inside the timing loop).

12. Consider excluding almost-incompressible files.

This is something you should consider for final shipping application, and if you do it in your shipping application, then you should do it for the benchmark too. The most common case is already-compressed files like JPEG images and MP3 audio. These files can usually be compressed slightly, maybe saving 1% of their size, but the time to decode them is not worth it overall - you can get more total size savings by running a more powerful compressor on other files. So it's most efficient to just send them uncompressed.

13. Tiny files should either be excluded or packed together.

There's almost never a use case where you really want to compress tiny files (< 16k bytes or so) as independent units. There's too much per-unit overhead in the compressor, and more importantly there's too much per-unit overhead in IO - you don't want to eat a disk seek to just to get one tiny file. So in a real application tiny files should always be grouped into paging units that are 256k or more, a size where loading them won't just be a total waste of disk seek time. So, when benchmarking compressors you also shouldn't run them on tiny independent files, because you will never do that in a shipping application.

5/08/2016

Order-1 Huffman

This is a simple idea that's rarely written down, so I thought I'd do a quick summary.

To my knowledge I was the first person to write about it (in "New Techniques in Context Modeling and Arithmetic Encoding" (PDF) ) but it's one of those simple ideas that probably a lot of people had and didn't write about (like Deferred Summation). It's also one of those ideas that keeps being forgotten and rediscovered over the years.

(I don't know much about the details of how Brotli does this, it may differ. I'll be talking about how I did it).

(also by "Huffman" I pretty much always mean "static Huffman" where you measure the histogram of a block and transmit the code lengths, not "adaptive Huffman" (modifying codelens per symbol (bleck)) or "deferred summation Huffman" (codelens computed from histogram of previous data with no explicit codelen transmission))

Let's start with just the case of order-1 8 bit literals. So you're coding a current 8-bit symbol with an 8-bit previous symbol as context. You can do this naively by just have 256 arrays, one for each 8-bit context. The decoder looks like this :


256 times :
read codelens from file
build huffman decode table

per symbol :

o1 = ptr[-1];
ptr[0] = huff_decode( bitstream , huff_table[o1] );

and on a very large file (*) that might be okay.

(* actually it's only okay on a very large file with completely stable statistics, which never happens in practice. In the real world "very large files" don't usually exist; instead they act like a sequence of small/medium files tacked together. That is, you want a decoder that works well on small files, and you want to be able to reset it periodically (re-transmit huffman codelens in this case) so that it can adapt to local statistics).

On small files, it's disastrous. You're sending 256 sets of codelens, which can be a lot of wasted data. Worst of all it's a huge decode time overhead to parse out the codelens and build the decode tables if you're only going to get a few symbols in that context.

So you want to reduce the count of huffman tables. A rough guideline is to make the number of tables proportional to the number of bytes. Maybe 1 table per 1024 bytes is tolerable to you, it depends.

(another goal for reduction might be to get all the huff tables to fit in L2 cache)

So we want to merge the Huffman tables. You want to find pairs of contexts that have the most similar statistics and merge those. If you don't mind the poor encoder-time speed, a good solution is a best-first merge :


for each pair {i,j} (i<j)
merge_cost(i,j) = Huffman_Cost( symbols_i + symbols_j ) - Huffman_Cost( symbols_i ) - Huffman_Cost( symbols_j )

Huffman_Cost( symbols ) = bits to send codelens + bits to encode symbols using those codelens


while # of contexts > target , and/or merge cost < target
pop lowest merge_cost
merge context j onto i
delete all merge costs involving j
recompute all merge costs involving i

if the cost was just entropy (H) instead of Huffman_Cost , then a merge_cost would always be strictly >= 0 (separate statistics are always cheaper than combining). But since the Huffman codelen transmission is not free, the first merges will actually reduce encoded size. So you should always do merges that are free or beneficial, even if huffman table count is low enough.

So contexts with similar statistics will get merged together, since coding them with a combined set of codelens either doesn't hurt or hurts only a little (or helps, with the cost of codelen transmission counted). In this way contexts where it wasn't really helping to differentiate them will get reduced.

Once this is done, the decoder becomes :


get n = number of huffman tables

n times :
read codelens from file
build huffman decode table

256 times :
read tableindex from file
merged_huff_table_ptr[i] = huff_table[ tableindex ]

per symbol :

o1 = ptr[-1];
ptr[0] = huff_decode( bitstream , merged_huff_table_ptr[o1] );

So merged_huff_table_ptr[] is still a [256] array, but it points at only [n] unique Huffman tables.

That's order-1 Huffman!

In the modern world, we know that o1 = the previous literal is not usually the best use of an 8-bit context. You might do something like top 3 bits of ptr[-1], top 2 bits of [ptr-2], 2 bits of position, to make a 7-bit context.

One of the cool things order-1 Huffman can do is to adaptively figure out the context for you.

For example with LZMA you have the option of the # of literal context bits (lc) and literal pos bits (lp). You want them to be as low as possible for better statistics, and there's no good way to choose them per file. (usually lc=2 or lp=2 , usually just one or the other, not both)

With order-1 Huffman, you just make a context with 3 bits of lc and 3 bits of lp, so you have a [64] 6-bit context. Then you let the merger throw away states that don't help. For example if it's a file where pos-bits are irrelevant (like text), they will just get merged out, all the lc contexts that have different lp values will merge together.

05-03-16 | Brotli signed int mode

Brotli signed int context mode. Looks like a good idea. My guess is this is what's helping Brotli10 on the binary files I wrote about in the previous post (horse.vipm and so on)

Signed int takes the previous two bytes and forms a 6-bit context from them thusly :


  Context = (Lut2[b2]<<3) | Lut2[b1];

      Lut2 :=
         0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7

So, it's roughly categorizing the two values into ranges, which means it can act as a kind of linear predictor (if that fits the data), eg. two preceding values in group 4 = my value probably is too, or if b2 is a "3" and b1 is a "4" then I'm likely a "5". Or not, if the data isn't linear like that. Or maybe there's only correlation to b1 and b2 gets ignored, which the order-1-huff can also model.

One thing I like is holding out values 00 and FF and special cases that get a unique bucket. This lets you detect the special cases of last two bytes = 0000,FFFF,FF00,00FF , which can be pretty important on binary.

I think that for the type of data we get in games that often has floats, it might be worth it to single out 7F and 80 as well, something like :

0,1111....
1..
22...
2........2,3
4,5555...
5...
66.....
6........6,7

but who knows, would have to test to see.

5/03/2016

Underappreciated Compressors

1. LZX. LZX is crazy good. It was such a huge step at the time, and nobody really recognized it. (I sure didn't) (I guess someone at MS did because they bought it).

It's a shame it never got a good mainstream implementation. It could/should have been the LZ we all used for the past 10 years.

One of the little mistakes in LZX was the 21 bit offset limit. This must have seemed enormous back on the Amiga in 1995, but very quickly became a major handicap against LZs with unlimited windows.

LZX with unlimited window (eg. on files less than 2 MB) is competitive with any modern LZ, especially on binary structured data where it really shines. In hindsight, LZX is the clear ancestor to LZMA and it was way ahead of its time. We're only clearly beating it in the past year or two (!!).

2. RAR. The primary LZ in RAR is a pretty straightforward LZ-Huff (I believe). It's fine, it's nothing bad or special.

What makes RAR special is the filters. It still has the best filters of any compressor I know.

RAR+filters often *beats* LZMA and other very slow high ratio compressors.

The special thing about the RAR filters is that they aren't like most of the "precomp" solutions that just try to recognize WAV headers and things like that - RAR may do some of that (I have no idea) - but it also definitely finds filters that work on headerless data. Like, you can take a BMP or WAV and strip off the header and RAR will still figure out that there's data to filter in there; it must have some analysis heuristics, and they're better than anything else I've seen.

As an example of when RAR filters do magic, here's a 24-bit RGB BMP with the first 100k stripped, so it's headerless and not easily recognized by file-type-detection filters :

PDI_1200_bmp_no_header.zl8.LZNA,1241369
PDI_1200_bmp_no_header.nz,1264747
PDI_1200_bmp_no_header.BitKnit,1268120  // <- wow BitKnit !
PDI_1200_bmp_no_header.LZNA,1306670
PDI_1200_bmp_no_header.rar,1312621  // <- RAR filters!
PDI_1200_bmp_no_header.7z,1377603
PDI_1200_bmp_no_header.brotli10,1425996
PDI_1200_bmp_no_header_lp2.7z,1469079
PDI_1200_bmp_no_header.Kraken,1506915
PDI_1200_bmp_no_header.lzx21,1580593
PDI_1200_bmp_no_header.zstd060,1619920
PDI_1200_bmp_no_header.mc-.rar,1631419 // <- RAR unfiltered
PDI_1200_bmp_no_header.brotli9,1635105
PDI_1200_bmp_no_header.z9.zip,1708589
PDI_1200_bmp_no_header.lz4xc4,1835955
PDI_1200_bmp_no_header.raw,2500000

That said, it does make mistakes. Sometimes filters can make things way worse if they make a wrong decision. They don't have a "filters must help" safety check. This is easy to prevent, you just run also with no filter and make sure it helped, but they seem to not do that (presumably to save the encode time) and the results can be disastrous :

lightmap.bc3.LZNA,361185
lightmap.bc3.7z,373909
lightmap.bc3_lp2.7z,387590
lightmap.bc3.brotli10,391602
lightmap.bc3.BitKnit,416208
lightmap.bc3.zstd060,417956
lightmap.bc3.Kraken,431476
lightmap.bc3.lzx21,441669
lightmap.bc3.mc-.rar,457893  // <- RAR with disabled filters
lightmap.bc3.brotli9,498802
lightmap.bc3.z9.zip,583178
lightmap.bc3.rar,778363  // <- RAR with filters huge fuckup !!
lightmap.bc3.raw,4194332

RAR filters fucking up on DXTC (BCn) is pretty consistent :

c.dds.nz,371610
c.dds.7z,371749
c.dds.zl8.LZNA,371783
c.dds.LZNA,373245
c.dds_lp2.7z,375384
c.dds.lzx21,395866
c.dds.brotli10,399674
c.dds.BitKnit,400528
c.dds.Kraken,405563
c.dds.mc-.rar,408363 // <- unfiltered is okay
c.dds.zstd060,411515
c.dds.brotli9,426948
c.dds.z9.zip,430952
c.dds.rar,438070     // <- oops!
c.dds.raw,524416

Sometimes it does magic :

horse.vipm_lp2.7z,925996
horse.vipm.LZNA,942950
horse.vipm.7z,945707
horse.vipm.brotli10,955363 // <- brotli10 big step
horse.vipm.rar,971716 // <- RAR with filters does magic
horse.vipm.BitKnit,1017740
horse.vipm.lzx21,1029541
horse.vipm.mc-.rar,1066205 // <- RAR with disabled filters
horse.vipm.zstd060,1100219
horse.vipm.Kraken,1106081
horse.vipm.brotli9,1108858
horse.vipm.z9.zip,1155056
horse.vipm.raw,1573070

Here's an XRGB dds where the RAR filters do magic :

d.dds.zl8.LZNA,352232
d.dds.nz,356649
d.dds.BitKnit,360220  // (at zl6 BitKnit beats LZNA ! crushes 7z! wow)
d.dds.LZNA,381250
d.dds.rar,382282  // <- RAR filter crushes 7z
d.dds_lp2.7z,427395
d.dds.7z,452898
d.dds.brotli10,471413
d.dds.Kraken,480257
d.dds.lzx21,520632
d.dds.mc-.rar,534913 // <- RAR unfiltered is poor
d.dds.brotli9,542792
d.dds.zstd060,545583
d.dds.z9.zip,560708
d.dds.raw,1048704

happy.zl8.LZNA,949709
happy.LZNA,955700
happy_lp2.7z,974550
happy.BitKnit,979832
happy.7z,1004359
happy.cOO.nz,1015048
happy.co.nz,1028196
happy.Kraken,1109748
happy.brotli10,1135252
happy.lzx21,1168220
happy.mc-.rar,1177426 // <- RAR unfiltered is okay
happy.zstd060,1199064
happy.brotli9,1219174
happy.rar,1354649 // <- RAR filters fucks up
happy.z9.zip,1658789
happy.lz4xc4,2211700
happy.raw,4155083

Not about RAR, but for historical comparison, lzt24 is another mesh (the "struct72" file here )

lzt24.zl8.LZNA,1164216
lzt24.LZNA,1177160
lzt24.nz,1206662
lzt24_lp2.7z,1221821
lzt24.BitKnit,1224524
lzt24.7z,1262013
lzt24.Kraken,1307691
lzt24.brotli10,1323486
lzt24.brotli9,1359566
lzt24.lzx21,1475776
lzt24.zstd060,1498401
lzt24.mc-.rar,1612286
lzt24.rar,1612286
lzt24.z9.zip,2290695
lzt24.raw,3471552

Found another weird one where RAR filters do magic; lzt25 is super-structured 13-byte structs :

lzt25.rar,40024  // <- WOW RAR filters!
lzt25.nz,45397
lzt25.7z,51942
lzt25_lp2.7z,52579
lzt25.LZNA,58903
lzt25.zl8.LZNA,61582  // <- zl8 LZNA worse than zl6 - weird file
lzt25.lzx21,63198
lzt25.zstd060,64550  // <- ZStd does surprisingly well here, I thought you needed more reps on this file
lzt25.brotli9,67856
lzt25.Kraken,67986
lzt25.brotli10,68472 // <- brotli10 worse than brotli9 !
lzt25.BitKnit,92940  // <- BitKnit oddly struggling
lzt25.mc-.rar,106423 // <- unfiltered RAR is the worst of the LZ's
lzt25.z9.zip,209811
lzt25.lz4xc4,324125
lzt25.raw,1029744

A lot of interesting things to pick out in those reports. (just saying, I'm not gonna address them all)

One just general thing is that the performance of these LZ's is in no way consistent. You can't just say that "X LZ is 5% better than Y", there's no really consistent pattern, they have wildly variable relative performance.

There's a family of sort of normal LZ's - LZX, Brotli9, ZStd, & unfiltered RAR. Then there's the family of the high-compress LZ's, LZNA, 7z, nz. Those are pretty consistently together, and form two end-points.

But then there are the floaters. BitKnit, Kraken, Filtered RAR, and Brotli10 can jump around between the "normal LZ" and "high-compress LZ" region. BitKnit and Brotli10 are the most variable - they both can jump right up to the high-compress LZ's like 7z, but on other files they drop right down into the pack of normal LZ's (LZX, etc.).

I have a guess about what's happening with Brotli. I haven't looked at the code at all, but my guess is that between level 9 and 10 the order-1 context optimization is turned on. In particular, there's this "signed int" context mode which I believe is what does the magic for brotli on things like horse.vipm (for example it has contexts for the case of last two bytes = 0x0000 , or last two bytes = 0xFFFF , which are pretty common on horse). My guess is that this mode is just not even tried at all at level 9, and at level 10 it turns on the code to pick the best context mode, and finds the signed int mode which is great on these files. Not sure.