03-12-12 - Comparing Compressors

It's always hard to compare compressors fairly in a way that's easily understood by the layman. I think the basic LZH compressors in Oodle are very good, but do they compress as much as LZMA ? No. Are they as fast as LZO? No. So if I really make a fair comparison chart that lists lots of other compressors, I will be neither the fastest nor the highest ratio.

(The only truly fair way to test, as always, is to test in the actual target application, with the actual target data. Other than that, it's easiest to compare "trumps", eg. if compressor A has the same speed as B, but more compaction on all files, then it is just 100% better and we can remove B from consideration)

I wrote before about the total time method of comparing compressors. Basically you assume the disk has some given speed D. Then you see what is the total time to load the compressed file (eg. compressed size/D) and the time to do the decompression.

"Total time" is not really the right metric for various reasons; it assumes that one CPU is fully available to compression and not needed for anything else. It assumes single threaded operation only. But the nice thing about it is you can vary D and see how the best compressor changes with D.

In particular, for two compressors you can solve for the disk speed at which their total time is equal :

D = disk speed
C = decompressor speed
R = compression ratio (compressed size / raw size) (eg. unitless and less than 1)

disk speed where two compressors are equal :

D = C1 * C2 * (R1 - R2) / (C1 - C2)

at lower disk speeds, the one with more compression is preferred, and at higher disk speeds the faster one with lower compression is preferred.

The other thing you can do is show "effective speed" instead of total time. If you imagine the client just gets back the raw file at the end, they don't know if you just loaded the raw file or if you loaded the compressed file and then decompressed it. Your effective speed is :

D = disk speed
C = decompressor speed
R = compression ratio (compressed size / raw size) (eg. unitless and less than 1)

S = 1 / ( R/D + 1/C )

So for example, if your compressor is "none", then R = 1.0 and C = infinity, so S = D - your speed is the disk speed.

If we have two compressors that have a different ratio/speed tradeoff, we can compare them in this way. I was going to compare my stuff to Zip/Zlib, but I can't. On the PC I'm both faster than Zip and get more compression, so there is no tradeoff. (*1) (*2)

(*1 = this is not anything big to brag about, Zip is ancient and any good modern compressor should be able to beat it on both speed and ratio. Also Zlib is not very well optimized; my code is also not particularly optimized for the PC, I optimize for the consoles because they are so much slower. It's kind of ironic that some of the most pervasive code in the world is not particularly great).

(*2 = actually there are more dimensions to the "Pareto space"; we usually show it as a curve in 2d, but there's also memory use, and Zip is quite low in memory use (which is why it's so easy to beat - all you have to do is increase the window size and you gain both ratio and speed (you gain speed because you get more long matches)); a full tradeoff analysis would be a manifold in 3d with axes of ratio,speed, and size)

Anyhoo, on my x64 laptop running single threaded and using the timing technique here I get :

zlib9 : 24,700,820 ->13,115,250 =  1.883 to 1, rate= 231.44 M/s

lzhlw : 24,700,820 ->10,171,779 =  2.428 to 1, rate= 256.23 M/s

rrlzh : 24,700,820 ->11,648,928 =  2.120 to 1, rate =273.00 M/s

so we can at least compare rrlzh (the faster/simpler of my LZH's) with lzhlw (my LZH with Large Window).

The nice thing to do is to compute the effective speed S for various possible disk speeds D, and make a chart :

On the left is effective speed vs. disk speed, on the right is a log-log plot of the same thing. The blue 45 degree line is the "none" compressor, eg. just read the uncompressed file at disk speed. The axis is MB/sec, and here (as is most common for me) I use M = millions, not megas (1024*1024) (but the numbers I was showing at GDC were megas, which makes everything seem a little slower).

We see that on the PC, lzhlw is the better choice at any reasonable disk speed. They are equal somewhere around D = 280 MB/sec, but it's kind of irrelevant because at that point they are worse than just loading uncompressed.

The gap between lines in a log-log plot is the *ratio* of the original numbers; eg. the speedup multipler for LZH over RAW is maximum at the lowest speeds (1 MB/sec, = 0 on the log-log chart) and decreases as the disk gets faster.


extrememoderate said...

"It's kind of ironic that some of the most pervasive code in the world is not particularly great"

The real problem is that is has no competition.
What other open source algorithms in the same field?
First is Tornado. Quite good actually...but only for big chunks of data. And undocumented. And with interface that's clearly designed for file to file or memory to file compression. And x86 only. And being version 0.5, not used by anything but FreeArc, many people won't trust that it's reliable. An option, but clearly not something that could replace zlib.
The second option would be....well...Deflate64 from 7-zip. A nice, modern codec that's oh so different from Deflate....
The 3rd option?
I don't see any.
Well, maybe alternative Deflate implementations, but zlib is very portable, very reliable and very widely known and they can't offer that much better efficiency.

So until you open source your rrlzh and lzhlw, zlib will keep being the best option for many uses.

cbloom said...

Yeah, I know, I'm as much to blame as anyone. But even aside from the issue of the format, it's weird that there isn't a fast zlib decoder.

The problem is that the people who develop compression algorithms can't be bothered to make a clean open source public domain version.

I always hoped that some group of us could come together and make a new standard, and then someone else would actually do all the work of making it work on GCC and big endian and so on.

But it turns out that when people write open source code, they prefer to invent their own stuff, even though they aren't particularly expert in the topic, so you wind up with yet another open source format that's not particularly good (eg. snappy).

Others like FreeArc and 7zip and LZO are just way too big and complex; what you really want is an STB-style single header, at least for the decoder.

Steve Worley said...

I totally agree with the rough state of compression libraries. There's tons of theory, lots and lots of (interesting!) experiments (especially at encode.ru) but simple, API friendly, clean implementations that are superior enough to zlib to make them worthwhile? Not many.

I found you from LZP research, since I was looking for different compromises between speed and compression (for my own encoding experiments).

One interesting point: even Oodle, a commercial library, isn't advertised or even listed by RAD. I had to search a lot to find even a reference to it on their site, in a Granny3D changelog! If it's really a general purpose library, they (er, probably meaning you...) should post a whitepaper overview/analysis of it for their site.

ryg said...

RAD isn't advertising Oodle yet because it's not actually released yet, except for the aforementioned bit in Granny :)

Yann Collet said...

I plan to open-source Zhuff (an LZH derivative) at some point, under a source code format & interface pretty close to LZ4. The real issue is ... time.

Even a simple codec such as LZ4 already requires quite some support. That being said, it's a huge and useful experience, that will greatly benefit to future developments.

extrememoderate said...

miniLZO is neither big nor hard to use, yet it offers what 99% of LZO users need. It takes 4 files, but that's hardly a show stopper.
With stronger codecs things are worse....though I'm not damning people for this. Being disappointed with the state of FOSS LZH algos I contemplated writing my own (well, I still do). But it would too be designed to fit my needs, no more, no less. Yet another one-use compressor w/out ambition to be better than zlib for an average compression user.
I'm very interested. Could you share 1 more detail though? What license do you intend to use?
BTW is it time for a support forum already?

Yann Collet said...

BSD license is the target, although the codec may start its life as L-GPL.

I have not thought of a support forum yet.
It seems much too soon to talk about the open source version, but there is already a page to post comment on the current win32 binary version.

It's linked from Zhuff Homepage : http://fastcompression.blogspot.com/p/zhuff.html

extrememoderate said...

I meant support forum for LZ4 and ZHuff at later time.

And BSD is a great news, thanks.

Yann Collet said...

LZ4 support forum is this : http://groups.google.com/group/lz4c

extrememoderate said...

Oh...google...after all it's not surprising.

Thanks for the info, but it's not for me.

old rants