It's always hard to compare compressors fairly in a way that's easily understood by the layman. I think the
basic LZH compressors in Oodle are very good, but do they compress as much as LZMA ? No. Are they as fast as
LZO? No. So if I really make a fair comparison chart that lists lots of other compressors, I will be neither
the fastest nor the highest ratio.
(The only truly fair way to test, as always, is to test in the actual target application, with the actual
target data. Other than that, it's easiest to compare "trumps", eg. if compressor A has the same speed as B,
but more compaction on all files, then it is just 100% better and we can remove B from consideration)
I wrote before about the
total time method of comparing
compressors. Basically you assume the disk has some given speed D. Then you see what is the total time to
load the compressed file (eg. compressed size/D) and the time to do the decompression.
"Total time" is not really the right metric for various reasons; it assumes that one CPU is fully available
to compression and not needed for anything else. It assumes single threaded operation only. But the nice
thing about it is you can vary D and see how the best compressor changes with D.
In particular, for two compressors you can solve for the disk speed at which their total time is equal :
D = disk speed
C = decompressor speed
R = compression ratio (compressed size / raw size) (eg. unitless and less than 1)
disk speed where two compressors are equal :
D = C1 * C2 * (R1 - R2) / (C1 - C2)
at lower disk speeds, the one with more compression is preferred, and at higher disk speeds the faster one
with lower compression is preferred.
The other thing you can do is show "effective speed" instead of total time. If you imagine the client just
gets back the raw file at the end, they don't know if you just loaded the raw file or if you loaded the
compressed file and then decompressed it. Your effective speed is :
D = disk speed
C = decompressor speed
R = compression ratio (compressed size / raw size) (eg. unitless and less than 1)
S = 1 / ( R/D + 1/C )
So for example, if your compressor is "none", then R = 1.0 and C = infinity, so S = D - your speed is the
disk speed.
If we have two compressors that have a different ratio/speed tradeoff, we can compare them in this way.
I was going to compare my stuff to Zip/Zlib, but I can't. On the PC I'm both faster than Zip and get more
compression, so there is no tradeoff. (*1) (*2)
(*1 = this is not anything big to brag about, Zip is ancient and any good modern compressor should be able
to beat it on both speed and ratio. Also Zlib is not very well optimized; my code is also not particularly
optimized for the PC, I optimize for the consoles because they are so much slower. It's kind of ironic that
some of the most pervasive code in the world is not particularly great).
(*2 = actually there are more dimensions to the "Pareto space"; we usually show it as a curve in 2d, but
there's also memory use, and Zip is quite low in memory use (which is why it's so easy to beat - all you have
to do is increase the window size and you gain both ratio and speed (you gain speed because you get more
long matches)); a full tradeoff analysis would be a manifold in 3d with axes of ratio,speed, and size)
Anyhoo, on my x64 laptop running single threaded and using the
timing technique here
I get :
zlib9 : 24,700,820 ->13,115,250 = 1.883 to 1, rate= 231.44 M/s
lzhlw : 24,700,820 ->10,171,779 = 2.428 to 1, rate= 256.23 M/s
rrlzh : 24,700,820 ->11,648,928 = 2.120 to 1, rate =273.00 M/s
so we can at least compare rrlzh (the faster/simpler of my LZH's) with lzhlw (my LZH with Large Window).
The nice thing to do is to compute the effective speed S for various possible disk speeds D, and make a chart :
On the left is effective speed vs. disk speed, on the right is a log-log plot of the same thing.
The blue 45 degree line is the "none" compressor, eg. just read the uncompressed file at disk speed.
The axis is MB/sec, and here (as is most common for me) I use M = millions, not megas (1024*1024) (but the numbers I was showing
at GDC were megas, which makes everything seem a little slower).
We see that on the PC, lzhlw is the better choice at any reasonable disk speed. They are equal somewhere
around D = 280 MB/sec, but it's kind of irrelevant because at that point they are worse than just loading
uncompressed.
The gap between lines in a log-log plot is the *ratio* of the original numbers; eg. the speedup multipler
for LZH over RAW is maximum at the lowest speeds (1 MB/sec, = 0 on the log-log chart) and decreases as
the disk gets faster.