(the compressors are the ones I can compile into my testbed and run from code, eg. not command line apps; they are all run memory to memory; I tried to run all compressors in max-compression-max-decode-speed mode , eg. turning on heavy options on the encode side. Decode times are generated by running each decompressor 10 times locked to each core of my box (80 total runs) and taking the min time; the cache is wiped before each decode. Load times are simulated by dividing the compressed file size by the disk speed parameter. All decoders were run single threaded.)
They are sorted by fastest decompressor. (the "raw" uncompressed file takes zero time to decompress).
"sum" = the sum of decomp + load times. This is the latency if you load the entire compressed file and then decompress in series.
"max" = the max of decomp & load times. This is the latency if the decomp and load were perfectly parallelizable, and neither ever stalled on the other.
The criterion you actually want to use is something between "sum" and "max", so the idea is you look at them both and kind of interpolate in your mind. (obviously you can replace "disk" with ram or network or "channel")
Discussion :
The compressors are ordered from left to right by speed. If you look at the chart of compressed file sizes, they should be going monotonically downward from left to right. Any time it pops up as you go to the right (eg. at snappy, minilzo, miniz, zlib) that is just a bad compressor; it has no hope of being a good choice in terms of space/speed tradeoff. The only ones that are on the "Pareto frontier" are raw, LZ4, OodleLZH, and LZMA.
Basically what you should see is that on a fast disk (100 mbps (and mb = millions of bytes, not 1024*1024)), a very slow decoder like LZMA does not make a lot of sense, you spend way too much time in decomp. On very slow data channels (like perhaps over the net) it starts to make sense, but you have to get to 5 mbps or slower before it becomes a clear win. (of course there may be other reasons that you want your data very small other than minimizing time to load; eg. if you are exceeding DVD capacity).
On a fast disk, the fast decompressors like LZ4 are appealing. (though even at 100 mbps, OodleLZH has a lower "max"; LZ4 has the best "sum").
Of the fast decoders, LZ4 is just the best. (in fact LZ4-largewindow would be even stronger). Zip is pretty poor; the small window is surely hurting it, it doesn't find enough matches which not only hurts compression, it hurts decode speed. Part of the problem is neither miniz nor zlib have super great decoders with all the tricks.
It's kind of ridiculous that we don't have a single decent mainstream free compression library. Even just zip-largewindow would be at least decent. (miniz could easily be extended to large windows; that would make it a much more competitive compressor for people who don't care about zlib compatibility)
If you are fully utilizing your CPU's, you may need a low-CPU decoder even if it's not the best choice in a vacuum. In fact because of that you should avoid CPU-hungry decoders even if they are the best by some simple measure like time to load. eg. even in cases where LZMA does seem like the right choice, if it's close you should avoid it, because you could use that CPU time for something else. You could say that any compressor that can decode faster than it can load compressed data is "free"; that is, the decode time can be totally hidden by parallelizing with the IO and you can saturate the disk loading compressed data. While that is true it assumes no other CPU work is being done, so does not apply to games. (it does sort of apply to archivers or installers, assuming you aren't using all the cores).
As a rough rule of thumb, compressors that are in the "sweet spot" take time that is roughly on par with the disk time to load their compressed data. That is, maybe half the time, or double the time, but not 1/10th the time of the disk (then they are going too fast, compressing too little, leaving too much on the table), and also not 10X the time of the disk (then they are just going way too slow and you'd be better off with less compression and a faster compressor).
The other thing we can do is draw the curves and see who's on the pareto frontier.
Here I make the Y axis the "effective mbps" to load and then decompress (sequentially). Note that "raw" is an x=y line, because the effective speed equals the disk speed.
Let me emphasize that these charts should be evaluated as information that goes into a decision. You do not just go "hey my disk is 80 mbps let me see which compressor is on top at that speed" and go use that. That's very wrong.
and the log-log (log base 2) :
You can see way down there at the bottom of the log-log, where the disk speed is 1.0 mbps, LZMA finally becomes best. Also note that log2 of 10 is a gigabyte per second, almost the speed of memory.
Some intuition about log-log compressor plots :
Over on the right hand side, all the curves will flatten out and become horizontal. This is the region where the decompress time dominates and the disk speed becomes almost irrelevant (load time is very tiny compared to decompress time). You see LZMA flattens out at relatively low disk speed (at 16 mbps (log2 = 4) it's already pretty flat). The speed over at the far right approaches the speed of just the decompressor running memory to memory.
On the left all the curves become straight lines with a slope of 1 (Y = X + B). In this area their total time is dominated by their loading time, which is just a constant times the disk speed. In a log-log plot this constant multiple becomes a constant addition - the Y intercept of each curve is equal to log2(rawLen/compLen) ; eg someone with 2:1 compression will hit the Y axis at log2(2) = 1.0 . You can see them stacked up hitting the Y axis in order of who gets the most compression.
Another plot we can do is the L2 mean of load time and decode time (sqrt of squares). What the L2 mean does is penalize compressors where the load time and decode time are very different (it favors ones where they are nearly equal). That is, it sort of weights the average towards the max. I think this is actually a better way to rate a compressor for most usages, but it's a little hand-wavey so take it with some skepticism.
Re: big window LZ77
ReplyDeleteI believe Adobe uses a zlib variant called "Flate" (which is of course impossible to search for) which takes a few unused match encodings to act as larger offsets. I think PDF and other formats use it.
Are there any good multithreaded decompressors around or is the only option to compress the file as independent chunks and then decompress all the chunks at once? A pretty slow decompressor running on 4-6 cores could outperform an optimized one that only uses a fraction of the cpu cycles available to it. Even my phone has 4 cores.
ReplyDelete@Chris - I'm not aware of any inherently multithreaded compression algorithms.
ReplyDeleteThere are lots that support chunking for parallelism (Oodle does it that way).
(de)compression is very inherently serial because each byte output needs its predecessor to be already done.
(the compression side is much easier to parallelize; generally you keep the code output still serial, but the code creation can use parallel match-finders or parallel block-sorters or whatever)
You could imagine doing something like context mixing decompression in fine-grained parallel; eg. run all the models on many cores and then merge them on one core, but that would be killed by overhead I suspect. (it would be a lot easier in what I call a "kernel-like" threading environment; eg. a console where you are in full control of thread switching, so you could lock threads to cores and know they were all running). (there would even be advantages if the cores had independent L2 caches; each model would get its own L2)
Isn't BWT, or at least suffix sorting parallelizable?
ReplyDeleteYeah, but only on the compress side, not the decompress side.
ReplyDeleteThough you could do a BWT where the entropy coding was done in chunks (so that decode would be parallelized) but then the un-sort was done as a single serial pass (so that you get decorrelation across chunks).
Most modern BWT decoders spend much more time in the entropy/MTF than the un-sort.
(it's also not perfect parallelization even on the compress side; the general approach is like merge sort, but the merging is rather expensive; there is some new work on better ways to parallelize the suffix sort though)
LZHAM has a parallel matchfinder.
ReplyDeleteAnd IIRC there have been topics on parallel reverse BWT on encode.ru.
ReplyDeleteAlso, a question:
oodlelzh is a proprietary codec, right?
"LZHAM has a parallel matchfinder."
ReplyDeleteYeah, lots of people do; that doesn't help decode speed, which is all I'm measuring at the moment. Also it's not really a "parallel compression algorithm", it's just using some parallelism to accelerate a sequential parse.
LZHAM's parallel match finder is somewhat more aggressively parallel than others, and there's source code which is pretty clean, so it's definitely worth a look.
>Yeah, lots of people do
ReplyDeleteCan you pose a few examples?
The only ones that I'm aware of are LZMA and LZHAM.
Well I was thinking of LZMA, LZHAM, and me ;) I assume there are more (?) but maybe not, I see FreeArc just uses LZMA's for example.
ReplyDeleteThere are parallel suffix sorters and suffix array is a pretty good way to do LZ match finding.