09-14-12 - Things Most Compressors Leave On the Table

It's very appealing to write a "pure" algorithmic compressor which just implements PPM or LZP or whatever in a very data agnostic way and make it quite minimal. But if you do that, you are generally leaving a lot on the table.

There are lots of "extra" things you can do on top of the base pure compressor. It makes it very hard to compare compressors when one of them is doing some of the extra things and another isn't.

I used to only write pure compressors and considered the extra things "cheating", but of course in practical reality they can sometimes provide very good bang-for-the-buck, so you have to do them. (and archivers these days are doing more and more of these things, so you will look bad in comparison if you don't do them).

Trying to dump out a list of things :

Parameter Optimization . Most compressors have some hard-coded parameters; some time it's an obvious one, like in LZMA you can set the # of position bits used in the context. Getting that number right for the particular file can be a big win. Other compressors have hard-coded tweaks that are not so obvious; for example almost all modern PPM or CM compressors use some kind of secondary-statistics table; the hash index made for that table is usually some heuristic, and tweaking it per file can be a big win.

Model Preconditioning . Any time your have a compressor that learns (eg. adaptive statistical coders, or the LZP cache table, or the LZ77 dictionary) - a "pure" compressor starts from an empty memory and then learns the file as it goes. But that's rarely optimal. You can usually get some win by starting from some pre-loaded state; rather than starting from empty and learning the file, you start from "default file" and learn towards the current file. (eg. every binary arithmetic probability should not start at 50% but rather at the expected final probability). And of course you can take this a step further by having a few different preconditions for different file types and selecting one.

Prefilters . BCJ (exe jump transform), deltas, image deltas, table record deltas (Bulat's DELTA), record transposes, various deswizzles, etc. etc. There are lots of prefilters possible, and they can provide very big wins for the amount of time they take. If you don't implement all the prefilters you are at a disadavantage to compressors that do. (for example, RAR has a pretty strong set of prefilters that are enabled by default, which means that RAR actually beats 7zip on lots of files, even though as a pure compressor it's much worse).

Header compression . Anything you send like buffer sizes or compressor parameters can generally be smaller by more advanced modeling. Typically this is just a few bytes total so not important, but it can become important if you transmit incremental headers, or something like static huffman codes. eg. something like Zip that can adapt by resending Huffmans, it's actually important to get that as small as possible, and it's usually something that's neglected because it's outside of the pure compressor.

Data Size Specialization . Most compressors either work well on large buffers or small buffers, not both; eg. if you do an LZSS , you might pick 3 byte offsets for large buffers, but on tiny buffers that's a huge waste; in fact you should use 1 byte offsets at first, and then switch to 2, and then 3. People rarely go to the trouble to have separately tuned algorithms for various buffer sizes.

Data specialization . Compressing text, structured records, images, etc. is actually all very different. You can get major win by special-casing for the major types of data (eg. text has weird features like the top bits tell you the type of character; word-replacing transforms are a big win, as are de-punctuators, etc. etc.).

Decompression . One of the new favorite tricks is decompressing data to compress it. If someone hands you a JPEG or a Zip or whatever and tells you to compress it as well as possible, of course the first thing you have to do is decompress it to undo the bad compressor so you can get back to the original bits.

This is almost all stuff I haven't done yet, so I have some big win in my back pocket if I ever get around to it.

In the compression community, I'm happy to see packages like FreeArc that are gathering together the prefilters so that they can be used with a variety of back-end "pure" compressors.

No comments:

old rants