cbloom rants: 10-16-12 - Two more small notes on LZNib

10/16/2012

10-16-12 - Two more small notes on LZNib

1. Because cost ties are common, and ties are not actually ties (due to "last offset"), just changing the order that you visit matches can change your compression. eg. if you walk matches from long to short or short to long or low offset to high offset, etc.

Another important way to break ties is for speed. Basically prefer long matches and long literal runs vs. a series of shorter ones that make the same output length. Because the code cost is integer bytes, you can do this pretty easily by just adding a small bias to the cost (one thousandth of a byte or whatever) each time you start a new match or LRL.

(more generally in an ideal world every compressor should have a lagrange parameter for space-speed tradeoff, but that's the kind of thing nobody ever gets around to)

2. Traditional LZ coders did not output matches unless they were cheaper than literals. That is, say you send a match len in 4 bits and an offset in 12 bits, so a match is 2 bytes - you would think that the minimum match length should be 3 - not 2 - because sending a 2 byte match is pointless (it's cheaper or the same cost to send those 2 bytes as literals (cheaper as literals if you are in a literal run-len already)). By using a larger MML, you can send higher match lengths in your 4 bits, so it should be a win.

This is not true if you have "last offset". With LO in your coder, it is often beneficial to send matches which are not a win (vs literals) on their own. eg. in the above example, minimum match length should be 2 in an LO coder.

This is one of those cases where text and binary data differ drastically. If you never tested on structured data you would not see this. Really the nature of LZ compression on text and binary is so different that it's worth considering two totally independent compressors (or at least some different tweaked config vals). Text match offsets fall off very steadily in a perfect curve, and "last offsets" are only used for interrupted matches, not for re-using an offset (and generally don't help that much). Binary match offsets have very sparse histograms with lots of strong peaks at the record sizes in the file, and "last offset" is used often just as a way of cheaply encoding the common record distance.

On text, it is in fact best to use an MML which makes matches strictly smaller than literals.

If I keep at this work in the future I'm sure I'll get around to doing an LZ specifically designed for structured data; it's sort of hopeless trying to find a compromise that works great on both; I see a lot more win possible.

cbloom rants

10/16/2012

10-16-12 - Two more small notes on LZNib

No comments:

old rants