10/01/2008

10-01-08 - 2

So I looked into LZMA / 7-Zip a bit. So far as I know there's no real description of the algorithm available, but there is source code here . This is from just browsing the source code for a bit, so take it with a bit of skepticism.

LZMA is an arithmetic coded LZ + context coder. It has a couple of clever ideas that make it perform very well. In fact it compresses better than Quantum and is also faster (at compression - Quantum is faster at decompression, in fact the only disadvantage of LZMA is that its decomp is not super fast).

The match finder is very fast. I can't figure out what it's doing exactly. It makes strong hashes using the CRC. The docs claim it then goes to a patricia trie, but my god I can't see that in the code.

It does lots of binary arithcoding. In fact I don't think it has a multisymbol arithcoder at all, it just breaks larger alphabets into a series of binary codes. He does a pretty good job of breaking up the binary tree so that the common cases of short lengths and low offsets get written in very few binary coding ops (that is, the coding choices of 0 or 1 are close to equal probability).

There are some clever models. In general it's well tweaked to use few contexts but capture the salient information in those contexts. For example, the order-1 literal model only uses the top N bits of the previous char as context, since for most data the top bits are the good predictors and the bottom bits are semi-random. One clever model is using the bottom 2 bits of position as part of the literal context. This is great for capturing the very common pattern on binary data of a 0 every fourth byte, or any every-2 or every-4 bytes kind of patterns (such as when you write a bunch of 32 bit floats and the exponent part is the same on all of them).

It also does a semi-optimal parse. From what I can tell it looks like it finds the match possibilities N steps ahead, and then does a kind of semi-backward cost evaluation from the end of that sequence to pick the best choice at the current coding spot. I believe N is only 4, so this is sort of like the old "lazy matching" but it looks ahead 4, and rather than a rough heuristic like most people use in lazy matching he actually evaluates a coding cost like a true optimal parse.

On text it's very good, but on binary it really kicks ass, I'm guessing the biggest win is because of the pos-conditioned contexts which are really good for binary.

No comments:

old rants