10/01/2008

10-01-08 : First Look at LZMA

So I looked into LZMA / 7-Zip a bit. So far as I know there's no real description of the algorithm available, but there is source code here . This is from just browsing the source code for a bit, so take it with a bit of skepticism.

LZMA is an arithmetic coded LZ + context coder. It has a couple of clever ideas that make it perform very well. In fact it compresses better than Quantum and is also faster (at compression - Quantum is faster at decompression, in fact the only disadvantage of LZMA is that its decomp is not super fast).

The match finder is very fast. I can't figure out what it's doing exactly. It makes strong hashes using the CRC. The docs claim it then goes to a patricia trie, but my god I can't see that in the code.

It does lots of binary arithcoding. In fact I don't think it has a multisymbol arithcoder at all, it just breaks larger alphabets into a series of binary codes. He does a pretty good job of breaking up the binary tree so that the common cases of short lengths and low offsets get written in very few binary coding ops (that is, the coding choices of 0 or 1 are close to equal probability).

There are some clever models. In general it's well tweaked to use few contexts but capture the salient information in those contexts. For example, the order-1 literal model only uses the top N bits of the previous char as context, since for most data the top bits are the good predictors and the bottom bits are semi-random. One clever model is using the bottom 2 bits of position as part of the literal context. This is great for capturing the very common pattern on binary data of a 0 every fourth byte, or any every-2 or every-4 bytes kind of patterns (such as when you write a bunch of 32 bit floats and the exponent part is the same on all of them).

It also does a semi-optimal parse. From what I can tell it looks like it finds the match possibilities N steps ahead, and then does a kind of semi-backward cost evaluation from the end of that sequence to pick the best choice at the current coding spot. I believe N is only 4, so this is sort of like the old "lazy matching" but it looks ahead 4, and rather than a rough heuristic like most people use in lazy matching he actually evaluates a coding cost like a true optimal parse.

On text it's very good, but on binary it really kicks ass, I'm guessing the biggest win is because of the pos-conditioned contexts which are really good for binary.

1 comment:

hari said...

Dear,

I am trying to analyse LZMA algorithm with 18.06 version of LZMA SDK.
I am having some doubts regarding the same.
Requesting you for the help to understand the things as documentation is not that explanatory.

1. In LZMA page on Wikipedia, table no. 4 gives information about state transitions on occurrence of LIT, MATCH, LONGREP and SHORTREP. Could you please explain on which basis these 12 states are derived.

2. Why "BitModelTotalBits" is having value of 11?
Can I consider it as base value and considering this as a base value, every calculation is done. Also we can take any value as a base value of Range Encoding calculations..??
Also, "kTopValue" is set to 24. What is reason behind selecting this specific value..??

3. In Range decoding, we are using two typed, based on probabilities,
i. binary symbols with fixed and equal probabilities (direct bits)
ii. binary symbols with predicted probabilities
On which basis, we decide which type we need to use.
Also, in second type, probability increment and decrement factor is not identical. Why this is so..??

4. Where Markov Chain concept is getting used in LZMA decoder code "LzmaSpec.cpp"..??

Waiting for the response.

Regards,
Hariom

old rants