I've been trying to figure out LZMA for a while, if anyone can help please chime in.
LZMA is very good, and also very obscure. While the code is free and published, it's completely opaque and the algorithm is not actually described anywhere. In particular,
it's very good on structured data - even better than PPM. And, superficially, it looks very much like any other LZ+arithmetic+optimal parse,
which there were many of before LZMA, and yet it trounces them all.
So, what's going on in LZMA? First, a description of the basic coder. Most of LZMA is very similar to LZX - LZX uses
a forward optimal parse, log2-ish encoded lengths and offsets, and the most recent 3 offsets in an MTF/LRU which are
coded as special "repeat match" codes. (LZX was made public when Microsoft made it part of CAB and published a
full spec ).
LZMA can code a literal, a match, or a recent offset match - one of the three most recent offsets (like LZX). This is pretty standard. It
also has two coding modes that are unusual : "Delta Literal" coding, and the 0th most recent offset match can code a single character match.
Everything it codes is context-coded binary arithmetic coded. Literals are coded as their bits; the initial
context is the previous character and a few bits of position, and as literal bits are coded they are shifted into the context for future bits (top to bottom).
This is pretty standard.
Using a few bits of position as part of the context lets it have different statistics at each byte position in a dword
(or whatever). This is very useful for coding structured data such as arrays of floats. This idea has been around
for a long time, but older coders don't do it and it certainly is part of the advantage on array/structured data.
The bottom bits of position are also used as part of the context for the match flag, and also the "is last match 0 long"
flag. Other match-related coding events don't use it.
In theory you should figure out what the local repetition period is and use that; LZMA doesn't make any effort to do
that and just always uses N bits of position (I think N=2 is a typical good value).
Lengths and Offsets are coded in what seems to be mostly a pretty standard log2-ish type coding (like Zip and others). Offsets
are coded as basically the position of their MSB and then the remaining bits. The MSB is context-coded with the
length of the match as context; this capture length-offset correlation. Then, the bottom 4 bits of the offset are
sent, binary arithmetic coded on each other in reverse order (bottom bit first). This lets you capture things like
a fixed structure in offsets (eg. all offsets are multiples of 4 or 8). The bits between the MSB and the bottom 4 are
sent without compression.
The binary match flags are context coded using the "state" , which is the position in an internal finite state machine. It is :
LZMA state machine :
Literal :
state <
7 :
normal literal
state >= 7 :
delta literal
state [0-3] -> state = 0
state [4-9] -> state -= 3 ([1-6])
else state -= 6 [10-11] -> ([4-5])
Match :
rep0
len 1 :
state -> <
7 ? 9 : 11
len > 1 :
state -> <
7 ? 8 : 11
rep12
state -> <
7 ? 8 : 11
normal
state -> <
7 ? 7 : 10
// or from Igor Pavlov's code :
static const int kLiteralNextStates[kNumStates] = {0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5};
static const int kMatchNextStates[kNumStates] = {7, 7, 7, 7, 7, 7, 7, 10, 10, 10, 10, 10};
static const int kRepNextStates[kNumStates] = {8, 8, 8, 8, 8, 8, 8, 11, 11, 11, 11, 11};
static const int kShortRepNextStates[kNumStates]= {9, 9, 9, 9, 9, 9, 9, 11, 11, 11, 11, 11};
Okay, this is the first funny unique thing. State basically tells you what the last few coding operations have been. As you send
matches, state gets larger, as you send literals, state gets smaller. In particular, after any literal encoding state is < 7, and
after any match encoding it is > 7. Then above and below that it tells you something about how many literals or matches you've
recently encoded. For example :
initial state = 5
code a normal match -> 7
code a rep match -> 11
code a literal -> 5
code a literal -> 2
code a literal -> 0
Now it's unclear to me whether this funny state machine thing is really a huge win as a context; presumably it is tweaked out to
be an advantage, but other coders
have used the previous match flags as context for the match flag coding (eg. was the last thing a match is one bit, take the last three
that gives you 8 states of previous context), which seems to me to have about the same effect.
There is one funny and clever thing here though, and that's the "delta literal". Any time you code a literal immediately after a match,
the state will be >= 7 so you will code a delta literal. After that state will fall below 7 so you will code normal literals. What is
a delta literal ?
Delta literals are coded as :
char literal = *ptr;
char lastPosPtr = ptr - lastOffset;
char delta = literal ^ *lastPosPtr;
that is, the character is xor'ed with the character at the last coded match offset away from current pointer (not at the last coded pos,
the last offset, that's important for structured data).
When I first saw this I thought "oh it's predicting that the char at the last offset is similar, so the xor makes equal values zero" , but that's
not the case at all. For one thing, xor is not a great way to handle correlated values, subtract mod 256 would be better. For another,
these character are in fact gauranteed to *NOT* match. If they did match, then the preceeding match would have just been one longer.
And that's what's cool about this.
Immediately after a match, you are in a funny position : you have a long preceding context which matches some other long context in the file
(where the match was). From PPM* and LZP we know that long contexts are very strong predictors - but we also know that we have failed to
match that character! If we just use the normal literal coder, we expect the most likely character to be the one that we just failed to
match, so that would be a big waste of code space. So instead, we use this delta literal coder which will let us statistically exclude the zero.
Okay, I think that's it for how the coding works. A few more tidbits :
The match finder in 7zip appears to be a pretty standard hash-to-binary-tree. It uses a hash to find the most recent occurance of the
first few chars of the current string, that points to a node in the binary tree, and then it walks the binary tree to find all matches.
The details of this are a little bit opaque to me, but I believe it walks backwards in order, and it only finds longer matches as it
walks back. That is, it starts at the lowest offset occurance of the substring and finds the match length for that, then it steps to
the next later one along the binary tree, and finds a match *if longer*. So it doesn't find all offsets, it presumes that larger offsets
are only interesting if their matches are longer. (I'm a little unclear on this so this could be wrong).
One thing I can't figure out is how the binary tree is maintained with the sliding window.
ADDENDUM : I just found it described in "Handbook of Data Compression By David Salomon, Giovanni Motta, David (CON) Bryant".
My description above of the binary tree was basically right. It is built in the "greedy" way : new nodes are added at the top of
the tree, which means that when you are searching down the tree, you will always see the lowest offset possible for
a given match length first, so you only need to consider longer lengths. Also since older nodes are always deeper in the tree,
you can slide the window by just killing nodes and don't have to worry about fixing the tree. Of course the disadvantage is
the tree can be arbitrarily unbalanced, but that's not a castrophe, it's never worse than just a straight linked list, which
is the alternative.
The big piece I'm missing is how the optimal parse works. It's a forward optimal parse which explores a limitted branch space (similar
to previous work that was done in Quantum and LZX). When it saves state in the optimal parse tree,
it only updates the FSM "state" variable and the last 3 offsets, it doesn't update the whole context-arithmetic state. At each
position it appears to consider the cost of either a literal, a match, or a "lazy" match (that's a literal and then the following match),
but I haven't figured out the details yet. It seems to optimal parse in 4k chunks, maybe it updates the arithmetic state on those boundaries.
I also see there are lots of heuristics to speed up the optimal parse, assumptions about certain coding decisions being cheaper
than others without really testing them, hard-coded things like (if offset > (1 << 14) && length < 7) which surely helps.
If anyone has figured it out, please help me out.
ADDENDUM : here's an illustration of how the special LZMA modes help on structured data. Say you have a file of structs; the structs are
72 bytes long. Within each struct are a bunch of uint32, floats, stuff like that. Within something like a float, you will have a byte
which is often very correlated, and some bytes that are near random. So we might have something like :
[00,00,40,00] [7F 00 3F 71] ... 72-8 bytes ... [00,00,40,00] [7E 00 4C 2F]
... history ... * start here
we will encode :
00,00,40,00 :
4 byte match at offset 72
(offset 72 is probably offset0 so this is a rep0 match)
7E :
delta literal
encode 7E ^ 7F = 1
00 :
one byte match to offset 72 (rep0)
4C :
delta literal
encode 4C ^ 3F = 0x73
2F :
regular literal
Also because of the position and state-based coding, if certain literals occur often in the same spot in the pattern, that will
be captured very well.
Note that this is not really the "holy grail" of compression which is a compressor that figures out the state-structure of the data
and uses that, but it is much closer than anything in the past. (eg. it doesn't actually figure out that the first dword of the structure
is a float, and you could easily confuse it, if your struct was 73 bytes long for example, the positions would no longer work in simple
bottom-bits cycles).