8/20/2010

08-20-10 - Deobfuscating LZMA

I've been trying to figure out LZMA for a while, if anyone can help please chime in.

LZMA is very good, and also very obscure. While the code is free and published, it's completely opaque and the algorithm is not actually described anywhere. In particular, it's very good on structured data - even better than PPM. And, superficially, it looks very much like any other LZ+arithmetic+optimal parse, which there were many of before LZMA, and yet it trounces them all.

So, what's going on in LZMA? First, a description of the basic coder. Most of LZMA is very similar to LZX - LZX uses a forward optimal parse, log2-ish encoded lengths and offsets, and the most recent 3 offsets in an MTF/LRU which are coded as special "repeat match" codes. (LZX was made public when Microsoft made it part of CAB and published a full spec ).

LZMA can code a literal, a match, or a recent offset match - one of the three most recent offsets (like LZX). This is pretty standard. It also has two coding modes that are unusual : "Delta Literal" coding, and the 0th most recent offset match can code a single character match.

Everything it codes is context-coded binary arithmetic coded. Literals are coded as their bits; the initial context is the previous character and a few bits of position, and as literal bits are coded they are shifted into the context for future bits (top to bottom). This is pretty standard.

Using a few bits of position as part of the context lets it have different statistics at each byte position in a dword (or whatever). This is very useful for coding structured data such as arrays of floats. This idea has been around for a long time, but older coders don't do it and it certainly is part of the advantage on array/structured data. The bottom bits of position are also used as part of the context for the match flag, and also the "is last match 0 long" flag. Other match-related coding events don't use it.

In theory you should figure out what the local repetition period is and use that; LZMA doesn't make any effort to do that and just always uses N bits of position (I think N=2 is a typical good value).

Lengths and Offsets are coded in what seems to be mostly a pretty standard log2-ish type coding (like Zip and others). Offsets are coded as basically the position of their MSB and then the remaining bits. The MSB is context-coded with the length of the match as context; this capture length-offset correlation. Then, the bottom 4 bits of the offset are sent, binary arithmetic coded on each other in reverse order (bottom bit first). This lets you capture things like a fixed structure in offsets (eg. all offsets are multiples of 4 or 8). The bits between the MSB and the bottom 4 are sent without compression.

The binary match flags are context coded using the "state" , which is the position in an internal finite state machine. It is :


LZMA state machine :

Literal :

  state < 7 :
    normal literal
  state >= 7 :
    delta literal

  state [0-3] -> state = 0
  state [4-9] -> state -= 3 ([1-6])
  else state -= 6 [10-11] -> ([4-5])


Match :

  rep0
   len 1 : 
     state ->   < 7 ? 9 : 11
   len > 1 :
     state ->   < 7 ? 8 : 11

  rep12
     state ->   < 7 ? 8 : 11

  normal  
     state ->   < 7 ? 7 : 10


// or from Igor Pavlov's code :

static const int kLiteralNextStates[kNumStates] = {0, 0, 0, 0, 1, 2, 3, 4,  5,  6,   4, 5};
static const int kMatchNextStates[kNumStates]   = {7, 7, 7, 7, 7, 7, 7, 10, 10, 10, 10, 10};
static const int kRepNextStates[kNumStates]     = {8, 8, 8, 8, 8, 8, 8, 11, 11, 11, 11, 11};
static const int kShortRepNextStates[kNumStates]= {9, 9, 9, 9, 9, 9, 9, 11, 11, 11, 11, 11};

Okay, this is the first funny unique thing. State basically tells you what the last few coding operations have been. As you send matches, state gets larger, as you send literals, state gets smaller. In particular, after any literal encoding state is < 7, and after any match encoding it is > 7. Then above and below that it tells you something about how many literals or matches you've recently encoded. For example :


initial state = 5
code a normal match -> 7
code a rep match -> 11
code a literal -> 5
code a literal -> 2
code a literal -> 0

Now it's unclear to me whether this funny state machine thing is really a huge win as a context; presumably it is tweaked out to be an advantage, but other coders have used the previous match flags as context for the match flag coding (eg. was the last thing a match is one bit, take the last three that gives you 8 states of previous context), which seems to me to have about the same effect.

There is one funny and clever thing here though, and that's the "delta literal". Any time you code a literal immediately after a match, the state will be >= 7 so you will code a delta literal. After that state will fall below 7 so you will code normal literals. What is a delta literal ?

Delta literals are coded as :


    char literal = *ptr;
    char lastPosPtr = ptr - lastOffset;
    char delta = literal ^ *lastPosPtr;

that is, the character is xor'ed with the character at the last coded match offset away from current pointer (not at the last coded pos, the last offset, that's important for structured data).

When I first saw this I thought "oh it's predicting that the char at the last offset is similar, so the xor makes equal values zero" , but that's not the case at all. For one thing, xor is not a great way to handle correlated values, subtract mod 256 would be better. For another, these character are in fact gauranteed to *NOT* match. If they did match, then the preceeding match would have just been one longer. And that's what's cool about this.

Immediately after a match, you are in a funny position : you have a long preceding context which matches some other long context in the file (where the match was). From PPM* and LZP we know that long contexts are very strong predictors - but we also know that we have failed to match that character! If we just use the normal literal coder, we expect the most likely character to be the one that we just failed to match, so that would be a big waste of code space. So instead, we use this delta literal coder which will let us statistically exclude the zero.

Okay, I think that's it for how the coding works. A few more tidbits :

The match finder in 7zip appears to be a pretty standard hash-to-binary-tree. It uses a hash to find the most recent occurance of the first few chars of the current string, that points to a node in the binary tree, and then it walks the binary tree to find all matches. The details of this are a little bit opaque to me, but I believe it walks backwards in order, and it only finds longer matches as it walks back. That is, it starts at the lowest offset occurance of the substring and finds the match length for that, then it steps to the next later one along the binary tree, and finds a match *if longer*. So it doesn't find all offsets, it presumes that larger offsets are only interesting if their matches are longer. (I'm a little unclear on this so this could be wrong).

One thing I can't figure out is how the binary tree is maintained with the sliding window.

ADDENDUM : I just found it described in "Handbook of Data Compression By David Salomon, Giovanni Motta, David (CON) Bryant". My description above of the binary tree was basically right. It is built in the "greedy" way : new nodes are added at the top of the tree, which means that when you are searching down the tree, you will always see the lowest offset possible for a given match length first, so you only need to consider longer lengths. Also since older nodes are always deeper in the tree, you can slide the window by just killing nodes and don't have to worry about fixing the tree. Of course the disadvantage is the tree can be arbitrarily unbalanced, but that's not a castrophe, it's never worse than just a straight linked list, which is the alternative.

The big piece I'm missing is how the optimal parse works. It's a forward optimal parse which explores a limitted branch space (similar to previous work that was done in Quantum and LZX). When it saves state in the optimal parse tree, it only updates the FSM "state" variable and the last 3 offsets, it doesn't update the whole context-arithmetic state. At each position it appears to consider the cost of either a literal, a match, or a "lazy" match (that's a literal and then the following match), but I haven't figured out the details yet. It seems to optimal parse in 4k chunks, maybe it updates the arithmetic state on those boundaries. I also see there are lots of heuristics to speed up the optimal parse, assumptions about certain coding decisions being cheaper than others without really testing them, hard-coded things like (if offset > (1 << 14) && length < 7) which surely helps. If anyone has figured it out, please help me out.


ADDENDUM : here's an illustration of how the special LZMA modes help on structured data. Say you have a file of structs; the structs are 72 bytes long. Within each struct are a bunch of uint32, floats, stuff like that. Within something like a float, you will have a byte which is often very correlated, and some bytes that are near random. So we might have something like :


[00,00,40,00] [7F 00 3F 71] ... 72-8 bytes ... [00,00,40,00] [7E 00 4C 2F]
... history ...                                * start here

we will encode :

00,00,40,00 : 
  4 byte match at offset 72
  (offset 72 is probably offset0 so this is a rep0 match)

7E :
  delta literal
  encode 7E ^ 7F = 1

00 :
  one byte match to offset 72 (rep0)

4C :
  delta literal
  encode 4C ^ 3F = 0x73

2F :
  regular literal

Also because of the position and state-based coding, if certain literals occur often in the same spot in the pattern, that will be captured very well.

Note that this is not really the "holy grail" of compression which is a compressor that figures out the state-structure of the data and uses that, but it is much closer than anything in the past. (eg. it doesn't actually figure out that the first dword of the structure is a float, and you could easily confuse it, if your struct was 73 bytes long for example, the positions would no longer work in simple bottom-bits cycles).

3 comments:

ryg said...

On the delta literal stuff: "When I first saw this I thought "oh it's predicting that the char at the last offset is similar, so the xor makes equal values zero" , but that's not the case at all. For one thing, xor is not a great way to handle correlated values, subtract mod 256 would be better."
There's a crucial bit that you're missing here. In structured binary data, even on a non-match, the top bits of the byte that failed to match are usually a good predictor for the top bits of the byte you're trying to code (I'll give an example in a minute). With xor, matching top bits are guaranteed to be zero; with sub they could be either all-0 or all-1 depending on whether a carry occurs or not. Assuming that top bits do tend to match, the sub has a near-random MSB which will take roughly 1 bit to code, while xor results in a MSB that's most likely zero, same as the other bits.

Why top bits would match: A lot of values in binary files are either small integers (clustering near zero) or offsets (which exhibit high locality). Say you have a 200k app binary. All called addresses are gonna be within a 200k window of the code segment base address. That's a bit less than 18 bits worth of payload, with the rest (=the top bits) always being equal.

To exploit this even more, the preprocessing filters used for things like x86 code compression usually turn call offsets (which are relative to make the code relocatable) into absolute addresses and byte-swap them into big endian. Why does this help?

Absolute addresses help because calls to the same function are now identical, instead of just having similar offsets (kinda similar doesn't cut it for LZ or context coders!). Big endian helps because the bytes that are likely to be equal (the high bits of the address) are now grouped together with the call instruction. And the xor trick means that the bytes that don't match perfectly are still likely to agree in the top few bits.

cbloom said...

" with sub they could be either all-0 or all-1 depending on whether a carry occurs or not. Assuming that top bits do tend to match, the sub has a near-random MSB which will take roughly 1 bit to code, while xor results in a MSB that's most likely zero, same as the other bits."

I didn't mean that you would actually sub. I meant you would either use (sub + 128) or sub folded. That's what compression people mean when they say "sub" ;)

Xor does weird things. If you actually thought the values were predicted to be close, xor would be bad. For example 0x40 and 0x3F are 1 apart, but xor'd they are 0x7F.

ryg said...

"I didn't mean that you would actually sub. I meant you would either use (sub + 128) or sub folded. That's what compression people mean when they say "sub" ;)"
sub + 128 == sub ^ 128 so the MSB is just as random. Folding would work though.

"Xor does weird things. If you actually thought the values were predicted to be close, xor would be bad. For example 0x40 and 0x3F are 1 apart, but xor'd they are 0x7F."
Depends on how you're coding and what your expectations of closeness are. Xor corresponds to a notion of closeness in terms of hamming distance, not absolute value of the difference. The tree-structured binary contexts that LZMA uses are well-suited to exploit that kind of structure where it exists - much better than a standard "counting-based" model for a multisymbol arith coder.

old rants