10-24-11 - LZ Optimal Parse with A Star Part 1

First two notes that aren't about the A-Star parse :

1. All good LZ encoders these days that aren't optimal parsers use complicated heuristics to try to bias the parse towards a cheaper one. The LZ parse is massively redundant (many parses can produce the same original data) and the cost is not the same. But in the forward normal parse you can't say for sure which decision is cheapest, so they just make some guesses.

For example the two key crucial heuristics in LZMA are :

// LZMA lastOffset heuristic :
  if (repLen >= 2 && (
        (repLen + 1 >= mainLen) ||
        (repLen + 2 >= mainLen && mainDist >= (1 << 9)) ||
        (repLen + 3 >= mainLen && mainDist >= (1 << 15))))
    // take repLen instead of mainLen

// LZMA lazy heuristic :

        // Truncate current match if match at next position will be better (LZMA's algorithm)
        if (nextlen >= prevlen && nextdist < prevdist/4 ||
            nextlen == prevlen + 1 && !ChangePair(prevdist, nextdist) ||
            nextlen > prevlen + 1 ||
            nextlen + 1 >= prevlen && prevlen >= MINLEN && ChangePair(nextdist, prevdist))
             return MINLEN-1;
        } else {
             return prevlen;

One is choosing a "repeat match" over a normal match, and the next is choosing a lazy match (literal then match) over greedy (immediate match).

My non-optimal LZ parser has to make similar decisions; what I did was set up a matrix and train it. I made four categories for a match based on what the offset is : {repeat, low, medium, high } , so to decide between two matches I do :

  cat1 = category of match 1 offset
  cat2 = category of match 2 offset
  diff = len1 - len2

  return diff >= c_threshold[cat1][cat2]
so c_threshold is a 4x4 matrix of integers. The values of threshold are in the range [0,3] so it's not too many to just enumerate all possibilities of threshold and see what's best.

Anyway, the thing about these heuristics is that they bias the parse in a certain way. They assume a certain cost tradeoff between literals vs. repeat matches vs. normal matches, or whatever. When the heuristic doesn't match the file they do badly. Otherwise they do amazingly well. One solution might be having several heuristics trained on different files and choose the one that creates the smallest output.

Also I should note - it's not trivial to tell when you have the heuristic wrong for the file. The problem is that there's a feedback loop between the parse heuristic and the statistical coder. That causes you to get into a local minimum, and you might not see that there's a better global minimum which is only available if you make a big jump in parsing style.

Repeating that more explicitly : the statistical coder will adapt to your parse heuristic; say you have a heuristic that prefers low offsets (like the LZMA heuristic above), that will cause your parse to select more low offsets, that will cause your statistical backend to code low offsets smaller. That's good, that's what you want, that's the whole point of the heuristic, it skews the parse intentionally to get the statistics going in the right direction. The issue then is that if you try to evaluate an alternative parse using the statistics that you have, it will look bad, because your statistics are trained for the parse you have.

2. It's crazy that LZ compression can do so well with so much redundancy in the code stream.

Think about it this way. Enumerate all the compressed bit streams that are created by encoding all raw bit streams up to length N.

Start with the shortest compressed bit stream. See what raw bits that decodes to. Now there may be several more (longer) compressed bit streams that decode to that same raw bit stream. Mark those all as unused.

Move to the next shortest compressed bit stream. First if there is a shorter unused one, move it into that slot. Then mark all other output streams that make the same raw bits as unused, and repeat.

For example a file like "aaaaaaa" has one encoding that's {a, offset -1 length 6} ; but there are *tons* of other encodings, such as {a,a,[-1,5]} or {a,[-1,3],[-1,3]} or {a,[-1,2],[-2,4]} etc. etc.

All the encodings except the shortest are useless. We don't want them. But even the ones that are not the best are quite short - they are small encodings, and small encodings take up lots of code space (see Kraft Inequality for example - one bit shorter means you take up twice the code space). So these useless but short encodings are real waste. In particular, there are other raw strings that don't have such a short encoding that would like to have that output length.

Anyhoo, surprisingly (just like with video coding) it seems to be good to add even *more* code stream redundancy by using things like the "repeat matches". I don't think I've ever seen an analysis of just how much wasted code space there is in the LZ output, I'm curious how much there is.

Hmm we didn't actually get to the A Star. Next time.


jfb said...

Charles, I'm not particularly knowledgeable about compression, but I'm wondering if you could tell me:

About how small can a better-than-RLE decompressor get? liblzf without bounds checking is 70 bytes + memcpy on ARM Thumb 2, for instance. Can one realistically get something simpler/smaller?

I've got a few cases where I have 4KB of code space and have been thinking of compressing the .data section (*very* compressible), but with .data being only say 100-150 bytes uncompressed I haven't found any method that pays off significantly once the decompressor is included. Wondering if you might know of any.



Yann said...

Matt Mahoney just posted an interesting entry about Crinkler, a compressor which tries to do just that : compress little demos (about ~4K) as much as possible, with a decoder code as small as possible.


But even Crinkler is not targeting anything at 100-150 bytes ranges, since it is itself ~210 bytes large. One has to wonder : why trying so much to compress something that little ? What's the expected gain, and aren't there any other budget which might be better target for "savings" ?

cbloom said...

Interesting; Crinkler assumes that it has tons of memory available at decompress time and takes advantage of that.

For the case of tiny data packets, the key will definitely be heavy optimization at encode time.

One interesting option if you really want the max possible theoretical win on tiny data packets is you could actually generate the executable code that reproduces that data. Normally that search is impossibly large, but on such tiny data it might be possible.

Will said...

http://www.retroprogramming.com/2009/06/stuff-programmers-love-to-play-with-1.html puts LZ at <50 bytes for x86

cbloom said...

liblzf looks huge and over-complicated for this application.

If memcpy is not included, an LZ decoder should be a lot smaller than 50 bytes.

A "Predictor" (aka Finnish) decoder can be done in about 10 instructions.

jfb said...

Charles, I looked up predictor decompressor and ran into


Is this the kind you were talking about? I just tried compressing some test data and it works shockingly well considering how understandable and simple the algorithm is. I'll give it a try on my device this weekend.

jfb said...

Max possible theoretical would be great, but I'm not sure how one would start such a search: Even finding 'codes that will produce the given data' is a bit of a problem, as there could be looping code, self-modifying code, some combination of the two, etc. all of which wouldn't just be one-instruction permutations from another...

So you couldn't just do a 'toggle this and judge the quality', as the vast majority won't work at all. One could alter instructions to equivalent instructions in some fashion but that wouldn't likely generate a clever decompressor...

How would you go about it and/or has someone tried? It might be fun weekend reading, though I have a sneaking suspicion it'd be exponential time ;) Then again, reminded of the title, pathfinding can be too but people do it every day..

cbloom said...

Yeah, that's Predictor. Of course the decompressor is smaller and faster if you do it branchless, something like :

// flag is 0 or 1
*out++ = flag ? *in : table[hash];
in += flag;

cbloom said...

God damn you blogger make comments editable.

extrememoderate said...

Let me remind you that you still didn't get to A Star. ;)

Unknown said...

Hi Charles,
Very interesting blog post. Can't wait for part 2, this is really useful stuff.

I played around with A* a bit while implementing lzham, but I got caught up on how to implement a decent admissible heuristic so I moved on to lower hanging fruit.


old rants