09-11-12 - LZ MinMatchLen and Parse Strategies

I've been thinking about this and want to get my thoughts down while they're (somewhat) clear.

We've seen in the past that changing the LZ "min match len" can give big compression wins on some files. eg. say you have some LZ coder which normally has a minmatchlen of 3, you instead set minmatchlen to 4 so that no length 3 matches are emitted, and you find that the compressed size is smaller. ( see, for example, post on png-alike )

(for concreteness, assume a Huffman coded LZ, so that if you don't emit any matches of length 3, then that symbol takes up no code space; no prefix code is reserved for it if it doesn't occur)

Now, perhaps this is not a surprise in a normal parse, because a normal parse has some heuristic about whether a match is better than a literal (or a lazy match is better, etc.), and if that heuristic doesn't match the file then the normal parse will not find the optimum.

But this is also true for (typical) optimal parsers. You would think that if the matches of length 3 hurt compression, the optimal parser would not choose them. So what's going on?

Well, first of all, the problem is that the matches of length 3 do *help* compression, in a greedy local optimization sense. That is, if you have some assumption about the Huffman code lengths so that you can measure the cost of each choice, and you just ask "what's smaller to code, these 3 literals or this length-3 match?" the answer is the length-3 match. That's what makes this a tricky case; if it was more expensive it wouldn't get chosen.

But you can see what caused the problem - in order to do the optimal parse we had to make an assumption about the initial Huffman code lengths to compute code costs. This biases us towards a certain strategy of parse. The "optimal parse" is then just a local optimization relative to that seed; it can't find radically different parses.

In particular, what's happening with these files is that *if* you generate length-3 matches, they are slightly cheaper than 3 literals, but only barely so. However, if you don't generate any length-3 matches, that makes your other prefix codes shorter; in particular the literals get shorter and the length-4 matches get shorter. The result is a smaller file overall.

With an optimal parser, you could find the better parse by using a different seed Huffman cost, rather than using the hard rule of changing the minMatchLen. What you'd have to do is parse a bunch of files, pick a set of representative Huffman code lengths that are quite different from each other, then you can run your optimal parse seeded from each one and take the best. This is just giving you a bunch of different initial positions for your local search in an attempt to find the global minimum.

In heuristic lazy parsers (like LZMA ( see this blog post )) there are some tweaks about "should I prefer this match or that one" or "should I prefer a literal or match". You have the same problem there, that the parse strategy affects the statistical coder, and the statistical coder affects the parse strategy. The heuristics are tweaked to guide the parse in a way that we expect to be good, but its not the best way on all files.

For a while I've had this idea that I call "multi parse". The short description is to run many parses at once and take the best once. This is a bit different from a normal "optimal parse" in the sense that our specific goal is to avoid doing a single local minimization.

For example with an LZMA style heuristic parser, you could run several parses with different constants in the normal match vs repeat match and match vs lazy match decisions, as well as min match len.

The key thing is that you can run a "multi parse" as an inner loop, rather than an outer loop. That is, you run all the parses at once as you step over the file, rather than tweaking parameters on the outside and running the entire compression several times :

tweak outer params :

for each parse strategy S

  for each byte B
    use S on B

"multi-parse" :

for each byte B

  for each parse strategy S

    use S on B

this is a big difference, because a lot of work can be shared. In particular, the string matcher only needs to be run once for all the parse strategies. Also in many cases their work is redundant; eg. if you don't find any matches then all strategies must output a literal.

But the real big win comes once you realize that parse strategy doesn't have to be selected for the whole file. It can vary throughout the file, and different strategies may be best in different regions. If you have the timeline of all the strategies layed out, then you can try to find a shortest-path serpentine walk across the strategies.


Yann Collet said...

This almost looks like a CM strategy

nothings said...

It can vary throughout the file, and different strategies may be best in different regions.

Sweet. A real way to leverage blocks.

cbloom said...

@Yann - yeah, it's the same kind of philosophy. We don't know a-priori which model will be best, so we just run them all and then try to pick once we see their performance on the file. In this case we have to select, not mix, so it's more like the old PPM LOE or "switching" schemes.

old rants