10/10/2008

10-10-08 - 7

On LZ Optimal Parsing

So the LZ-Huffman I've done for RAD/Oodle does some new stuff with optimal parsing. I want to write up what I'm doing so I remember it, and I'm also going to try to write about the semi-optimal parsing that a lot of people are doing now.

First let's review a bit. In LZ coding you have a lot of choices about how you code things. Greedily choosing the longest match at each point dOpoes not produce the best code stream. There is a very good ancient paper by Storer and Szymanski that introducing optimal parsing, and their method is both fast and 100% optimal, but it is only "optimal" in the sense of coding the fewest number of literals possible. See my old notes on Storer-Szymaski optimal parsing . Now, for the old LZSS coder that used fixed bit coding (8 bit literals, 24 bit matches, or whatever), that optimality is also the same as code length optimality. But in reality what we care about is code length optimality - we want to produce the shortest possible file - which is not necessarilly the same as coding the fewest possible literals.

There are sort of two interesting domains of modern LZ coders, and optimal parsing matters a whole lot to both of them. One domain is the super fast, super simple LZ coder like my LZH. Optimal parsing is important to LZH because the matches are so important to compression, you really want to get the best matches you can. The other domain is something like LZMA or BALZ (an ROLZ) that does a lot of context coding of literals. In those cases, the context coder for literals is so good that you really want to only code good matches, and often the literals will be so cheap you want to send them as literals; also the cost of literals and matches is highly variable for those coders.

Optimal parsing is a path search in a graph. For game developers you can think about A* as I talk about this. You start yourself at the head of the file, and your goal is to walk to the end of the file in the shortest possible path. From each node in the graph, your exit paths are the possible coding operations at that point in the file, eg. write a literal, write match X of length Y. Each coding operation has a certain cost which is equal to the number of bits output if you do that coding operation, and then you step along that link to the next point in the file as specified by the match length.

Note that with the original LZSS we only needed to consider either matching or not matching at each position in the file, since any time you wrote a match the best thing to do was to write the maximum match length. With true code-optimal parsing, you have to consider all possible match offsets and all possible match lengths at each location (!!). This is a huge number of coding choices, but we can rule out almost all of them. (more on this later).

So, the brute force retarded optimal parser should be obvious at this point. Start with an empty code stream and push position 0 on the stack. Repeat forever : pop the stack. Try all coding operations at the position specified in the stack. Push all of those coding operations on the stack with the code stream they made, and the position of the end of the match. This works fine and produces a 100% optimal parse, but is insanely slow. If you have N coding choices at each position it makes N^L operations for a length L file.

With adaptive coding (eg. you're using an adaptive entropy coder for literals matches and offsets), you cannot get a 100% optimal parse unless you do this full search, because it does not have the optimal subgraph property. That is if you have the optimal path from A to C, and that path goes through B, that optimal path is not the same as taking he optimal path from A to B and the optimal path from B to C. To speed things up we are going to start making approximations, and we are going to generally assume a looser version of the optimal subgraph property. What this means in terms of LZ coding is that if you are at a certain position in the file, the best way to code the whole file does not necessarily include the best way to code from {begin} to the current position.

Let's revisit the original LZSS parse now. LZSS does no entropy coding, so it doesn't matter how you get to a given byte of the stream, you will code the remainder in the same way. This is obviously a "Dynamic Programming" problem. You have shared sub-portions of the problem that can be collapsed by storing their result in a table. The optimal path from point P to the end is the same regardless of how you get to P. You could now proceed by doing the brute force approach and search depth-first, and each time you find a path, you store all the sub-paths that you visited, and then you reuse them instead of recomputing them. Or you can do it a more elegant way which is identical, which is to walk backwards through the file and at each position P store the optimal path from the P to the end. When you find the optimal path at P obviously all the future subgraphs that you might need are already done because you're going backwards. So, we now think of the LZSS optimal parse as a way of solving the graph search using dynamic programming.

The case of a static entropy coder is simpler so let me focus on that first. Offsets and lengths and literals are written in some # of bits that is known in advance and we wish to make the shortest possible code stream. We can do a sort of LZSS backwards parse to do this quickly. It's not as simple as LZSS because we cannot assume that the best match choice at each step is the longest one.

Optimal Parser for Static Statistics :

We're going to walk backwards like LZSS. At each position already considered (future positions in the file), we have already found the optimal choice at that spot. We have various coding choices at the current position, either literal or various {offset,length} possibilities. The total cost of each of those coices is : {Cost to code the current choice} + {Table cost at current pos + match length} , that is the already computed cost to finish the walk after you step match length.

It should be obvious that we can solve this just by considering every coding operation possible at each position and walk backwards. That gives us a total cost of O(N*L) , which is a lot better than O(N^L) , but it's still slow because N is very large. So let's start making approximations.

What are all these coding choices that we have? We found various offsets that we matched at least min match length (usually 3). This gives us a bunch of possible offsets and the maximum match length at each offset : {Off1,Len1} , {Off2,Len2}. In general we also have to consider all lengths >= 3 at each offset.

The first approximation is the assumption that for any given match length, the lowest code cost comes from the smallest possible offset that matches with that length. That is, we assume that if Off1 < Off2 , then cost(Off1) <= cost(Off2). This is certainly true for all non-pathological files in the real world. This kind of assumption is also self-satisfying, that is by making this assumption, the optimal parser will prefer shorter offsets, which makes them more likely, which makes their code length shorter. It's a positive feedback loop; I'll talk more about feedback later.

With this approximation, we no longer need to consider all offsets and all match lengths. Now we only need to consider all match lengths, and the offset for any length is chosen for us - it's the lowest offset that codes that length. Our code choices now look like :


you find a match of length 7 at Off1, length 11 at Off2, length 22 at Off3
with Off1 < Off2 < Off3

choose from :

Length 3-7 at Off1
Length 8-11 at Off2
Lebgth 12-22 at Off3

Okay, so we could do this, but it's still too slow. We need to reduce our choices further.

One option is to only consider the longest possible length at each offset. That is, assume that if you code a length at a given offset, the best choice is the longest length at that offset. So in our example you would only choose from {7 at Off1, 11 at Off2, 22 at Off3}. That is sort of okay, but it actually gives up a lot of compression. It misses too many good choices. Apparently there are some good coding paths that involve taking a shorter than maximal match.

To get to the really cool optimization, we need to stop and think about this intuitively. Why would you not write a maximal match? It should almost always give us a shorter total code len, because writing a longer match is very cheap and lets us not code symbols after the match. The cost of a given match is


Cost(offset) + Cost(match length) + CostTable( at pos + match length )

Let's assume for now that the cost of the offset and the match length are seperable (they aren't used to predict each other). The cost of each match length is an entropy code that generally codes shorter lengths in fewer bits. It's sub-linear in match length, but it is generally increasing. However, it increases less than the average cost of future symbols. Note that for long match lengths, the cost of len and (len+1) may be exactly equal depending on how you code your lengths. For example of you use a semi-log2 code, then the cost of many large match lengths is identical. That means coding longer matches is free in terms of the current code cost.

The CostTable generally increases backwards linearly proportional to the average entropy. That is :


CostTable[ {end} ] = 0
CostTable[ end - 1 ] = cost to code form (end-1) to end
CostTable[ end - N ] = N * H

on average ; where H is the average entropy of one symbol

However, even though CostTable is linearly increasing backwards on average, it is not always. Consider for example what CostTable looks like in the neighborhood of a very good very long match. Say at some position P this long match is available with length L. At position (P+1) the best coding choice will be that same match with length (L-1), same at (P+2) etc. The cost table in this area will look like :

CostTable[ P-1 ] = Cost to code literal at (P-1) + Cost(L) + CostTable[ P + L ];
CostTable[ P   ] = Cost(L  ) + CostTable[ P + L ];
CostTable[ P+1 ] = Cost(L-1) + CostTable[ P + L ];
CostTable[ P+2 ] = Cost(L-2) + CostTable[ P + L ];
CostTable[ P+3 ] = Cost(L-3) + CostTable[ P + L ];

If L is large, like say 100 or more, the Cost(L) and the Cost(L-1) will be very very similar. In this region, CostTable[] is increasing as you step backwards, but it's only increasing very slowly, far more slowly than the average slope of H. We see that CostTable goes backwards like N*H in a line, but it deviates up and down across the line. Some areas you code worse than average entropy, some areas you code much better than average entropy. The areas where you code better than average entropy are candidates for choosing a shorter than maximal match length!

What are we really doing as we step back through the file and choose the best match length? We have this constant table Cost(L) that tells us the cost of coding each match length. This table is increasing, it's smallest at Cost(3) and largest at Cost(infinity). We find the maximum match length at a given offset and the minimum at a given offset (the minimum is the smallest length that can't be coded from a lower offset). We take the Cost(L) table in that range and add it onto CostTable() in that range, then pick the lowest value after the sum. Generally CostTable increases backwards faster than Cost(L) increases forwards, but not always! As we saw in the example above with a long match, CostTable can increase backwards slower than Cost(L) increases forwards.

So, how do we find this spots quickly? Well, when we are looking at coding at spot P, we have already visited all the CostTable spots > P. What we'd like to know is, Is CostTable ahead of us increasing sublinearly? And if so, where is it MOST sublinear? That is, Where is CostTable[P] as far below (end - P)*H ? That will tell us the most likely spot to code a less than maximal match.

We could build a binary search tree for these spots as we go backwards, which would be logN and all, but I used a simpler solution. At each spot P, I store the best sublinear spot preceding that spot (preceding in file order, actually occuring later in the backwards walk). That is :


I chose a coding at pos P and updated CostTable[P]

BestSpotPreceding[P] = P;
int step = 1;
while( CostTable[P] + step * TinyCost < CostTable[P + step] )
{
	BestSpotPreceding[P + step] = P;
	step++;
}

Normally when CostTable is increasing a lot as you step back, this does not get triggered at all. That is, CostTable[P+1] is usually a H smaller than CostTable[P]. But in weird cases, the cost of coding this symbol was almost free, so BestSpotPreceding points back to us. BestSpotPreceding gets updated in waves that terminate at cheap code spots. Normally you keep stepping back, then you hit a really cheap code spot and a wave percolates forward, filling out BestSpotPreceding, then you step some more normally, then you hit a cheap code spot and it waves through again. Sort of carries in an arithmetic coder. Anyway, this is technically still O(N*L), but it's a very cheap operation and it's way way less than N*L in practice.

So, how do we use this? Instead of considering only the maximum length at each offset ( {7 at Off1, 11 at Off2, 22 at Off3} in our previous example ), we also consider matches at BestSpotPreceding[] in that offset. That is :

Look in 3-7 at Off1
Consider a 7 at Off1
Also consider BestSpotPreceding[Pos + 7] at Off1 (if it's >= 3 )

Look in 8-11 at Off2
Consider an 11 at Off2
Also consider BestSpotPreceding[Pos + 11] at Off2 (if it's >= 8)

Look in 12-22 at Off3
Consider a 22 at Off3
Also consider BestSpotPreceding[Pos + 22] at Off3 (if it's >= 12)

And those are all the matches we have to consider. I did this and found it hurt less than 0.001 bpb compared to the full optimal parse that considers all possible coding operations at each position.

ADDENDUM #1 : I should mention there's another cool way to get this optimization if you don't want to do the BestSpotPreceding thing. We again want to ramp up our intuition for the problem. The cases where we need to make a non-trivial decision about match lengths are cases when we are writing two matches in a row, both greater than minimum length, and they overlap. That means there's a big range of ways we can code the same block. For example :

 

At pos P we find a match of length 7
At pos P+4 we find a match of length 10

The area from pos P to [P+14] can be coded with two matches, the ways are :

{4,10}
{5,9}
{6,8}
{7,7}

The area of overlap of the matches all codes to the same offset, it's the same offset that was found at P+4, but if you match past that at P, then you just match less of it. The amount of overlap is the number of different coding choices we have.

We're coding the same two offsets and skipping the same number of symbols, so the only difference in code cost is the cost to code the match lengths. The code costs of the choices are :


Cost(len 4) + Cost(len 10)
Cost(len 5) + Cost(len 9)
Cost(len 6) + Cost(len 8)
Cost(len 7) + Cost(len 7)

But we don't even need to compute these. The cost of coding a match len is almost always increasing in len, and increasing more slowly for larger lens, that is :

(len1 > len2) implies codecost( len1 ) >= codecost( len2 )

(len1 > len2) implies ( codecost( len1 +1 ) - codecost(len1) ) <= ( codecost( len2 + 1 ) - codecost( len2 ) )

That is, code costs go up for larger lens, but they go up more slowly. It's like a sqrt(x) function - sharp at the beginning and then levels out. That means the lowest total code cost is the one that's most skewed. You want to extend the longer match as much as possible because that's very cheap, and you want to shorten the shorter match as much as possible because that saves you a lot. Intuitively you can think the difference between coding a len of 3 and 4 is very big, but the difference between 511 and 512 is tiny tiny.

In our example that means the best choice is :

Cost(len 4) + Cost(len 10)

The simple heuristic to do this in general is :


Consider coding option at pos P
Find max match len possible at each offset
Consider coding with max match len

len = max match len
while ( matchoffset[ P + len - 1] == matchoffset[ P + len ] )
	len--;
if ( len != max match len )
	consider coding at len

Note the similarity with the BestSpotPreceding heuristic. We try coding with our longest match len. Then we also try coding with the *shortest* match len that still reaches the same match after we jump forward. We're trying the two most skew possibilities, our match as long as possible and our match as short as possible, constrained to be getting the same match following. (matchoffset is already computed because we're going backwards).

ADDENDUM #2 : More early outs. LowestCostPreceding. @@ todo : write this.

There are some more minor issues we need to consider.

The first one is that if we are doing a Huffman (or static arithmetic) coder to entropy code the length, offset & literal, then we can't actually know the code length in advance. And the way we choose to do the coding will affect the statistics which will affect the code lengths, which affects how we choose to do the coding, ad infinitum. It's a feedback loop.

As usual in these kind of optimization searches, we make the semi-static assumption. We make a good guess for the code lengths, find the optimal parse for those code lengths, use that parse to guess new code lengths, optimal parse again, etc. Hopefully this smoothly converges. Well, it doesn't. The problem is that the coding alphabet is actually pretty sparse. On medium size files there are not that many literals and there are plenty of offsets and match lengths you might not code at all in one particular encoding choice. For example, in one pass you might not code any matches of length {31} at all. Now when you do the next pass you need something reasonable in there. If you use the actual code lengths, you can strangely skew how you do the coding, and it can actually ping pong a lot in the search and not settle down.

To improve this, I suggest smoothing the counts in the early passes. That is, the chance of a length 31 match should be somewhere between the chance of a length 30 and length 32 match, regardless of the coincidence that we happened not to code with it in the last pass. To do this, I fit a laplacian probability model to lengths and offsets, and in the early passes I blend that model in to smooth out the regularize the distribution. As I do more passes I decrease the weight of that blended simple model and use the raw counts more.

In practice, you can avoid doing multiple passes of the actual optimal parser without hurting compression too much. You can do a first pass with just a normal non optimal parser and build statistics from that. Smooth those statistics with a regularizing simple model. Then do the optimal parse with those statistics, and you're done. Doing more passes will improve the coding decisions, though.

The other issue is that it's still a bit slow. Most of the slowness comes from the areas where very long matches are possible. If there's a long match of length L at pos P, we also consider (L-1) at (P+1), and (L-2) at (P+2), etc., and for each of those we generally find many closer offsets to also consider, roughly O(L) of them, which makes us O(L^2) over that region. On super degerate files like "pic" in the Calgary Corpus, this makes us very slow. However, it's easy to fix. In this area with the long match, it really doesn't matter much what we do, because all of them are good choices. This is a property that in areas of super high compression, the importance to the whole file is much less. You want to maximize your wins on the hard to compress spots, because in terms of output bytes they are more important. So, in these big long match areas, we just cut out the whole area inside the long match, and we always code the long match at the beginning. (actually I allow coding at P and P+1 up to P+K for a small K ; K is like 4 and L is like 100).

Optimal Parser for Adaptive Coders :

Now, with real modern coders you aren't doing static statistics, they're adapting as you go. You could just make a guess at the overall statistics and use the semi-static optimal parser that I just described, and then code the stream with your adaptive coders. That's better than not optimal parsing at all, but it's giving up a lot. Adaptive coders get a big win by changing their statistics as they scan through the file, and an optimal parser can make much better decisions if it uses the actual local statistics. For example, in one part of the file, literals might code very well, and the optimal parser should favor literals more, in another part of the file there might be lots of very short match lengths, and the optimal parser should code for that, etc.

One way to solve this is to use a different kind of semi-static approximation and optimal parse forward in blocks. For each block, you pretend your coder are static and use some fixed code lengths. Now, run the semi-static optimal parser that I described above to find the optimal coding for the next 1000 bytes or so. Now step up through that coding and output codes with your adaptive entropy coder. Only output 900 bytes or so, don't go all the way to the end of the block so that you avoid edge effects. Now hold use the current statistics to find the optimal parse for the next 1000 bytes. This in fact works very well, and I believe was first used by RKive many years ago. He called it "dynamic programming" which is sort of right in the sense that all backward optimal parses are a type of dynamic programming.

There's another form of adaptive optimal parse that a lot of people do now using forward walking. It's called "flexible parsing" by the encode.ru guy, and I believe a form of this is used in LZMA.

The simplest form of "flexible parse" is just the old "lazy evaluation" of Zip. Lazy matching walks through the file and considers either writing {match} or {literal + match at next pos}. It's a slightly non-greedy parser. Flexible parsing is basically the same thing, but we consider a few different choices.

To do this, we run through the whole file first and find the best match at each position. (you don't actually have to do this in practice, you could just look ahead some amount N from the current position to avoid multiple passes). In any cases, the match search at each pos is only done once, and we have them ready far enough in the future to do the flexible parse.

So, we consider a few coding choices at the current location. We choose the one that gives the lowest code cost to the end of the file. To make good decisions, it needs to also consider the coding after you take the current operation. That is, it's like a recursive function :


CodeCostRecurive( P )
{
	consider n in N choices
		cost of choice n = Cost(code n) + CodeCostRecurive( P + len(n) )
	return best of N
}

now of course if we actually did this full recursion we would be doing the whole O(N^L) tree like we didn't want to do. But we don't need to do that because we're only using this to make the decision at the current byte. That is, we can use the "short horizon" assumption - coding choices far in the future don't matter that much to our current coding choice. It's like only considering a few moves ahead in a Chess program (which is a bad thing to do there, but a good thing here). In fact, you only need to look 2 moves ahead here to make good coding choices.

Also, you don't need to consider very many N choices. N = 4 or 8 is plenty. Certainly consider coding a literal and coding the longest possible match, and then heuristically pick some other likely good choices in between (such as the longest matches available with some shorter offsets, or the longest matches possible from previous-match-references if you have those).

How do we terminate the recursion? After 2 steps (considering one op and then one more op after that) we just need an estimate of the cost of coding the rest of the file. That's easy, it's just :


CodeCostEstimate( P ) = ( {end} - P ) * H

That is, just assume the remainder is coded at the entropy. The entropy you use here should be a conservative overestimate of the entropy; for example just use the entropy of what you've done so far and add a fudge factor. This ensures that you actually prefer writing matches and don't do something strange if you start the file in a very low entropy region.

This flexible parse sounds kind of expensive, but in fact with N = 4, it's only 16 table lookups to compute all the code lengths, then we pick the best one. This of course is not a true optimal parse, but it's pretty damn good, and the fact that it's using the true current adaptive statistics makes up for the fact that it's not doing the best possible parse.

ADDENDUM : I should note that this kind of forward parse is also better at making good decisions with PMR's (Previous Match References). PMR's are cheap to code offsets, so they should always be considered in the optimal parse when they are possible matches. Going backwards you can't make good decisions about trying to use PMR's because you haven't done the earlier parts of the file yet.

No comments:

old rants