cbloom rants: 12-17-11 - LZ Optimal Parse with A Star Part 2

Okay, optimal parsing with A star. (BTW "optimal" parsing here is really a misnomer that goes back to the LZSS backwards parse where it really was optimal; with a non-trivial coder you can't really do an optimal parse, we really mean "more optimal" (than greedy/lazy type parses)).

Part 1 was just a warmup, but may get you in the mood.

The reason for using A Star is to handle LZ parsing when you have adaptive state. The state changes as you step through the parse forward, so it's hard to deal with this in an LZSS style backwards parse. See some previous notes on backwards parsing and LZ here : 1 , 2 , 3

So, the "state" of the coder is something like maybe an adaptive statistical mode, maybe the LZMA "markov chain" state machine variable, maybe an LZX style recent offset cache (also used in LZMA). I will assume that the state can be packed into a not too huge size, maybe 32 bytes or so, but that the count of states is too large to just try them all (eg. more than 256 states). (*1)

(*1 - in the case that you can collapse the entire state of the coder into a reasonably small number of states (256 or so) then different approaches can be used; perhaps more on this some day; but basically any adaptive statistical state or recent offset makes the state space too large for this).

Trying all parses is impossible even for the tiniest of files. At each position you have something like 1-16 options. (actually sometimes more than 16, but you can limit the choices without much penalty (*2)). You always have the choice of a literal, when you have a match there are typically several offsets, and several lengths per offset to consider. If the state of the coder is changed by the parse choice, then you have to consider different offsets even if they code to the same number of bits in the current decision, because they affect the state in the future.

(*2 - the details of this depend on the back end of coder; for example if your offset coder is very simple, something like just Golomb type (NOSB) coding, then you know that only the shortest offset for a given length needs to be considered, another simplification used in LZMA, only the longest length for a given offset is considered; in some coders it helps to consider shorter length choices as well; in general for a match of Length L you need to consider all lengths in [2,L] but in practice you can reduce that large set by picking a few "inflection points" (perhaps more on this some day)).

Okay, a few more generalities. Let's revisit the LZSS backwards optimal parser. It came from a forward style parser, which we can implement with "dynamic programming" ; like this :


At pos P , consider the set of possible coding choices {C}

For each choice (ci), find the cost of the choice, plus the cost after that choice :
{

  Cost to end [ci] = Current cost of choice C [ci] + Best cost to end [ P + C[ci].len ]

}

choose ci as best Cost to end
Best code to end[ P ] = Cost to end [ best ci ]

You may note that if you do this walking forward, then the "Best cost to end" at the next position may not be computed yet. If so, then you suspend the current computation and step ahead to do that, then eventually come back and finish the current decision.

Of course with LZSS the simpler way to do it is just to parse backwards from the end, because that ensures the future costs are already done when you need them. But let's stick with the forward parse because we need to introduce adaptive state.

The forward parse LZSS (with no state) is still O(N) just like the backward parse (this time cost assumes the string matching is free or previously done, and that you consider a fixed number of match choices, not proportional to the number of matches or length of matches, which would ruin the O(N) property) - it just requires more book keeping.

In full detail a forward LZSS looks like this :


Set "best cost to end" for all positions to "uncomputed"

Push Pos 1 on stack of needed positions.

While stack is not empty :

pop stack; gives you a pos P

If any of the positions that I need ( P + C.len ) are not done :
{
  push self (P) back on stack
  push all positions ( P + C.len ) on stack
    in order from lowest to highest pos
}
else
{
  make a choice as above and fill "best cost to end" at pos P
}

If you could not make a choice the first time you visit pos P, then because of the order that we push things on the stack, when you come back and pop P the second time it's gauranteed that everything needed is done. Therefore each position is visited at most twice. Therefore it's still O(N).

We push from lowest to highest len, so that the pops are highest pos first. This makes us do later positions first; that way earlier positions are more likely to have everything they need already done.

Of course with LZSS this is silly, you should just go backwards, but we'll use it to inspire the next step.

To be continued...

cbloom rants

12/17/2011

12-17-11 - LZ Optimal Parse with A Star Part 2

No comments:

old rants