12/17/2011

12-17-11 - LZ Optimal Parse with A Star Part 3

Continuing ...
Part 1
Part 2

At the end of Part 2 we looked at how to do a forward LZSS optimal parse. Now we're going to add adaptive "state" to the mix.

Each node in the walk of parses represents a certain {Pos,State} pair. There are now too many possible nodes to store them all, so we can't just use an array to store all {Pos,State} nodes we have visited. So hopefully we will not visit them all, so we will store them in a hash table.

We are parsing forward, so for any node we visit (a {Pos,State} will be called a "node") we know how we got there. There can be many ways of reaching the same node, but we only care about the cheapest one. So we only need to store one entering link into each node, and the total cost from the beginning of the path to get to that node.

If you think about the flow of how the forward LZSS parse completes, it's sort of like an ice tendril reaching out which then suddenly crystalizes. You start at the beginning and you are always pushing the longest length choice first - that is, you are taking big steps into the parse towards the end without filling in all the gaps. Once you get to the end with that first long path (which is actually the greedy parse - the parse made by taking the longest match available at each step), then it starts popping backwards and filling in all the gaps. It then does all the dense work, filling backwards towards the beginning.

So it's like the parse goes in two directions - reaching from the beginning to get to the end (with node that don't have enough information), and then densely bubbling back from the end (and making final decisions). (if I was less lazy I would make a video of this).

Anyhoo, we'll make that structure more explicit. The hash table, for each node, stores the cost to get to the end from that node, and the coding choice that gives that cost.

The forward parse uses entry links, which I will henceforth call "arrivals". This is a destination node (a {pos,state}), and the cost from the beginning. (you don't need to store how you got here from the beginning since that can be reproduced at the end by rewalking from the beginning).


Full cost of parse through this node =

arrival.cost_from_head + hash_node.cost_to_tail

Once a node has a cost in the hash table, it is done, because it had all the information it needed at that node. But more arrivals can come in later as we fill in the gaps, so the full cost from the beginning of the parse to the end of the parse is not known.

Okay, so let's start looking at the parse, based on our simple LZSS pseudo-code from last time :


hash table of node-to-end costs starts empty
stack of arrivals from head starts empty

Push {Pos 1,state initial} on stack of arrivals

While stack is not empty :

pop stack; gives you an arrival to node {P,state}

see if node {P,state} is already in the hash
if so
{
  total cost is arrival.cost_from_head + hash_node.cost_to_tail
  done with this arrival
  continue (back to stack popping);
}

For each coding choice {C} at the current pos
{
  find next_state = state transition from cur state after coding choice C
  next_pos = P + C.len
  next_node = {next_pos,next_state]

  if next_node is in the hash table :
  {
    compute cost to end from code cost of {C} plus next_node.cost_to_tail
  }
  else
  {
    push next_node to the arrivals stack (*1)
  }
}

if no pushes were done
{
  then processing of current node is done
  choose the best cost to end from the choices above
  create a node {P,state} in the hash with that cost
}

(*1 = if any pushes are done, then the current node is also repushed first (before other pushes). The pushes should be done in order from lowest pos to highest pos, just as with LZSS, so that the deep walk is done first).

So, we have a parse, but it's walking every node, which is way too many. Currently this is a full graph walk. What we need are some early outs to avoid walking the whole thing.

The key is to use our intuition about LZ parsing a bit. Because we step deep first, we quickly get one parse for the whole segment (the greedy parse). Then we start stepping back and considering variations on that parse.

The parse doesn't collapse the way it did with LZSS because of the presence of state. That is, say I parsed to the end and now I'm bubbling back and I get back to some pos P. I already walked the long length, so I'm going to consider a shorter one. When I walk to the shorter one with LZSS, then states I need would already be done. But now, the nodes aren't done, but importantly the positions have been visited. That is -


At pos P, state S
many future node positions are already done
 (I already walked the longest match length forward)

eg. maybe {P+3, S1} and {P+5, S2} and {P+7, S3} have been done

I a shorter length now; eg. to {P+2,S4}

from there I consider {P+5, S5}

the node is not done, but a different state at P+5 was done.

If the state didn't matter, we would be able to reuse that node and collapse back to O(N) like LZSS.

Now of course state does matter, but crucially it doesn't matter *that much*. In particular, there is sort of a limit on how much it can help.

Consider for example if "state" is some semi-adaptive statistics. Those statistics are adaptive, so if you go far enough into the future, the state will adapt to the coding parse, and the initial state won't have helped that much. So maybe the initial state helps a lot for the next 8 coding steps. And maybe it helps at most 4 bits each time. Then having a better initial state can help at most 32 bits.

When you see that some other parse has been through this same position P, albeit with different state at this position, if that parse has completed and has a total cost, then we know it is the optimal cost through that node, not just the greedy parse or whatever. That is, whenever a hash node has a cost_to_tail, it is the optimal parse cost to tail. If there is a good parse later on in the file, the optimal parse is going to find that parse, even if it starts from a non-ideal state.

This is the form of our early outs :


When you pop an arrival to node {P,S} , look at the best cost to arrive to pos P for any state, 

if arrival.cost_from_head - best_cost_from_head[P] > threshold
  -> early out

if arrival.cost_from_head + best_cost_to_tail[P] > best_cost_total + threshold
  -> early out

where we've introduced two arrays that track the best seen cost to head & tail at each pos, regardless of state. We also keep a best total cost, which is initially set to infinity until we get through a total parse, and then is updated any time we see a new whole-walk cost.

This is just A star. From each node we are trying to find a lower bound for the cost to get to the end. What we use is previous encodings from that position to the end, and we assume that starting from a different state can't help more than some amount.

Next time, some subtleties.

2 comments:

Zed said...

please look at the end of http://encode.ru/threads/2094-Optimal-Preprocessing-Parsing-for-LZ?p=41883&viewfull=1#post41883

cbloom said...

Thanks for the pointer - I posted some notes there.

old rants