Part 1
Part 2
Part 3
So we have our A star parse from last time.
First of all, when we "early out" we still actually fill out that hash_node. That is, you pop a certain "arrival", then you evaluate the early out conditions and decide this arrival is not worth pursuing. You need to make a hash_node and mark it as a dead end, so that when you pop earlier arrivals that see this node, they won't try to visit it again.
One option would be to use a separate hash of just bools that mark dead ends. This could be a super-efficient smaller hash table of bit flags or bloom filters or something, which would save memory and perhaps speed.
I didn't do this because you can get some win from considering parses that have been "early outed". What you do is
when you decide to "early out" an arrival, you will not walk to any future nodes that are not yet done, but
you *will* consider paths that go to nodes that were already there. In pseudo-code :
pop an arrival
check arrival early outs and just set a flag
for all coding choices at current pos
{
find next_node
if next_node exists
compute cost to end
else
if ! early out flag
push next_node on arrivals stack
}
So the early out stops you from creating any new nodes in the graph walk that you wouldn't have visited anyway, but
you can still find new connections through that graph. What this lets you do in practice is drive the early out thresholds
tighter.
The other subtlety is that it helps a lot to actually have two (or more) stages of early out. Rather than just stop consider all exit coding choices once you don't like your arrival, you have a couple of levels. If your arrival looks sort of bad but not terrible, then you still consider some of the coding choices. Instead of considering 8 or 16 coding choices, you reduce it to 2 or 4 which you believe are likely advantageous.
The exact details depend on the structure of your back end coder, but some examples of "likely advantangeous" coding choices that you would consider in the intermediate early out case : if you have a "repeat recent offset" structure like LZX/LZMA, then those are obvious things to include in the "likely advantageous". Another one might be RLE or continue-previous type of match codes. Another would be if the literal codes below a certain number of bits with the current statistics. Also the longest match if it's longer than a certain amount.
Okay, so our A star is working now, but we have a problem. We're still just not getting enough early outs, and if you ran this on a big file it will take forever (sometimes).
The solution is to use another aspect we expect from our LZ back end, which is "semi-locality". Locality means that a decision we make now will not have a huge effect way in the future. Yes, it has some effect, because it may change the state and that affects the future, but over time the state changes so many times and adapts to future coding that the decision 4000 bytes ago doesn't matter all that much.
Another key point is that the bad (slow) case occurs when there are lots of parses that cost approximately the same. Because of our early out structure, if there is a really good cheap parse we will generally converge towards it, and then the other choices will be more expensive and they will early out and we won't consider too many paths. We only get into bad degeneracy if there are lots of parses with similar cost. And the thing is, in that case we really don't care which one we pick. So when we find an area of the file that has a huge branching factor that's hard to make a decision about, we are imperfect but it doesn't cost us much overall.
The result is that we can cut up the file to make the parse space tractable. What I do is work in "quanta". You take the current chunk of the file as your quantum and parse it as if it was its own little file. The parse at the beginning of the quantum will be mostly unaffected by the quantum cut, but the parse at the end will be highly affected by the false EOF, so you just throw it out. That is, advance through the first 50% or 75% of the parse, and then start the next quantum there.
There is one special case for the quantum cutting which is long matches that extend past the end of the quantum. What you would see is when outputting the first 50% of the parse, the last code will be a match that goes to the end of the quantum. Instead I just output the full length of the match. This is not ideal but the loss is negligible.
For speed you can go even further and use adaptive quantum lens. On highly degenerate parts of the file, there may be a huge node space to parse that doesn't get early-out'ed. When you detect one of these, you can just reduce the quantum len for that part of the file. eg. you start with a quantum length of 4096 ; if as you are parsing that quantum you find that the hash table occupancy is beyond some threshold (like 1 million nodes for example), you decide the branching factor is too great and reduce the quantum length to 2048 and resume parsing on just the beginning of that chunk. You might hit 1 million nodes again, then you reduce to 1024, etc.
That's it! Probably a followup post with some results numbers and maybe some more notes about subtle issues. I could also do several long posts about ideas I tried that didn't work which I think are sort of interesting.