10-04-12 - Hash-Link match finder tricks

Some notes on the standard old Hash->Links match finder for LZ. (See previous posts on StringMatchTest Hash1b and Amortized Hashing or index post here )

Some additional tricks which are becoming more or less standard these days :

1. Carry-forward "follows" matches. Previously discussed, see Hash1b post. (also in the Hash1b post : checking for improvement first).

2. "Good enough length". Once you find a match of length >= GOOD_ENOUGH (256 or 1024 or so), you stop the search. This helps in super-degenerate areas; eg. you are at a big run of zeros and that has occured many times before in your file, you can get into a very bad O(N^2) thing if you aren't careful, so once you find a long match, just take it. Hurts compression very little. (note this is not just a max match length; that does hurt compression a bit more (on super-compressable files))

3. Extra steps when not finding matches. The first place I saw this was in LZ4 and Snappy, dunno where it was done first. The idea is when you fail to find a match, instead of stepping ahead by 1 you step ahead by some variable amount. As you continue to fail to find matches, that variable amount increases. Something like :

ptr += 1 + (numSearchesWithNoMatchFound>>5);

instead of just ptr++. The idea is that on incompressible files (or incompressible portions of files) you stop bothering with all the work to find matches that you won't find anyway. Once you get back to a compressible part, the step resets.

4. Variable "amortize" (truncated hash search). A variant of #3 is to use a variable limit for the amortized hash search. Instead of just stepping over literals and doing no match search at all, you could do a match search but with a very short truncated limit. Alternatively, if you are spending too much time in the match finder, you could reduce the limit (eg. in degenerate cases not helped by the "good enough len"). The amortize limit might vary between 64 and 4096.

The goal of all this is to even out the speed of the LZ encoder.

The ideal scenario for an LZ encoder (greedy parsing) is that it finds a very long match (and thus can step over many bytes without doing any lookup at all), and it finds it in a hash bucket which has very few other entries, or if there are other entries they are very easily rejected (eg. they mismatch on the first byte).

The worst scenario for an LZ encoder (without our tricks) is either : 1. there are tons of long matches, so we go and visit tons of bytes before picking one, or 2. there are no matches (or only a very short match) but we had to look at tons of pointers in our hash bucket to find it, and we will have to do hash lookups many times in the file because we are not finding long matches.

No comments:

old rants