cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 11-02-11 - StringMatchTest Release
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
From fast to slow :
All my fast matchers now use "cache tables". In fact I now use cache tables all the way up to my "Normal" level (default level; something like zip -7).
With cache tables you have a few primary parameters :
hash bits
hash ways
2nd hash
very fastest :
0 :
hash ways =1
2nd hash = off
1 :
hash ways = 2
2nd hash = off
2 :
hash ways = 2
2nd hash = on
...
hash ways = 16
2nd hash = on
The good thing about cache tables is the cpu cache coherency. You look up by the hash, and then all
your possible matches are right there in one cache line. (there's an option of whether you store the
first U32 of each match right there in the cache table to avoid a pointer chase to check the beginning
of the match).
Cache tables are superb space-speed tradeoff up until ways hits around 16, and then they start to lose to hash-link.
Hash-link :
Hash link is still good, but can be annoying to make fast (*) and does have bad degenerate case behavior (when you have a bad hash collision, the links on that chain get overloaded with crap).
(* = you have to do dynamic amortization and shite like that which is not a big deal, but ugly; this is to handle the incompressible-but-lots-of-hash-collisions case, and to handle the super-compressible-lots-of- redundant-matches case).
The good thing about hash-link is that you are strictly walking matches in increasing offset order.
This means you only need to consider longer lengths, which helps break the O(N^2) problem in practice
(though not in theory). It also gives you a very easy way to use a heuristic to decide if a match is
better or not. You're always doing a simple compare :
previous best match vs.
new match with
higher offset
longer length
which is a lot simpler than something like the cache table case where you see your matches in random order.
Being rather redundant : the nice thing about hash-link is that any time you find a match length, you know absolutely that you have the lowest offset occurance of that match length.
I'm not so high on Suffix Tries any more.
*if* your LZ just needs the longest length at each position, they're superb. If you actually need the best match at every position (traditional optimal parse), they're superb. eg. if you were doing LZSS with large fixed-size offsets, you just want to find the longest match all the time - boom Suffix Trie is the answer. They have no bad degenerate case, that's great.
But in practice on a modern LZ they have problems.
The main problem is that a Suffix Trie is actually quite bad at finding the lowest offset occurance of a short match. And that's a very important thing to be good at for LZ. The problem is that a proper ST with follows is doing its updates way out deep in the leaves of the tree, but the short matches are up at the root, and they are pointing at the *first* occurance of that substring. If you update all parents to the most recent pointer, you lose your O(N) completely.
(I wrote about this problem before : cbloom rants 08-22-13 - Sketch of Suffix Trie for Last Occurance )
You can do something ugly like use a suffix trie to find long matches and a hash->link with a low walk limit to find the most recent occurance of short matches. But bleh.
And my negativity about Suffix Tries also comes from another point :
Match finding is not that important. Well, there are a lot of caveats on that. On structured data (not text), with a pos-state-lastoffset coder like LZMA - match finding is not that important. Or rather, parsing is more important. Or rather, parsing is a better space-speed tradeoff.
It's way way way better to run an optimal parse with a crap match finder (even cache table with low ways) than to run a heuristic parse with great match finder. The parse is just way more important, and per CPU cost gives you way more win.
And there's another issue :
With a forward optimal parse, you can actually avoid finding matches at every position.
There are a variety of ways to get to skip ahead in a forward optimal parse :
Any time you find a very long match -
just take it and skip ahead
(eg. fast bytes in LZMA)
this can reduce the N^2 penalty of a bad match finder
When you are not finding any matches -
start taking multi-literal steps
using something like (failedMatches>>6) heuristic
When you find a long enough rep match -
just take it
and this "long enough" can be much less than "fast bytes"
eg. fb=128 for skipping normal matches
but you can just take a rep match at length >= 8
which occurs much more often
the net result is lots of opportunity for more of a "greedy" type of match finding in your optimal
parser, where you don't need every match.
This means that good greedy-parse match finders like hash-link and Yann's MMC ( my MMC note ) become interesting again (even for optimal parsing).