cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 11-02-11 - StringMatchTest Release
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
From fast to slow :
All my fast matchers now use "cache tables". In fact I now use cache tables all the way up to my "Normal" level (default level; something like zip -7).
With cache tables you have a few primary parameters :
hash bits
hash ways
2nd hash
very fastest :
0 :
hash ways =1
2nd hash = off
1 :
hash ways = 2
2nd hash = off
2 :
hash ways = 2
2nd hash = on
...
hash ways = 16
2nd hash = on
The good thing about cache tables is the cpu cache coherency. You look up by the hash, and then all
your possible matches are right there in one cache line. (there's an option of whether you store the
first U32 of each match right there in the cache table to avoid a pointer chase to check the beginning
of the match).
Cache tables are superb space-speed tradeoff up until ways hits around 16, and then they start to lose to hash-link.
Hash-link :
Hash link is still good, but can be annoying to make fast (*) and does have bad degenerate case behavior (when you have a bad hash collision, the links on that chain get overloaded with crap).
(* = you have to do dynamic amortization and shite like that which is not a big deal, but ugly; this is to handle the incompressible-but-lots-of-hash-collisions case, and to handle the super-compressible-lots-of- redundant-matches case).
The good thing about hash-link is that you are strictly walking matches in increasing offset order.
This means you only need to consider longer lengths, which helps break the O(N^2) problem in practice
(though not in theory). It also gives you a very easy way to use a heuristic to decide if a match is
better or not. You're always doing a simple compare :
previous best match vs.
new match with
higher offset
longer length
which is a lot simpler than something like the cache table case where you see your matches in random order.
Being rather redundant : the nice thing about hash-link is that any time you find a match length, you know absolutely that you have the lowest offset occurance of that match length.
I'm not so high on Suffix Tries any more.
*if* your LZ just needs the longest length at each position, they're superb. If you actually need the best match at every position (traditional optimal parse), they're superb. eg. if you were doing LZSS with large fixed-size offsets, you just want to find the longest match all the time - boom Suffix Trie is the answer. They have no bad degenerate case, that's great.
But in practice on a modern LZ they have problems.
The main problem is that a Suffix Trie is actually quite bad at finding the lowest offset occurance of a short match. And that's a very important thing to be good at for LZ. The problem is that a proper ST with follows is doing its updates way out deep in the leaves of the tree, but the short matches are up at the root, and they are pointing at the *first* occurance of that substring. If you update all parents to the most recent pointer, you lose your O(N) completely.
(I wrote about this problem before : cbloom rants 08-22-13 - Sketch of Suffix Trie for Last Occurance )
You can do something ugly like use a suffix trie to find long matches and a hash->link with a low walk limit to find the most recent occurance of short matches. But bleh.
And my negativity about Suffix Tries also comes from another point :
Match finding is not that important. Well, there are a lot of caveats on that. On structured data (not text), with a pos-state-lastoffset coder like LZMA - match finding is not that important. Or rather, parsing is more important. Or rather, parsing is a better space-speed tradeoff.
It's way way way better to run an optimal parse with a crap match finder (even cache table with low ways) than to run a heuristic parse with great match finder. The parse is just way more important, and per CPU cost gives you way more win.
And there's another issue :
With a forward optimal parse, you can actually avoid finding matches at every position.
There are a variety of ways to get to skip ahead in a forward optimal parse :
Any time you find a very long match -
just take it and skip ahead
(eg. fast bytes in LZMA)
this can reduce the N^2 penalty of a bad match finder
When you are not finding any matches -
start taking multi-literal steps
using something like (failedMatches>>6) heuristic
When you find a long enough rep match -
just take it
and this "long enough" can be much less than "fast bytes"
eg. fb=128 for skipping normal matches
but you can just take a rep match at length >= 8
which occurs much more often
the net result is lots of opportunity for more of a "greedy" type of match finding in your optimal
parser, where you don't need every match.
This means that good greedy-parse match finders like hash-link and Yann's MMC ( my MMC note ) become interesting again (even for optimal parsing).
For multi-way cache tables, do you ever do SIMD parallel probing?
ReplyDeleteThere's no big win from SIMD in this case since the values in the cache table are pointers, not the actual characters you're matching against.
ReplyDeleteAs Charles mentions, you can store the first few characters at each offset in the cache table (along with the offset itself), which avoids the cost of potentially missing the cache accessing the window only to find out you had a hash collision. In that case, you might be able to profitably SIMD it.
You can also store a few high hash bits (assuming low bits are used as bucket index), which will filter out a lot of non-matches very quickly. U32 prefix seems like a waste of space, but I guess it depends on how long your matches tend to be.
ReplyDeleteThe nice thing about the U32 prefix is you can completely resolve a len-3 match without chasing a pointer at all.
ReplyDeleteIt gives you sort of proportional speed - any time you eat a cache miss, you know you have at least a len-4 match, which means hey at least you're getting a 4 byte step from that time penalty.
But yeah, it just has to be tested. You're right that only a few high hash bits (eg. 8) would resolve most collisions.
I did test the U32 prefix and it was faster than chasing the pointer on my tests (8-way/16-way), but really not by much (something like 1%). That was testing with somewhat weird data though, so no particular conclusion.
ReplyDeleteDoing direct len-3 matches from your hash table means you can't be sloppy with cleaning out stale offsets (>=1 wrap-around ago) from the hash table, or you can generate invalid encodings based on aliased match offsets.
Right, but you potentially eat more cache misses while probing large buckets. You could squeeze high_hash_bits + offset into a U32, and maybe double the number of probes per cache miss compared to interning. You'd also get more associative ways and maybe can find better matches.
ReplyDeleteAssociativity effects aside, you could probably predict the # cache misses per match find based on the LZ dumps. Also, IIRC, short matches also tend to be recent/frequent, so might even not incur a cache miss even though you have to chase the offset.
BTW, I tend to call the "hash high bits" the "syndrome" on loan from CRC terminology.
Quick question, to be sure to understand words used in this post :
ReplyDeletehash-ways : related to the number of pointers or indexes on a hash row
2nd hash : prefix hash-bits into cell ?
hash-ways : yes, # of entries in a hash row. So you look up by the hash and then iterate over the # of hash ways.
ReplyDelete2nd-hash : this is using a different hash function, but writing in the same hash table. So it just stomps on the primary hash.
It's most effective if the 2nd hash is on a different length substring. eg. if the primary hash is on length 3, the 2nd hash is on length 4 or 5.