9/30/2012

09-30-12 - Long Range Matcher Notes

Some notes on the LRM mentioned last time.

Any time you do a string search based on hashes you will have a degeneracy problem. We saw this with the standard "Hash1b" (Hash->links) string matcher. In short, the problem is if you have many occurances of the same hash, then exact string matching becomes very slow. The standard solution is to truncate your search at some number of maximum steps (aka "amortized hashing"), but that has potentially unbounded cost (though typically low).

We have this problem with LRM and I brushed it under the rug last time. When you are doing "seperate scan" (eg. not incrementally adding to the hash table), then there's no need to have a truncated search, instead you can just have a truncated insert. That is, if you're limitting your search to 10, then don't add 1000 of the same hash and only ever search 10, just add 10. In fact on my test files it's not terrible to limit the LRM search to just 1 (!).

But I'm not happy with that as a general solution because there is a potential for huge inefficiency. The really bad degenerate case looks something like this :


LRM hash length is 32 or whatever
Lots of strings in the file of length 32 have the same hash value
You only add 100 or so to the hash
One of the ones you didn't add would have provided a really good match

Typically, missing that match is not a disaster, because at the next byte you will roll to a new hash and look that up, and so on, so if you miss a 128k long match, you will usually find a (128k - 256) long match 256 bytes later. But it is possible to miss it for a long time if you are unlucky, and I like my inefficiency to be bounded. The more common bad case is that you get matches just a bit shorter than possible, and that happens many times, and it adds up to compression lost. eg. say hash length is 16 and there are 24 byte matches possible, but due to the reduced search you only find 16 or 18-length matches.

But most importantly, I don't like to throw away compression for no good reason, I want to know that the speed of doing it this approximate way is worth it vs. a more exact matcher.

There are a few obvious solutions with LRM :

1. Push matches backwards :

If you find a match at pos P of length L, that match might also have worked at pos (P-1) for length (L+1), but a match wasn't found there, either because of the approximate search or because hashes are only put in the dictionary every N bytes.

In practice you want to be scanning matches forward (so that you can roll the hash forward, and also so you can carry forward "last match follows" in generate cases), so to implement this you probably want to have a circular window of the next 256 positions or whatever with matches in them.

This is almost free (in terms of speed and memory use) so should really be in any LRM.

2. Multiple Hashes :

The simplest form of this is to do two hashes; like one of length 16 and one of length 64 (or whatever). The shorter hash is the main one you use to find most matches, the longer hash is there to make sure you can find the big matches.

That is, this is trying to reduce the chance that you miss out on a very long match due to truncating the search on the short hash. More generally, to really be scale-invariant, you should have a bunch of levels; length 16,64,256,1024,etc. Unfortunately implementing this the naive way (by simply having several independent hashes and tables) hurts speed by a linear factor.

3. Multiple Non-Redundant Hashes :

The previous scheme has some obvious inefficiencies; why are we doing completely independent hash lookups when in fact you can't match a 64-long hash if you don't match a 16-long hash.

So you can imagine that we would first do a 16-long hash , in a lookup where the hashes have been unique'd (each hash value only occurs once), then for each 16-long hash there is another table of the 64-long hashes that occured for that 16-long hash. So then we look up in the next. If one is found there, we look in the 256-long etc.

An alternative way to imagine this is as a sorted array. For each entry you store a hash of 16,64,256,etc. You compare first on the 16-long hash, then for entries where that is equal you compare on the 64-long hash, etc. So to lookup you first use the 16-long hash and do a sorted array lookup; then in each range of equal hashes you do another sorted array lookup on the 64-long hash, etc.

These methods are okay, but the storage requirements are too high in the naive rep. You can in fact store them compactly but it all gets a bit complicated.

4. Hash-suffix sort :

Of course it should occur to us that what we're doing in #3 is really just a form of coarse suffix sort! Why not just actually use a suffix sort?

One way is like this : for each 16-byte sequence of the file, replace it with a 4-byte U32 hash value, so the array shrinks by 4X. Now suffix-sort this array of hash values, but use a U32 alphabet instead of a U8 alphabet; that is, suffix strings only start on every 4th byte.

To lookup you can use normal sorted-array lookup strategies (binary search, interpolation search, jump-ins + binary or jump-ins + interpolation, etc). So you start with a 16-byte hash to get into the suffix sort, then if you match you use the next 16-byte hash to step further, etc.

No comments:

old rants