9/23/2011

09-23-11 - Morphing Matching Chain

"MMC" is a lazy-update suffix tree.

mmc - Morphing Match Chain - Google Project Hosting
Fast Data Compression MMC - Morphing Match Chain
Fast Data Compression BST Binary Search Tree
encode.ru : A new match searching structure
Ultra-fast LZ

(I'm playing a bit loose with the term "suffix tree" as most people do; in fact a suffix tree is a very special construction that uses the all-suffixes property and internal pointers to have O(N) construction time; really what I'm talking about is a radix string tree or patricia type tree). (also I guess these trees are tries)

Some background first. You want to match strings for LZ compression. Say you decide to use a suffix tree. At each level of the tree, you have already matched L characters of the search string; you just look up your next character and descend that part of the tree that has that character as a prefix. eg. to look up string str, if you've already decended to level L, you find the child for character str[L] (if it exists) and descend into that part of the tree. One way to implement this is to use a linked list for all the characters that have been seen at a given level (and thus point to children at level +1).

So your nodes have two links :


child = subtree that matches at least L+1 characters
sibling = more nodes at current level (match L characters)

the tree for "bar","band",bang" looks like :

b
|  (child links are vertical)
a
|
r-n  (sibling links are horizontal)
| |
* d-g
  | |
  * *

where * means leaf or end of string (and is omitted in practice).

Okay, pretty simple. This structure is not used much in data compression because we generally want sliding windows, and removal of strings as they fall out of the sliding window is difficult.

(Larsson and others have shown that it is possible to do a true sliding suffix tree, but the complexity has prevented use in practice; this would be a nice project if someone wants to make an actual fast implementation of the sliding suffix trie)

Now let's look at the standard way you do a hash table for string matching in the LZ sliding window case.

The standard thing is to use a fixed size hash to a linked list of all strings that share that hash. The linked list can just be an array of positions where that hash value last occured. So :


pos = hashTable[h] contains the position where h last occured
chain[pos] contains the lat position before pos where that same hash h occurred

the nice thing about this is that chain[] can just be an array of the size of the sliding window, and you modulo the lookup into it. In particular :

//search :
h = hash desired string
next = hashTable[h];
while ( next > cur - window_size )
{
  // check match len of next vs cur
  next = chain[next & (window_size-1) ];
}

note that the links can point outside the sliding window (eg. either hashTable[] or chain[] may contain values that go outside the window), but we detect those and know our walk is done. (the key aspect here is that the links are sorted by position, so that when a link goes out of the window we are done with the walk; this means that you can't do anything like MTF on the list because it ruins the position sort order). Also note that there's no check for null needed because we can just initial the hash table with a negative value so that null is just a position outside the window.

To add to the hash table when we slide the window we just tack onto the list :


// add string :
chain[ cur & (window_size)-1 ] = hashTable[h];
hashTable[h] = cur;

and there's the sort of magic bit - we also removed a node right there. We actually popped the node off the back of the sliding window. That was okay because it must have been the last node on its list, so we didn't corrupt any of our lists.

That's it for hash-chain review. It's really nice how simple the add/remove is, particularly for "Greedy" type LZ parsers where you do Insert much more often than you do Find. (there are two general classes of LZ parers - "Optimal" which generally do a Find & Insert at every position, and "Greedy" which when they find a match, step ahead by the match len and only do Inserts).

So, can we get the advantages of hash chains and suffix trees?

Well, we need another idea, and that is "lazy updates". The idea is that we let our tree get out of sorts a bit, and then fix it the next time we visit it. This is a very general idea and can be applied to almost any tree type. I think the first time I encountered it was in the very cool old SurRender Umbra product, where they used lazy updates of their spatial tree structures. When objects moved or spawned they got put on a list on a node. When you descend the tree later on looking for things, if a node has child nodes you would take the list of objects on the node and push them to the children - but then you only descend to the child that you care about. This can save a lot of work under certain usage patterns; for example if objects are spawning off in some part of the tree that you don't visit, they just get put in a high up node and never pushed down to the leaves.

Anyhoo, so our suffix tree requires a node with two links. Like the hash table we will implement our links just as positions :

struct SuffixNode { int sibling; int child; }
like the hash table, our siblings will be in order of occurance, so when we see a position that's out of the window we know we are done walking.

Now, instead of maintaining the suffix tree when we add a node, we're just going to tack the new node on the front of the list. We will then percolate in an update the next time we visit that part of the tree. So when you search the tree, you can first encounter some unmaintained nodes before you get to the maintained section.

For example, say we had "bar" and "band" in our tree, and we add "bang" at level 2 , we just stick it on the head and don't descend the tree to put it in the right place :


b
|  (child links are vertical)
a
|
NG-r-n  (sibling links are horizontal)
     |
     d

(caps indicates unmaintained portion)

now the next time we visit the "ba" part of the tree in a retrieval, we also do some maintenance. We remember the first time we see each character (using a [256] array), and if we see that same character again we know that it's because part of the tree was not maintained.

Say we come in looking for "bank". If see a node with an "n" (that's a maintained n) we know we are done and we go to the child link - there can't be any more n's behind that node. If we see an "N" (no child link), we remember it but we have to keep walking siblings. We might see more "N"s and we are done if we see an "n". Then we update the links. We remove the "n" (of band) from the sibling link and connect it to the "N" instead :


b
|  (child links are vertical)
a
|
n-r
|   
g---d

And this is the essence of MMC (lazy update suffix trie = LUST).

A few more details are significant. Like the simple hash chain, we always add nodes to the front of the list. The lazy update also always adds nodes to the head - that is, the branch that points to more children is always at the most recent occurance of that substring. eg. if you see "danger" then "dank" then "danish" you know that the "dan" node is either unmaintained, or points are the most recent occurance of "dan" (the one in "danish"). What this means is that the simple node removal method of the hash chain works - when the window slides, we just let nodes fall out of the range that we consider valid and they drop off the end of the tree. We don't have to worry about those nodes being an internal node to the tree that we are removing, because they are always the last one on a list.

In practice the MMC incremental update becomes complex because you may be updating multiple levels of the tree at once as you scan. When you first see the "NG" you haven't seen the "n" yet and you don't want to scan ahead the list right away, you want to process it when you see it; so you initially promote NG to a maintained node, but link it to a temporary invalid link that points back to the previous level. Then you keep walking the list and when you see the "n" you fix up that link to complete the maintenance.

It does appear that MMC is a novel and interesting way of doing a suffix trie for a sliding window.

3 comments:

Anonymous said...

Yeah, this was the one I was telling you about that I'd thought at first I'd gotten from your blog.

cbloom said...

Yeah it was mentioned here some time ago (it's by the guy who did LZ4, Yann Collet) but I didn't actually understand it until today.

cbloom said...

Early testing indicates that MMC is pretty solid. I'm testing some other ideas as well, though. Will post a followup soon.

old rants