9/24/2011

09-24-11 - Suffix Tries 1

To make terminology clear I'm going to use "trie" to mean a tree in which as you descend the length of character match always gets longer, and "suffix trie" to indicate the special case where a trie is made from all suffixes *and* there are "follow" pointers (more on this later).

Just building a trie for LZ string searching is pretty easy. Using the linked-list method (which certainly has disadvantages), internal nodes only need a child & sibling pointer, and some bit of data. If you always descend one char at a time that data is just one char. If you want to do "path compression" (multi-char steps in a single link) you need some kind of pointer + length.

(it's actually much easier to write the code with path compression, since when you add a new string you only have to find the deepest match in the tree then add one node; with single char steps you may have to add many nodes).

So for a file of length N, internal nodes are something like 10 bytes, and you need at most N nodes. Leaves can be smaller or even implicit.

With just a normal trie, you have a nice advantage for optimal parsing, which is that when you find the longest match, you also automatically walk past all shorter matches. At each node you could store the most recent position that that substring was seen, so you can find the lowest offset for each length of match for free. (this requires more storage in the nodes plus a lot more memory writes, but I think those memory writes are basically free since they are to nodes in cache anyway).

The Find and Insert operations are nearly identical so they of course should be done together.

A trie could be given a "lazy update". What you do is on Insert you just tack the nodes on somewhere low down in the tree. Then on Find, when you encounter nodes that have not been fully inserted you pick them up an carry them with you as you descend. Whenever you take a path that your baggage can't take, you leave that baggage behind. This could have advantages under certain usage patterns, but I haven't actually tried it.

But it's only when you get the "follow" pointers that a suffix trie really makes a huge difference.

A follow pointer is a pointer in the tree from any node (substring) to the location in the tree of the substring without the first character. That is, if you are at "banana" in the tree, the follow pointer should point at the location of "anana" in the tree.

When you're doing LZ compression and you find a match at pos P of length L, you know that at pos P+1 there must be a match of at least length L-1 , simply by using the same offset and matching one less character. (there could be a longer match, though). So, if you know the suffix node that was used to find the match of length L at pos P, then you can jump in directly to match of length L-1 at the next position.

This is huge. Consider for example the fully degenerate case, a file of length N of all the same character. (yes obviously there are special case solutions to the fully degenerate case, but that doesn't fix the problem, it just makes it more complex to create the problem). A naive string matcher is actually O(N^3) !!

For each position in the file (*N)
Consider all potential matches (*N)
Compare all the characters in that potential match (*N)
A normal trie makes this O(N^2) , because the comparing characters in the string is combined with finding all potential matches, so the tree descent + string compares combined are just O(N).

But a true suffix trie with follow pointers is only O(N) for the whole parse. Somewhere early on would find a match of length O(N) and then each subsequent one just finds a match of L-1 in O(1) time using the follow pointer. (the O(N) whole parse only works if you are just finding the longest length at each position; if you are doing the optimal parse where you find the lowest offset for each length it's O(N^2))

Unfortunately, it seems that when you introduce the follow pointer this is when the code for the suffix trie gets rather tricky. It goes from 50 lines of code to 500 lines of code, and it's hard to do without introducing parent pointers and lots more tree maintenance. It also makes it way harder to do a sliding window.

7 comments:

Cyan said...

Afaiu, the "follow pointer" provides a great speed boost.

It seems logical. However, i guess for this to be correct that the suffix trie is no longer timely ordered (which means pointer to sibling and child always go to older positions) ?

Here is an example where the follow pointer seems to fail finding the best match length :

Imagine there is a match at
BAND; (4 characters)
then the "follow pointer" should point to
AND; (3 characters)
but maybe, in between, there was a position with :
AND THIS ONE IS BETTER; (22 characters)

cbloom said...

No, a correct implementation of follow pointers will point to the *node* for "AND" not the leaf for "AND;" so if there are longer matches possible it will find them.

If a follow pointer is original made pointing at a leaf, it must be updated to point to a node when that leaf is later split.

For example say "bands..." is originally put in the tree as a leaf, the follow pointer will point to "ands.." also a leaf. Later on "andy.." is seen and the "and" prefix is split into a node that branches to 's' and 'y'. Now the follow pointer for "band" needs to be updated to point to the node for "and".

This requires some maintenance structures but doesn't change the O().

Cyan said...

OK, so a node does not point towards a position in the input stream, only a leaf does.

Anonymous said...

Is there a way to create the follow pointer efficiently other than looking up the following chars?

If you were doing M queries and N inserts and M >> N, then pushing the work into the insert would make sense. But in LZ compression, you have N suffixes and N positions at which you try to find the best match, so I would think you're looking at N inserts and N queries.

If the follow makes the N queries O(1) time at the cost of making creating the follow take as much time as the query would have been, you're not saving anything.

But maybe there's some trick I'm missing.

cbloom said...

Yeah, so :

1. In suffix tries the insert & find are really the same operation so the only thing that makes sense is to make a combined "find_and_insert".

2. Follow pointer maintenance and such really only work great if you are doing "find_and_insert" on every position.

3. So, here's how it works :

at position N I want to do a "find_and_insert". I start in the tree by using the follow pointer from the match at position N-1.

As I do my match at position N I make some new nodes. Those are missing follow pointers, and I just set them aside (I don't immediately fill them out). (with path compression this is only ever one new node)

I now move on to position N+1 and use the follow pointer from position N to start my search. I'm now in the right part of tree to fill out the follow pointer that was missing which I set aside in the previous search.

I actually understand it now as a funny way of sequentially walking through the tree. At every position N you are filling out the previous follow pointer, using the current follow pointer, and trying to find the next follow pointer. You are almost never walking the tree from the root to the leaves, instead you're jumping around on these links.

I actually think it's closely related to the BWT and the tree-structured BWT coders by Maniscalco and others but I haven't quite grocked that yet.

cbloom said...

It should be reasonably easy to see that you can walk an entire file in O(N).

Any time you fail to grow a follow (that is, you had lastml at the previous spot and you now have lastml-1) that is done in O(1).

If you do find a longer match, say it's longer by L, that is O(L). But you carry that L forward to the next spot. So if you add up all the match grow lengths, the sum of all of them is O(N).

Therefore the whole walk is O(N).

Anonymous said...

Oh wait, I see. It's contingent on this idea that we're always tracking two places at once, the current insertion and the best match. Then we can use the follow from the best match to guide the insertion.

(Well, and even if we weren't desiring to track a best match, you could still track the best match just to run this algorithm faster. Weird.)

old rants