cbloom rants: 08-22-13 - Sketch of Suffix Trie for Last Occurance

I don't usually like to write about algorithms that I haven't actually implemented yet, but it seems in my old age that I will not actually get around to doing lots of things that I think about, so here goes one.

I use a Suffix Trie for string matching for my LZ compressors when optimal parsing.

reminder: Suffix Tries are really the super-awesome solution, but only for the case that you are optimal parsing not greedy parsing, so you are visiting every byte, and for large windows (sliding window Suffix Tries are not awesome). see : LZ String Matcher Decision Tree (w/ links to Suffix Trie posts)

Something has always bothered me about it. Almost the entire algorithm is this sweet gem of computer science perfection with no hackiness and a perfect O(N) running time. But there's one problem.

A Suffix Trie really just gives you the longest matching substring in the window. It's not really about the *location* of that substring. In particular, the standard construction using pointers to the string that was inserted will give you the *first* occurance of each substring. For LZ compression what you want is the *last* occurance of each substring.

(I'm assuming throughout that you use path compression and your nodes have pointers into the original window. This means that each step along the original window adds one node, and that node has the pointer to the insertion location.)

In order to get the right answer, whenever you do a suffix query and find the deepest node that you match, you should then visit all children and see if any of them have a more recent pointer. Say you're at depth D, all children at depth > D are also substring matches of the same first D bytes, so those pointers are equally valid string matches, and for LZ you want the latest one.

An equivalent alternative is instead of searching all children on query, you update all parents on insertion. Any time you insert a new node, go back to all your parents and change their pointers to your current pointer, because your pointer must match them all up to their depth, and it's a larger value.

Of course this ruins the speed of the suffix trie so you can't do that.

In Oodle I use limitted parent updates to address this issue. Every time I do a query/insert (they always go together in an optimal parse, and the insert is always directly under the deepest match found), I take the current pointer and update N steps up the parent links. I tested various values of N against doing full updates and found that N=32 gave me indistinguishable compression ratios and very little speed hit.

(any fixed value of N preserves the O(N) of the suffix trie, it's just a constant multiplier). (you need to walk up to parents anyway if you want to find shorter matches at lower offsets; the normal suffix lookup just gives you the single longest match).

So anyway, that heuristic seems to work okay, but it just bothers me because everything else about the Suffix Trie is so pure with no tweak constants in it, and then there's this one hack. So, can we solve this problem exactly?

I believe so, but I don't quite see the details yet. The idea goes like this :

I want to use the "push pointer up to parents method". But I don't actually want to update all parents for each insertion. The key to being fast is that many of the nodes of the suffix trie will never be touched again, so we want to kind of virtually mark those nodes as dirty, and they can update themselves if they are ever visited, but we don't do any work if they aren't visited. (BTW by "fast" here I mean the entire parse should still be O(N) or O(NlogN) but not fall to O(N^2) which is what you get if you do full updates).

In particular in the degenerate match cases, you spend all your time way out at the leaves of the suffix trie chasing the "follows" pointer, you never go back to the root, and many of the updates overwrite each other in a trivial way. That is, you might do substring "xxaaaa" at "ptr", and then "xxaaaaa" at "ptr+1" ; the update of "ptr" back up the tree will be entirely overwrittten by the update from "ptr+1" (since it also serves as an "xxaa" match and is later), so if you just delay the update it doesn't need to be done at all.

(in the end this whole problem boils down to a very simple tree question : how do you mark a walk from a leaf back to the root with some value, such that any query along that walk will get the value, but without actually doing O(depth) work if those nodes are not touched? Though it's not really that question in general, because in order to be fast you need to use the special properties of the Suffix Trie traversal.)

My idea is to use "sentries". (this is a bit like skip-lists or something). In addition to the "parent" pointer, each node has a pointer to the preceding "sentry". Sentry steps take you >= 1 step toward root, and the step distance increases. So stepping up the sentry links might take you 1,1,2,4,.. steps towards root. eg. you reach root in log(depth) steps.

When you insert a new node, instead of walking all parents and changing them to your pointer, you walk all sentries and store your pointer as a pending update.

When you query a node, you walk to all sentries and see if any of them has a lower pointer. This effectively finds if any of your children did an update that you need to know about.

The pointer that you place in the sentry is really a "pending update" marker. It means that update needs to be applied from that node up the tree to the next sentry (ADD: I think you also need to store the range that it applies to, since a large-step range can get broken down to smaller ranges by updates). You know what branch of the tree it applies to because the pointer is the string and the string tells you what branch of the tree to follow.

The tricky bit happens when you set the pointer in the sentry node, there may be another pointer there from a previous insertion that is still pending update. You need to apply the previous pending update before you store your new pointer in the pending update slot.

Say a node contains a pending update with the pointer "a", and you come in and want to mark it with "b". You need to push the "a" update into the range that it applies to, so that you can set that node to be pending with a "b".

The key to speed is that you only need to push the "a" update where it diverges from "b". For example if the substring of "a" and "b" is the same up to a deeper sentry that contains "b" then you can just throw away the "a" pending update, the "b" update completely replaces it for that range.

Saying it all again :

You have one pointer update "a" that goes down a branch of the tree. You don't want to actually touch all those nodes, so you store it as applying to the whole range. You do a later pointer update "b" that goes down a branch that partially overlaps with the "a" branch. The part that is "a" only you want to leave as a whole range marking, and you do a range-marking for "b". You have to find the intersection of the two branches, and then the area where they overlap is again range-marked with "b" because it's newer and replaces "a". The key to speed is that you're marking big ranges of nodes, not individual nodes. My proposal for marking the ranges quickly is to use power-of-2 sentries, to mark a range of length 21 you would mark spans of length 16+4+1 kind of a thing.

Maybe some drawings are clearer. Here we insert pointer "a", and then later do a query with pointer "b" that shares some prefix with "a", and then insert "b".

The "b" update to the first sentry has to push the "a" update that was there up until the substrings diverge. The update back to the root sees that "a" and "b" are the same substring for that entire span and so simply replaces the pending update of "a" with a pending update of "b".

Let's see, finishing up.

One thing that is maybe not clear is that within the larger sentry steps the smaller steps are also there. That is, if you're at a deep leaf you walk back to the root with steps that go 1,1,2,4,8,16,32. But in that last big 32 step, that doesn't mean that's one region of 32 nodes with no other sentries. Within there are still 1,2,4 type steps. If you have to disambiguate an update within that range, it doesn't mean you have to push up all 32 nodes one by one. You look and see hey I have a divergence in this 32-long gap, so can I just step up 16 with "a" and "b" being the same? etc.

I have no idea what the actual O() of this scheme is. It feels like O(NlogN) but I certainly don't claim that it is without doing the analysis.

I haven't actually implemented this so there may be some major error in it, or it might be no speed win at all vs. always doing full updates.

Maybe there's a better way to mark tree branches lazily? Some kind of hashing of the node id? Probabilistic methods?

cbloom rants

8/22/2013

08-22-13 - Sketch of Suffix Trie for Last Occurance

No comments:

old rants