9/24/2011

09-24-11 - Suffix Tries 2

Say you have a suffix trie with path compression.

So, for example if you had "abxyz" , "abymn" and "abxyq" then you would have :


[ab]   (vertical link is a child)
|
[xy]-[ymn]  (horizontal link is a sibling)
|
z-q

only the first character is used for selecting between siblings, but then you may need to step multiple characters to get to the next branch point.

(BTW I just thought of an interesting alternative way to do suffix tries in a b-tree/judy kind of way. Make your node always have 256 slots. Instead of always matching the first character to find your child, match N. That way for sparse parts of the tree N will be large and you will have many levels of the tree in one 256-slot chunk. In dense parts of the tree N becomes small, down to 1, in which case you get a radix array). Anyhoo..

So there are substrings that don't correspond to any specific node. For example "abx" is between "ab" and "abxy" which have definite spots in the tree. If you want to add "abxr" you have to first break the "xy" and then add the new node.

Okay, this is all trivial and just tree management, but there's something interesting about it :

If you have a "follow" pointer and the length you want does not correspond to a specific node (ie it's one of those between lengths), then there can be no longer match possible.

So, you had a previous match of length "lastml". You step to the next position, you know the best match is at least >= lastml-1. You use a follow pointer to jump into the tree and find the node for the following suffix. You see that the node does not have length "lastml-1", but some other length. You are done! No more tree walking is needed, you know the best match length is simply lastml-1.

Why is this? Consider if there was a longer match possible. Let's say our string was "sabcdt..." at the last position we matched 5 ("sabcd"). So we now have "abcdt..." and know match is >= 4. We look up the follow node for "abcd" and find there is no length=4 node in the tree. That means that the only path in the tree had "dt" in it - there has been no character other than "t" after "d" or there would be a branching node there. But I know that I cannot match "t" because if I did then the previous match would have been longer. Therefore there is no longer match possible.

This turns out to be very common. I'm sure if I actually spent a month or so on suffix tries I would learn lots of useful properties (there are lots of papers on this topic).

No comments:

old rants