09-10-14 - Suffix Trie EOF handling

I realized something.

In a Suffix Trie (or suffix sort, or anything similar), handling the end-of-buffer is a mess.

The typical way it's described in the literature is to treat the end-of-buffer (henceforth EOB) as if it is a special character with value out of range (such as -1 or 256). That way you can just compare strings that go up to EOB with other strings, and the EOB will always mismatch a normal character and cause those strings to sort in a predictable way.

eg. on "banana" when you insert the final "na-EOB" you compare against "nana" and wind up comparing EOB vs. 'n' to find the sort order for that suffix.

The problem is that this is just a mess in the code. All over my suffix trie code, everywhere that I do a string lookup, I had to do extra special case checking for "is it EOB" and handle that case.

In addition, when I find a mismatch of "na-EOB" and "nan", the normal path of the code would be to change the prefix "na" into a branch and add children for the two different paths - EOB and "n". But I can't actually do that in my code because the child is selected by an unsigned char (uint8), and EOB is out of bounds for that variable type. So in all the node branch construction paths I have to special case "is it a mismatch just because of EOB, then don't create a branch". Blah blah.

Anyway, I realized that can all go away.

The key point is this :

Once you add a suffix that hits EOB (eg. the first mismatch against the existing suffixes is at EOB), then all future suffixes will also hit EOB, and that will be the deepest match for all of them.

Furthermore, all future suffix nodes can be found immediately using "follows"

eg. in the "banana" case, construction of the trie goes like :

"ban" is in the suffix trie
  (eg. "ban..","an..","n.." are all in the trie)

add "ana" :

we find the existing "an.."

the strings we compare are "anana-EOB" vs "ana-EOB"

so the mismatch hits EOB

That means all future suffixes will also hit EOB, and their placement in the tree can be found
just by using "follows" from the current string.

"ana-EOB" inserts at "anana"
"na-EOB" inserts at "nana"
"a-EOB" inserts at "ana"

That is, at the end of every trie construction when you start hitting EOB you just jump out to this special case of very simple follows addition.

So all the special EOB handling can be pulled out of the normal Trie code and set off to the side, which is lovely for the code clarity.

In practice it's quite common for files to end with a run of "000000000" which now gets swallowed up neatly by this special case.

ADD : if you don't care about adding every occurance of each suffix, then it gets even simpler -

If you hit EOB when adding a string - just don't add it. A full match of that suffix already exists, and your suffix that goes to EOB can't match any lookup better than what's in there.

(note that when I say "hit EOB" I mean you are at a branch node, and your current character doesn't match any of the branches because it is EOB. You will still add leaves that go to EOB, but you will never actually walk down those leaves to reach EOB.)

No comments:

old rants