9/24/2014

09-24-14 - Smart Phone Advice

Errmmm... I think I might finally get a smart phone.

I don't really want to waste any time researching about this because I fucking hate them and want as little as possible to do with this entire industry.

Definitely not anything Apple. I abolutely don't want to deal with any headaches of doing funny OS flashes or anything nonstandard that will complicate my life.

I'm thinking Google Nexus 5 because I understand it's the most minimal pure android ; I hate dealing with bloatware. I've already spent more time on this than I would like to. I actually prefer something smaller and lighter since I will only use it in emergencies. Galaxy S5 Mini? Not actually significantly smaller. Jesus christ.

It looks like "Straight Talk" is probably the right plan option for someone like me who will rarely use it. (?). I can't use anything T-Mobile because their coverage sucks around Seattle. Too bad because they seem to have the best pay-go plans.

One thing I have no idea about is how much bandwidth I need and whether paying per byte is okay or if I'll be fucked. If I accidentally browse to some web page that spews data at me, will that cost me a fortune? I don't like the idea of having to worry about that.

9/10/2014

09-10-14 - Suffix Trie EOF handling

I realized something.

In a Suffix Trie (or suffix sort, or anything similar), handling the end-of-buffer is a mess.

The typical way it's described in the literature is to treat the end-of-buffer (henceforth EOB) as if it is a special character with value out of range (such as -1 or 256). That way you can just compare strings that go up to EOB with other strings, and the EOB will always mismatch a normal character and cause those strings to sort in a predictable way.

eg. on "banana" when you insert the final "na-EOB" you compare against "nana" and wind up comparing EOB vs. 'n' to find the sort order for that suffix.

The problem is that this is just a mess in the code. All over my suffix trie code, everywhere that I do a string lookup, I had to do extra special case checking for "is it EOB" and handle that case.

In addition, when I find a mismatch of "na-EOB" and "nan", the normal path of the code would be to change the prefix "na" into a branch and add children for the two different paths - EOB and "n". But I can't actually do that in my code because the child is selected by an unsigned char (uint8), and EOB is out of bounds for that variable type. So in all the node branch construction paths I have to special case "is it a mismatch just because of EOB, then don't create a branch". Blah blah.

Anyway, I realized that can all go away.

The key point is this :

Once you add a suffix that hits EOB (eg. the first mismatch against the existing suffixes is at EOB), then all future suffixes will also hit EOB, and that will be the deepest match for all of them.

Furthermore, all future suffix nodes can be found immediately using "follows"

eg. in the "banana" case, construction of the trie goes like :


"ban" is in the suffix trie
  (eg. "ban..","an..","n.." are all in the trie)

add "ana" :

we find the existing "an.."

the strings we compare are "anana-EOB" vs "ana-EOB"

so the mismatch hits EOB

That means all future suffixes will also hit EOB, and their placement in the tree can be found
just by using "follows" from the current string.

"ana-EOB" inserts at "anana"
"na-EOB" inserts at "nana"
"a-EOB" inserts at "ana"

That is, at the end of every trie construction when you start hitting EOB you just jump out to this special case of very simple follows addition.

So all the special EOB handling can be pulled out of the normal Trie code and set off to the side, which is lovely for the code clarity.

In practice it's quite common for files to end with a run of "000000000" which now gets swallowed up neatly by this special case.


ADD : if you don't care about adding every occurance of each suffix, then it gets even simpler -

If you hit EOB when adding a string - just don't add it. A full match of that suffix already exists, and your suffix that goes to EOB can't match any lookup better than what's in there.

(note that when I say "hit EOB" I mean you are at a branch node, and your current character doesn't match any of the branches because it is EOB. You will still add leaves that go to EOB, but you will never actually walk down those leaves to reach EOB.)

old rants