tag:blogger.com,1999:blog-5246987755651065286.post8745956179122500074..comments2021-10-04T07:14:49.947-07:00Comments on cbloom rants: 09-24-11 - Suffix Tries 1cbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.comBlogger7125tag:blogger.com,1999:blog-5246987755651065286.post-26894923636175393992011-09-26T20:15:24.495-07:002011-09-26T20:15:24.495-07:00Oh wait, I see. It's contingent on this idea t...Oh wait, I see. It's contingent on this idea that we're always tracking two places at once, the current insertion and the best match. Then we can use the follow from the best match to guide the insertion.<br /><br />(Well, and even if we weren't desiring to track a best match, you could still track the best match just to run this algorithm faster. Weird.)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-1579312903563031612011-09-26T18:05:32.610-07:002011-09-26T18:05:32.610-07:00It should be reasonably easy to see that you can w...It should be reasonably easy to see that you can walk an entire file in O(N).<br /><br />Any time you fail to grow a follow (that is, you had lastml at the previous spot and you now have lastml-1) that is done in O(1).<br /><br />If you do find a longer match, say it's longer by L, that is O(L). But you carry that L forward to the next spot. So if you add up all the match grow lengths, the sum of all of them is O(N).<br /><br />Therefore the whole walk is O(N).cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-8804918318292860032011-09-26T17:47:58.383-07:002011-09-26T17:47:58.383-07:00Yeah, so :
1. In suffix tries the insert & fi...Yeah, so :<br /><br />1. In suffix tries the insert & find are really the same operation so the only thing that makes sense is to make a combined "find_and_insert".<br /><br />2. Follow pointer maintenance and such really only work great if you are doing "find_and_insert" on every position.<br /><br />3. So, here's how it works :<br /><br />at position N I want to do a "find_and_insert". I start in the tree by using the follow pointer from the match at position N-1.<br /><br />As I do my match at position N I make some new nodes. Those are missing follow pointers, and I just set them aside (I don't immediately fill them out). (with path compression this is only ever one new node)<br /><br />I now move on to position N+1 and use the follow pointer from position N to start my search. I'm now in the right part of tree to fill out the follow pointer that was missing which I set aside in the previous search.<br /><br />I actually understand it now as a funny way of sequentially walking through the tree. At every position N you are filling out the previous follow pointer, using the current follow pointer, and trying to find the next follow pointer. You are almost never walking the tree from the root to the leaves, instead you're jumping around on these links.<br /><br />I actually think it's closely related to the BWT and the tree-structured BWT coders by Maniscalco and others but I haven't quite grocked that yet.cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-40925110815342163932011-09-26T17:26:30.606-07:002011-09-26T17:26:30.606-07:00Is there a way to create the follow pointer effici...Is there a way to create the follow pointer efficiently other than looking up the following chars?<br /><br />If you were doing M queries and N inserts and M >> N, then pushing the work into the insert would make sense. But in LZ compression, you have N suffixes and N positions at which you try to find the best match, so I would think you're looking at N inserts and N queries.<br /><br />If the follow makes the N queries O(1) time at the cost of making creating the follow take as much time as the query would have been, you're not saving anything.<br /><br />But maybe there's some trick I'm missing.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-8938802680590132652011-09-26T09:05:18.209-07:002011-09-26T09:05:18.209-07:00OK, so a node does not point towards a position in...OK, so a node does not point towards a position in the input stream, only a leaf does.Cyanhttps://www.blogger.com/profile/02905407922640810117noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-49312196744183122382011-09-26T08:46:23.859-07:002011-09-26T08:46:23.859-07:00No, a correct implementation of follow pointers wi...No, a correct implementation of follow pointers will point to the *node* for "AND" not the leaf for "AND;" so if there are longer matches possible it will find them.<br /><br />If a follow pointer is original made pointing at a leaf, it must be updated to point to a node when that leaf is later split.<br /><br />For example say "bands..." is originally put in the tree as a leaf, the follow pointer will point to "ands.." also a leaf. Later on "andy.." is seen and the "and" prefix is split into a node that branches to 's' and 'y'. Now the follow pointer for "band" needs to be updated to point to the node for "and".<br /><br />This requires some maintenance structures but doesn't change the O().cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-29161656168875534302011-09-26T05:54:20.294-07:002011-09-26T05:54:20.294-07:00Afaiu, the "follow pointer" provides a g...Afaiu, the "follow pointer" provides a great speed boost.<br /><br />It seems logical. However, i guess for this to be correct that the suffix trie is no longer timely ordered (which means pointer to sibling and child always go to older positions) ? <br /><br />Here is an example where the follow pointer seems to fail finding the best match length :<br /><br />Imagine there is a match at <br />BAND; (4 characters)<br />then the "follow pointer" should point to <br />AND; (3 characters)<br />but maybe, in between, there was a position with :<br />AND THIS ONE IS BETTER; (22 characters)Cyanhttps://www.blogger.com/profile/02905407922640810117noreply@blogger.com