SuffixArray3 : suffix array string matcher which uses a min/max tree to find allowed offsets.

The min/max tree is a binary hierarchy ; at level L there are (size>>L) entries, and each entry covers a range of size (1 << L). Construction is O(N) because N/2+N/4+N/8 ... = N

The min/max tree method is generally slightly slower than the elegant "chain of fences" approach used for SuffixArray2, but it's close. The big advantage is the min/max tree can also be used for windowed matching, which is not easy to integrate in SA2.

(ADDENDUM : this is super unclear, see more at end)

First check that it satisfies the O(N) goal on the stress tests :

0 = stress_all_as 1 = stress_many_matches 2 = stress_search_limit 3 = stress_sliding_follow 4 = stress_suffix_forward 5 = twobooks

Yep. Then check optimal parse, large window vs. the good options :

0 = ares_merged.gr2_sec_5.dat 1 = bigship1.x 2 = BOOK1 3 = combined.gr2_sec_3.dat 4 = GrannyRocks_wot.gr2 5 = Gryphon_stripped.gr2 6 = hetero50k 7 = lzt20 8 = lzt21 9 = lzt22 10 = lzt23 11 = lzt24 12 = lzt25 13 = lzt27 14 = lzt28 15 = lzt29 16 = movie_headers.bin 17 = orange_purple.BMP 18 = predsave.bin

The good Suffix Trie is clearly the best, but we're in the ballpark.

Now optimal parse, 17 bit (128k) window :

totals: Test_MMC2 : DNF Test_LzFind2 : 506.954418 , 1355.992865 Test_SuffixArray3 : 506.954418 , 514.931740 Test_MMC1 : 13.095260 , 1298.507490 Test_LzFind1 : 12.674177 , 226.796123 Test_Hash1 : 503.319301 , 1094.570022

Finally greedy parse, 17 bit window :

totals: Test_MMC2 : 0.663028 , 110.373098 Test_LzFind2 : DNF Test_SuffixArray3 : 0.663036 , 236.896551 Test_MMC1 : 0.663028 , 222.626069 Test_LzFind1 : 0.662929 , 216.912409 Test_Hash1 : 0.662718 , 62.385071

average match length :

And once more for clarity :

Greedy parse, 16 bit window , just the good candidates :

totals: Test_SuffixArray3 : 0.630772 , 239.280605 Test_LzFind1 : 0.630688 , 181.430093 Test_MMC2 : 0.630765 , 88.413339 Test_Hash1 : 0.630246 , 51.980073

It should be noted that LzFind1 is approximate, and Hash1 is even more approximate. Though looking at the match length chart you certainly can't see it.

ADDENDUM : an email I wrote trying to explain this better :

```
First a reminder of how the normal suffix array searcher works :
We're at some file position F
We look up sortIndex[F] to find our sort position S
we know our longest matches must be at sort positions S-1 or S+1
we step to our neighbors
The problem with windowed matching is our neighbors may be way out of the window
So what we ant is a way to step more than 1 away to find the closest neighbor in the sort order that is within our desired window.
So really all we are doing is adding another search structure on top of the suffix array. It's a search structure to go in either direction from S and find the closest spot to S that has file position in the range [F-window,F-1]
To do this I just build a tree. It's indexed by sort position, and its content is file positions.
Level L of the tree contains nodes that cover an interval of size 1<
````<`

L
That is, level 0 has intervals of size 1, that's just
Tree_0[S] = sortIndexInverse[S]
(eg. it's just the file position of that sort pos)
(in fact we don't make this level of the tree, we just use sortIndexInverse which we already have)
Tree_1[i] covers steps of size 2, that is :
file positions sortIndexInverse[i*2] and sortIndexInverse[i*2+1]
I store the min and max of that
Tree_2[i] covers steps of size 4, that is min and max of Tree_1[i*2] and Tree_1[i*2+1]
Now once you have this binary interval tree you can do the normal kind of binary tree walk to find the closest neighbor that's in the range you want.

Also see the following blog on how I walk binary interval trees and the released source code for StringMatchTest contains the full code for this matcher.

## No comments:

Post a Comment