SuffixArray3 : suffix array string matcher which uses a min/max tree to find allowed offsets.
The min/max tree is a binary hierarchy ; at level L there are (size>>L) entries, and each entry covers a range of size (1 << L). Construction is O(N) because N/2+N/4+N/8 ... = N
The min/max tree method is generally slightly slower than the elegant "chain of fences" approach used for SuffixArray2, but it's close. The big advantage is the min/max tree can also be used for windowed matching, which is not easy to integrate in SA2.
(ADDENDUM : this is super unclear, see more at end)
First check that it satisfies the O(N) goal on the stress tests :
0 = stress_all_as
1 = stress_many_matches
2 = stress_search_limit
3 = stress_sliding_follow
4 = stress_suffix_forward
5 = twobooks
Yep. Then check optimal parse, large window vs. the good options :
0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin
The good Suffix Trie is clearly the best, but we're in the ballpark.
Now optimal parse, 17 bit (128k) window :
totals:
Test_MMC2 : DNF
Test_LzFind2 : 506.954418 , 1355.992865
Test_SuffixArray3 : 506.954418 , 514.931740
Test_MMC1 : 13.095260 , 1298.507490
Test_LzFind1 : 12.674177 , 226.796123
Test_Hash1 : 503.319301 , 1094.570022
Finally greedy parse, 17 bit window :
totals:
Test_MMC2 : 0.663028 , 110.373098
Test_LzFind2 : DNF
Test_SuffixArray3 : 0.663036 , 236.896551
Test_MMC1 : 0.663028 , 222.626069
Test_LzFind1 : 0.662929 , 216.912409
Test_Hash1 : 0.662718 , 62.385071
average match length :
And once more for clarity :
Greedy parse, 16 bit window , just the good candidates :
totals:
Test_SuffixArray3 : 0.630772 , 239.280605
Test_LzFind1 : 0.630688 , 181.430093
Test_MMC2 : 0.630765 , 88.413339
Test_Hash1 : 0.630246 , 51.980073
It should be noted that LzFind1 is approximate, and Hash1 is even more approximate. Though looking at the match length chart you certainly can't see it.
ADDENDUM : an email I wrote trying to explain this better :
First a reminder of how the normal suffix array searcher works :
We're at some file position F
We look up sortIndex[F] to find our sort position S
we know our longest matches must be at sort positions S-1 or S+1
we step to our neighbors
The problem with windowed matching is our neighbors may be way out of the window
So what we ant is a way to step more than 1 away to find the closest neighbor in the sort order that is within our desired window.
So really all we are doing is adding another search structure on top of the suffix array. It's a search structure to go in either direction from S and find the closest spot to S that has file position in the range [F-window,F-1]
To do this I just build a tree. It's indexed by sort position, and its content is file positions.
Level L of the tree contains nodes that cover an interval of size 1<
<
L
That is, level 0 has intervals of size 1, that's just
Tree_0[S] = sortIndexInverse[S]
(eg. it's just the file position of that sort pos)
(in fact we don't make this level of the tree, we just use sortIndexInverse which we already have)
Tree_1[i] covers steps of size 2, that is :
file positions sortIndexInverse[i*2] and sortIndexInverse[i*2+1]
I store the min and max of that
Tree_2[i] covers steps of size 4, that is min and max of Tree_1[i*2] and Tree_1[i*2+1]
Now once you have this binary interval tree you can do the normal kind of binary tree walk to find the closest neighbor that's in the range you want.
Also see the following blog on how I walk binary interval trees and the released source code for StringMatchTest contains the full code for this matcher.
No comments:
Post a Comment