cbloom rants: 10-01-11 - String Match Results Part 6

You knew that couldn't be the end.

SuffixArray3 : suffix array string matcher which uses a min/max tree to find allowed offsets.

The min/max tree is a binary hierarchy ; at level L there are (size>>L) entries, and each entry covers a range of size (1 << L). Construction is O(N) because N/2+N/4+N/8 ... = N

The min/max tree method is generally slightly slower than the elegant "chain of fences" approach used for SuffixArray2, but it's close. The big advantage is the min/max tree can also be used for windowed matching, which is not easy to integrate in SA2.

(ADDENDUM : this is super unclear, see more at end)

First check that it satisfies the O(N) goal on the stress tests :

0 = stress_all_as
1 = stress_many_matches
2 = stress_search_limit
3 = stress_sliding_follow
4 = stress_suffix_forward
5 = twobooks

Yep. Then check optimal parse, large window vs. the good options :

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin

The good Suffix Trie is clearly the best, but we're in the ballpark.

Now optimal parse, 17 bit (128k) window :


totals:
Test_MMC2 : DNF 
Test_LzFind2 : 506.954418 , 1355.992865 
Test_SuffixArray3 : 506.954418 , 514.931740 
Test_MMC1 : 13.095260 , 1298.507490 
Test_LzFind1 : 12.674177 , 226.796123 
Test_Hash1 : 503.319301 , 1094.570022

Finally greedy parse, 17 bit window :


totals:
Test_MMC2 : 0.663028 , 110.373098 
Test_LzFind2 : DNF 
Test_SuffixArray3 : 0.663036 , 236.896551 
Test_MMC1 : 0.663028 , 222.626069 
Test_LzFind1 : 0.662929 , 216.912409 
Test_Hash1 : 0.662718 , 62.385071

average match length :

And once more for clarity :

Greedy parse, 16 bit window , just the good candidates :

totals:
Test_SuffixArray3 : 0.630772 , 239.280605 
Test_LzFind1 : 0.630688 , 181.430093 
Test_MMC2 : 0.630765 , 88.413339 
Test_Hash1 : 0.630246 , 51.980073

It should be noted that LzFind1 is approximate, and Hash1 is even more approximate. Though looking at the match length chart you certainly can't see it.

ADDENDUM : an email I wrote trying to explain this better :


First a reminder of how the normal suffix array searcher works :

We're at some file position F
We look up sortIndex[F] to find our sort position S
we know our longest matches must be at sort positions S-1 or S+1
we step to our neighbors

The problem with windowed matching is our neighbors may be way out of the window

So what we ant is a way to step more than 1 away to find the closest neighbor in the sort order that is within our desired window.

So really all we are doing is adding another search structure on top of the suffix array.  It's a search structure to go in either direction from S and find the closest spot to S that has file position in the range [F-window,F-1] 

To do this I just build a tree.  It's indexed by sort position, and its content is file positions.

Level L of the tree contains nodes that cover an interval of size 1<<L 

That is, level 0 has intervals of size 1, that's just

Tree_0[S] = sortIndexInverse[S] 
  (eg. it's just the file position of that sort pos)
  (in fact we don't make this level of the tree, we just use sortIndexInverse which we already have)

Tree_1[i] covers steps of size 2, that is :
  file positions sortIndexInverse[i*2] and sortIndexInverse[i*2+1]
  I store the min and max of that

Tree_2[i] covers steps of size 4, that is min and max of Tree_1[i*2] and Tree_1[i*2+1]

Now once you have this binary interval tree you can do the normal kind of binary tree walk to find the closest neighbor that's in the range you want.

Also see the following blog on how I walk binary interval trees and the released source code for StringMatchTest contains the full code for this matcher.

cbloom rants

10/01/2011

10-01-11 - String Match Results Part 6

No comments:

old rants