10/01/2011

10-01-11 - String Match Results Part 6

You knew that couldn't be the end.

SuffixArray3 : suffix array string matcher which uses a min/max tree to find allowed offsets.

The min/max tree is a binary hierarchy ; at level L there are (size>>L) entries, and each entry covers a range of size (1 << L). Construction is O(N) because N/2+N/4+N/8 ... = N

The min/max tree method is generally slightly slower than the elegant "chain of fences" approach used for SuffixArray2, but it's close. The big advantage is the min/max tree can also be used for windowed matching, which is not easy to integrate in SA2.

First check that it satisfies the O(N) goal on the stress tests :

0 = stress_all_as
1 = stress_many_matches
2 = stress_search_limit
3 = stress_sliding_follow
4 = stress_suffix_forward
5 = twobooks

Yep. Then check optimal parse, large window vs. the good options :

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin

The good Suffix Trie is clearly the best, but we're in the ballpark.

Now optimal parse, 17 bit (128k) window :


totals:
Test_MMC2 : DNF 
Test_LzFind2 : 506.954418 , 1355.992865 
Test_SuffixArray3 : 506.954418 , 514.931740 
Test_MMC1 : 13.095260 , 1298.507490 
Test_LzFind1 : 12.674177 , 226.796123 
Test_Hash1 : 503.319301 , 1094.570022 

Finally greedy parse, 17 bit window :


totals:
Test_MMC2 : 0.663028 , 110.373098 
Test_LzFind2 : DNF 
Test_SuffixArray3 : 0.663036 , 236.896551 
Test_MMC1 : 0.663028 , 222.626069 
Test_LzFind1 : 0.662929 , 216.912409 
Test_Hash1 : 0.662718 , 62.385071 

average match length :

And once more for clarity :

Greedy parse, 16 bit window , just the good candidates :

totals:
Test_SuffixArray3 : 0.630772 , 239.280605 
Test_LzFind1 : 0.630688 , 181.430093 
Test_MMC2 : 0.630765 , 88.413339 
Test_Hash1 : 0.630246 , 51.980073 

It should be noted that LzFind1 is approximate, and Hash1 is even more approximate. Though looking at the match length chart you certainly can't see it.

0 comments:

old rants