As they should, they have very constant time operation that goes up pretty steadily from Tornado -3 to -7, because there's a
constant number of hash probes per match attempt.
totals : match len : clocks
Test_MMC1 : 0.663028 , 231.254818
Test_Hash1 : 0.662718 , 64.888003
Test_Tornado_3 : 0.630377 , 19.658834
Test_Tornado_4 : 0.593174 , 28.456055
Test_Tornado_5 : 0.586540 , 40.546146
Test_Tornado_6 : 0.580042 , 56.841156
Test_Tornado_7 : 0.596584 , 141.432393
There may be something wrong with my Tornado wrapper as the -3 matcher actually finds the longest total length. I dunno.
The speeds look reasonable. I don't really care much about these approximate matchers because the loss is hard to
quantify, so there you go (normally when I see an anomaly like that I would investigate it to make sure I understand
why it's happening).
0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin
Conclusion : I've got to get off string matching so this is probably the end of posts on this topic.
MMC looks promising but has some flaws. There are some cases that trigger a slowness spike in it. Also it has some bad O(N^2) with unbounded match length ("MMC2") so I have to run it with a limit ("MMC1") which removes some of its advantage over LzFind and Hash1 and other approximate matchers. (without the limit it has the advantage of being exact). It's also a GPL at the moment which is a killer.
LzFind doesn't have anything going for it really.
For approximate/small-window matching I don't see any reason to not use the classic Zip hash chain method. I tried a few variants of this, like doing a hash chain to match the first 4 bytes and then link listing off that, and all the variants were worse than the classic way.
For large window / exact matching / optimal parsing, a correct O(N) matcher is the way to go. The suffix-array based matcher is by far the easiest for your kids to implement at home.
No comments:
Post a Comment