As they should, they have very constant time operation that goes up pretty steadily from Tornado -3 to -7, because there's a constant number of hash probes per match attempt.
totals : match len : clocks Test_MMC1 : 0.663028 , 231.254818 Test_Hash1 : 0.662718 , 64.888003 Test_Tornado_3 : 0.630377 , 19.658834 Test_Tornado_4 : 0.593174 , 28.456055 Test_Tornado_5 : 0.586540 , 40.546146 Test_Tornado_6 : 0.580042 , 56.841156 Test_Tornado_7 : 0.596584 , 141.432393There may be something wrong with my Tornado wrapper as the -3 matcher actually finds the longest total length. I dunno. The speeds look reasonable. I don't really care much about these approximate matchers because the loss is hard to quantify, so there you go (normally when I see an anomaly like that I would investigate it to make sure I understand why it's happening).
0 = ares_merged.gr2_sec_5.dat 1 = bigship1.x 2 = BOOK1 3 = combined.gr2_sec_3.dat 4 = GrannyRocks_wot.gr2 5 = Gryphon_stripped.gr2 6 = hetero50k 7 = lzt20 8 = lzt21 9 = lzt22 10 = lzt23 11 = lzt24 12 = lzt25 13 = lzt27 14 = lzt28 15 = lzt29 16 = movie_headers.bin 17 = orange_purple.BMP 18 = predsave.bin
Conclusion : I've got to get off string matching so this is probably the end of posts on this topic.
MMC looks promising but has some flaws. There are some cases that trigger a slowness spike in it. Also it has some bad O(N^2) with unbounded match length ("MMC2") so I have to run it with a limit ("MMC1") which removes some of its advantage over LzFind and Hash1 and other approximate matchers. (without the limit it has the advantage of being exact). It's also a GPL at the moment which is a killer.
LzFind doesn't have anything going for it really.
For approximate/small-window matching I don't see any reason to not use the classic Zip hash chain method. I tried a few variants of this, like doing a hash chain to match the first 4 bytes and then link listing off that, and all the variants were worse than the classic way.
For large window / exact matching / optimal parsing, a correct O(N) matcher is the way to go. The suffix-array based matcher is by far the easiest for your kids to implement at home.
No comments:
Post a Comment