cbloom rants: 09-30-11 - String Match Results Part 5 + Conclusion

9/30/2011

09-30-11 - String Match Results Part 5 + Conclusion

Finally for completeness, some of the matchers from Tornado in FreeArc. These are basically all standard "cache table" style matchers, originally due to LZRW, made popular by LZP and LZO. The various Tornado settings select different amounts of hash rows and ways.

As they should, they have very constant time operation that goes up pretty steadily from Tornado -3 to -7, because there's a constant number of hash probes per match attempt.


totals : match len : clocks
Test_MMC1 : 0.663028 , 231.254818 
Test_Hash1 : 0.662718 , 64.888003 
Test_Tornado_3 : 0.630377 , 19.658834 
Test_Tornado_4 : 0.593174 , 28.456055 
Test_Tornado_5 : 0.586540 , 40.546146 
Test_Tornado_6 : 0.580042 , 56.841156 
Test_Tornado_7 : 0.596584 , 141.432393

There may be something wrong with my Tornado wrapper as the -3 matcher actually finds the longest total length. I dunno. The speeds look reasonable. I don't really care much about these approximate matchers because the loss is hard to quantify, so there you go (normally when I see an anomaly like that I would investigate it to make sure I understand why it's happening).

0 = ares_merged.gr2_sec_5.dat
1 = bigship1.x
2 = BOOK1
3 = combined.gr2_sec_3.dat
4 = GrannyRocks_wot.gr2
5 = Gryphon_stripped.gr2
6 = hetero50k
7 = lzt20
8 = lzt21
9 = lzt22
10 = lzt23
11 = lzt24
12 = lzt25
13 = lzt27
14 = lzt28
15 = lzt29
16 = movie_headers.bin
17 = orange_purple.BMP
18 = predsave.bin

Conclusion : I've got to get off string matching so this is probably the end of posts on this topic.

MMC looks promising but has some flaws. There are some cases that trigger a slowness spike in it. Also it has some bad O(N^2) with unbounded match length ("MMC2") so I have to run it with a limit ("MMC1") which removes some of its advantage over LzFind and Hash1 and other approximate matchers. (without the limit it has the advantage of being exact). It's also a GPL at the moment which is a killer.

LzFind doesn't have anything going for it really.

For approximate/small-window matching I don't see any reason to not use the classic Zip hash chain method. I tried a few variants of this, like doing a hash chain to match the first 4 bytes and then link listing off that, and all the variants were worse than the classic way.

For large window / exact matching / optimal parsing, a correct O(N) matcher is the way to go. The suffix-array based matcher is by far the easiest for your kids to implement at home.

cbloom rants

9/30/2011

09-30-11 - String Match Results Part 5 + Conclusion

No comments:

old rants