Optimal parsing, large window.
The candidates are :
SuffixArray1 : naive matcher built on divsufsort
SuffixArray2 : using the forward-offset elimination from this post
Suffix2 : Suffix Trie with follows and path compression but missing the bits in this post
Suffix3 : Suffix Trie without follow
Suffix5 : fully working Suffix Trie
MMC1 : MMC with max match length of 256
MMC2 : MMC with no limit
LzFind1 : LzFind (LZMA HC4 - binary tree) with max ML of 256 and step limit of 32
LzFind2 : LzFind with no max ML or step limit
Note : LzFind was modified from LZMA to not record all matches, just the longest, to make it more like the competitors. MMC was modified to make window size a variable.
In all cases I show :
Test_Matcher : average match length per byte , average clocks per byteAnd with no further ado :
got arg : window_bits = 24 working on : m:\test_data\lz_stress_tests loading file : m:\test_data\lz_stress_tests\stress_many_matches Test_SuffixArray1 : 32.757760 , 164.087948 Test_SuffixArray2 : 32.756953 , 199.878476 Test_Suffix2 : 32.757760 , 115.846130 Test_Suffix3 : 31.628279 , 499.722569 Test_Suffix5 : 32.757760 , 184.172167 Test_MMC2 : 32.757760 , 1507.818166 Test_LzFind2 : 32.757760 , 576.154370 loading file : m:\test_data\lz_stress_tests\stress_search_limit Test_SuffixArray1 : 823.341331 , 182.789064 Test_SuffixArray2 : 823.341331 , 243.492241 Test_Suffix2 : 823.341331 , 393.930504 Test_Suffix3 : 807.648294 , 2082.447274 Test_Suffix5 : 823.341331 , 91.699276 Test_MMC2 : 823.341331 , 6346.400206 Test_LzFind2 : 823.341331 , 1807.516994 loading file : m:\test_data\lz_stress_tests\stress_sliding_follow Test_SuffixArray1 : 199.576550 , 189.029462 Test_SuffixArray2 : 199.573198 , 220.316868 Test_Suffix2 : 199.576550 , 95.225780 Test_Suffix3 : 198.967622 , 2110.521111 Test_Suffix5 : 199.576550 , 106.019526 Test_MMC2 : 199.576550 , 36571.382020 Test_LzFind2 : 199.576550 , 1249.184412 loading file : m:\test_data\lz_stress_tests\stress_suffix_forward Test_SuffixArray1 : 5199.164464 , 6138.802402 Test_SuffixArray2 : 5199.164401 , 213.675569 Test_Suffix2 : 5199.164464 , 12901.429712 Test_Suffix3 : 5199.075953 , 32152.812339 Test_Suffix5 : 5199.164464 , 145.684678 Test_MMC2 : 5199.016562 , 6652.666440 Test_LzFind2 : 5199.164464 , 11739.369336 loading file : m:\test_data\lz_stress_tests\stress_all_as Test_SuffixArray1 : 21119.499148 , 40938.612689 Test_SuffixArray2 : 21119.499148 , 127.520147 Test_Suffix2 : 21119.499148 , 88178.094886 Test_Suffix3 : 21119.499148 , 104833.677202 Test_Suffix5 : 21119.499148 , 119.676823 Test_MMC2 : 21119.499148 , 25951.480871 Test_LzFind2 : 21119.499148 , 38581.431558 loading file : m:\test_data\lz_stress_tests\twobooks Test_SuffixArray1 : 192196.571348 , 412.356092 Test_SuffixArray2 : 192196.571348 , 420.437773 Test_Suffix2 : 192196.571348 , 268.524287 Test_Suffix3 : DNF Test_Suffix5 : 192196.571348 , 292.777726 Test_MMC2 : DNF Test_LzFind2 : DNF(DNF = Did Not Finish = over 100k clocks per byte).
Conclusion : SuffixArray2 and Suffix5 both actually work and are correct with no blowup cases.
SuffixArray1 looks good on the majority of files (and is slightly faster than SuffixArray2 on those files), but "stress_suffix_forward" clearly calls it out and shows the break down case.
Suffix2 almost works except on the degenerate tests due to failure to get some details of follows quite right ( see here ).
Suffix3 just shows that a Suffix Trie without follows is some foolishness.
We won't show SuffixArray1 or Suffix2 or Suffix3 again.
MMC2 and LZFind2 both have bad failure cases. Both are simply not usable if you want to find the longest match at every byte. We will revisit them later in other usages though and see that they are good for what they're designed for.
I've not included any of the hash chain type matchers in this test because they all obviously crap their pants in this scenario.
No comments:
Post a Comment