Optimal parsing, large window.
The candidates are :
SuffixArray1 : naive matcher built on divsufsort
SuffixArray2 : using the forward-offset elimination from this post
Suffix2 : Suffix Trie with follows and path compression but missing the bits in this post
Suffix3 : Suffix Trie without follow
Suffix5 : fully working Suffix Trie
MMC1 : MMC with max match length of 256
MMC2 : MMC with no limit
LzFind1 : LzFind (LZMA HC4 - binary tree) with max ML of 256 and step limit of 32
LzFind2 : LzFind with no max ML or step limit
Note : LzFind was modified from LZMA to not record all matches, just the longest, to make it more like the competitors. MMC was modified to make window size a variable.
In all cases I show :
Test_Matcher : average match length per byte , average clocks per byte
And with no further ado :
got arg : window_bits = 24
working on : m:\test_data\lz_stress_tests
loading file : m:\test_data\lz_stress_tests\stress_many_matches
Test_SuffixArray1 : 32.757760 , 164.087948
Test_SuffixArray2 : 32.756953 , 199.878476
Test_Suffix2 : 32.757760 , 115.846130
Test_Suffix3 : 31.628279 , 499.722569
Test_Suffix5 : 32.757760 , 184.172167
Test_MMC2 : 32.757760 , 1507.818166
Test_LzFind2 : 32.757760 , 576.154370
loading file : m:\test_data\lz_stress_tests\stress_search_limit
Test_SuffixArray1 : 823.341331 , 182.789064
Test_SuffixArray2 : 823.341331 , 243.492241
Test_Suffix2 : 823.341331 , 393.930504
Test_Suffix3 : 807.648294 , 2082.447274
Test_Suffix5 : 823.341331 , 91.699276
Test_MMC2 : 823.341331 , 6346.400206
Test_LzFind2 : 823.341331 , 1807.516994
loading file : m:\test_data\lz_stress_tests\stress_sliding_follow
Test_SuffixArray1 : 199.576550 , 189.029462
Test_SuffixArray2 : 199.573198 , 220.316868
Test_Suffix2 : 199.576550 , 95.225780
Test_Suffix3 : 198.967622 , 2110.521111
Test_Suffix5 : 199.576550 , 106.019526
Test_MMC2 : 199.576550 , 36571.382020
Test_LzFind2 : 199.576550 , 1249.184412
loading file : m:\test_data\lz_stress_tests\stress_suffix_forward
Test_SuffixArray1 : 5199.164464 , 6138.802402
Test_SuffixArray2 : 5199.164401 , 213.675569
Test_Suffix2 : 5199.164464 , 12901.429712
Test_Suffix3 : 5199.075953 , 32152.812339
Test_Suffix5 : 5199.164464 , 145.684678
Test_MMC2 : 5199.016562 , 6652.666440
Test_LzFind2 : 5199.164464 , 11739.369336
loading file : m:\test_data\lz_stress_tests\stress_all_as
Test_SuffixArray1 : 21119.499148 , 40938.612689
Test_SuffixArray2 : 21119.499148 , 127.520147
Test_Suffix2 : 21119.499148 , 88178.094886
Test_Suffix3 : 21119.499148 , 104833.677202
Test_Suffix5 : 21119.499148 , 119.676823
Test_MMC2 : 21119.499148 , 25951.480871
Test_LzFind2 : 21119.499148 , 38581.431558
loading file : m:\test_data\lz_stress_tests\twobooks
Test_SuffixArray1 : 192196.571348 , 412.356092
Test_SuffixArray2 : 192196.571348 , 420.437773
Test_Suffix2 : 192196.571348 , 268.524287
Test_Suffix3 : DNF
Test_Suffix5 : 192196.571348 , 292.777726
Test_MMC2 : DNF
Test_LzFind2 : DNF
(DNF = Did Not Finish = over 100k clocks per byte).
Conclusion : SuffixArray2 and Suffix5 both actually work and are correct with no blowup cases.
SuffixArray1 looks good on the majority of files (and is slightly faster than SuffixArray2 on those files), but "stress_suffix_forward" clearly calls it out and shows the break down case.
Suffix2 almost works except on the degenerate tests due to failure to get some details of follows quite right ( see here ).
Suffix3 just shows that a Suffix Trie without follows is some foolishness.
We won't show SuffixArray1 or Suffix2 or Suffix3 again.
MMC2 and LZFind2 both have bad failure cases. Both are simply not usable if you want to find the longest match at every byte. We will revisit them later in other usages though and see that they are good for what they're designed for.
I've not included any of the hash chain type matchers in this test because they all obviously crap their pants in this scenario.
No comments:
Post a Comment