cbloom rants: String Match Stress Test Files

2/11/2016

String Match Stress Test Files

A gift. My string match stress test set :

string_match_stress_tests.7z (60,832 bytes)

Consists of :

 paper1_twice
 stress_all_as
 stress_many_matches
 stress_search_limit
 stress_sliding_follow
 stress_suffix_forward

An optimal parse matcher (matching at every position in each file against all previous bytes within that file) should get these average match lengths : (min match length of 4, and no matches searched for in the last 8 bytes of each file)


paper1_twice : 13294.229727
stress_all_as : 21119.499148
stress_many_matches : 32.757760
stress_search_limit : 823.341331
stress_sliding_follow : 199.576550
stress_suffix_forward : 5199.164464

total ml : 2896554306
total bytes : 483870

Previous post on the same test set : 09-27-11 - String Match Stress Test

And these were used in the String Match Test post series , though there I used "twobooks" instead of "paper1_twice".

These stress tests are designed to make imperfect string matchers show their flaws. Correct implementations of Suffix Array or Suffix Tree searchers should find this total match length without ever going into bad N^2 slowdowns (their speed should be roughly constant). Other matchers like hash-link, LzFind (hash-btree) and MMC will either find lower total match length (due to an "amortize" step limit) or will fall into bad N^2 (or worse!) slowdowns.

cbloom rants

2/11/2016

String Match Stress Test Files

No comments:

old rants