copy /b book1 + book1 twobooksthen test on "twobooks".
There are three general classes of how string matchers respond to a case like "twobooks" :
1. No problemo. Time per byte is roughly constant no matter what you throw at it (for both greedy and non-greedy parsing). This class is basically only made up of matchers that have a correct "follows" implementation.
2. Okay with greedy parsing. This class craps out in some kind of O(N^2) way if you ask them to match at every position, but if you let them do greedy matching they are okay. This class does not have a correct "follows" implementation, but does otherwise avoid O(N^2) behavior. For example MMC seems to fall into this class, as does a suffix tree without "follows".
Any matcher with a small constant number of maximum compares can fall into this performance class, but at the cost of an unknown amount of match quality.
3. Craps out even with greedy parsing. This class fails to avoid O(N^2) trap that happens when you have a long match and also many ways to make it. For example simple hash chains without an "amortize" limit fall in this class. (with non-greedy parsing they are O(N^3) on degenerate cases like a file that's all the same char).
Two other interesting stress tests I'm using are :
Inspired by ryg, "stress_suffix_forward" :
4k of aaaaa...
then 64k of aaaa...
obviously when you first reach the second part of "aaaa..." you need to find the beginning of the file,
but a naive suffix sort will have to look through 64k of following a's before it finds it.
Another useful one to check on the "amortize" behavior is "stress_search_limit" :
then, 1000 times :
128 random bytes
the first 128 bytes of book1
obviously when you encounter all of book1 for the second time, you should match the whole book1 at the
head of the file, but matchers which use some kind of simple search limit (eg. amortized hashing)
will see the 128 byte matches first and may never get back to the really long one.