Comments on cbloom rants: 09-27-11 - String Match Stress Test

Yeah, true, "twobooks" is actually a ver...

2011-09-27T19:36:07.038-07:00

Yeah, true, "twobooks" is actually a very easy to fix degenerate case of this problem. You can patch a solution to it onto any underlying string matcher just by tracking {lastoffset,lastml} and checking them first.

But that doesn't really solve the underlying problem. It's sort of like patching on handling of the degenerate RLE case, that doesn't fix the ababab case, etc.

You can make a trickier "follows" stress test by repeating a long common prefix which is then followed by a variety of things. And also suffixes of the long prefix, such that the simple {lastml,lastoffset} no longer gives you the best match.

Oh, right, I see, ok. Ugh. Of course a match of l...

2011-09-27T19:30:03.783-07:00

Oh, right, I see, ok. Ugh.

Of course a match of length K at position N always implies a match of K-1 at position N+1, so you could probably avoid this specific case pretty easily (and optimizing that is useful in general for long matches, not just in this specific case).

" I understand how the later tests are tes...

2011-09-27T19:10:46.859-07:00

" I understand how the later tests are testing for O(N^2) traps and etc., but I don't understand how twobooks is relevant to that stuff ... "

Yeah, there is. (this is for non-greedy parsing only). That's why twobooks is interesting, it isolates one specific O(N^2) case that is rarely handled.

Any string matcher that does not have a "follows" implementation will come to the second occurance of book1 and do something like this :

find the longest match, which is N characters, which takes N character compares in the strcmp

go to the next pos, find the longest match, that takes N-1 character compares

etc. N+(N-1)+(N-2) ... = O(N^2)

You can fail to see these traps if you think of the string compare as being O(1).

"(Also, if you're making a compressor that operates on, say, independent 256KB blocks, it's kind of an irrelevant test.)"

You could take the first 128k of book1 and do the same test.

But really it applies to any long match scenario in non-greedy parsing.

Obviously most people (such as myself in the exist rrLZH code) limit their exposure to these kind of traps by doing things like bailing out of optimal parsing when they see a match longer than 256 or whatever.

I understand how the later tests are testing for O...

2011-09-27T18:37:10.720-07:00

I understand how the later tests are testing for O(N^2) traps and etc., but I don't understand how twobooks is relevant to that stuff--isn't it just testing whether you can find long matches at all? There shouldn't be anything O(N^2) about it, should there?

(Also, if you're making a compressor that operates on, say, independent 256KB blocks, it's kind of an irrelevant test.)