cbloom rants: 09-28-11 - String Matching with Suffix Arrays

9/28/2011

09-28-11 - String Matching with Suffix Arrays

A suffix sorter (such as the excellent divsufsort by Yuta Mori) provides a list of the suffix positions in an array in sorted order. Eg. sortedSuffixes[i] is the ith suffix in order.

You can easily invert this table to make sortLookup such that sortLookup[ sortedSuffix[i] ] == i . eg. sortLookup[i] is the sort order for position i.

Now at this point, for each suffix sort position i, you know that the longest match with another suffix is either at i-1 or i+1.

Next we need the neighboring pair match lengths for the suffix sort. This can be done in O(N) as previously described here . So we now have a sortSameLen[] array such that sortSameLen[i] tells you the match length between (sorted order) elements i and i+1.

Using just these you can find all the match lengths for any suffix in the array thusly :


For a suffix start at index pos
Find its sort order : sortIndex = sortLookup[pos]
In each direction (+1 and -1)
current_match_len = infinite
step to next sort index
current_match_len = MIN(current_match_len,sortSameLen[sort index])

Okay. This is all old news. But it has a problem that has been discussed previously .

When matching strings for LZ and such, we don't want the longest match in the array, we want the longest match that occurs earlier. Handled naively this ruins the great O() performance of suffix array string matching. But you can do better.

Run Algorithm Next Index with Lower Value on the sortedSuffix[] array. This provides an array nextSuffixPreceding[]. This is exactly what you need - it provides the next closest suffix with a preceding index.

Now instead of the longest match being at +1 and -1, the longest match is at nextSuffixPreceding[i] and priorSuffixPreceding[i].

There's one last problem - if my current suffix is at position pos, and I look up si = sortIndex[pos] and from that nextSuffixPreceding[si] - I need to walk up to that position one by one doing MIN() on the adjacent pair match lengths (sortSameLen). That ruins my O() win.

But there's a solution - simply build the match length as well when you run "next index with lower value". This can be done easily by tracking the match length back to the preceding "fence". This adds no complexity to the algorithm.

The total sequence of operations is :


sort suffixes : O(NlogN) or so

build sort lookup : O(N)

build sort pair same len : O(N)

build next/prior pos preceding with match lengths : O(N)

now to find a match length :

at position "pos"
si = sortLookup[pos]
for each direction (following and preceding)
  matchpos = nextSuffixPreceding[si]
  matchlen = nextSuffixPreceding_Len[si]

that is, the match length lookup is a very simple O(1) per position (or O(N) for all positions).

One minor annoyance remains, which is that the suffix array string searcher does not provide the lowest offset for a given length of match. It gives you the closest in suffix order, which is not what you want.

1 comment:

cbloom said...: Challenge to the reader : can this be extended for windowed matching?

You will need to construct an array of "next pos preceding but in window".; September 30, 2011 at 5:00 PM

cbloom rants

9/28/2011

09-28-11 - String Matching with Suffix Arrays

1 comment:

old rants