cbloom rants: 06-17-10 - Suffix Array Neighboring Pair Match Lens

6/17/2010

06-17-10 - Suffix Array Neighboring Pair Match Lens

I suffix sort strings for my LZ coder. To find match lengths, I first construct the array of neighboring pair match lengths. You can then find the match length between any two indexes (i,j) as the MIN of all pair match lengths between them. I wrote about this before when I wrote about the LZ Optimal parse strategy, but in order to find all matches against any given suffix, you find the position of that suffix in the sort order, then walk to neighbors and keep doing MIN() with the pair match lengths.

Constructing the pair match lengths the naive way is O(N^2) on degenerate cases (eg. where the whole file is 0). I've had a note for a long time in my code that an O(N) solution should be possible, but I never got around to figuring it out. Anyway, I stumbled upon this :


from :

Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
Toru Kasai 1, Gunho Lee2, Hiroki Arimura1;3, Setsuo Arikawa1, and Kunsoo Park2

Algorithm GetHeight
input: A text A and its suffix array Pos
1 for i:=1 to n do
2 Rank[Pos[i]] := i
3od
4 h:=0
5 for i:=1 to n do
6 if Rank[i] > 1 then
7 k := Pos[Rank[i]-1]
8 while A[i+h] = A[j+h] do
9 h := h+1
10 od
11 Height[Rank[i]] := h
12 if h>0 then h := h-1 fi
13 fi
14 od

The idea is obvious once you see it. You walk in order of the original character array index, (not the sorted order). At index [i] you find a match of length L to its suffix-sorted neighbor. At index [i+1] then, you must be able to match to a length of at least the same length-1 by matching to the same neighbor, because by stepping forward one char you've just removed one from that suffix match. For example :


On "mississippi" :

At index 1 :

ississippi

matches :

issippi

with length 4.

Now step forward one.

At index 2 :

ssissippi

I know I can get a match of length 3 by matching the previous suffix "issippi" (stepped one)
so I can start my string compare at 3

And here's the C code :


static void MakeSortSameLen(int * sortSameLen, // fills out sortSameLen[0 .. sortWindowLen-2]
        const int * sortIndex,const int * sortIndexInverse,const U8 * sortWindowBase,int sortWindowLen)
{
    int n = sortWindowLen;
    const U8 * A = sortWindowBase;

    int h = 0;
    for(int i=0; i< n ;i ++)
    {
        int sort_i = sortIndexInverse[i];
        if ( sort_i > 0 )
        {
            int j = sortIndex[sort_i-1];
            int h_max = sortWindowLen - RR_MAX(i,j);
            while ( h < h_max && A[i+h] == A[j+h] )
            {
                h++;
            }
            sortSameLen[sort_i-1] = h;

            if ( h > 0 ) h--;
        }
    }
}

cbloom rants

6/17/2010

06-17-10 - Suffix Array Neighboring Pair Match Lens

No comments:

old rants