Constructing the pair match lengths the naive way is O(N^2) on degenerate cases (eg. where the whole file is 0). I've had a note for a long time in my code that an O(N) solution should be possible, but I never got around to figuring it out. Anyway, I stumbled upon this :
from : Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications Toru Kasai 1, Gunho Lee2, Hiroki Arimura1;3, Setsuo Arikawa1, and Kunsoo Park2 Algorithm GetHeight input: A text A and its suffix array Pos 1 for i:=1 to n do 2 Rank[Pos[i]] := i 3od 4 h:=0 5 for i:=1 to n do 6 if Rank[i] > 1 then 7 k := Pos[Rank[i]-1] 8 while A[i+h] = A[j+h] do 9 h := h+1 10 od 11 Height[Rank[i]] := h 12 if h>0 then h := h-1 fi 13 fi 14 od
The idea is obvious once you see it. You walk in order of the original character array index, (not the sorted order). At index [i] you find a match of length L to its suffix-sorted neighbor. At index [i+1] then, you must be able to match to a length of at least the same length-1 by matching to the same neighbor, because by stepping forward one char you've just removed one from that suffix match. For example :
On "mississippi" : At index 1 : ississippi matches : issippi with length 4. Now step forward one. At index 2 : ssissippi I know I can get a match of length 3 by matching the previous suffix "issippi" (stepped one) so I can start my string compare at 3
And here's the C code :
static void MakeSortSameLen(int * sortSameLen, // fills out sortSameLen[0 .. sortWindowLen-2] const int * sortIndex,const int * sortIndexInverse,const U8 * sortWindowBase,int sortWindowLen) { int n = sortWindowLen; const U8 * A = sortWindowBase; int h = 0; for(int i=0; i< n ;i ++) { int sort_i = sortIndexInverse[i]; if ( sort_i > 0 ) { int j = sortIndex[sort_i-1]; int h_max = sortWindowLen - RR_MAX(i,j); while ( h < h_max && A[i+h] == A[j+h] ) { h++; } sortSameLen[sort_i-1] = h; if ( h > 0 ) h--; } } }
No comments:
Post a Comment