cbloom rants: 10-18-11 - StringMatchTest

For reference :

The good way to do the standard Zip-style Hash -> Linked List for LZ77 parsing.

There are two tables : the hash entry point table, which gives you the head of the linked list, and the link table, which is a circular buffer of ints which contain the next position where that hash occured.

That is :


  hashTable[ hash ]  contains the last (most recent preceding) position that hash occured
  chainTable[ pos & (window_size-1) ]  contains the last position of the hash at pos before pos

To walk the table you do :


  i = hashTable[ hash ];
  while ( i in window )
    i = chainTable[ i & (window_size-1) ]

To update the table you do :


  head = hashTable[ hash ];
  hashTable[hash] = pos;
  chainTable[ pos & (window_size-1) ] = head;

And now for some minor details that are a bit subtle. We're going to go through "Hash1" from StringMatchTest which I know I still haven't posted.


int64 Test_Hash1(MATCH_TEST_ARGS)
{
    uint8 * buf = (uint8 *)charbuf;
    const int window_mask = window_size-1;
        
    vector<int> chain; // circular buffer on window_size
    chain.resize(window_size);
    int * pchain = chain.data();
    
    const int hash_size = MIN(window_size,1<<20);
    const int hash_mask = hash_size-1;

for small files or small windows, you can only get good per-byte speed if you make hash size proportional with that MIN. (what you can't see is outside the Test_ I also made window_size be no bigger than the smallest power of 2 that encloses file size).


    vector<int> hashTable; // points to pos of most recent occurance of this hash
    hashTable.resize(hash_size);
    int * phashTable = hashTable.data();
    
    memset(phashTable,0,hash_size*sizeof(int));

As noted previously, for large hashes you can get a big win by using a malloc that gives you zeros. I don't do it here for fairness to the other tests. I do make sure that my initialization value is zero so you can switch to VirtualAlloc/calloc.


    int64 total_ml = 0;
    
    // fiddle the pointers so that position 0 counts as being out of the window
    int pos = window_size+1;
    buf -= pos;
    ASSERT( (char *)&buf[pos] == charbuf );
    size += pos;

I don't want to do two checks in the inner loop for whether a position is a null link vs. out of the window. So I make the "null" value of the linked list (zero) be out of the window.


    for(;pos<(size-TAIL_SPACE_NEEDED);)
    {
        MATCH_CHECK_TIME_LIMIT();

        // grab 4 bytes (@ could slide here)
        uint32 cur4 = *((uint32 *)&buf[pos]);
        //ASSERT( cur4 == *((uint32 *)&buf[pos]) );

On PC's it's fastest just to grab the unaligned dword. On PowerPC it's faster to slide bytes through the dword. Note that endian-ness changes the value, so you may need to tweak the hash function differently for the two endian cases.


        // hash them 
        uint32 hash = hashfour(cur4) & hash_mask;
        int hashHead =  phashTable[hash];
        int nextHashIndex = hashHead;
        int bestml = 0;
        int windowStart = pos-window_size;
        ASSERT( windowStart >= 0 );
        
        #ifdef MAX_STEPS
        int steps = 0;
        #endif

Start the walk. Not interesting.


        while( nextHashIndex >= windowStart )
        {
            uint32 vs4 = *((uint32 *)&buf[nextHashIndex]);
            int hi = nextHashIndex&window_mask;
            if ( vs4 == cur4 )
            {
                int ml = matchlenbetter(&buf[pos],&buf[nextHashIndex],bestml,&buf[size]);
                    
                if ( ml != 0 )
                {
                    ASSERT( ml > bestml );
                    bestml = ml;
                    
                    // meh bleh extra check cuz matchlenbetter can actually go past end
                    if ( pos+bestml >= size )
                    {
                        bestml = size - pos;
                        break;
                    }
                }
            }

            #ifdef MAX_STEPS
            if ( ++steps > MAX_STEPS )
                break;
            #endif
                                
            nextHashIndex = pchain[hi];
        }

This is the simple walk of the chain of links. Min match len is 4 here which is particularly fast, but other lens can be done similarly.

"MAX_STEPS" is the optimal "amortize" (walk limit) which hurts compression in an unbounded way but is necessary for speed.

"matchlenbetter" is a little trick ; it's best to check the character at "bestml" first because it is the most likely to differ. If that char doesn't match, we know we can't beat the previous match, so we can stop immediately. After that I check the chars in [4,bestml) to ensure that we really do match >= bestml (the first 4 are already checked) and lastly the characters after, to extend the match.

The remainder just updates the hash and is not interesting :


        ASSERT( bestml == 0 || bestml >= 4 );
        total_ml += bestml;
        
        // add self :
        //  (this also implicitly removes the node at the end of the sliding window)
        phashTable[hash] = pos;
        int ci = pos & window_mask;
        pchain[ci] = hashHead;
                
        if ( greedy && bestml > 0 )
        {
            int end = pos+bestml;
            pos++;
            ASSERT( end <= size );
            for(;pos<end;pos++)
            {               
                uint32 cur4 = *((uint32 *)&buf[pos]);
                
                // hash them 
                uint32 hash = hashfour(cur4) & hash_mask;
                int hashHead =  phashTable[hash];
                phashTable[hash] = pos;
                int ci = pos & window_mask;
                pchain[ci] = hashHead;      
            }
            pos = end;
        }
        else
        {
            pos++;
        }
    }
    
    return total_ml;
}

Note that for non-greedy parsing you can help the O(N^2) in some cases by setting bestml to lastml-1. This helps enormously in practice because of the heuristic of matchlenbetter but does not eliminate the O(N^2) in theory. (the reason it helps in practice is because typically when you find a very long match, then the next byte will not have any match longer than lastml-1).

(but hash1 is really just not the right choice for non-greedy parsing; SuffixArray or SuffixTrie are vastly superior)

cbloom rants

10/18/2011

10-18-11 - StringMatchTest - Hash 1b

1 comment: