cbloom rants: 02-05-15 - LZSA

When you don't find a match in LZSA that actually carries a lot of information.

Say you have an MML of 2 and encode a literal "a". That means all digrams in the dictionary that start with "a" are exclusions. The following character cannot be any of those, and those are the most likely characters so they are worth a lot.

Discussed previously here :

cbloom rants 09-03-10 - LZ and Exclusions

Even when you send a match, you have a powerful exclude from the characters that cannot extend that match. In a normal LZ77 like LZMA this is done for the *single* match offset that we sent with a literal-after-match exclude. In LZSA we can easily exclude *all* the characters that could have extended the match.

LZSA-HC = LZSA- High Compression

For LZSA-HC we will take MML down to 1, so literals and matches are unified and we are always writing "matches". However we will also write matches by first writing the first character of the match and then writing the rest of the match. I assume that you have ensured the dictionary contains every character at least once, so there's no issue of impossible encodings.

Algorithm LZSA-HC :


Data structures required are the same as LZSA-Basic

When the match lookup structure (typically suffix trie) returns a match
it should also provide the set of characters which were found in the dictionary
 but did not match the next character in the search string


To encode :

initialize exclude_set = { emtpy }

for all of ptr

encode current character (*ptr) using literal coder,
 doing exclusion from current exclude_set

look up the current string (ptr) in the match lookup structure
set exclude_set = characters found that followed match but != ptr[match_length]

transmit match length ( >= 1 )

if ( match_length > 1 )
{
  send the suffix substring matched :
  simply one arithmetic encode call

  we only need to send it within the range of suffixes of the already-sent first character
    char_suffix_low[ *ptr ] is a table lookup
    char_suffix_count = char_suffix_low[ *ptr + 1 ] - char_suffix_low;

  arithmetic_encode( suffix_sort_low_index - char_suffix_low, suffix_count , char_suffix_count );
}

ptr += match_length;

a few notes on this.

We only send matches, length 1,2,3... Then exclude all characters that could not come after the match. This exclusion is exactly the same as exclusion in PPM when you escape down orders. In LZ you escape from order-ML down to order-0.

Coding the first character of the match separately is just so that I can use a different coder there. I use order-1 plus some bits of match history as context. For purity, I could have left that out and just always coded a suffix index for the match. In that case "exclusion" would consist of subtracting off the char_suffix_count[] number of suffixes from each excluded character.

Because match_length is sent after the first character, I use the first character as context for coding the match length.

The "if ( match_length > 1 )" is actually optional. You could go ahead and run that block for match_length = 1 as well. It will arithmetic_encode a symbol whose probability is 1; that is a cumprob which is equal to the full range. This should be a NOP on your arithmetic encoder. Whether that's a good idea or not depends on implementation.

In practice the exclude_set is 256 bit flags = 32 bytes.

LZSA-HC must use greedy parsing (always send the longest possible match) because of full exclusion. There's no such thing as lazy/flexible/optimal parses.


To decode :

we now can't really use the faster backward tree
because we need a lookup structure that will provide

initialize exclude_set = { emtpy }

for all of ptr

decode current character (*ptr) using literal coder,
 doing exclusion from current exclude_set

decode match_length

if ( match_length == 1 )
{
  // just precompute a table of exclude sets for the after-1-character case :
  exclude_set = o1_exclude_set[ *ptr ]
}
else
{
  decode the suffix index
 
  within the range of suffixes of the already-sent first character
    char_suffix_low[ *ptr ] is a table lookup
    char_suffix_count = char_suffix_low[ *ptr + 1 ] - char_suffix_low;

  int suffix_index = arithmetic_fetch( char_suffix_count ) + char_suffix_low;

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];
  copy_match( out_ptr , match_string, match_length );
  out_ptr += match_length;

  we also need the suffix low index & count to remove the arithmetic interval :

  suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length );
 
  here get_suffix_count must be a real match lookup and should also fill exclude_set

  arithmetic_remove( suffix_sort_low_index, suffix_count , dictionary_size );
}

ptr += match_length

And that is LZSA-HC

cbloom rants

2/05/2015

02-05-15 - LZSA - Part 4

No comments:

old rants