Say you have an MML of 2 and encode a literal "a". That means all digrams in the dictionary that start with "a" are exclusions. The following character cannot be any of those, and those are the most likely characters so they are worth a lot.
Discussed previously here :
cbloom rants 09-03-10 - LZ and Exclusions
Even when you send a match, you have a powerful exclude from the characters that cannot extend that match. In a normal LZ77 like LZMA this is done for the *single* match offset that we sent with a literal-after-match exclude. In LZSA we can easily exclude *all* the characters that could have extended the match.
LZSA-HC = LZSA- High Compression
For LZSA-HC we will take MML down to 1, so literals and matches are unified and we are always writing "matches". However we will also write matches by first writing the first character of the match and then writing the rest of the match. I assume that you have ensured the dictionary contains every character at least once, so there's no issue of impossible encodings.
Algorithm LZSA-HC :
Data structures required are the same as LZSA-Basic When the match lookup structure (typically suffix trie) returns a match it should also provide the set of characters which were found in the dictionary but did not match the next character in the search string To encode : initialize exclude_set = { emtpy } for all of ptr encode current character (*ptr) using literal coder, doing exclusion from current exclude_set look up the current string (ptr) in the match lookup structure set exclude_set = characters found that followed match but != ptr[match_length] transmit match length ( >= 1 ) if ( match_length > 1 ) { send the suffix substring matched : simply one arithmetic encode call we only need to send it within the range of suffixes of the already-sent first character char_suffix_low[ *ptr ] is a table lookup char_suffix_count = char_suffix_low[ *ptr + 1 ] - char_suffix_low; arithmetic_encode( suffix_sort_low_index - char_suffix_low, suffix_count , char_suffix_count ); } ptr += match_length;a few notes on this.
We only send matches, length 1,2,3... Then exclude all characters that could not come after the match. This exclusion is exactly the same as exclusion in PPM when you escape down orders. In LZ you escape from order-ML down to order-0.
Coding the first character of the match separately is just so that I can use a different coder there. I use order-1 plus some bits of match history as context. For purity, I could have left that out and just always coded a suffix index for the match. In that case "exclusion" would consist of subtracting off the char_suffix_count[] number of suffixes from each excluded character.
Because match_length is sent after the first character, I use the first character as context for coding the match length.
The "if ( match_length > 1 )" is actually optional. You could go ahead and run that block for match_length = 1 as well. It will arithmetic_encode a symbol whose probability is 1; that is a cumprob which is equal to the full range. This should be a NOP on your arithmetic encoder. Whether that's a good idea or not depends on implementation.
In practice the exclude_set is 256 bit flags = 32 bytes.
LZSA-HC must use greedy parsing (always send the longest possible match) because of full exclusion. There's no such thing as lazy/flexible/optimal parses.
To decode : we now can't really use the faster backward tree because we need a lookup structure that will provide initialize exclude_set = { emtpy } for all of ptr decode current character (*ptr) using literal coder, doing exclusion from current exclude_set decode match_length if ( match_length == 1 ) { // just precompute a table of exclude sets for the after-1-character case : exclude_set = o1_exclude_set[ *ptr ] } else { decode the suffix index within the range of suffixes of the already-sent first character char_suffix_low[ *ptr ] is a table lookup char_suffix_count = char_suffix_low[ *ptr + 1 ] - char_suffix_low; int suffix_index = arithmetic_fetch( char_suffix_count ) + char_suffix_low; at this point {suffix_index, match_length} is our match string unsigned char * match_string = suffix_sort[suffix_index]; copy_match( out_ptr , match_string, match_length ); out_ptr += match_length; we also need the suffix low index & count to remove the arithmetic interval : suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length ); here get_suffix_count must be a real match lookup and should also fill exclude_set arithmetic_remove( suffix_sort_low_index, suffix_count , dictionary_size ); } ptr += match_lengthAnd that is LZSA-HC
No comments:
Post a Comment