cbloom rants: 02-04-15 - LZSA

Algorithm LZSA-Basic :


LZSA-Basic encoder :

given a dictionary
form the suffix sort
make a match lookup structure on the suffix sort (suffix trie for example)

when you look up a string in the match structure
you are given the lowest index of that substring in the suffix sort
and also the number of entries that match that prefix

for example in a suffix trie
each leaf corresponds to an entry in the suffix sort
each node stores the lowest leaf index under it, and the number of leaves


To encode :

look up the current string in the match lookup structure
if match length < MML 
{
  flag literal & send literal
}
else
{
  flag not literal
  send match length

  send the suffix substring matched :
  simply one arithmetic encode call
  (dictionary size usually a power of 2 for more speed)

  arithmetic_encode( suffix_sort_low_index, suffix_count , dictionary_size );
}

Lazy parsing and other standard LZ things are optional.

Minimum Match Length , MML >= 2 as written. However, you could also set MML=1 and dispense with the match flag entirely. Then literals are written as a match of length 1, (and you must ensure every character occurs at least once in the dictionary). This is identical to order-0 coding of the literals, because the suffix ranges for matches of length 1 are just the order-0 counts! In practice it's better to code literal separately because it lets you do a custom literal coder (using order-1 context, or match history context, or whatever).


LZSA-Basic decoder :

decoder requires the suffix sort
it also requires the suffix count for the given match length
(see later)


To decode :

get match flag
if not match
{
  decode literal
}
else
{
  decode match length

  get the suffix index :

  int suffix_index = arithmetic_fetch( dictionary_size );

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];
  copy_match( out_ptr , match_string, match_length );
  out_ptr += match_length;

  we also need the suffix low index & count to remove the arithmetic interval :

  suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length );

  this must be the same interval that was encoded :
  (suffix_index is somewhere in that range)
  (note that all suffix_index values in that range provide the same match string over match_length)

  arithmetic_remove( suffix_sort_low_index, suffix_count , dictionary_size );
}

easy peasy, and very fast. Decoding is just as fast as normal LZ77, except for one piece : get_suffix_count.

To implement get_suffix_count we need something like the suffix trie that was used in the encoder. But we can do something a bit more compact and efficient. Rather than a forward tree, we can use a backward only tree, because we have a leaf index to jump into, and we only need to go up to parents to find the right node.


get_suffix_count :


struct backward_suffix_node
{
  int parent; // node index
  int depth;
  int low,count; // suffix sort range
};

unsigned char * suffix_sort[ dictionary_size ];
int suffix_leaf_parent[ dictionary_size ];
backward_suffix_node suffix_nodes[ dictionary_size ];


suffix_sort_low_index, suffix_count = get_suffix_count( suffix_index, match_length )
{
    int node = -1;
    int parent = suffix_leaf_parent[ suffix_index ];
    while( match_length <= suffix_nodes[ parent ] )
    {
        node = parent;
        parent = suffix_nodes[ node ].parent;
    }

    if ( node == -1 )
        return suffix_index, 1;
    else
        return suffix_nodes[ node ].low , suffix_nodes[ node ].count;
}

the logic here is just slightly fiddly due to path compression. With path compression, match_length can be between the depth of two nodes, and when that happens you want to stay at the child node. The leaf nodes are implicit, and the number of internal nodes is always <= the number of leaves.

You could of course also accelerate the suffix_count lookup for low match lengths, at ML=3 or 4 for example by just having a direct array lookup for that case.

In theory walking backward like this has a bad O(N^2) possible running time (if the tree is deep, but you're only getting short matches in it). Conversely, walking forward up the tree ensures that decode time is O(N), because the trie walk is proportional to match length. In practice the backward walk is always significantly faster. (a single forward trie step can involve lots of linked list steps and jumping around in memory; the backward trie is much more compact and easier to walk without conditionals; you have to have a very deep average tree depth for it to be worse). If this was actually an issue, you could augment the backwards trie with a Fenwick/skip-list style larger parent step in a binary pattern (some halfway-to-root steps, some quarter-way-to-root steps, etc.). But it just isn't an issue.

cbloom rants

2/04/2015

02-04-15 - LZSA - Part 2

No comments:

old rants