cbloom rants: 02-06-15 - LZSA

LZSA is not really an LZ. This is kind of what fascinates me about LZSA and why I think it's so interesting (*). Ryg called it "lz-ish" because it's not really LZ. It's actually much closer to PPM.

(* = it's definitely not interesting because it's practical. I haven't really found a use of LZSA in the real world yet. It appears to be a purely academic exercise.)

The fundamental thing is that the code space used by LZSA to encode is match is proportional to the probability of that substring :


P(abc) = P(a) * P(b|a) * P(c|ab)

where P is the probability as estimated by observed counts. That kind of code-space allocation is very fundamentally PPM-ish.

This is in contrast to things like LZFG and LZW that also refer directly to substrings without offsets, but do not have the PPM-ish allocation of code space.

The funny thing about LZSA is that naively it *looks* very much like an LZ. The decoder of LZSA-Basic is :


LZSA-Basic decoder rough sketch :

{

handle literal case
decode match_length

arithmetic_fetch suffix_index in dictionary_size

match_ptr = suffix_array[suffix_index];

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove get_suffix_count( suffix_index, match_length ) , dictionary_size

}

whereas a similar LZ77 on a static dictionary with a flat offset model is :


LZ77-static-flat decoder rough sketch :

{

handle literal case
decode match_length

arithmetic_fetch offset_index in dictionary_size

match_ptr = dictionary_array + offset_index;

copy(out_ptr, match_ptr, match_length );
out_ptr += match_length;

arithmetic_remove {offset_index,offset_index+1} , dictionary_size

}

they are almost identical. The only two changes are : 1. an indirection table for the match index, and 2. the arithmetic_remove can have a range bigger than one, eg. it can remove fewer than log2(dictionary_size) bits.

We're going to have a further look at LZSA as a PPM by examining some more variants :

LZSA-ROLZ :

ROLZ = "reduced offset LZ" uses some previous bytes of context to reduce the set of strings that can be matched.

This is trivial to do in LZSA, because we can use the same suffix array that we used for string matching to do the context lookup as well. All we have to do is start the suffix array lookup at an earlier position than the current pointer.

eg. instead of looking up matches for "ptr" and requiring ML >= 3 , we instead look up matches for "ptr-2" and require ML >= 5. (that's assuming you want to keep the same MML at 3. In fact with ROLZ you might want to decrease MML because you're sending offsets in fewer bits, so you could use an MML of 2 instead, which translates to a required suffix lookup ML of 4).

That is, say my string to encode is "abracket" and I've done "ab" so I'm at "racket". My static dictionary is "abraabracadabra". The suffix sort is :

a
aabracadabra
abra
abraabracadabra
abracadabra
acadabra
adabra
bra
braabracadabra
bracadabra
cadabra
dabra
ra
raabracadabra
racadabra

The dictionary size is 15.

With LZSA-Basic I would look up "racket" find a match of length 3, send ML=3, and index = 14 in a range of 15.

With LZSA-ROLZ-o2 I would do :


string = "ab|racket"

look up context "ab" and find the low & high suffix indexes for that substring
 (low index = 2, count = 3)

look up "abracket" ; found "abracadabra" match of length 5
 at index = 4

send ML=3

arithmetic_encode( suffix_index - context_low_index , context_count )
  (that's 2 in a range of 3)

You could precompute and tabulate the suffix ranges for the contexts, and then the complexity is identical to LZSA-Basic.

In LZSA-ROLZ you cannot encode any possible string, so MML down to 1 like LZSA-HC is not possible. You must always be able to escape out of your context to be able to encode characters that aren't found within that context.

LZSA-Basic had the property of coding from order-0,order-1,2,3,.. ML, jumping back to order-0. In LZSA-ROLZ, instead of jumping down to order 0 after the end of a match, you jump down to order-2 (or whatever order you chose for the ROLZ). You might then have to jump again to order-0 to encode a literal. So you still have this pattern of the context order going up in waves and jumping back down, you just don't jump all the way to order-0.

LZSA-ROLZ* :

(pronounced "ROLZ star" ; named by analogy to "PPM*" (PPM star) which starts from the longest possible context)

This idea can be taken further, which turns out to be interesting.

Instead of duing ROLZ with a fixed order, do ROLZ from the highest order possible. That is, take the current context (preceding characters) and find the longest match in the dictionary. In order to do that efficiently you need another lookup structure, such as a suffix trie on the reversed dictionary (a prefix tree). The prefix tree should have pointers to the same string in the suffix tree.

eg. say you're coding "..abcd|efgh.."

You look up "dcba..." in the prefix tree (context backwards). The longest match you find is "dcbx..". So you're at order-3. You take the pointer over to the suffix tree to find "bcd" in the suffix tree. You then try to code "efgh..." from the strings that followed "bcd" in the dictionary.

You pick a match, send a match length, send an offset.

Say the deepest order you found is order-N. Then if you code a match, you're coding at order-N+1,order-N+2,etc.. as you go through the match length.

The funny thing is : those are the same orders you would code if you just did PPM* and only coded symbols, not string matches.

Say you're doing PPM* and you find an order-N context (say "abcd"). You successfully code a symbol (say "x"). You move to the next position. Your context now is "abcdx" - well, that context must occur in the dictionary because you successfully coded an x following abcd. Therefore you will an order-N+1 context. Furthermore there can be no longer context or you would have found a longer one at the previous location as well. Therefore with PPM* as you successfully code symbols you will always code order-N, order-N+1,order-N+2 , just like LZSA!

If LZSA-ROLZ* can't encode a symbol at order-N it must escape down to lower orders. You want to escape down to the next lower order that has new following characters. You need to use exclusion.

Remember than in LZSA the character counts are used just like PPM, because of the way the suffix ranges form a probability distribution. If the following strings are "and..,at..,boobs.." then character a is coded with a 2/3 probability.

There are a few subtle differences :

In real PPM* they don't use just the longest context. They use the *shortest* determinstic context (if there is any). If there is any deterministic context, then the longest context is determinstic, predicting the same thing. But there may be other mismatching contexts that predic the same thing and thus affect counts. That is :


..abcdefg|x    abcdefg is the longest context, and only x follows it
.....xefg|y    context "efg" sees both x and y.  So "defg" is the shortest determinstic context
....ydefg|x    but "ydefg" also predicts x

So the longest determistic context only sees "x" with a count of 1

But if you use the shortest determinstic context, you see "x" with a count of 2

And that affects your escape estimation.

The big difference is in the coding of escapes vs. match lengths.

And if you code matches in LZSA* from orders lower than the deepest, that results in different order selection than PPM*.

cbloom rants

2/06/2015

02-06-15 - LZSA - Part 5

No comments:

old rants