cbloom rants: 02-05-15 - LZSA

PPM.

You can of course implement a static PPM easily with a suffix array. (you would not want to do it for dynamic PPM because inserting into a sort is painful; the standard context tree used for PPM is a suffix trie)

Say my string is "abraabracadabra" (Steve Miller style). The suffix sort is :

a
aabracadabra
abra
abraabracadabra
abracadabra
acadabra
adabra
bra
braabracadabra
bracadabra
cadabra
dabra
ra
raabracadabra
racadabra

I want to encode the next character starting with order-3 PPM. My current context is "bra". So I look up "bra..." in the suffix sort. I've seen "a" and "c" before, each with count 1. So I have a total count of 2 and a novel count of 2. I can do the old fashioned PPM escape estimation from that either code a symbol in that context or escape. If I escape, I go to order 2 and look up "ra..." , etc.

Now, what if you did a funny thing with your PPM.

Start at order-0 and encode a character. At order-0 we have all the strings in the suffix array, so we just count the occurance of each first character ( C(a) = 7, C(b)=3, C(c)=1, C(d)=1, C(r)=3 , Total = 15).

Say we encode an "a". So we send 7 in 15. (and a no-escape flag)

On the next character move to order-1. So we reduce to these strings :

a
aabracadabra
abra
abraabracadabra
abracadabra
acadabra
adabra

Within this context we have C(a)=1, C(b)=3, C(c)=1, C(d)=1 (to send an 'r' we would have to escape). Say we sent a "c".

Now we change to order-2. We only have this string at order-2 :

acadabra

So if our next character is "a" we can send that with just a no-escape flag. Otherwise we have to escape out of this context.

What we have now is a "deterministic context" and we can keep increasing order as long as we match from it.

This is LZSA !

(the only difference is that in LZSA the escape is only modeled based on context order, not the contents of the context, which it normally would be in PPM)

To be clear :

When LZSA encodes a reference to the string "abc" it does with an arithmetic encoder an the equivalent probability :


P = probability equivalent as used in arithmetic encoding

P = C("abc") / C_total

C("abc") = C("a") * C("ab")/C("a") * C("abc")/C("ab")

as the counts become large and match the true probabilities :

C("ab")/C("a") => P("b"|"a")   (reads count of "b" given a previous "a")


C("abc") => C("a") * P("b"|"a") * P("c"|"ab")

P encoded => P("a") * P("b"|"a") * P("c"|"ab")

That's order-0 * order-1 * order-2. (and so on for larger match lengths).

The match length can be thought of as unary. So ML=3 is "1110". We read that as "match,match,match, no-match".

In this way we see the match length as a series of escape/no-escape flags.

Another note on PPM.

In real-world modern PPM's you do LOE. (LOE = Local Order Estimation). In the olden days you just chose "we always use order-4 PPM", which meands you start at order-4, and escape down to order-3,2,1. With LOE you look at local information and decide which order to use.

Now, let's say you had some data that was binary and had a repeated pattern that was like :

1 byte is random
3 bytes predictable
1 byte is random
7 bytes predictable

repeated. What orders should you use?

order-0 for random
order-0,1,2
order-0 for random
order-0,1,2,3,4,6

these are the orders that LZ uses!

This is part of why LZ can beat PPM on binaries but not on text, because the funny way that LZ jumps down to order-0 at the start of a match can actually be a good thing on binary.

Now, what if you had skip contexts?

(skip contexts are contexts that use some non-consecutive range of previous characters; eg. something like YYN is a two-symbol context that uses the 2nd & 3rd symbols back, but not the immediately preceding 1 symbol.)

If you have random bytes and skip contexts, then what you want is :

order-0 for random
order-0, Y, YY
order-0 for random
YYYN, YYYNY, YYYNYY, etc..

and this is what a "repeat match" is in LZSA.

Say I matched "abr" in the example above, but then my string is "abrd" , so I get a match of length 3, then a mismatch. I can then continue from the "repeat match" skip-context using YYYN of "abrX" :

YYYN...
abraabracadabra
abracadabra

so there are two strings available matches for a "repeat match". If your next character is not "a" or "c" then you code an "escape" and drop the YYYN context which is the same as saying you code a flag to select a normal match and not a repeat match.

If we like, we could make LZSA more of a true PPM.

In LZSA-Basic we encoded the match length and then the suffix index to select the match.

Instead, you could code the suffix index first. The decoder can then get the match string :


  int suffix_index = arithmetic_fetch( dictionary_size );

  at this point {suffix_index, match_length} is our match string

  unsigned char * match_string = suffix_sort[suffix_index];

but we still don't know the match length.

You can then encode "match length" unary style, a sequence of binary "more length" flags (these are PPM escape flags). Because we already know match_string, we can use the characters matched so far in our match flag coding. We could use them as contexts for the match flag. Or we could do a true PPM and use them to look up a context node and get total & novel counts to do PPM-style escape estimation.

If we do that, then LZSA really becomes a true PPM. It's a PPM that does this funny order selection : order-0,order-1,... order-N, then escapes back to order-0.

It has a neat advantage over traditional PPM - we are encoding the character selection in one single arithmetic coding operation, instead of one at each context level.

cbloom rants

2/05/2015

02-05-15 - LZSA - Part 3

No comments:

old rants