You are doing LZ77-style coding (eg. matches in the prior stream or literals), but send the matches in a different way.

You have a Suffix Trie built on the prior stream. To find the longest match for a normal LZ77 you would take the current string to code and look it up by walking it down the Trie. When you reach the point of deepest match, you see what string in the prior stream made that node in the Trie, and send the position of that string as an offset.

Essentially what the offset does is encode a position in the tree.

But there are many redundancies in the normal LZ77 scheme. For example if you only encode a match of length 3, then the offsets that point to "abcd.." and "abce.." are equivalent, and shouldn't be distinguished by the encoding. The fact that they both take up space in the numerical offset is a waste of bits. You only want to distinguish offsets that actually point at something different for the current match length.

The idea in a nutshell is that instead of sending an offset, you send the descent into the trie to find that string.

At each node, first send a single bit for does the next byte in the string match any of the children. (This is equivalent to a PPM escape). If not, then you're done matching. If you like, this is like sending the match length with unary : 1 bits as long as you're in a node that has a matching child, then a 0 bit when you run out of matches. (alternatively you could send the entire match length up front with a different scheme).

When one of the children matches, you must encode which one. This is just an encoding of the next character, selected from the previously seen characters in this context. If all offsets are equally likely (they aren't) then the correct thing is just Probability(child) = Trie_Leaf_Count(child) , because the number of leaves under a node is the number of times we've seen this substring in the past.

(More generally the probability of offsets is not uniform, so you should scale the probability of each child using some modeling of the offsets. Accumulate P(child) += P(offset) for each offset under a child. Ugly. This is unfortunately very important on binary data where the 4-8-struct offset patterns are very strong.)

Ignoring that aside - the big coding gain is that we are no longer uselessly distinguishing offsets that only differ at higher match length, AND instead of just wasting those bits, we instead use them to make those offsets code smaller.

For example : say we've matched "ab" so far. The previous stream contains "abcd","abce","abcf", and "abq". Pretend that somehow those are the only strings. Normal LZ77 needs 2 bits to select from them - but if our match len is only 3 that's a big waste. This way we would say the next char in the match can either be "c" or "q" and the probabilities are 3/4 and 1/4 respectively. So if the length-3 match is a "c" we send that selection in only log2(4/3) bits = 0.415

And the astute reader will already be thinking - this is just PPM! In fact it is exactly a kind of PPM, in which you start out at low order (min match length, typically 3 or so) and your order gets deeper as you match. When you escape you junk back to order 3 coding, and if that escapes it jumps back to order 0 (literal).

There are several major problems :

1. Decoding is slow because you have to maintain the Suffix Trie for both encode and decode. You lose the simple LZ77 decoder.

2. Modern LZ's benefit a lot from modeling the numerical value of the offset in binary files. That's ugly & hard to do in this framework. This method is a lot simpler on text-like data that doesn't have numerical offset patterns.

3. It's not Pareto. If you're doing all this work you may as well just do PPM.

In any case it's theoretically interesting as an ideal of how you would encode LZ offsets if you could.

(and yes I know there have been many similar ideas in the past; LZFG of course, and Langdon's great LZ-PPM equivalence proof)

## No comments:

Post a Comment