cbloom rants: 02-06-15 - LZSA

Some wrapping up.

I've only talked about LZSA for the case of static dictionaries. What about the more common case of scanning through a file, the dictionary is dynamic from previously transmitted data?

Well, in theory you could use LZSA there. You need a streaming/online suffix trie construction. That does exist, and it's O(N) just like offline suffix array construction. So in theory you could do that.

But it's really no good. The problem is that LZSA is transmitting substrings with probability based on their character counts. That's ideal for data that is generated by a truly static source (that also has no finite state patterns). Real world data is not like that. Real world sources are always locally changing (*). This means that low-offset recent data is much better correlated than old data. That is easy to model in LZ77 but very hard to model in LZSA; you would have to somehow work the offsets of the strings into their code space allocation. Data that has finite state patterns also benefits from numeric modeling of the offsets (eg. binary 4-8 patterns and so on).

(* = there's also a kind of funny "Flatland" type thing that happens in data compression. If you're in a 2d Flatland, then 2d rigid bodies keep their shape, even under translation. However, a 3d rigid body that's translating through your slice will appear to change shape. The same thing happens with data compression models. Even if the source comes from an actual static model, if your view of that model is only a partial view (eg. your model is simpler than the real model) then it will appear to be a dynamic model to you. Say you have a simple model with a few parameters, the best value of those parameters will change over time, appearing to be dynamic.)

Furthermore, at the point that you're doing LZSA-dynamic, you've got the complexity of PPM and you may as well just do PPM.

The whole advantage of LZSA is that it's an incredibly simple way to get a PPM-like static model. You just take your dictionary and suffix sort it and you're done. You don't have to go training a model, you don't have to find a way to compact your PPM nodes. You just have suffix sorted strings.

Some papers for further reading :

"A note on the Ziv - Lempel model for compressing individual sequences" - Langdon's probability model equivalent of LZ

"Unbounded Length Contexts for PPM" - the PPM* paper

"Dictionary Selection using Partial Matching" - LZ-PM , an early ROLZ

"PPM Performance with BWT Complexity" - just a modification of PPM* with some notes.

"Data compression with finite windows" - LZFG ; the "C2" is the Suffix Trie LZFG that I usually think of.

Mark Nelson's Suffix Tree writeup ; see also Larsson, Senft, and Ukkonen.

There's obviously some relation to ACB, particularly LZSA*, just in the general idea of finding the longest backwards context and then encoding the longest forward match from there, but the details of the coding are totally unrelated.

cbloom rants

2/06/2015

02-06-15 - LZSA - Part 6

No comments:

Post a Comment