cbloom rants: 02-04-15 - LZSA

I'm going to introduce what I believe is a novel (*) compression algorithm. I'm calling it "LZSA" for "LZ Suffix Array" , though ryg rightly points out it's not really an LZ.

(* = if you're actually a scientist and not a cock-munching "entrepreneur" you should know that nothing is ever novel. This could be considered a simplification of ACB)

(Note to self : first emailed 07/27/2014 as "Internal Compression Blog")

I'm going to write a mini series about this.

Here's some previous posts that are related :

cbloom rants 09-27-08 - 2
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 08-27-14 - LZ Match Length Redundancy
cbloom rants 07-14-14 - Suffix-Trie Coded LZ

So let's dive in.

Part 1 : motivation and background

I was working on compression from static dictionaries. The problem with traditional LZ77 on static dictionaries is that to get good compression you want a large dictionary, but then the offsets require more bits as well. In a normal dynamic scan dictionary, you have very strong offset modeling (they tend to be small, as well as binary patterns). In particular, short common strings will occur at low offset and thus not require many bits. But in a static dictionary all references take the same large number of bits, even if the match is short and the substring matched is very common. (*)

(* = obviously you could sort the substrings by frequency to try to create an optimal static dictionary that has strongly biased offsets; but then you also have to be aware of how adjacent substrings form larger strings (eg. "ab" next to "cd" also adds "abcd"), and have to make that whole grammar sorted, and that seems like a monstrous hard problem)

The problem is that common substrings occur all over the static dictionary (eg. in an english text dictionary "the" occurs in thousands of places), but in LZ77 you have to code an offset to one specific occurance of that substring. In effect you are wasting log2(N) bits, where N is the count of that substring.

In fact, the solution is very easy conceptually. Just take the static dictionary and do a suffix sort on it. Now all occurances of "the" are consecutive in the suffix sort.

Say our dictionary is "banana" , then the strings we can match are :

banana
anana
nana
ana
na
a

to code "ana" we could send index 1 or 3, they both decode as "ana" at length 3.

After suffix sort our strings are :

a
ana
anana
banana
na
nana

And to send "ana" we send index 1 or 2.

So now we need to send an integer, and we need it to be in a range, but we don't need to specify it exactly.

That is, we want to send an integer in the range {suffix_lo,suffix_hi} but we don't care what it is exactly in that range (because they all decode to the same string), and we don't want to waste bits unnecessarily specifying what it is in that region.

That's exactly what an arithmetic encoder does! We just need the low and high index of our substring in the suffix array, and we send that as an arithmetic encoder.

It's exactly like a cumulative frequency table. The arithmetic encoder is gauranteed to send an integer that is somewhere in the range we need. We don't know which exact integer the decoder will see; it won't be determined until we do some more arithmetic encodings and the range is reduced further.

We're just treating the # of strings in the dictionary as the cumulative probability total. Then the low & high suffix index that contains our substring are the probabilities that we use to encode a "symbol" in the arithmetic coder. By coding that range we have specified a substring, and we save log2(substring_count) bits of unnecessary information.

Next post I'll describe the algorithm more precisely and then I'll talk about it.

cbloom rants

2/04/2015

02-04-15 - LZSA - Part 1

No comments:

old rants