cbloom rants: 07/2014

Idea : Suffix-Trie Coded LZ :

You are doing LZ77-style coding (eg. matches in the prior stream or literals), but send the matches in a different way.

You have a Suffix Trie built on the prior stream. To find the longest match for a normal LZ77 you would take the current string to code and look it up by walking it down the Trie. When you reach the point of deepest match, you see what string in the prior stream made that node in the Trie, and send the position of that string as an offset.

Essentially what the offset does is encode a position in the tree.

But there are many redundancies in the normal LZ77 scheme. For example if you only encode a match of length 3, then the offsets that point to "abcd.." and "abce.." are equivalent, and shouldn't be distinguished by the encoding. The fact that they both take up space in the numerical offset is a waste of bits. You only want to distinguish offsets that actually point at something different for the current match length.

The idea in a nutshell is that instead of sending an offset, you send the descent into the trie to find that string.

At each node, first send a single bit for does the next byte in the string match any of the children. (This is equivalent to a PPM escape). If not, then you're done matching. If you like, this is like sending the match length with unary : 1 bits as long as you're in a node that has a matching child, then a 0 bit when you run out of matches. (alternatively you could send the entire match length up front with a different scheme).

When one of the children matches, you must encode which one. This is just an encoding of the next character, selected from the previously seen characters in this context. If all offsets are equally likely (they aren't) then the correct thing is just Probability(child) = Trie_Leaf_Count(child) , because the number of leaves under a node is the number of times we've seen this substring in the past.

(More generally the probability of offsets is not uniform, so you should scale the probability of each child using some modeling of the offsets. Accumulate P(child) += P(offset) for each offset under a child. Ugly. This is unfortunately very important on binary data where the 4-8-struct offset patterns are very strong.)

Ignoring that aside - the big coding gain is that we are no longer uselessly distinguishing offsets that only differ at higher match length, AND instead of just wasting those bits, we instead use them to make those offsets code smaller.

For example : say we've matched "ab" so far. The previous stream contains "abcd","abce","abcf", and "abq". Pretend that somehow those are the only strings. Normal LZ77 needs 2 bits to select from them - but if our match len is only 3 that's a big waste. This way we would say the next char in the match can either be "c" or "q" and the probabilities are 3/4 and 1/4 respectively. So if the length-3 match is a "c" we send that selection in only log2(4/3) bits = 0.415

And the astute reader will already be thinking - this is just PPM! In fact it is exactly a kind of PPM, in which you start out at low order (min match length, typically 3 or so) and your order gets deeper as you match. When you escape you junk back to order 3 coding, and if that escapes it jumps back to order 0 (literal).

There are several major problems :

1. Decoding is slow because you have to maintain the Suffix Trie for both encode and decode. You lose the simple LZ77 decoder.

2. Modern LZ's benefit a lot from modeling the numerical value of the offset in binary files. That's ugly & hard to do in this framework. This method is a lot simpler on text-like data that doesn't have numerical offset patterns.

3. It's not Pareto. If you're doing all this work you may as well just do PPM.

In any case it's theoretically interesting as an ideal of how you would encode LZ offsets if you could.

(and yes I know there have been many similar ideas in the past; LZFG of course, and Langdon's great LZ-PPM equivalence proof)

I did some work for Oodle 1.41 on speeding up compressors. Mainly the Fast/VeryFast encoders got faster. I also took a pass at trying to make sure the various options were "Pareto", that is the best possible space/speed tradeoff. I had some options that were off the curve, like much slower than they needed to be, or just worse with no benefit, so it was just a mistake to use them (LZNib Normal was particularly bad).

Oodle 1.40 got the new LZA compressor. LZA is a very high compression arithmetic-coded LZ. The goal of LZA is as much compression as possible while retaining somewhat reasonable (or at least tolerable) decode speeds. My belief is that LZA should be used for internet distribution, but not for runtime loading.

The charts :

compression ratio : (raw/comp ratio; higher is better)

compressor	VeryFast	Fast	Normal	Optimal1	Optimal2
LZA	2.362	2.508	2.541	2.645	2.698
LZHLW	2.161	2.299	2.33	2.352	2.432
LZH	1.901	1.979	2.039	2.121	2.134
LZNIB	1.727	1.884	1.853	2.079	2.079
LZBLW	1.636	1.761	1.833	1.873	1.873
LZB16	1.481	1.571	1.654	1.674	1.674

lzmamax  : 2.665 to 1
lzmafast : 2.314 to 1
zlib9 : 1.883 to 1 
zlib5 : 1.871 to 1
lz4hc : 1.667 to 1
lz4fast : 1.464 to 1

encode speed : (mb/s)

compressor	VeryFast	Fast	Normal	Optimal1	Optimal2
LZA	23.05	12.7	6.27	1.54	1.07
LZHLW	59.67	19.16	7.21	4.67	1.96
LZH	76.08	17.08	11.7	0.83	0.46
LZNIB	182.14	43.87	10.76	0.51	0.51
LZBLW	246.83	49.67	1.62	1.61	1.61
LZB16	511.36	107.11	36.98	4.02	4.02

lzmamax  : 5.55
lzmafast : 11.08
zlib9 : 4.86
zlib5 : 25.23
lz4hc : 32.32
lz4fast : 606.37

decode speed : (mb/s)

compressor	VeryFast	Fast	Normal	Optimal1	Optimal2
LZA	34.93	37.15	37.76	37.48	37.81
LZHLW	363.94	385.85	384.83	391.28	388.4
LZH	357.62	392.35	397.72	387.28	383.38
LZNIB	923.66	987.11	903.21	1195.66	1194.75
LZBLW	2545.35	2495.37	2465.99	2514.48	2515.25
LZB16	2752.65	2598.69	2687.85	2768.34	2765.92

lzmamax  : 42.17
lzmafast : 40.22
zlib9 : 308.93
zlib5 : 302.53
lz4hc : 2363.75
lz4fast : 2288.58

While working on LZA I found some encoder speed wins that I ported back to LZHLW (mainly in Fast and VeryFast). A big one is to early out for last offsets; when I get a last offset match > N long, I just take it and don't even look for non-last-offset matches. This is done in the non-Optimal modes, and surprisingly hurts compression almost not all while helping speed a lot.

Four of the compressors are now in pretty good shape (LZA,LZHLW,LZNIB, and LZB16). There are a few minor issues to fix someday (someday = never unless the need arises) :

LZA decoder should be a little faster (currently lags LZMA a tiny bit). LZA Optimal1 would be better with a semi-greedy match finder like MMC (LZMA is much faster to encode than me at the same compression level; perhaps a different optimal parse scheme is needed too). LZA Optimal2 should seed with multi-parse. LZHLW Optimal could be faster. LZNIB Normal needs much better match selection heuristics, the ones I have are really just not right. LZNIB Optimal should be faster; needs a better way to do threshold-match-finding. LZB16 Optimal should be faster; needs a better 64k-sliding-window match finder.

The LZH and LZBLW compressors are a bit neglected and you can see they still have some of the anomalies in the space/speed tradeoff curve, like the Normal encode speed for LZBLW is so bad that you may as well just use Optimal. Put aside until there's a reason to fix them.

If another game developer tells me that "zlib is a great compromise and you probably can't beat it by much" I'm going to murder them. For the record :

zlib -9 :
4.86 MB/sec to encode
308.93 MB/sec to decode
1.883 to 1 compression

LZHLW Optimal1 :
4.67 MB/sec to encode
391.28 MB/sec to decode
2.352 to 1 compression

come on! The encoder is slow, the decoder is slow, and it compresses poorly.

LZMA in very high compression settings is a good tradeoff. In its low compression fast modes, it's very poor. zlib has the same flaw - they just don't have good encoders for fast compression modes.

LZ4 I have no issues with; in its designed zone it offers excellent tradeoffs.

In most cases the encoder implementations are :


VeryFast =
cache table match finder
single hash
greedy parse

Fast = 
cache table match finder
hash with ways
second hash
lazy parse
very simple heuristic decisions

Normal =
varies a lot for the different compressors
generally something like a hash-link match finder
or a cache table with more ways
more lazy eval
more careful "is match better" heuristics

Optimal =
exact match finder (SuffixTrie or similar)
cost-based match decision, not heuristic
backward exact parse of LZB16
all others have "last offset" so require an approximate forward parse

I'm mostly ripping out my Hash->Link match finders and replacing them with N-way cache tables. While the cache table is slightly worse for compression, it's a big speed win, which makes it better on the space-speed tradeoff spectrum.

I don't have a good solution for windowed optimal parse match finding (such as LZB16-Optimal). I'm currently using overlapped suffix arrays, but that's not awesome. Sliding window SuffixTrie is an engineering nightmare but would probably be good for that. MMC is a pretty good compromise in practice, though it's not exact and does have degenerate case breakdowns.

LZB16's encode speed is very sensitive to the hash table size.


-h12
24,700,820 ->16,944,823 =  5.488 bpb =  1.458 to 1
encode           : 0.045 seconds, 161.75 b/kc, rate= 550.51 mb/s
decode           : 0.009 seconds, 849.04 b/kc, rate= 2889.66 mb/s

-h13
24,700,820 ->16,682,108 =  5.403 bpb =  1.481 to 1
encode           : 0.049 seconds, 148.08 b/kc, rate= 503.97 mb/s
decode           : 0.009 seconds, 827.85 b/kc, rate= 2817.56 mb/s

-h14
24,700,820 ->16,491,675 =  5.341 bpb =  1.498 to 1
encode           : 0.055 seconds, 133.07 b/kc, rate= 452.89 mb/s
decode           : 0.009 seconds, 812.73 b/kc, rate= 2766.10 mb/s

-h15
24,700,820 ->16,409,957 =  5.315 bpb =  1.505 to 1
encode           : 0.064 seconds, 113.23 b/kc, rate= 385.37 mb/s
decode           : 0.009 seconds, 802.46 b/kc, rate= 2731.13 mb/s

If you accidentally set it too big you get a huge drop-off in speed. (The charts above show -h13 ; -h12 is more comparable to lz4fast (which was built with HASH_LOG=12)).

I stole an idea from LZ4 that helped the encoder speed a lot. (lz4fast is very good!) Instead of doing the basic loop like :


while(!eof)
{
  if ( match )
    output match
  else
    output literal
}

instead do :


while(!eof)
{
  while( ! match )
  {
    output literal
  }

  output match
}

This lets you make a tight loop just for outputing literals. It makes it clearer to you as a programmer what's happening in that loop and you can save work and simplify things. It winds up being a lot faster. (I've been doing the same thing in my decoders forever but hadn't done in the encoder).

My LZB16 is very slightly more complex to encode than LZ4, because I do some things that let me have a faster decoder. For example my normal matches are all no-overlap, and I hide the overlap matches in the excess-match-length branch.

cbloom rants

7/14/2014

07-14-14 - Suffix-Trie Coded LZ

7/03/2014

07-03-14 - Oodle 1.41 Comparison Charts

old rants