5/09/2015

05-09-15 - Oodle LZNA

Oodle 1.45 has a new compressor called LZNA. (LZ-nibbled-ANS)

LZNA is a high compression LZ (usually a bit more than 7z/LZMA) with better decode speed. Around 2.5X faster to decode than LZMA.

Anyone who needs LZMA-level compression and higher decode speeds should consider LZNA. Currently LZNA requires SSE2 to be fast, so it only runs full speed on modern platforms with x86 chips.

LZNA gets its speed from two primary changes. 1. It uses RANS instead of arithmetic coding. 2. It uses nibble-wise coding instead of bit-wise coding, so it can do 4x fewer coding operations in some cases. The magic sauce that makes these possible is Ryg's realization about mixing cumulative probability distributions . That lets you do the bitwise-style shift update of probabilities (keeping a power of two total), but on larger alphabets.

LZNA usually beats LZMA compression on binary, slightly worse on text. LZNA is closer to LZHAM decompress speeds.


Some results :


lzt99

LZNA -z6 : 24,700,820 -> 9,154,248 =  2.965 bpb =  2.698 to 1
decode only      : 0.327 seconds, 43.75 b/kc, rate= 75.65 mb/s

LZMA : 24,700,820 -> 9,329,925 =  3.021 bpb =  2.647 to 1
decode           : 0.838 seconds, 58.67 clocks, rate= 29.47 M/s

LZHAM : 24,700,820 ->10,140,761 =  3.284 bpb =  2.435 to 1
decode           : 0.264 seconds, 18.44 clocks, rate= 93.74 M/s

(note on settings : LZHAM is run at BETTER because UBER is too slow. LZHAM BETTER is comparable to Oodle's -z6 ; UBER is similar to my -z7. LZMA is run at the best compression setting I can find; -m9 and lc=0,lp=2,pb=2 for binary data; with LZHAM I don't see a way to set the context bits. This is the new LZHAM 1.0, slightly different than my previous tests of LZHAM. All 64-bit, big dictionaries.).


baby_robot_shell

LZNA -z6 : 58,788,904 ->12,933,907 =  1.760 bpb =  4.545 to 1
decode only      : 0.677 seconds, 50.22 b/kc, rate= 86.84 mb/s

LZMA : 58,788,904 ->13,525,659 =  1.840 bpb =  4.346 to 1
decode           : 1.384 seconds, 40.70 clocks, rate= 42.49 M/s

LZHAM : 58,788,904 ->15,594,877 =  2.122 bpb =  3.769 to 1
decode           : 0.582 seconds, 17.12 clocks, rate= 100.97 M/s

I'm not showing encode speeds because they're all running different amounts of threading. It would be complicated to show fairly. LZHAM is the most aggressively threaded, and also the slowest without threading.


My "game testset" total sizes, from most compression to least :


Oodle LZNA -z8 :            57,176,229
Oodle LZNA -z5 :            58,318,469

LZMA -mx9 d26:lc0:lp2:pb3 : 58,884,562
LZMA -mx9 :                 59,987,629

LZHAM -mx9 :                62,621,098

Oodle LZHLW -z6 :           68,199,739

zip -9 :                    88,436,013

raw :                       167,495,105


Here's the new Pareto chart for Oodle. See previous post on these charts

This is load+decomp speedup relative to memcpy : (lzt99)

The left-side Y-intercept is the compression ratio. The right-side Y-intercept is the decompression speed. In between you can see the zones where each compressor is the best tradeoff.

With LZMA and LZHAM : (changed colors)

lzt99 is bad for LZHAM, perhaps because it's heterogeneous and LZHAM assumes pretty stable data. (LZHAM usually beats LZHLW for compression ratio). Here's a different example :

load+decomp speedup relative to memcpy : (baby_robot_shell)

3/25/2015

03-25-15 - My Chameleon

I did my own implementation of the Chameleon compression algorithm. (the original distribution is via the density project)

This is the core of Chameleon's encoder :

    cur = *fm32++; h = CHAMELEON_HASH(cur); flags <<= 1;
    if ( c->hash[h] == cur ) { flags ++; *to16++ = (uint16) h; }
    else { c->hash[h] = cur; *((uint32 *)to16) = cur; to16 += 2; }

This is the decoder :

    if ( (int16)flags < 0 ) { cur = c->hash[ *fm16++ ]; }
    else { cur = *((const uint32 *)fm16); fm16 += 2; c->hash[ CHAMELEON_HASH(cur) ] = cur; }
    flags <<= 1; *to32++ = cur;

I thought it deserved a super-simple STB-style header-only dashfuly-described implementation :

Chameleon.h

My Chameleon.h is not portable or safe or any of that jizzle. Maybe it will be someday. (Update : now builds on GCC & clang. Tested on PS4. Still not Endian-invariant.)


// Usage :

#define CHAMELEON_IMPL
#include "Chameleon.h"

Chameleon c;

Chameleon_Reset(&c);

size_t comp_buf_size = CHAMELEON_MAXIMUM_OUTPUT_SIZE(in_size);

void * comp_buf = malloc(comp_buf_size);

size_t comp_len = Chameleon_Encode(&c, comp_buf, in_buf, in_size );

Chameleon_Reset(&c);

Chameleon_Decode(&c, out_buf, in_size, comp_buf );

int cmp = memcmp(in_buf,out_buf,in_size);
assert( comp == 0 );


ADD : Chameleon2 SIMD prototype now posted : (NOTE : this is not good, do not use)

Chameleon2.h - experimental SIMD wide Chameleon
both Chameleons in a zip

The SIMD encoder is not fast. Even on SSE4 it only barely beats scalar Chameleon. So this is a dead end. Maybe some day when we get fast hardware scatter/gather it will be good (*).

(* = though use of hardware scatter here is always going to be treacherous, because hashes may be repeated, and the order in which collisions resolve must be consistent)

03-25-15 - Density - Chameleon

Casey pointed me at Density .

Density contains 3 algorithms, from super fast to slower : Chameleon, Cheetah, Lion.

They all attain speed primarily by working on U32 quanta of input, rather than bytes. They're sort of LZPish type things that work on U32's, which is a reasonable way to get speed in this modern world. (Cheetah and Lion are really similar to the old LZP1/LZP2 with bit flags for different predictors, or to some of the LZRW's that output forward hashes; the main difference is working on U32 quanta and no match lengths)

The compression ratio is very poor. The highest compression option (Lion) is around LZ4-fast territory, not as good as LZ4-hc. But, are they Pareto? Is it a good space-speed tradeoff?

Well, I can't build Density (I use MSVC) so I can't test their implementation for space-speed.

Compressed sizes :


lzt99 :
uncompressed       24,700,820

density :
c0 Chameleon       19,530,262
c1 Cheetah         17,482,048
c2 Lion            16,627,513

lz4 -1             16,193,125
lz4 -9             14,825,016

Oodle -1 (LZB)     16,944,829
Oodle -2 (LZB)     16,409,913

Oodle LZNIB        12,375,347

(lz4 -9 is not competitive for encode time, it's just to show the level of compression you could get at very fast decode speeds if you don't care about encode time ; LZNIB is an even more extreme case of the same thing - slow to encode, but decode time comparable to Chameleon).

To check speed I did my own implementation of Chameleon (which I believe to be faster than Density's, so it's a fair test). See the next post to get my implementation.

The results are :

comp_len = 19492042
Chameleon_Encode_Time : seconds:0.0274 ticks per: 1.919 mbps : 901.12
Chameleon_Decode_Time : seconds:0.0293 ticks per: 2.050 mbps : 843.31

round trip time = 0.05670
I get a somewhat smaller file size than Density's version for unknown reason.

Let's compare to Oodle's LZB (an LZ4ish) :


Oodle -1 :

24,700,820 ->16,944,829 =  5.488 bpb =  1.458 to 1
encode           : 0.061 seconds, 232.40 b/kc, rate= 401.85 mb/s
decode           : 0.013 seconds, 1071.15 b/kc, rate= 1852.17 mb/s

round trip time = 0.074

Oodle -2 :

24,700,820 ->16,409,913 =  5.315 bpb =  1.505 to 1 
encode           : 0.070 seconds, 203.89 b/kc, rate= 352.55 mb/s
decode           : 0.014 seconds, 1008.76 b/kc, rate= 1744.34 mb/s

round trip time = 0.084

lzt99 is a collection of typical game data files.

We can test on enwik8 (text/html) too :


Chameleon :

enwik8 :
Chameleon_Encode_Time : seconds:0.1077 ticks per: 1.862 mbps : 928.36
Chameleon_Decode_Time : seconds:0.0676 ticks per: 1.169 mbps : 1479.08
comp_len = 61524068

Oodle -1 :

enwik8 : 
100,000,000 ->57,267,299 =  4.581 bpb =  1.746 to 1 
encode           : 0.481 seconds, 120.17 b/kc, rate= 207.79 mb/s
decode           : 0.083 seconds, 697.58 b/kc, rate= 1206.19 mb/s

here Chameleon is much more compelling. It's competitive for size & decode speed, not just encode speed.

Commentary :

Any time you're storing files on disk, this is not the right algorithm. You want something more asymmetric (slow compress, fast decompress).

I'm not sure if Cheetah and Lion are Pareto for round trip time. I'd have to test speed on a wider set of sample data.

When do you actually want a compressor that's this fast and gets so little compression? I'm not sure.

3/15/2015

03-15-15 - LZ Literal Correlation Images

I made some pictures.

I'm showing literal correlation by making an image of the histogram. That is, given an 8-bit predictor, you tally of each event :


int histo[256][256]

histo[predicted][value] ++

then I scale the histo so the max is at 255 and make it into an image.

Most of the images that I show are in log scale, otherwise all the detail is too dark, dominated by a few peaks. I also sometimes remove the predicted=value line, so that the off axis detail is more visible.

Let's stop a moment and look t what we can see in these images.

This is a literal histo of "lzt99" , using predicted = lolit (last offset literal; the rep0len1 literal). This is in log scale, with the diagonal removed :

In my images y = prediction and x = current value. x=0, y=0 is in the upper left instead of the lower left where it should be because fucking bitmaps are annoying (everyone is fired, left handed coordinate systems my ass).

The order-0 probability is the vertical line sum for each x. So any vertical lines indicate just strong order-0 correlations.

Most files are a mix of different probability sources, which makes these images look a sum of different contibuting factors.

The most obvious factor here is the diagonal line at x=y. That's just a strong value=predicted generator.

The red blob is a cluster of events around x and y = 0. This indicates a probability event that's related to |x+y| being small. That is, the sum, or length, or something tends to be small.

The green shows a square of probabilities. A square indicates that for a certain range of y's, all x's are equally likely. In this case the range is 48-58. So if y is in 48-58, then any x in 48-58 is equally likely.

There are similar weaker squarish patterns all along the diagonal. Surprisingly these are *not* actually at the binary 8/16 points you might expect. They're actually in steps of 6 & 10.

The blue blobs are at x/y = 64/192. There's a funny very specific strong asymmetric pattern in these. When y = 191 , it predicts x=63,62,61,60 - but NOT 64,65,66. Then at y=192, predict x=64,65,66, but not 63.

In addition to the blue blobs, there are weak dots at all the 32 multiples. This indicates that when y= any multiple of 32, there's a generating event for x = any multiple of 32. (Note that in log scale, these dots look more important than they really are.). There are also some weak order-0 generators at x=32 and so on.

There's some just general light gray background - that's just uncompressible random data (as seen by this model anyway).


Here's a bunch of images : (click for hi res)

rawrawraw subsubsub xorxorxor
loglogNDlinND loglogNDlinND loglogNDlinND
Fez LO
Fez O1
lzt24 LO
lzt24 O1
lzt99 LO
lzt99 O1
enwik7 LO
enwik7 O1

details :

LO means y axis (predictor) is last-offset-literal , in an LZ match parse. Only the literals coded by the LZ are shown.

O1 means y axis is order1 (previous byte). I didn't generate the O1 from the LZ match parse, so it's showing *all* bytes in the file, not just the literals from the LZ parse.

"log" is just log-scale of the histo. An octave (halving of probability) is 16 pixel levels.

"logND" is log without the x=y diagonal. An octave is 32 pixel levels.

"linND" is linear, without the x=y diagonal.

"raw" means the x axis is just the value. "xor" means the x axis is value^predicted. "sub" means the x axis is (value-predicted+127).

Note that raw/xor/sub are just permutations of the values along a horizontal axis, they don't change the values.


Discussion :

The goal of a de-correlating transform is to create vertical lines. Vertical lines are order-0 probability peaks and can be coded without using the predictor as context at all.

If you use an order-0 coder, then any detail which is not in a vertical line is an opportunity for compression that you are passing up.

"Fez" is obvious pure delta data. "sub" is almost a perfect model for it.

"lzt24" has two (three?) primary probability sources. One is almost pure "sub" x is near y data.

The other sources, however, do not do very well under sub. They are pure order-0 peaks at x=64 and 192 (vertical lines in the "raw" image), and also those strange blobs of correlation at (x/y = 64 and 192). The problem is "sub" turns those vertical lines into diagonal lines, effectively smearing them all over the probability spectrum.

A compact but full model for the lzt24 literals would be like this :


is y (predictor) near 64 or 192 ?

if so -> strongly predict x = 64 or 192

else -> predict x = y or x = 64 or 192 (weaker)

lzt99, being more heterogenous, has various sources.

"xor" takes squares to squares. This works pretty well on text.

In general, the LO correlation is easier to model than O1.

The lzt99 O1 histo in particular has lots of funny stuff. There are bunch of non-diagonal lines, indicating things like x=y/4 patterns, which is odd.

old rants