3/14/2016

XRGB Bitmap Test

This is obvious and I think it's been done before, but hey.

I was remembering how modern LZ's like LZMA (BitKnit, etc.) that (can) do pos&3 for literals might like bitmaps in XRGB rather than 24-bit RGB.

In XRGB, each color channel gets its own entropy coding. Also offset bottom bits works if the offsets are whole pixel steps (the off&3 will be zero). In 24-bit RGB that stuff is all mod-3 which we don't do.

(in general LZMA-class compressors fall apart a bit if the structure is not the typical 4/8/pow2)

In compressors it's generally terrible to stick extra bytes in and give the compressor more work to do. In this case we're injecting a 0 in every 4th byte, and the compressor has to figure out those are all redundant just to get back to its original size.

Anyway, this is an old idea, but I don't think I ever actually tried it. So :


PDI_1200.bmp

LZNA :

24-bit RGB : LZNA : 2,760,054 -> 1,376,781
32-bit XRGB: LZNA : 3,676,818 -> 1,311,502

24-bit  RGB with DPCM filter : LZNA : 2,760,054 -> 1,022,066
32-bit XRGB with DPCM filter : LZNA : 3,676,818 -> 1,015,379  (MML8 : 1,012,988)

webpll : 961,356
paq8o8 : 1,096,342

moses.bmp

24-bit RGB : LZNA : 6,580,854 -> 3,274,757
32-bit XRGB: LZNA : 8,769,618 -> 3,022,320

24-bit  RGB with DPCM filter : LZNA : 6,580,854 -> 2,433,246
32-bit XRGB with DPCM filter : LZNA : 8,769,618 -> 2,372,921

webpll : 2,204,444
gralic111d : 1,822,108

other compressors :

32-bit XRGB with DPCM filter : LZA  : 8,769,618 -> 2,365,661 (MML8 : 2,354,434)

24-bit  RGB no filter : BitKnit : 6,580,854 -> 3,462,455
32-bit XRGB no filter : BitKnit : 8,769,618 -> 3,070,141
32-bit XRGB with DPCM filter : BitKnit : 8,769,618 -> 2,601,463

32-bit XRGB: LZNA : 8,769,618 -> 3,022,320
32-bit XRGB: LZA  : 8,769,618 -> 3,009,417

24-bit  RGB: LZMA : 6,580,854 -> 3,488,546 (LZMA lc=0,lp=2,pb=2)
32-bit XRGB: LZMA : 8,769,618 -> 3,141,455 (LZMA lc=0,lp=2,pb=2)

repro:

bmp copy moses.bmp moses.tga 32
V:\devel\projects\oodle\radbitmap\radbitmaptest
radbitmaptest64 rrz -z0 r:\moses.tga moses.tga.rrz -f8 -l1

Key observations :

1. On "moses" unfiltered : padding to XRGB does help a solid amount (3,274,757 to 3,022,320 for LZNA) , despite the source being 4/3 bigger. I think that proves the concept. (BitKnit & LZMA even bigger difference)

2. On filtered data, padding to XRGB still helps, but much (much) less. Presumably this is because post-filter data is just a bunch of low values, so the 24-bit RGB data is not so multiple-of-three structured (it's a lot of 0's, +1's, and -1's, less coherent, less difference between the color channels, etc.)

3. On un-filtered data, "sub" literals might be helping BitKnit (it beats LZMA on 32-bit unfiltered, and hangs with LZNA). On filtered data, the sub-literals don't help (might even hurt) and BK falls behind. We like the way sub literals sometimes act as an automatic structure stride and delta filter, but they can't compete with a real image-specific DPCM.


Now, XRGB padding is an ugly way to do this. You'd much rather stick with 24-bit RGB and have an LZ that works inherently on 3-byte items.

The first step is :


LZ that works on "items"

(eg. item = a pixel)

LZ matches (offsets and lens) are in whole items

(the more analogous to bottom-bits style would be to allow whole-items and "remainders";
that's /item and %item, and let the entropy coder handle it if remainder==0 always;
but probably best to just force remainders=0)

When you don't match (literal item)
each byte in the item gets it own entropy stats
(eg. color channels of pixels)

which maybe is useful on things other than just images.

The other step is something like :


Offset is an x,y delta instead of linear
(this replaces offset bottom bits)

could be generically useful in any kind of row/column structured data

Filtering for values with x-y neighbors

(do you do the LZ on un-filtered data, and only filter the literals?)
(or do you filter everything and do the LZ on filter residuals?)

and a lot of this is just webp-ll

3/11/2016

Seven Test

I made a new test set called "sevens", taking the lead from enwik7, the size of each file is 10 MB (10^7).

The goal here is not to show the total or who does best overall (that relies on how you weight each type of file and whether you think this selection is representative of the occurance ratios in your data), rather to show how each compressor does on different types of data, to highlight their different strengths.

Showing compression factor (eg. N:1 , higher is better) :

run details :


ZStd is 0.5.1 at level 21 (optimal)
LZMA is 7z -mx9 -m0=lzma:d24
Brotli is bro.exe by Sportman --quality 9 --window 24 (*)
Oodle is v2.13 at -z6 (Optimal2)

All competitors run via their provided exe

Some takeaways :

Binary structured data is really where the other compressors leave a lot of room to beat them. ("granny" and "records"). The difference in sizes on all the other files is pretty meh.

BitKnit does its special thang on granny - close to LZNA but 2X faster to decode (and ~ 6X faster than LZMA). Really super space-speed. BitKnit drops down to more like LZHLW levels on the non-record files (LZNA/LZMA has a small edge on them).

I was really surprised by ZStd vs Brotli. I actually went back and double checked by CSV to make sure I hadn't switched the columns by accident. In particular - Brotli does poorly on enwik7 (huh!?) but it does pretty well on "granny", and surprisingly ZStd does quite poorly on "granny" & "records". Not what I expected at all. Brotli is surprising poor on text/web and surprisingly good on binary record data.

LZHLW is still an excellent choice after all these years.

(* = Brotli quality 10 takes an order of magnitude longer than any of the others. I got fed up with waiting for it. Oodle also has "super" modes at -z8 that aren't used here. (**))

(for concreteness : Brotli 11 does pretty well on granny7 ; (6.148:1 vs 4.634:1 at q9) but it runs at 68 kb/s (!!) (and still not LZMA-level compression))

(** = I used to show results in benchmarks that required really slow encoders (for example the old LZNIB optimal "super parse" was hella slow); that can result in very small sizes and great decode speed, but it's a form of cheating. Encoders slower than 1 mb/s just won't be used, they're too slow, so it's reporting a result that real users won't actually see, and that's BS. I'm trying to be more legit about this now for my own stuff. Slow encoders are still interesting for research purposes because they show what should be possible, so you can try to get that result back in a faster way. (this in fact happened with LZNIB and is a Big Deal))

Seven Test Space-Speeds

Showing decompress time space-speed tradeoff on the different files of "seven test" :

records7

granny7

game7

exe7

enwik7

dds7

audio7

Note on the test :

This is running the non-Oodle compressors via my build of their lib (*). Brotli not included because it's too hard to build in MSVC (before 2010). "oohc" here is "Optimal2" level (originally posted with Optimal1 level, changed to Optimal2 for consistency with previous post).

The sorting of the labels on the right is by compressed size.

Report on total of all files :

-------------------------------------------------------
by ratio:
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s

by encode speed:
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s

by decode speed:
oohcLZNIB   :  1.84:1 ,    7.2 enc mb/s , 1271.4 dec mb/s
zstdmax     :  1.99:1 ,    4.6 enc mb/s ,  457.5 dec mb/s
oohcLZHLW   :  2.16:1 ,    3.4 enc mb/s ,  431.9 dec mb/s
oohcBitKnit :  2.27:1 ,    4.9 enc mb/s ,  258.0 dec mb/s
lzham       :  2.23:1 ,    1.9 enc mb/s ,  156.0 dec mb/s
oohcLZNA    :  2.37:1 ,    2.9 enc mb/s ,  125.5 dec mb/s
lzma        :  2.35:1 ,    2.7 enc mb/s ,   37.3 dec mb/s
-------------------------------------------------------

How to for my reference :


type test_slowies_seven.bat
@REM test each one individially :
spawnm -n external_compressors_test.exe -e2 -d10 -noohc -nlzma -nlzham -nzstdmax r:\testsets\seven\* -cr:\seven_csvs\@f.csv
@REM test as a set :
external_compressors_test.exe -e2 -d10 -noohc -nlzma -nlzham -nzstdmax r:\testsets\seven

dele r:\compressorspeeds.*
@REM testproj compressorspeedchart
spawnm c:\src\testproj\x64\debug\TestProj.exe r:\seven_csvs\*.csv
ed r:\compressorspeeds.*

(* = I use code or libs to test speeds, never exes; I always measure speed memory->memory, single threaded, with cold caches)

2/29/2016

LZSSE Results

Quick report of my results on LZSSE. (updated 03/06/2016)

(LZSSE Latest commit c22a696 ; fetched 03/06/2016 ; test machine Core i7-3770 3.4 GHz ; built MSVC 2012 x64 ; LZSSE2 and 8 optimal parse level 16)

Basically LZSSE is in fact great on text, faster than LZ4 and much better compression.

On binary, LZSSE2 is quite bad, but LZSSE8 is roughly on par with LZ4. It looks like LZ4 is maybe slightly better on binary than LZSSE8, but it's close.

In general, LZ4 is does well on files that tend to have long LRL's and long ML's. Files with lots of short (or zero) LRL's and short ML's are bad for LZ4 (eg. text) and not bad for LZSSE.

(LZB16 is Oodle's LZ4 variant; 64k window like LZSSE; LZNIB and LZBLW have large windows)


Some results :

enwik8 LZSSE2 : 100,000,000 ->38,068,528 : 2866.17 mb/s
enwik8 LZSSE8 : 100,000,000 ->38,721,328 : 2906.29 mb/s
enwik8 LZB16  : 100,000,000 ->43,054,201 : 2115.25 mb/s

(LZSSE kills on text)

lzt99  LZSSE2 : 24,700,820 ->15,793,708  : 1751.36 mb/s
lzt99  LZSSE8 : 24,700,820 ->15,190,395  : 2971.34 mb/s
lzt99  LZB16  : 24,700,820 ->14,754,643  : 3104.96 mb/s

(LZSSE2 really slows down on heterogenous binary file lzt99)
(LZSSE8 does okay, but slightly worse than LZ4/LZB16 in size & speed)

mozilla LZSSE2: 51,220,480 ->22,474,508 : 2424.21 mb/s
mozilla LZSSE8: 51,220,480 ->22,148,366 : 3008.33 mb/s
mozilla LZB16 : 51,220,480 ->22,337,815 : 2433.78 mb/s

(all about the same size on silesia mozilla)
(LZSSE8 definitely fastest)

lzt24  LZB16  : 3,471,552 -> 2,379,133 : 4435.98 mb/s
lzt24  LZSSE8 : 3,471,552 -> 2,444,527 : 4006.24 mb/s
lzt24  LZSSE2 : 3,471,552 -> 2,742,546 : 1605.62 mb/s
lzt24  LZNIB  : 3,471,552 -> 1,673,034 : 1540.25 mb/s

(lzt24 (a granny file) really terrible for LZSSE2; it's as slow as LZNIB)
(LZSSE8 fixes it though, almost catches LZB16, but not quite)

------------------

Some more binary files.  LZSSE2 is not good on any of these, so omitted.

win81  LZB16  : 104,857,600 ->54,459,677 : 2463.37 mb/s
win81  LZSSE8 : 104,857,600 ->54,911,633 : 3182.21 mb/s

all_dds LZB16 : 79,993,099 ->47,683,003 : 2577.24 mb/s
all_dds LZSSE8: 79,993,099 ->47,807,041 : 2607.63 mb/s

AOW3_Skin_Giants.clb
LZB16  :  7,105,158 -> 3,498,306 : 3350.06 mb/s
LZSSE8 :  7,105,158 -> 3,612,433 : 3548.39 mb/s

baby_robot_shell.gr2
LZB16  : 58,788,904 ->32,862,033 : 2968.36 mb/s
LZSSE8 : 58,788,904 ->33,201,406 : 2642.94 mb/s

LZSSE8 vs LZB16 is pretty close.

LZSSE8 is maybe more consistently fast; its decode speed has less variation than LZ4. Slowest LZSSE8 was all_dds at 2607 mb/s ; LZ4 went down to 2115 mb/s on enwik8. Even excluding text, it was down to 2433 mb/s on mozilla. LZB16/LZ4 had a slightly higher max speed (on lzt24).

Conclusion :

On binary-like data, LZ4 and LZSSE8 are pretty close. On text-like data, LZSSE8 is definitely better. So for general data, it looks like LZSSE8 is a definite win.

LZSSE Notes

There are a few things that I think are interesting in LZSSE. And really very little of it is about the SIMD-ness.

1. SIMD processing of control words.

All LZ-Bytewises do a little bit of shifts and masks to pull out fields and flags from the control word. Stuff like lrl = (control>>4) and numbytesout = lrl+ml;

This work is pretty trivial, and it's fast already in scalar. But if you can do it N at a time, why not.

A particular advantage here is that SSE instruction sets are somewhat better at branchless code than scalar, it's a bit easier to make masks from conditions and such-like, so that can be a win. Also helps if you're front-end-bound, since decoding one instruction to do an 8-wide shift is less work than 8 instructions. (it's almost impossible for a data compressor to be back-end bound on simple integer math ops, there are just so many execution units; that's rare, it's much possible to hit instruction decode limits)

2. Using SSE in scalar code to splat out match or LRL.

LZSSE parses the control words SIMD (wide) but the actual literal or match copy is scalar, in the sense that only one is done at a time. It still uses SSE to fetch those bytes, but in a scalar way. Most LZ's can do this (many may do it already without being aware of it; eg. if you use memcpy(,16) you might be doing an SSE splat).

3. Limitted LRL and ML in control word with no excess. Outer looping on control words only, no looping on LRL/ML.

To output long LRL's, you have to output a series of control words, each with short LRL. To output long ML's, you have to output a series of control words.

This I think is the biggest difference in LZSSE vs. something like LZ4. You can make an LZ4 variant that works like this, and in fact it's an interesting thing to do, and is sometimes fast. In an LZ4 that does strictly alternating LRL-ML, to do this you need to be able to send ML==0 so that long literal runs can be continued as a sequence of control words.

Traditional LZ4 decoder :


{
lrl = control>>4;
ml = (control&0xF)+4;
off = get 2 bytes;  comp += 2;

// get excess if flagged with 0xF in control :
if ( lrl == 0xF ) lrl += *comp++; // and maybe more
if ( ml == 19 ) ml += *comp++; // and maybe more

copy(out,comp,lrl); // <- may loop on lrl
out += lrl; comp += lrl;

copy(out,out-off,ml); // <- may loop on ml
out += ml;
}

non-looping LZ4 decoder : (LZSSE style)

{
lrl = control>>4;
ml = control&0xF; // <- no +4 , 0 possible
off = get 2 bytes;  comp += 2;  // <- * see below

// no excess

copy(out,comp,16); // <- unconditional 16 byte copy, no loop
out += lrl; comp += lrl;

copy(out,out-off,16); // <- unconditional 16 byte copy, no loop
out += ml;
}

(* = the big complication in LZSSE comes from trying to avoid sending the offset again when you're continuing a match; something like if previous control word ml == 0xF that means a continuation so don't get offset)

(ignoring the issue of overlapping matches for now)

This non-looping decoder is much less branchy, no branches for excess lens, no branches for looping copies. It's much faster than LZ4 *if* the data doesn't have long LRL's or ML's in it.

4. Flagged match/LRL instead of strictly alternating LRL-ML. This is probably a win on data with lots of short matches, where matches often follow matches with no LRL in between, like text.

If you have to branch for that flag, it's a pretty huge speed hit (see, eg. LZNIB). So it's only viable in a fast LZ-Bytewise if you can do it branchless like LZSSE.

Bit Input Notes

1. The big win of U64 branchless bit input is having >= 56 bits (or 57) after refill. The basic refill operation itself is not faster than branchy 32-at-a-time refills, but that only has >= 32 (or 33) bits after refill. The advantage comes if you can unconditionally consume bits knowing that count. eg. if you have a 12-bit limitted Huffman, you can consume 4 symbols without needing to refill.

2. The best case for bit input is when the length that you consume is not very variable. eg. in the Huffman case, 1-12 bits, has a reasonably low limit. The worst case is when it has a high max and is quite random. Then you can't avoid refill checks, and they're quite unpredictable (if you do the branchy case)

3. If your refills have a large maximum, but the average is low, branchy can be faster than branchless. Because the maximum is high (eg. maybe a max of 32 bits consumed), you can only do one decode op before checking refill. Branchless will then always refill. Branchy can skip the refill if the average is low - particularly if it's predictably low.

4. If using branchy refills, try to make it predictable. An interesting idea is to use multiple bit buffers so that each consumption spot gets its own buffer, and then can create a pattern. A very specific case is consuming a fixed number of bits. something like :


bitbuffer

if ( random )
{
  consume 4 bits from bitbuffer
  if bitbuffer out -> refill
}
else
{
  consume 6 bits from bitbuffer
  if bitbuffer out -> refill
}

these branches (for bitbuffer refill) will be very random because of the two different sites that consume different amounts. However, this :

bitbuffer1, bitbuffer2

if ( random )
{
  consume 4 bits from bitbuffer1
  if bitbuffer1 out -> refill
}
else
{
  consume 6 bits from bitbuffer2
  if bitbuffer2 out -> refill
}

these branches for refill are now perfectly predictable in a pattern (they are taken every Nth time exactly).

5. Bit buffer work is slow, but it's "mathy". On modern processors that are typically math-starved, it can be cheap *if* you have enough ILP to fully use all the execution units. The problem is a single bit buffer on its own is super serial work, so you need multiple bit buffers running simultaneously, or enough other work.

For example, it can actually be *faster* than byte-aligned input (using something like "EncodeMod") if the byte-input does a branch, and that branch is unpredictable (in the bad 25-75% randomly taken range).

2/17/2016

LZSSE

An LZ Codec Designed for SSE Decompression

LZSSE code

Some good stuff.

Basically this is a nibble control word LZ (like LZNIB). The nibble has a threshold value T, < T is an LRL (literal run len), >= T is a match length. LZSSET are various threshold variants. As Conor noted, ideally T would be variable, optimized per file (or even better - per quantum) to adapt to different data better.

LZSSE has a 64k window (like LZ4/LZB16) but unlike them supports MML (minimum match length) of 3. MML 3 typically helps compression a little, but in scalar decoders it really hurts speed.

I think the main interesting idea (other than implementation details) is that by limitting the LRL and ML, with no excess/overflow support (ML overflow is handled with continue-match nibbles), it means that you can do a non-looping output of 8/16 bytes. You get long matches or LRL's by reading more control nibbles.

That is, a normal LZ actually has a nested looping structure :


loop on controls from packed stream
{
 control specifies lrl/ml

 loop on lrl/ml
 {
   output bytes
 }
}

LZSSE only has *one* outer loop on controls.

There are some implementation problems at the moment. The LZSSE2 optimal parse encoder is just broken. It's unusably slow and must have some bad N^2 degeneracy. This can be fixed, it's not a problem with the format.

Another problem is that LZSSE2 expands incompressible data too much. Real world data (particularly in games) often has incompressible data mixed with compressible. The ideal fix would be to have the generalized LZSSET and choose T per quantum. A simpler fix would be to do something like cut files into 16k or 64k quanta, and to select the best of LZSSE2/4/8 per-quantum and also support uncompressed quanta to prevent expansion.

I will take this moment to complain that the test sets everyone is using are really shit. Not Conors fault, but enwiks and Silesia are grossly not at all representative of data that we see in the real world. Silesia is mostly text and weird highly-compressible data; the file I like best in there for my own testing is "mozilla" (though BTW mozilla also contains a bunch of internal zlib streams; it benefits enormously from precomp). We need a better test corpus!!!

2/11/2016

String Match Stress Test Files

A gift. My string match stress test set :

string_match_stress_tests.7z (60,832 bytes)

Consists of :

 paper1_twice
 stress_all_as
 stress_many_matches
 stress_search_limit
 stress_sliding_follow
 stress_suffix_forward

An optimal parse matcher (matching at every position in each file against all previous bytes within that file) should get these average match lengths : (min match length of 4, and no matches searched for in the last 8 bytes of each file)


paper1_twice : 13294.229727
stress_all_as : 21119.499148
stress_many_matches : 32.757760
stress_search_limit : 823.341331
stress_sliding_follow : 199.576550
stress_suffix_forward : 5199.164464

total ml : 2896554306
total bytes : 483870

Previous post on the same test set : 09-27-11 - String Match Stress Test

And these were used in the String Match Test post series , though there I used "twobooks" instead of "paper1_twice".

These stress tests are designed to make imperfect string matchers show their flaws. Correct implementations of Suffix Array or Suffix Tree searchers should find this total match length without ever going into bad N^2 slowdowns (their speed should be roughly constant). Other matchers like hash-link, LzFind (hash-btree) and MMC will either find lower total match length (due to an "amortize" step limit) or will fall into bad N^2 (or worse!) slowdowns.

1/29/2016

Oodle Network Usage Notes

Two things I thought to write down.

1. Oodle Network speed is very cache sensitive.

Oodle Network uses a shared trained model. This is typically 4 - 8 MB. As it compresses or decompresses, it needs to access random bits of that memory.

If you compress/decompress a packet when that model is cold (not in cache), every access will be a cache miss and performance can be quite poor.

In synthetic test, coding packets over and over, the model is as hot as possible (in caches). So performance can seem better in synthetic test loops than in the real world.

In real use, it's best to batch up all encoding/decoding operations as much as possible. Rather than do :


decode one packet
apply packet to world
do some other stuff

decode one packet
apply packet to world
do some other stuff

...

try to group all the Oodle Network encoding & decoding together :

gather up all my packets to send

receive all packets from network stack

encode all my outbound packets
decode all my inbound packets

now act on inbound packets

this puts all the usage of the shared model together as close as possible to try to maximize the amount that the model is found in cache.

2. Oodle Network should not be used on already compressed data. Oodle Network should not be used on large packets.

Most games send pre-compressed data of various forms. Some send media files such as JPEGs that are already compressed. Some send big blobs that have been packed with zlib. Some send audio data that's already been compressed.

This data should be excluded from the Oodle Network path and send without going through the compressor. It won't get any compression on them and will just take CPU time. (you could send them as a packet with complen == rawlen, which is a flag for "raw data" in Oodle Network).

More importantly, these packets should NOT be included in the training set for building the model. They are essentially random bytes and will just crud up the model. It's a bit like if you're trying to memorize the digits of Pi and someone keeps yelling random numbers in your ear. (Well, actually it's not like that at all, but those kind of totally bullshit analogies seem very popular, so there you are.)

On large packets that are not precompressed, Oodle Network will work, but it's just not the best choice. It's almost always better to use an Oodle LZ data compressor (BitKnit, LZNIB, whatever, depending on your space-speed tradeoff desired).

The vast majority of games have a kind of bipolar packet distribution :


A. normal frame update packets < 1024 bytes

B. occasional very large packets > 4096 bytes

it will work better to only use Oodle Network on the type A packets (smaller, standard updates) and to use Oodle LZ on the type B packets (rarer, large data transfers).

For example some games send the entire state of the level in the first few packets, and then afterward send only deltas from that state. In that style, the initial big level dump should be sent through Oodle LZ, and then only the smaller deltas go through Oodle Network.

Not only will Oodle LZ do better on the big packets, but by excluding them from the training set for Oodle Network, the smaller packets will be compressed better because the data will all have similar structure.

1/16/2016

Oodle 2.1.2

Oodle 2.1.2 is out. Oodle - now with more BitKnit!


Oodle 2.1.2 example_lz_chart [file] [repeats]
got arg : input=r:\testsets\big\lzt99
got arg : num_repeats=5
lz test loading: r:\testsets\big\lzt99
uncompressed size : 24700820
---------------------------------------------------------------
chart cell contains : raw/comp ratio : encode mb/s : decode mb/s
LZB16: LZ-bytewise: super fast to encode & decode, least compression
LZNIB: LZ-nibbled : still fast, but more compression; between LZB & LZH
LZHLW: LZ-Huffman : like zip/zlib, but much more compression & faster
LZNA : LZ-nib-ANS : very high compression with faster decodes than LZMA
All compressors can be run at different encoder effort levels
---------------------------------------------------------------
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |1.51:517:2988|1.57:236:2971|1.62:109:2964|1.65: 37:3003|
LZBLW  |1.64:249:2732|1.74: 80:2682|1.77: 24:2679|1.85:1.6:2708|
LZNIB  |1.80:264:1627|1.92: 70:1557|1.94: 23:1504|2.04: 12:1401|
LZHLW  |2.16: 67: 424|2.30: 20: 447|2.33:7.2: 445|2.35:5.4: 445|
BitKnit|2.43: 28: 243|2.47: 20: 245|2.50: 13: 249|2.54:6.4: 249|
LZNA   |2.36: 24: 115|2.54: 18: 119|2.58: 13: 120|2.69:4.9: 120|
---------------------------------------------------------------
compression ratio:
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |    1.510    |    1.569    |    1.615    |    1.654    |
LZBLW  |    1.636    |    1.739    |    1.775    |    1.850    |
LZNIB  |    1.802    |    1.921    |    1.941    |    2.044    |
LZHLW  |    2.161    |    2.299    |    2.330    |    2.355    |
BitKnit|    2.431    |    2.471    |    2.499    |    2.536    |
LZNA   |    2.363    |    2.542    |    2.584    |    2.686    |
---------------------------------------------------------------
encode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |    517.317  |    236.094  |    108.555  |     36.578  |
LZBLW  |    248.537  |     80.299  |     23.663  |      1.610  |
LZNIB  |    263.950  |     69.930  |     22.617  |     11.735  |
LZHLW  |     67.154  |     20.019  |      7.200  |      5.425  |
BitKnit|     28.203  |     20.223  |     12.672  |      6.371  |
LZNA   |     24.192  |     18.423  |     12.883  |      4.907  |
---------------------------------------------------------------
decode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |
LZB16  |   2988.429  |   2971.339  |   2963.616  |   3003.187  |
LZBLW  |   2731.951  |   2681.796  |   2678.558  |   2707.534  |
LZNIB  |   1626.806  |   1557.309  |   1504.097  |   1400.654  |
LZHLW  |    423.936  |    446.990  |    444.832  |    445.040  |
BitKnit|    242.916  |    245.409  |    248.812  |    248.972  |
LZNA   |    114.791  |    119.369  |    119.994  |    120.362  |
---------------------------------------------------------------


Another test :


Oodle 2.1.2 example_lz_chart [file] [repeats]
got arg : input=r:\game_testset_m0.7z
got arg : num_repeats=5
lz test loading: r:\game_testset_m0.7z
uncompressed size : 79290970
---------------------------------------------------------------
chart cell contains : raw/comp ratio : encode mb/s : decode mb/s
LZB16: LZ-bytewise: super fast to encode & decode, least compression
LZNIB: LZ-nibbled : still fast, but more compression; between LZB & LZH
LZHLW: LZ-Huffman : like zip/zlib, but much more compression & faster
LZNA : LZ-nib-ANS : very high compression with faster decodes than LZMA
All compressors can be run at different encoder effort levels
---------------------------------------------------------------
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |1.4:1039:4304|1.41:438:4176|1.42:184:4202|1.44: 52:4293|1.44:4.5:4407|
LZBLW  |1.51:380:3855|1.55:124:3778|1.56: 26:3774|1.62:1.0:3862|1.62:1.0:3862|
LZNIB  |1.56:346:2406|1.59: 84:2398|1.62: 24:2054|1.67: 15:2048|1.67: 10:2053|
LZHLW  |1.67: 85: 647|1.74: 25: 679|1.75:6.5: 635|1.77:3.3: 613|1.79:1.5: 618|
BitKnit|1.83: 24: 395|1.90: 18: 409|1.90: 12: 408|1.91:7.1: 402|1.91:6.5: 401|
LZNA   |1.78: 22: 171|1.84: 18: 178|1.88: 12: 185|1.93:5.6: 167|1.93:1.5: 167|
---------------------------------------------------------------
compression ratio:
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |    1.390    |    1.408    |    1.424    |    1.436    |    1.442    |
LZBLW  |    1.509    |    1.548    |    1.558    |    1.615    |    1.615    |
LZNIB  |    1.557    |    1.593    |    1.622    |    1.669    |    1.668    |
LZHLW  |    1.669    |    1.745    |    1.754    |    1.767    |    1.790    |
BitKnit|    1.825    |    1.897    |    1.905    |    1.913    |    1.915    |
LZNA   |    1.781    |    1.838    |    1.878    |    1.927    |    1.932    |
---------------------------------------------------------------
encode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |   1038.910  |    437.928  |    184.457  |     52.008  |      4.465  |
LZBLW  |    380.030  |    123.621  |     26.028  |      0.973  |      0.973  |
LZNIB  |    345.905  |     83.577  |     24.299  |     14.544  |     10.444  |
LZHLW  |     84.519  |     25.218  |      6.542  |      3.256  |      1.547  |
BitKnit|     24.116  |     17.944  |     12.476  |      7.052  |      6.464  |
LZNA   |     21.859  |     18.034  |     11.767  |      5.602  |      1.465  |
---------------------------------------------------------------
decode speed (mb/s):
       |   VeryFast  |   Fast      |   Normal    |   Optimal1  |   Optimal2  |
LZB16  |   4304.144  |   4175.854  |   4202.491  |   4292.925  |   4406.853  |
LZBLW  |   3855.255  |   3777.826  |   3774.093  |   3861.922  |   3861.582  |
LZNIB  |   2406.379  |   2397.753  |   2054.429  |   2048.329  |   2053.340  |
LZHLW  |    646.796  |    679.173  |    635.035  |    613.051  |    617.994  |
BitKnit|    394.599  |    408.539  |    408.044  |    402.239  |    401.352  |
LZNA   |    171.111  |    177.565  |    184.677  |    167.439  |    166.904  |
---------------------------------------------------------------


vs LZMA :
ratio: 1.901
enc  : 2.70 mb/s
dec  : 30.27 mb/s

On this file, BitKnit is 13X faster to decode than LZMA, and gets more compression. (or at "Normal" level, the ratio is similar and BitKnit is 4.6X faster to encode).

old rants