9/30/2012

09-30-12 - Long Range Matcher Notes

Some notes on the LRM mentioned last time.

Any time you do a string search based on hashes you will have a degeneracy problem. We saw this with the standard "Hash1b" (Hash->links) string matcher. In short, the problem is if you have many occurances of the same hash, then exact string matching becomes very slow. The standard solution is to truncate your search at some number of maximum steps (aka "amortized hashing"), but that has potentially unbounded cost (though typically low).

We have this problem with LRM and I brushed it under the rug last time. When you are doing "seperate scan" (eg. not incrementally adding to the hash table), then there's no need to have a truncated search, instead you can just have a truncated insert. That is, if you're limitting your search to 10, then don't add 1000 of the same hash and only ever search 10, just add 10. In fact on my test files it's not terrible to limit the LRM search to just 1 (!).

But I'm not happy with that as a general solution because there is a potential for huge inefficiency. The really bad degenerate case looks something like this :


LRM hash length is 32 or whatever
Lots of strings in the file of length 32 have the same hash value
You only add 100 or so to the hash
One of the ones you didn't add would have provided a really good match

Typically, missing that match is not a disaster, because at the next byte you will roll to a new hash and look that up, and so on, so if you miss a 128k long match, you will usually find a (128k - 256) long match 256 bytes later. But it is possible to miss it for a long time if you are unlucky, and I like my inefficiency to be bounded. The more common bad case is that you get matches just a bit shorter than possible, and that happens many times, and it adds up to compression lost. eg. say hash length is 16 and there are 24 byte matches possible, but due to the reduced search you only find 16 or 18-length matches.

But most importantly, I don't like to throw away compression for no good reason, I want to know that the speed of doing it this approximate way is worth it vs. a more exact matcher.

There are a few obvious solutions with LRM :

1. Push matches backwards :

If you find a match at pos P of length L, that match might also have worked at pos (P-1) for length (L+1), but a match wasn't found there, either because of the approximate search or because hashes are only put in the dictionary every N bytes.

In practice you want to be scanning matches forward (so that you can roll the hash forward, and also so you can carry forward "last match follows" in generate cases), so to implement this you probably want to have a circular window of the next 256 positions or whatever with matches in them.

This is almost free (in terms of speed and memory use) so should really be in any LRM.

2. Multiple Hashes :

The simplest form of this is to do two hashes; like one of length 16 and one of length 64 (or whatever). The shorter hash is the main one you use to find most matches, the longer hash is there to make sure you can find the big matches.

That is, this is trying to reduce the chance that you miss out on a very long match due to truncating the search on the short hash. More generally, to really be scale-invariant, you should have a bunch of levels; length 16,64,256,1024,etc. Unfortunately implementing this the naive way (by simply having several independent hashes and tables) hurts speed by a linear factor.

3. Multiple Non-Redundant Hashes :

The previous scheme has some obvious inefficiencies; why are we doing completely independent hash lookups when in fact you can't match a 64-long hash if you don't match a 16-long hash.

So you can imagine that we would first do a 16-long hash , in a lookup where the hashes have been unique'd (each hash value only occurs once), then for each 16-long hash there is another table of the 64-long hashes that occured for that 16-long hash. So then we look up in the next. If one is found there, we look in the 256-long etc.

An alternative way to imagine this is as a sorted array. For each entry you store a hash of 16,64,256,etc. You compare first on the 16-long hash, then for entries where that is equal you compare on the 64-long hash, etc. So to lookup you first use the 16-long hash and do a sorted array lookup; then in each range of equal hashes you do another sorted array lookup on the 64-long hash, etc.

These methods are okay, but the storage requirements are too high in the naive rep. You can in fact store them compactly but it all gets a bit complicated.

4. Hash-suffix sort :

Of course it should occur to us that what we're doing in #3 is really just a form of coarse suffix sort! Why not just actually use a suffix sort?

One way is like this : for each 16-byte sequence of the file, replace it with a 4-byte U32 hash value, so the array shrinks by 4X. Now suffix-sort this array of hash values, but use a U32 alphabet instead of a U8 alphabet; that is, suffix strings only start on every 4th byte.

To lookup you can use normal sorted-array lookup strategies (binary search, interpolation search, jump-ins + binary or jump-ins + interpolation, etc). So you start with a 16-byte hash to get into the suffix sort, then if you match you use the next 16-byte hash to step further, etc.

9/28/2012

09-28-12 - LZNib on enwik8 with Long Range Matcher

I wrote a "Long Range Matcher" (ala SREP/etc. (see previous blog post )).

I'm using a rolling hash, and a sorted-array lookup. Building the lookup is insanely fast. One problem with it in an Optimal Parsing scenario is that when you get a very long LRM match, you will see it over and over (hence N^2 kind of badness), so I use a heuristic that if I get a match over some threshold (256 or 1024) I don't look for any more in that interval.

For a rolling hash I'm currently using just multiplicative hash with modulus of 2^32 and no additive constant. I have no idea if this is good, I've not had much luck finding good reference material on rolling hashes. (and yes of course I've read the Wikipedia and such easily Googleable stuff; I've not tested buzhash yet; I don't like table lookups for hashing, I needs all my L1 for myself)

LRM builds its search structure by hashing L bytes (LRM hash length is a parameter (default is 12)) every S bytes (S step is a parameter (default is 10)). Memory use is 8 bytes per LRM entry, so a step of 8 would mean the LRM uses memory equal to the size of the file. For large files you have to increase the step. Hash length does not affect memory use.

So anyhoo, I tested on enwik8. This is a test of different dictionary overlaps and LRM settings.

Compression works like this :


I split the file into 8 M chunks. (12 of them)

Chunks are compressed independently (in parallel).

Each chunk preconditions its dictionary with "overlap" bytes preceding the chunk.

Each chunk can also use the LRM to match the entire file preceding its overlap range.

So for each chunk the view of the file is like :

[... LRM matches here ...][ overlap precondition ][ my chunk ][ not my business ]

(note : enwik8 is 100mB (millions of bytes) not 100 MB, which means that 12*8 MB chunks = 96 MB actually covers the 100 mB).

(BTW of course compression is maximized if you don't do any chunking, or set the overlap to infinity; we want chunking for parallelism, and we want overlap to be finite to limit memory use; enwik8 is actually small enough to do infinite overlap, but we want a solution that has bounded memory use for arbitrarily large files)

With no further ado, some data. Varying the amount of chunk dictionary overlap :

overlap MB no LRM LRM
0 32709771 31842119
1 32355536 31797627
2 32203046 31692184
3 32105834 31628054
4 32020438 31568893
5 31947086 31518298
6 31870320 31463842
7 31797504 31409024
8 31731210 31361250
9 31673081 31397825
10 31619846 31355133
11 31571057 31316477
12 31527702 31281434
13 31492445 31253955
14 31462962 31231454
15 31431215 31206202
16 31408009 31189477
17 31391335 31215474
18 31374592 31202448
19 31361233 31192874

0 overlap means the chunks are totally independent. My LRM has a minimum match length of 8 and also must match a hash equal to the rolling hash length. The "with LRM" in the above test used a step of 10 and hash length of 12.

LRM helps less as the overlap gets bigger, because you find the most important matches in the LRM region. Also enwik8 being text doesn't really have that huge repeated block that lots of binary data has. (on many of my binary test files, turning on LRM gives a huge jump because some chunk is completely repeated from the beginning of the file to the end). On text it's more incremental.

We can also look at how compression varies with the LRM step and hash length :

lrm_step lrm_length compressed
0 0 32020443
32 32 31846039
16 32 31801822
16 16 31669798
12 16 31629439
8 16 31566822
12 12 31599906
10 12 31568893
8 12 31529746
6 12 31478409
8 8 31511345
6 8 31457094

(this test was run with 4 MB overlap). On text you really want the shortest hash you can get. That's not true for binary though, 12 or 16 is usually best. Longer than that hurts compression a little but may help speed.

For reference, some other compressors on enwik8 (from Matt's LTCB page )


enwik8
lzham alpha 3 x64 -m4 -d29                    24,954,329
cabarc 1.00.0601  -m lzx:21                   28,465,607
bzip2 1.0.2       -9                          29,008,736
crush 0.01        cx                          32,577,338
gzip 1.3.5        -9                          36,445,248


ADDENDUM : newer LRM, with bubble-back and other improvement :


LZNib enwik8
lots of win from bubble back (LRM Scanner Windowed) :

    32/32 no bubble : 31,522,926
    32/32 wi bubble : 31,445,380
    12/12 wi bubble : 30,983,058
    10/12 no bubble : 31,268,849
    10/12 wi bubble : 30,958,529
     6/10 wi bubble : 30,886,133

9/26/2012

09-26-12 - Optimizing Carefully and Compression Bugs

This Theora update is a good reminder to be careful when optimizing. Apparently they had a massively broken DCT which was just throwing away quality for no good reason.

This may seem really bone-headed but it's very very common. You may recall in some of my past posts I looked into some basic video stuff in great detail and found that almost every major video encoder/decoder is broken on the simple stuff :

the DCT (or whatever transform)
quantization/dequantization
color space conversion
upsample/downsample
translating float math into ints

These things are all relatively trivial, but it's just SO common to get them wrong, and you throw away a bottom bit (or half a bottom bit) when you do so. Any time you are writing a compressor, you need to write reference implementations of all these basic things that you know are right - and check them! And then a crucial thing is : keep the reference implementation around! Ideally you would be able to switch it on from the command line, or failing that with a build toggle, so at anytime you can go back and enable the slow mode and make sure everything works as expected.

(of course a frequent cause of trouble is that people go and grab an optimized integer implementation that they found somewhere, and it either is bad or they use it incorrectly (eg. maybe it assumes data that's biased in a certain way, or centered at 0, or scaled up by *8, or etc))

A lot of this basic stuff in video is very easy to do regression tests on (eg. take some random 8x8 data, dct, quantize, dequantize, idct, measure the error, it should be very low) so there's no excuse to get it wrong. But even very experienced programmers do get it wrong, because they get lazy. They might even start with a reference implementation they know is right, and then they start optimizing and translating stuff into ints or SIMD, and they don't maintain the slow path, and somewhere along the line a mistake slips in and they don't even know it.


I've been thinking about a more difficult problem, which is : how do you deal with bugs in compression algorithms?

I don't mean bugs like "it crashes" or "the decompressor doesn't reproduce the original data" - those are the easy kind of bugs and you just go fix them. I mean bugs that cause the compressor to not work the way you intended, and thus not compress as much as it should.

The very hard thing about these bugs is that you can have them and not even know it; I'm sure I have a hundred of them right now. Frequently they are tiny things like you have a less-than where you should have a less-or-equal.

To avoid them really requires a level of care that most programmers never use. You have to be super vigilant. Any time something surprises you or is a bit fishy, you can't just go "hmm that's weird, oh well, move on to the next task". You have to stop and think and look into it. You have to gather data obsessively.

Any time you implement some new idea and it doesn't give you the compression win you expected, you can't just say "oh well guess that didn't work", you have to treat it like a crash bug, and go set breakpoints and watch your variables and make sure it really is doing what you think; and if it is, then you have to gather stats about how often that code is hit and what the values are, and see where your expectations didn't match reality.


I've really been enjoying working on compression again. It's one of the most fun areas of programming that exists. What makes it great :

1. Clear objective measure of success. You can measure size and speed (or whatever other criteria) and see if you are doing well or not. (lossy compression is harder for this).

2. Decompressors are one extreme of fun "hacker" programming; they have to be very lean; great decompressors are like haikus, they're little pearls that you feel could not get any simpler without ruining them.

3. Compressors, on the other hand, can be big and slow, and you get to pull out all the big guns of algorithms for optimization and searching and so on.

9/24/2012

09-24-12 - LZ String Matcher Decision Tree

Revisiting this to clarify a bit on the question of "I want to do X , which string matcher should I use?"

Starting from the clearest cases to the least clear :

There are two types of parsing, I will call them Optimal and Greedy. What I really mean is Optimal = find matches at every position in the file, and Greedy = find matches only at some positions, and then take big steps over the match. (Lazy parses and such are in the Greedy category).

There are three types of windowing : 1. not windowed (eg. infinite window or we don't really care about windowing), 2. fixed window; eg. you send offsets in 16 bits so you have a 64k window for matches, 3. multi-window or "cascade" ; eg. you send offsets up to 128 in 1 byte , up to 16384 in 2 bytes, etc. and you want to find the best match in each window.

There are two scan cases : 1. Incremental scan; that is, we're matching against the same buffer we are parsing; matches cannot come from in front of our current parse position, 2. Separate scan - the match source buffer is independent from the current parse point, eg. this is the case for precondition dictionaries, or just part of the file that's well before the current parse point.

1. Optimal Parsing , No window, Incremental scan : Suffix Trie is the clear winner here. Suffix Trie is only a super clear winner when you are parsing and matching at the same time, since they are exactly the same work you double your time taken if they are separate. That is, you must be scanning forward, adding strings and getting matches. Suffix Trie can be extended to Cascaded Windowing in an approximate way, by walking to parents in the tree, but doing it exactly breaks the O(N) of the Suffix Trie.

2. Optimal Parsing, No window or single window, Separate Scan : Suffix Array is pretty great here. Separate scan means you can just take the whole buffer you want to match against and suffix sort it.

(BTW this is a general point that I don't think most people get - any time you are not doing incremental update, a sort is a superb search structure. For example it's very rarely best to use a hash table when you are doing separate scan, you should just have a sorted list, possibly with jump-ins)

3. Optimal Parsing, Windowed or Cascaded, Incremental or Separate Scan : there's not an awesome solution for this. One method I use is cascades of suffix arrays. I wrote in the past about how to use Suffix Array Searcher with Incremental Scan (you have to exclude positions ahead of you), and also how to extend it to Windowing. But those method get slow if the percentage of matches allowed gets low; eg. if you have a 1 GB buffer and a 64k window, you get a slowdown proportional to (1GB/64k). To address this I use chunks of suffix array; eg. for a 64k window you might cut the file into 256k chunks and sort each one, then you only have to search in a chunk that's reasonably close to the size of your window. For cascaded windows, you might need multiple levels of chunk size. This is all okay and it has good O(N) performance (eg. no degeneracy disasters), but it's rather complex and not awesome.

Another option for this case is just to use something like Hash->Links and accept its drawbacks. A more complex option is to use a hybrid; eg. for cascaded windows you might use Hash->Links for the small windows, and then Suffix Array for medium size windows, and Suffix Trie for infinite window. For very small windows (4k or less) hash->links (or even just a "cache table") is very good, so it can be a nice supplement to a matcher like suffix trie is not great at cascaded windows.

Addendum : "Suffix Array Sets" is definitely the best solution for this.

4. Greedy Parsing : Here SuffixArray and SuffixTrie both are not awesome, because they are essentially doing all the work of an optimal-style parse (eg. string matching at every position), which is a big waste of time if you only need the greedy matches.

Hash-Link is comparable to the best matcher that I know of for greedy parsing. Yann's MMC is generally a bit faster (or finds better matches at the same speed) but is basically in the same class. The pseudo-binary-tree thing used in LZMA (and I believe it's the same thing that was used in the original PkZip that was patented) is not awesome; sometimes it's slightly faster than hash-link, sometimes slightly slower. All Window relatively easily.

Hash-Link extends very naturally to cascaded windows, because you are always visiting links in order from lowest offset to highest, you can easily find exact best matches in each window of the cascade as you go.

With Greedy Parsing you don't have to worry about degeneracies quite so much, because when you find a very long match you are just going to take it and step over it. (that is, with optimal parse if you find a 32k long match, then at the next step you will find a 32k-1 match, etc. which is a bad N^2 (or N^3) thing if you aren't super careful (eg. use a SuffixTrie with correct "follows" implementation)). However, with lazy parsing you can still hit a mild form of this degeneracy, but you can avoid that pretty easily by just not doing the lazy eval if your first match length is long enough (over 1024 or whatever).

(BTW I'm pretty sure it's possible to do a Suffix Trie with lazy/incremental update for Greedy Parsing; the result should be similar to MMC but provide exact best matches without any degenerate bad cases; it's rather complex and I figure that if I want perfect matching I generally also want Optimal Parsing, so the space of perfect matching + greedy parsing is not that important)

Previous posts on string matching :

cbloom rants 06-17-10 - Suffix Array Neighboring Pair Match Lens
cbloom rants 09-23-11 - Morphing Matching Chain
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
cbloom rants 09-25-11 - More on LZ String Matching
cbloom rants 09-26-11 - Tiny Suffix Note
cbloom rants 09-27-11 - String Match Stress Test
cbloom rants 09-28-11 - Algorithm - Next Index with Lower Value
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression
cbloom rants 09-30-11 - Don't use memset to zero
cbloom rants 09-30-11 - String Match Results Part 1
cbloom rants 09-30-11 - String Match Results Part 2
cbloom rants 09-30-11 - String Match Results Part 2b
cbloom rants 09-30-11 - String Match Results Part 3
cbloom rants 09-30-11 - String Match Results Part 4
cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 10-01-11 - More Reliable Timing on Windows
cbloom rants 10-01-11 - String Match Results Part 6
cbloom rants 10-02-11 - How to walk binary interval tree
cbloom rants 10-03-11 - Amortized Hashing
cbloom rants 10-18-11 - StringMatchTest - Hash 1b
cbloom rants 11-02-11 - StringMatchTest Release

09-24-12 - I prefer the old C Preprocessor

It was so much better back when CPP was just text sub. (for context, I'm working on making structs that can encapsulate arbitrary function calls).

I want to be able to use CPP to make lists of N args (N a compile-time variable) like :


(int stuff0, int stuff1, int stuff2)

In MSVC (any old compiler where CPP is just text sub) you can do this quite easily.


#define LIST1(prefix,between)   RR_STRING_JOIN(prefix,1)
#define LIST2(prefix,between)   LIST1(prefix,between) between RR_STRING_JOIN(prefix,2)
#define LIST3(prefix,between)   LIST2(prefix,between) between RR_STRING_JOIN(prefix,3)
#define LIST4(prefix,between)   LIST3(prefix,between) between RR_STRING_JOIN(prefix,4)
#define LIST5(prefix,between)   LIST4(prefix,between) between RR_STRING_JOIN(prefix,5)
#define LIST6(prefix,between)   LIST5(prefix,between) between RR_STRING_JOIN(prefix,6)
#define LIST7(prefix,between)   LIST6(prefix,between) between RR_STRING_JOIN(prefix,7)
#define LIST8(prefix,between)   LIST7(prefix,between) between RR_STRING_JOIN(prefix,8)
#define LIST9(prefix,between)   LIST8(prefix,between) between RR_STRING_JOIN(prefix,9)

#define LISTN(N,prefix,between) RR_STRING_JOIN(LIST,N)(prefix,between)

#define LISTCOMMAS(N,prefix)        LISTN(N,prefix,COMMA)

#define COMMA   ,

and then you can use it like :

#define TestFuncN(N)  void RR_STRING_JOIN(TestFunc,N) ( LISTCOMMAS(N,int arg) );

Similarly for other variants of LIST, and then you can quite neatly construct structs/templates that take N args of N types.

But this doesn't work in compilers (GCC) with the newer standard that says preprocessor tokens have to be C identifiers (or whatever pedantic thing it says). IMO it's another one of those GCC/C99 (C89?) things that breaks old code and takes power away from the programmer and has very little to no benefit. (I guess I just really don't like the strict C standard).

Urg. GCC is like the nit-picky guy on the team who wants to endlessly debate some pointless crap that doesn't actually help anyone.

Is there any way to do this type of thing correctly? I'm so sick of manually making variants for N args every time I want this, the freaking preprocessor is the perfect tool to make all the N-arg variants for me if only they would let me use it.

(see for example : cblib/autoprintf.inl or cblib/callback.h)

To be concrete, a common usage case is something like cblib/callback where you want a struct to encapsulate a member function call. You have to do something like :


    explicit CallbackM3(T_ClassPtr c, T_fun_type f, Arg1 a1 , Arg2 a2, Arg3 a3, double when) : Callback(when)
    {
        ASSERT( c != NULL && f != NULL );
        __p = c;
        _mem_fun = f;
        _arg1 = a1;
        _arg2 = a2;
        _arg3 = a3;
    }

and write variants for every number of args. With the LIST macros I could very easily make variants for N args automatically.

... later ...

Oh well, I just bit the bullet and made a bunch of these :


#if NUM >= 1
    prefix1 
#if NUM >= 2
    prefix2 
#if NUM >= 3
    prefix3 
#if NUM >= 4
    prefix4 
#if NUM >= 5
    prefix5 
#if NUM >= 6
    prefix6 
#if NUM >= 7
    prefix7 
#if NUM >= 8
    prefix8 
#if NUM >= 9
    prefix9 
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif

which is much much worse than the LIST solution but works in GCC. And of course that could be much nicer if you could put preprocessor directives inside macros, cuz then I could just have a macro that does that, instead I have to copy-paste the whole thing all over.

Bleh. How lame is it that C++98 doesn't have a way to encapsulate a function call in a struct. (yes yes I know we finally have lambdas now, well not now but maybe in 10 years or so).

Another thing that would have made this all easier would be if C has a "null" type. Then I could just make templates that always take 10 arguments, and for the fewer-argument variants I could make the later ones be null types. eg. for cases like :


template <typename t1,typename t2,typename t3,typename t4>
struct stuff
{
    t1 m1;
    t2 m2;
    t3 m3;
    t4 m4;
};

I could just have stuff to get a two-member variant. Meh, I guess there are lots of annoying omissions in C++98. Why can't I have "typeof(var)" ? Why can't I say "if ( defined(type) )" ? etc. etc. You would also need the ability to not even try to compile scopes that are compile-time unreachable (eg. stuff in if(0) ).

Hmm.. so one option for this is I could just run a different CPP before compiling; it used to be that CPP was completely independent from the compiler, and you could do whatever you want there, but I'm not sure that's the case any more. (eg. assuming I want things like debugging in the original pre-CPP code to work). Or I could just eat the pain once again and work around GCC yet again.

9/23/2012

09-23-12 - Patches and Deltas

A while ago Jon posted a lament about how bad Steam's patches are. Making small patches seems like something nice for Oodle to do, so I had a look into what the state of the art is for patches/deltas.

To be clear, the idea is computer 1 (receiver) has a previous version of a file (or set of files, but for now we'll just assume it's one file; if not, make a tar), computer 2 (transmitter) has a newer version and wishes to send a minimum patch, which computer 1 can apply to create the newer version.

First of all, you need to generate patches from uncompressed data (or the patch generator needs to be able to do decompression). Once the patch is generated, it should generally be compressed. If you're trying to patch a zip to a zip, there will be lots of different bits even if the contained files are the same, so decompress first before patching.

Second, there are really two classes of this problem, and they're quite different. One class is where the transmitter cannot see the old version that the receiver has; this is the case where there is no authoritative source of data; eg. in rsync. Another class is where the transmitter has all previous versions and can use them to create the diff; this is the case eg. for game developers creating patches to update installed games.

Let's look at each class.

Class 1 : Transmitter can't see previous version

This the case for rsync and Windows RDC (Remote Differential Compression).

The basic way all these methods work is by sending only hashes of chunks of data to each other, and hoping that when the hashes for chunks of the files match, then the bytes actually were the same. These methods are fallible - it is possible to get corrupt data if you have an unlucky hash match.

In more detail about how they work :

The file is divided into chunks. It's important that these chunks are chosen based on the *contents* of the file, not just every 256 bytes or whatever, some fixed size chunking. If you did fixed size chunking, then just adding 1 byte at the head of a file would make every chunk different. You want to use some kind of natural signature to choose the chunk boundaries. (this reminds me rather of SURF type stuff in image feature detection).

I've seen two approaches to finding chunking boundaries :

1. Pick a desired average chunk size of L. Start from the previous chunk end, and look ahead 2*L and compute a hash at each position. The next chunk boundary is set to the local min (or max) of the hash value in that range.

2. Pick a desired average chunk size of 2^B. Make a mask M with B random bits set. Compute a hash at each position in the file; any position with (hash & M) == (M) is a boundary; this should occur once in 2^B bytes, giving you an average chunk len of 2^B.

Both methods can fall apart in degenerate areas, so you could either enforce a maximum chunk size, or you could specifically detect degenerate areas (areas with the same hash at many positions) and handle them as a special case.

So once you have these chunk boundaries, you compute a strong hash for each chunk (MD5 or SHA or whatever; actually any many-bit hash is fine, the cryptographic hashes are widely over-used for this, they are generally slower to compute than an equally strong non-cryptographic hash). Then the transmitter and receiver send these hashes between each other; if the hashes match they assume the bytes match and don't send the bytes. If the hashes differ, they send the bytes for that chunk.

When sending the bytes for a chunk that needs updating, you can use all the chunks that were the same as context for a compressor.

If the file is large, you may wish to use multi-scale chunking. That is, first do a coarse level of chunking at large scale to find big regions that need transmision, then for each of those regions do finer scale chunking. One way to do this is to just use a constant size chunk (1024 bytes or whatever), and to apply the same algorithm to your chunk-signature set; eg. recurse (RDC does this).

Class 2 : Transmitter can see previous version

This case is simple and allows for smaller patches (as well as guaranteed, non-probablistic patches). (you probably want to do some simple hash check to ensure that the previous versions do in fact match).

The simplest way to do this is just to take an LZ77 compressor, take your previous version of the file and put it in your LZ dictionary, then compress the new version of the file. This will do byte-exact string matching and find any parts of the file that are duplicated in the previous version.

(aside : I went and looked for "preload dictionary" options in a bunch of mainstream compressors and couldn't find it in any of them. This is something that every major compressor should have, so if you are the author of FreeArc or 7zip or anything like that, go add a preload dictionary option)

(aside : you could use other compressors than LZ77 of course; for example you could use PPM (or CM) and use the previous version to precondition the model. For large preconditions, the PPM would have to be very high order, probably unbounded order. An unbounded order PPM would be just as good (actually, better) at differential compression than LZ77. The reason why we like LZ77 for this application is that the memory use is very low, and we want to use very large preconditions. In particular, the memory use (in excess of the window itself) for LZ77 compression can be very low without losing the ability to deduplicate large blocks; it's very easy to control, and when you hit memory limits you simply increase the block size that you can deduplicate; eg. up to 1 GB you can find all dupes of 64 bytes or more; from 1-2 GB you can find dupes of 128 bytes or more; etc. this kind of scaling is very hard to do with other compression algorithms)

But for large distributions, you will quickly run into the limits of how many byte-exact matches an LZ77 matcher can handle. Even a 32 MB preload is going to stress most matchers, so you need some kind of special very-large-window matcher to find the large repeated blocks.

Now at this point the approaches for very-large-window matching look an awful lot like what was done in class 1 for differential transmission, but it is really a different problem and not to be confused.

The standard approach is to pick a large minimum match length L (L = 32 or more) for the long-range matches, and to only put them in the dictionary once every N bytes (N 16 or more, scaling based on available memory and the size of the files). So basically every N bytes you compute a hash for the next L bytes and add that to the dictionary. Now when scanning over the new version to look for matches, you compute an L-byte hash at every position (this is fast if you use a rolling hash) and look that up.

One interesting variant of this is out-of-core matching; that is if the previous version is bigger than memory. What you can do is find the longest match using only the hashes, and then confirm it by pulling the previous file bytes back into memory only when you think you have the longest once. (SREP does some things like this; oddly SREP also doesn't include a "preload dictionary" option, or it could be used for making patches)

In the end you're just generating LZ matches though. Note that even though you only make dictionary entries every N bytes for L byte chunks, you can generate matches of arbitrary length by doing byte-by-byte matching off the end of the chunk, and you can even adjust to other offsets by sliding matches to their neighboring bytes. But you might not want to do that; instead for very large offets and match lengths you could just not send some bottom bits; eg. only send a max of 24 bits of offset, but you allow infinite window matches, so over 24 bits of offset you don't send some of the bottom bits.

Special Cases

So far we've only looked at pretty straightforward repeated sequence finding (deduplication). In some cases, tiny changes to original files can make lots of derived bytes differ.

A common case is executables; a tiny change to source code can cause the compiled exe to differ all over. Ideally you would back up to the source data and transmit that diff and regenerate the derived data, but that's not practical.

Some of the diff programs have special case handling for exes that backs out one of the major problems : jump address changes. Basically the problem is if something like the address of memcpy changes (or the location of a vtable, or address of some global variable, etc..), then you'll have diffs all over your file and generating a small patch will be hard.

I speculate that what these diffs do basically is first do the local-jump to absolute-jump transform, and then they create a mapping of the absolute addresses to find the same routine in the new & old files. They send the changed address, like "hey replace all occurances of address 0x104AC with 0x10FEE) so that chunks that only differ by some addresses moving can be counted as unchanged.

(bsdiff does some fancy stuff like this for executables) (ADDENDUM : not really; what bsdiff does is much more general and not as good on exes; see comments)

If you're trying to send small patches of something like lightmaps, you might have areas where you just increased the brightness of a light; that might change very pixel and create a huge diff. It might be possible to express deltas of image (and sound) data as linear transforms (add & scale). An alternative would be finding the original piece of content and just using it as a mocomp source (dictionary precondition) for an image compressor. But at some point the complexity of the diff becomes not worth it.

Links

In no particular order :

-ck hacking lrzip-0.613
ngramhashing - Rolling Hash C++ Library - Google Project Hosting
A new 900GB compression target
Patchdelta compression
Remote diff utility
SREP huge-dictionary LZ77 preprocessor
Long Range ZIP � Freecode
About Remote Differential Compression (Windows)
There is a Better Way. Instead of using fixed sized blocks, use variable sized b... Hacker News
bsdiff windows
ZIDRAV Free Communications software downloads at SourceForge.net
Binary diff (bsdiff)
Data deduplication (exdupe)
xdelta
Tridge (rsync)

BTW some of you have horrible page titles. "binary diff" is not a useful page title, nor is "data deduplication". It's like all the freaking music out there named "track1.mp3".

I have not done an extensive review of the existing solutions. I think bsdiff is very strong, but is limited to relatively small files, since it builds an entire suffix sort. I'm not sure what the best solution is for large file diffs; perhaps xdelta (?). The algorithms in rsync look good but I don't see any variant that makes "class 2" (transmitter has previous version) diffs. It seems neither lrzip nor srep have a "precondition dictionary" option (wtf?). So there you go.

9/22/2012

09-22-12 - Input Streams and Focus Changes

Clearly apps should have an input/render thread which takes input and immediately responds to simple actions even when the app is busy doing processing.

This thread should be able to display the current state of the "world" (whatever the app is managing) and let you do simple things like move/resize windows, scroll, etc. without blocking on complex processing.

Almost every app gets this wrong; even the ones that try (like some web browsers) just don't actually do it; eg. you should never ever get into a situation where you browse to a slow page that has some broken script or something, and that causes your other tabs to become unresponsive. (part of the problem with web browsers these days of course is that scripts are allowed to do input processing, which never should have been allowed, but anyhoo).

Anyway, that's just very basic and obvious. A slightly more advanced topic is how to respond to input when the slow processing causes a change in state which affects input processing.

That is, we see a series of input commands { A B C D ... } and we start doing them, but A is some big slow operation. As long as the commands are complelety indepenent (like "pure" functions) then we can just fire off A, then while it's still running we go ahead and execute B, C, D ...

But if A is something like "open a new window and take focus" , then it's completely ambiguous about whether we should go ahead and execute B,C,D now or not.

I can certainly make arguments for either side.

Argument for "go ahead and process B C D immediately" :

Say for example you're in a web browser and you click on a link as action A. The link is very slow to load so you decide you'll do something else and you center-click some other links on the original page to open them in new tabs. Clearly these inputs should be acted on immediately.

Argument for "delay processing B C D until A is done" :

For similarity we'll assume a web browser again. Say you are trying to log into your bank, which you have done many times. You type in your user name and hit enter. You know that this will load the next page which will put you at a password prompt, so you go ahead and start typing your password. Of course those key presses should be enqueued until the focus change is done.

A proponent of this argument could outline two clear principles :

1. User input should be race free. That is, the final result should not depend on a race between my fingers and the computer. I should get a consistent result even if the processing of commands is subject to random delays. One way to do this is :

2. For keyboard input, any keyboard command which changes key focus should cause all future keyboard input to be enqueued until that focus change is done.

This certainly bugs me on a nearly daily basis. The most common place I hit it is in MSVC because that's where I spend most of my life, and I've developed muscle-memory for common things. So I'll frequently do something like hit "ctrl-F stuff enter" , expecting to do a search for "stuff" , only to be dismayed to see that for some inscrutable reason the find dialog box took longer than usual to open, and instead of searching for "stuff" I instead typed it into my source code and am left with an empty find box.

I think in the case of pure keyboard input in a source code editor, the argument for race-freeness of user input is the right one. I should be able to develop keyboard finger instinctive actions which have consistent results.

However, the counter-example of the slow web browser means that this is not an obvious general rule for user inputs.

The thing I ask myself in these scenarios is "if there was a tiny human inside my computer that was making this decision, could they do it?". If the answer to that question is yes, then it means that there is a solution in theory, it just may not be easy to express as a computer algorithm.

I believe that in this case, 99% of the time a human would be able to tell you if the input should be enqueued or not. For example in the source code "ctrl-F stuff" case - duh, of course he wants stuff to be in the find dialog, not typed into the source code; the human computer would get that right (by saying "enqueue the input, don't process immediately"). Also in the web browser case where I click a slow link and then click other stuff on the original page - again a human would get that right by saying "don't enqueue the input, do process it immediately").

Obviously there are ambiguous cases, but this is an interesting point that I figured out while playing poker that I think most people don't get : the hard decisions don't matter !

Quickly repeating the point for the case of poker (I've written this before) : in poker you are constantly faced with decisions, some easy (in that the right answer is relatively obvious) and some very hard, where the right answer is quite unclear, maybe the right answer is not what the standard wisdom thinks it is, or maybe it requires deep thought. The thing is, the hard decisions don't matter. The reason they are hard is because the EV (expected value) of either line is very close; eg. maybe the EV of raise is 1.1 BB and the EV of call is 1.05 BB ; obviously in analysis you aren't actually figuring out the EV, but just the fact that it's not clear tells you that either line is okay.

The way that people lose value in poker is by flubbing the *easy* decisions. If you fold a full house on the river because you were afraid your opponent had quads, that is a huge error and gives up tons of value. When you fail to do something that is obviously right (like three-betting often enough from the blinds against aggressive late position openers) that is a big error. When you are faced with tricky situations that poker experts would have to debate for some time and still might not agree what the best line is - those are not important situations.

You can of course apply the same situation to politics, and here to algorithms. People love to debate the tricky situations, or to say that "that solution is not the answer because it doesn't work 100% of the time". That's stupid non-productive nit picking.

A common debate game is to make up extreme examples that prove someone's solution is not universal or not completely logical or self-consistent. That's retarded. Similarly, if you have a good solution for case A, and a good (different) solution for case B, a common debate game is to interpolate the cases and find something in the middle where it's ambiguous or neither solution works, and the sophomoric debater contends that this invalidates the solutions. Of course it doesn't, it's still a good solution for case A and case B and if those are the common cases then who cares.

What actually matters is to get the answer right *when there is obviously a right answer*.

In particular with user input response, the user expects the app to respond in the way that it obviously *should* respond when there is an obvious response. If you do something that would be very easy for the app to get right, and it gets it wrong, that is very frustrating. However if you give input that you know is ambiguous, then it's not a bad thing if the app gets it wrong.

09-22-12 - Oodle Beta and Roadmap

Oodle went Beta a few weeks ago (yay). If you're a game developer interested in Oodle you can mail oodle at rad.

I wanted to write about what's in Oodle new and what's coming, as a sort of roadmap for myself and others. This is not a detailed feature list; contact RAD for documents with more details about Oodle features.

Oodle Beta : (now)

What's in Oodle at the moment :

  • OodleLZH and LZHLW : the LZ-huff compressors I have written about repeatedly. Designed for offline compression and very fast low memory decompression in game.

  • PC, PS3 (+SPU) and Xenon support

  • OodleIOQ : Fast low-level async IO

  • OodleIOQStream : Streaming async IO (keeping buffers full for media playback) ; with Bink and Miles drop-in IO layers.

  • OodleFile : Synchronous/buffered IO built on the async IO layer. This is a pluggable thing and there are memory OodleFiles and transparent LZ decompression OodleFiles and so on. Currently the ability to make your own OodleFile type is not exposed, but maybe it will be in 1.1 or so.

  • OodleWork : Work-stealing thread pool system. Includes stackless coroutines. A bit more conscientiously designed than most.

  • OodleHandle : Generic handle/dependency system so you can set up async chains (like load X then decompress it, then process it, etc.), WFMO mixing IO and work and SPU jobs and so on.

  • OodlePackage : "Packages" (archives) with solid archiving and very async decompressors.

  • OodleVFS : (virtual file system) for drive name mapping and per-drive file system overrides; also supports transparent mirroring from one drive to another.

  • VFS patches, for overlaying a set of files over an existing set.

  • Multithreaded memory allocator ; some other handy threading stuff like a tiny mutex; lots of other threading stuff that's in Oodle but not exposed to public API.

  • Oodle Thread Viewer (super-lite Telemetry of sorts) (and real Telemetry integration for people who have that)

Oodle RC / 1.0 : (2012-2013)

  • More platforms (Mac, Linux, iOS)

  • Mostly just loose ends from stuff in the Beta. eg. Packages need to have update scripts to rebuild them when the source content changes. Packages need to be able to have their header in a separate file from the body (for header aggregation). etc.

  • Lots of little stuff needs to get wrapped up; for example I need the ability to cache open file handles so that multiple opens for read on the same handle just use the cache, with some kind of LRU type of thing and a limit on the number of open handles kept.

Oodle 1.1 : (around GDC 2013)

  • All major/current platforms (Mac, Linux, WiiU, ARM/mobile, next-gen consoles)

  • A better cross-platform DVD to HD transparent mirroring system.

  • OodleLZB and LZA (LZ-bytewise and LZ-arithmetic) , for when the LZH's don't quite hit the space/speed tradeoff you need.

  • DVD layout tool and DVD emulator ; back when I was a real game dev I was dismayed at using the various platforms DVD emulators and layout tools, so I think doing a cross-platform one would be awesome. My goal for the emulator is that you should be able to just use your normal content on your disk, but with a layout file to tell me where on the virtual disc each file is I can emulate seeking and such. It should be possible to just flip a switch and test your game as if it was on DVD.

  • PC File Serving Host. Basically a server you can run on your dev PC that will serve files to Oodle on a console so that you get transparent mirroring. (some consoles provide support for this already, some don't; it should be possible to do it for all consoles in a unified way).

  • High level paging / resource manager interface. Oodle 1.0 is very imperative ; C-style. You can certainly make your own paging system on top of the Oodle 1.0 interfaces, but there's been a reasonable amount of queries about an out of the box ready to go resource manager. This would probably be a C++ style interface (perhaps in C though using function pointers).

  • Paging transition package optimizer : you give me a list of resources for each region and a region transition map, I generate transition bundles with constrained optimization.

  • OodleHandle for GPU Compute jobs ; be able to spawn & wait on them in the unified IO/Work system.

Oodle 1.2/2.0 :

  • Image stuff is probably the biggest thing for the major rev of Oodle. DXTC, R/D-lossy DXTC, some kind of lossy image codec, perhaps just a fast JPEG decoder. I have lots of this code already, but it needs a big cleanup and "publicating".

  • Patch generator.

  • Stackful coroutines ? If so, then also a set of primitives that yield the coroutine instead of doing an OS wait; eg. "coMutex" and such that yield if blocked, and of course "coFRead" for files that yields if it can't return immediately.

9/15/2012

09-15-12 - Some compression comparison charts

These charts show the time to load + decompress a compressed file using various compressors.

(the compressors are the ones I can compile into my testbed and run from code, eg. not command line apps; they are all run memory to memory; I tried to run all compressors in max-compression-max-decode-speed mode , eg. turning on heavy options on the encode side. Decode times are generated by running each decompressor 10 times locked to each core of my box (80 total runs) and taking the min time; the cache is wiped before each decode. Load times are simulated by dividing the compressed file size by the disk speed parameter. All decoders were run single threaded.)

They are sorted by fastest decompressor. (the "raw" uncompressed file takes zero time to decompress).

"sum" = the sum of decomp + load times. This is the latency if you load the entire compressed file and then decompress in series.

"max" = the max of decomp & load times. This is the latency if the decomp and load were perfectly parallelizable, and neither ever stalled on the other.

The criterion you actually want to use is something between "sum" and "max", so the idea is you look at them both and kind of interpolate in your mind. (obviously you can replace "disk" with ram or network or "channel")

Discussion :

The compressors are ordered from left to right by speed. If you look at the chart of compressed file sizes, they should be going monotonically downward from left to right. Any time it pops up as you go to the right (eg. at snappy, minilzo, miniz, zlib) that is just a bad compressor; it has no hope of being a good choice in terms of space/speed tradeoff. The only ones that are on the "Pareto frontier" are raw, LZ4, OodleLZH, and LZMA.

Basically what you should see is that on a fast disk (100 mbps (and mb = millions of bytes, not 1024*1024)), a very slow decoder like LZMA does not make a lot of sense, you spend way too much time in decomp. On very slow data channels (like perhaps over the net) it starts to make sense, but you have to get to 5 mbps or slower before it becomes a clear win. (of course there may be other reasons that you want your data very small other than minimizing time to load; eg. if you are exceeding DVD capacity).

On a fast disk, the fast decompressors like LZ4 are appealing. (though even at 100 mbps, OodleLZH has a lower "max"; LZ4 has the best "sum").

Of the fast decoders, LZ4 is just the best. (in fact LZ4-largewindow would be even stronger). Zip is pretty poor; the small window is surely hurting it, it doesn't find enough matches which not only hurts compression, it hurts decode speed. Part of the problem is neither miniz nor zlib have super great decoders with all the tricks.

It's kind of ridiculous that we don't have a single decent mainstream free compression library. Even just zip-largewindow would be at least decent. (miniz could easily be extended to large windows; that would make it a much more competitive compressor for people who don't care about zlib compatibility)

If you are fully utilizing your CPU's, you may need a low-CPU decoder even if it's not the best choice in a vacuum. In fact because of that you should avoid CPU-hungry decoders even if they are the best by some simple measure like time to load. eg. even in cases where LZMA does seem like the right choice, if it's close you should avoid it, because you could use that CPU time for something else. You could say that any compressor that can decode faster than it can load compressed data is "free"; that is, the decode time can be totally hidden by parallelizing with the IO and you can saturate the disk loading compressed data. While that is true it assumes no other CPU work is being done, so does not apply to games. (it does sort of apply to archivers or installers, assuming you aren't using all the cores).

As a rough rule of thumb, compressors that are in the "sweet spot" take time that is roughly on par with the disk time to load their compressed data. That is, maybe half the time, or double the time, but not 1/10th the time of the disk (then they are going too fast, compressing too little, leaving too much on the table), and also not 10X the time of the disk (then they are just going way too slow and you'd be better off with less compression and a faster compressor).


The other thing we can do is draw the curves and see who's on the pareto frontier.

Here I make the Y axis the "effective mbps" to load and then decompress (sequentially). Note that "raw" is an x=y line, because the effective speed equals the disk speed.

Let me emphasize that these charts should be evaluated as information that goes into a decision. You do not just go "hey my disk is 80 mbps let me see which compressor is on top at that speed" and go use that. That's very wrong.

and the log-log (log base 2) :

You can see way down there at the bottom of the log-log, where the disk speed is 1.0 mbps, LZMA finally becomes best. Also note that log2 of 10 is a gigabyte per second, almost the speed of memory.

Some intuition about log-log compressor plots :

Over on the right hand side, all the curves will flatten out and become horizontal. This is the region where the decompress time dominates and the disk speed becomes almost irrelevant (load time is very tiny compared to decompress time). You see LZMA flattens out at relatively low disk speed (at 16 mbps (log2 = 4) it's already pretty flat). The speed over at the far right approaches the speed of just the decompressor running memory to memory.

On the left all the curves become straight lines with a slope of 1 (Y = X + B). In this area their total time is dominated by their loading time, which is just a constant times the disk speed. In a log-log plot this constant multiple becomes a constant addition - the Y intercept of each curve is equal to log2(rawLen/compLen) ; eg someone with 2:1 compression will hit the Y axis at log2(2) = 1.0 . You can see them stacked up hitting the Y axis in order of who gets the most compression.

Another plot we can do is the L2 mean of load time and decode time (sqrt of squares). What the L2 mean does is penalize compressors where the load time and decode time are very different (it favors ones where they are nearly equal). That is, it sort of weights the average towards the max. I think this is actually a better way to rate a compressor for most usages, but it's a little hand-wavey so take it with some skepticism.

9/14/2012

09-14-12 - Things Most Compressors Leave On the Table

It's very appealing to write a "pure" algorithmic compressor which just implements PPM or LZP or whatever in a very data agnostic way and make it quite minimal. But if you do that, you are generally leaving a lot on the table.

There are lots of "extra" things you can do on top of the base pure compressor. It makes it very hard to compare compressors when one of them is doing some of the extra things and another isn't.

I used to only write pure compressors and considered the extra things "cheating", but of course in practical reality they can sometimes provide very good bang-for-the-buck, so you have to do them. (and archivers these days are doing more and more of these things, so you will look bad in comparison if you don't do them).

Trying to dump out a list of things :

Parameter Optimization . Most compressors have some hard-coded parameters; some time it's an obvious one, like in LZMA you can set the # of position bits used in the context. Getting that number right for the particular file can be a big win. Other compressors have hard-coded tweaks that are not so obvious; for example almost all modern PPM or CM compressors use some kind of secondary-statistics table; the hash index made for that table is usually some heuristic, and tweaking it per file can be a big win.

Model Preconditioning . Any time your have a compressor that learns (eg. adaptive statistical coders, or the LZP cache table, or the LZ77 dictionary) - a "pure" compressor starts from an empty memory and then learns the file as it goes. But that's rarely optimal. You can usually get some win by starting from some pre-loaded state; rather than starting from empty and learning the file, you start from "default file" and learn towards the current file. (eg. every binary arithmetic probability should not start at 50% but rather at the expected final probability). And of course you can take this a step further by having a few different preconditions for different file types and selecting one.

Prefilters . BCJ (exe jump transform), deltas, image deltas, table record deltas (Bulat's DELTA), record transposes, various deswizzles, etc. etc. There are lots of prefilters possible, and they can provide very big wins for the amount of time they take. If you don't implement all the prefilters you are at a disadavantage to compressors that do. (for example, RAR has a pretty strong set of prefilters that are enabled by default, which means that RAR actually beats 7zip on lots of files, even though as a pure compressor it's much worse).

Header compression . Anything you send like buffer sizes or compressor parameters can generally be smaller by more advanced modeling. Typically this is just a few bytes total so not important, but it can become important if you transmit incremental headers, or something like static huffman codes. eg. something like Zip that can adapt by resending Huffmans, it's actually important to get that as small as possible, and it's usually something that's neglected because it's outside of the pure compressor.

Data Size Specialization . Most compressors either work well on large buffers or small buffers, not both; eg. if you do an LZSS , you might pick 3 byte offsets for large buffers, but on tiny buffers that's a huge waste; in fact you should use 1 byte offsets at first, and then switch to 2, and then 3. People rarely go to the trouble to have separately tuned algorithms for various buffer sizes.

Data specialization . Compressing text, structured records, images, etc. is actually all very different. You can get major win by special-casing for the major types of data (eg. text has weird features like the top bits tell you the type of character; word-replacing transforms are a big win, as are de-punctuators, etc. etc.).

Decompression . One of the new favorite tricks is decompressing data to compress it. If someone hands you a JPEG or a Zip or whatever and tells you to compress it as well as possible, of course the first thing you have to do is decompress it to undo the bad compressor so you can get back to the original bits.

This is almost all stuff I haven't done yet, so I have some big win in my back pocket if I ever get around to it.

In the compression community, I'm happy to see packages like FreeArc that are gathering together the prefilters so that they can be used with a variety of back-end "pure" compressors.

9/13/2012

09-13-12 - LZNib

LZNib is the straightforward/trivial way to do an LZ77 coder using EncodeMod for variable length numbers and 4-bit nibble aligned IO. That is, literals are always 8 bit; the control word is 4 bits and signals either a literal run len or a match length, using a range division value instead of a flag bit.

eg. if the nibble is < config_divider_lrl , it's a literal run len; if nibble is >= config_divider_lrl, it's a match.

The point of LZNib is to see how much compression is possible while keeping the decode speed close to the fastest of any reasonable compressor (LZ4,snappy,etc).

Testing different values of config_divider_lrl :

name raw config_divider_lrl=4 config_divider_lrl=5 config_divider_lrl=6 config_divider_lrl=7 config_divider_lrl=8 config_divider_lrl=9 config_divider_lrl=10 config_divider_lrl=11 config_divider_lrl=12 best
lzt00 16914 5639 5638 5632 5636 5654 5671 5696 5728 5771 5632
lzt01 200000 199360 199354 199348 199345 199339 199333 199324 199319 199314 199314
lzt02 755121 244328 243844 250836 255146 255177 255257 257754 260107 260597 243844
lzt03 3471552 1746220 1744630 1743728 1743043 1742718 1742728 1743191 1744496 1746915 1742718
lzt04 48649 13932 13939 13968 14015 14120 14184 14268 14319 14507 13932
lzt05 927796 422058 421115 420746 420592 418289 418200 418639 418854 418082 418082
lzt06 563160 414925 414080 412748 412223 409673 408884 408361 408435 407393 407393
lzt07 500000 237756 237318 237004 236910 236771 236949 237381 238091 239176 236771
lzt08 355400 309397 309490 308579 307706 307263 306418 305689 305495 305793 305495
lzt09 786488 302834 303018 303773 304350 305222 307405 308888 310649 314647 302834
lzt10 154624 11799 11785 11792 11800 11821 11843 11866 11885 11923 11785
lzt11 58524 22420 22341 22249 22288 22276 22322 22370 22561 22581 22249
lzt12 164423 28901 28974 28900 28957 29053 29122 29296 29381 29545 28900
lzt13 1041576 1072275 1068614 1058545 1047273 1025641 1025616 1025520 1025404 1024891 1024891
lzt14 102400 52010 51755 51595 51462 51379 51314 51298 51302 51341 51298
lzt15 34664 11846 11795 11767 11760 11740 11740 11756 11831 11837 11740
lzt16 21504 11056 11000 10961 10934 10911 10904 10893 10883 10892 10883
lzt17 53161 20122 20119 20152 20210 20288 20424 20601 20834 21091 20119
lzt18 102400 77317 77307 77274 77045 77037 77020 77006 76964 76976 76964
lzt19 768771 306499 307120 308138 309635 311801 314857 318983 323981 329683 306499
lzt20 1179702 975546 974447 973507 972326 971521 972060 971614 971569 985009 971521
lzt21 679936 99059 99182 99385 99673 100013 100492 101018 101652 102387 99059
lzt22 400000 334796 334533 334357 334027 333860 333733 333543 333501 337864 333501
lzt23 1048576 1029556 1026539 1023978 1021833 1019900 1018124 1016552 1015139 1013815 1013815
lzt24 3471552 1711694 1710524 1708577 1706969 1696663 1695663 1694205 1692996 1688324 1688324
lzt25 1029744 224428 224423 224306 229365 229362 229368 229603 227083 227546 224306
lzt26 262144 240106 239633 239200 238864 238538 238232 237960 237738 237571 237571
lzt27 857241 323147 323098 323274 323133 322050 322068 322799 322182 322573 322050
lzt28 1591760 343555 345586 348549 350601 352455 354077 356025 360583 364438 343555
lzt29 3953035 1445657 1442589 1440996 1440794 1437132 1437593 1440565 1442614 1442914 1437132
lzt30 100000 100668 100660 100656 100655 100653 100651 100651 100643 100643 100643
total 24700817 12338906 12324450 12314520 12308570 12268320 12272252 12283315 12296219 12326039 12212820

The divider at 8 is the same as using a flag bit + 3 bit payload in the nibble.

There does seem to be some win to be had by transmitting divider in the file (but it's not huge). Adaptive divider seems like an interesting thing to explore also, but it will make the decoder slower. Obviously it would be nice to be able to find the optimal divider for each file without just running the compressor N times.


Some comparison to other compressors. I tried to run all compressors with settings for max compression and max decode speed.

I'm having a bit of trouble finding anything to compare LZNib to. Obviously the LZ4+Large Window (LZ4P) that I talked about before is a fair challenger. I thought CRUSH would be the one, but it does very poorly on the big files, indicating that it has a small window (?). If anyone knows of a strong byte-aligned LZ that has large/unbounded window, let me know so I have something to compare against.

(ADDENDUM : tor -c1 can do byte-aligned IO and large windows, but it's a really bad byte-aligned coder)

(obviously things like LZ4 and snappy are not fair challengers because their small window makes them much worse; also note that zlib of course has a small window but is the only one with entropy coding; also several of these are not truly "byte aligned"; they have byte-aligned literals but do matches and control words bit by bit, which is "cheating" in this test (if you're allowed to do variable bit IO you can do much better; hell you may as well do huffman if you're doing variable bit IO) (though I'm also cheating because I believe I'm the only LZ here with a proper optimal parser) (I'm also cheating in a more subtle way, because I have this test set to tweak on and others don't)).

(aside : I don't understand why so many people are still writing small-window LZ's. Yes there is a certain use case for small-window LZ (eg. I did one for use on the SPU), but the most common use case for LZ these days is just "decompress the whole thing into a linear memory buffer", in which case you always have the entire buffer available for matches, so why not use it all?)

raw minilzo snappy quicklz 3 lz4 hc lzopack -9 ULZ c6 crush cx zlib lz4p 332 lznib div8
lzt00 16914 7195 7254 6082 6473 5805 6472 5487 4896 6068 5654
lzt01 200000 198906 198222 200009 198900 198934 199680 222477 198199 198880 199339
lzt02 755121 567421 552599 334625 410695 426187 312889 258902 386203 292427 255177
lzt03 3471552 2456002 2399663 2036985 1820761 1854718 1835797 1938138 1789728 1795951 1742718
lzt04 48649 20602 21170 16894 16709 14858 17359 13869 11903 15584 14120
lzt05 927796 590072 554964 480848 460889 453608 464602 444945 422484 440742 418289
lzt06 563160 536084 516818 563169 493055 490308 428137 432989 446533 419768 409673
lzt07 500000 297306 298207 268255 265688 242029 268271 245662 229426 248500 236771
lzt08 355400 356248 351850 317102 331454 314918 337423 315688 277666 322959 307263
lzt09 786488 460896 472498 372682 344792 345561 329608 304048 325921 325124 305222
lzt10 154624 21355 25960 17249 15139 14013 16238 12540 12577 13299 11821
lzt11 58524 28153 29121 25626 25832 23717 26720 23279 21637 23870 22276
lzt12 164423 52725 53515 38745 33666 31574 36077 29601 27583 30864 29053
lzt13 1041576 1045665 1041684 1041585 1042749 1041633 1048598 1061423 969636 1040033 1025641
lzt14 102400 61394 63638 55124 56525 52102 58509 51491 48155 53395 51379
lzt15 34664 16026 15417 13626 14062 12663 14016 12470 11464 12723 11740
lzt16 21504 12858 13165 11646 12349 11203 12554 11119 10311 11392 10911
lzt17 53161 28075 29415 23478 23141 20979 23829 19877 18518 22028 20288
lzt18 102400 100499 97931 81100 85659 74268 89973 76858 68392 79138 77037
lzt19 768771 495312 524686 411916 363217 360558 333732 302006 312257 335912 311801
lzt20 1179702 1181855 1161896 1037098 1045179 1042190 1013392 952329 952365 993442 971521
lzt21 679936 240528 240244 188446 194075 174892 125322 103608 148267 113461 100013
lzt22 400000 401446 397959 338837 361733 354449 355978 321598 309569 348347 333860
lzt23 1048576 1052692 1048694 1048585 1040701 1030737 1047609 985814 777633 1035197 1019900
lzt24 3471552 2668424 2613405 2425865 2369885 2521469 2040395 2080506 2289316 1934129 1696663
lzt25 1029744 361885 425735 351577 324190 326180 477377 297974 210363 332747 229362
lzt26 262144 261152 259327 244555 246465 242343 252297 237162 222808 244990 238538
lzt27 857241 460703 466747 435522 430350 395926 392139 335655 333120 353497 322050
lzt28 1591760 578289 617498 453170 445806 421804 401166 349753 335243 388712 352455
lzt29 3953035 2570903 2535625 2259281 2235299 2052597 2227835 2013763 1805289 1519904 1437132
lzt30 100000 100397 100015 100009 100394 100033 100782 112373 100020 100393 100653
total 24700817 17231068 17134922 15199691 14815832 14652256 14294776 13573404 13077482 13053476 12268320

BTW I finally got most of these compressors linked into my own test bed for this post. Review of their API's (and some ranting) :

minilzo , quicklz, LZ4 : yes, good, easy. API is basically compress(char *,int len,char *), as it should be.

One minor niggle with these guys : I'm actually leaning towards compress API's taking "void *" now for buffers. It's annoying dealing with API's where somebody thinks "char *" is the way to take an array and someone else wants "const unsigned char *", etc. If it's just memory, take void *, I think. I'm not completely sure about that.

Oh, also "uncompress" ? Where did that come from? It's "decompress". Come on!

Oh, and about LZO : so minilzo is nice and clean, but it only gives you access to the most crappy compressor. To get the good LZO variants you have to use the full LZO sdk which is a nightmare. So I didn't bother and just ran lzopack -9.

miniz : I don't love the single file miniz.c thing; I'd like to have an implementation .c and a header with the externs. It makes it way harder to read the externs and see what is supposed to be public. I also don't like the custom types (mz_uLong and such), I like "int" (I guess "size_t" is okay). But not bad, way better than :

zlib : WTF WTF ridiculous clusterfuck. There's an MSVC make file that seems to make a lib that is not what you want. Then the contrib has some projects for two MSVC versions and not others; then you have to make sure that you get the DLL import and WINAPI defines in your import the same as the built lib. Much hair pulling due to "_deflate not found" link errors.

(BTW I don't entirely blame zlib for that; how many man hours have been wasted due to the linker being so damn unhelpful. Hey jerks who work on the linker, if I'm trying to import "__stdcall deflate" and you only have "__cdecl deflate" then perhaps you could figure out that just maybe I wanted those to hook up. We've only been having this damn problem for the last 20 years. It's the kind of thing where one decent human being on that team would go "hmm I bet we could do something better here". The linker is like the overly precise nerd; if you ask "can you pass the salt" he's like "no, I don't have any salt" , and you're like "fuck you, I can see the salt right next to you" and he's like "nope, no salt", and you're like wtf wtf so you walk around and grab it and you say "okay, what's this?" and he says "that's kosher salt").

Compressed sizes from zlib were slightly smaller than miniz, so I posted them.

lzma : this is a mixed bag; the C++ interface is a total disaster of COM insanity, but you can avoid that and the C interface (LzmaEnc/LzmaDec) is okay. A bit of a mess and you oddly have to handle the props transmission yourself, but so much better than the C++ interface that it looks great in comparison. Very bizarre to expose so many technical encoder settings in the basic API though.

snappy : WTF ? No MSVC build, it uses some crazy STL crap for no reason (iostreams? are you kidding me? using std::string for buffers? WTF), it's all overly complex and bad-C++y. Really gross. There's a mess of files and I can't tell which ones I actually need. Oh, and it generates warnings like crazy; so it's all bloated "clean style" but not actually clean, which is the worst way to be. And then after futzing around with it, I'm rewarded with a really crappy compressor.

I guess I see the point of snappy; it's LZRW1 for modern CPUs. Okay, that's fine. And it is reasonably fast even on incompressible data. So it has a certain use case, perhaps for disk compression, or I guess Google uses it for network data compression or some such. But people have been taking snappy and using it as just a utility data compressor and it is totally terrible for that.

old rants