10/26/2012

10-26-12 - Oodle Rewrite Thoughts

I'm getting increasingly annoyed with the C-style Oodle threading code. It's just such a nightmare to manually manage things like object lifetimes in an async / multi-threaded environment.

Even something as simple as "write part of this buffer to a file" constantly causes me pain, because implied in that operation is "the buffer must not be freed until the write is done" , "the buffer should not be changed in the area being written until the write is done" , and "the file should not be closed until the write is done".

When you first start out and aren't doing a lot of complicated ops, it doesn't seem too bad, you can keep those things in your head; they become "comment-enforced" rules; that is, the code doesn't make itself correct, you have to write comments like "// write is pending, don't free buffer yet" (often you don't actually write the comments, but they're still "comment-enforced" as opposed to "code-enforced").

I think the better way is the very-C++-y Oodle futures .

Oodle futures rely on every object they take as inputs having refcounts, so there is no issue of free before exit. Some key points about the Oodle futures that I think are good :

A. Dependencies are automatic based on your arguments. You depend on anything you take as arguments. If the arguments themselves depend on async ops, then you depend on the chain of ops automatically. This is super-sweet and just removes a ton of bugs. You are then required to write code such that all your dependencies are in the form of function arguments, which at first is a pain in the ass, but actually results in much cleaner code overall because it makes the expression of dependencies really clear (as opposed to just touching some global deep inside your function, which creates a dependency in a really nasty way).

B. Futures create implicit async handles; the async handles in Oodle future are all ref-counted so they clean themselves automatically when you no longer care about them. This is way better than the manual lifetime management in Oodle right now, in which you either have to hold a bunch of handles.

C. It's an easy way to plug in the result of one async op into the input of the next one. It's like an imperative way of using code to do that graph drawing thing ; "this op has an output which goes into this input slot". Without an automated system for this, what I'm doing at the moment is writing lots of little stub functions that just wait on one op, gather up its results and starts the next op. There's no inefficiency in this, it's the same thing the future system does, but it's a pain in the ass.

If I was restarting from scratch I would go even further. Something like :

1. Every object has a refcount AND a read-write lock built into. Maybe the refcount and RW lock count go together in one U32 or U64 which is maintained by lockfree ops.

Refcounting is obvious. Lifetimes of async ops are way too complicated without it.

The RW lock in every object is something that sophomoric programmers don't see the need for. They think "hey it's a simple struct, I fill it on one thread, then pass it to another thread, and he touches it". No no no, you're a horrible programmer and I don't want to work with you. It seems simple at first, but it's just so fragile and prone to bugs any time you change anything, it's not worth it. If every object doesn't just come with an RW lock it's too easy to be lazy and skip adding one, which is very bad. If the lock is uncontended, as in the simple struct handoff case above, then it's very cheap, so just use it anyway.

2. Whenever you start an async op on an object, it takes a ref and also takes either a read lock or write lock.

3. Buffers are special in that you RW lock them in ranges. Same thing with textures and such. So you can write non-overlapping ranges simultaneously.

4. Every object has a list of the ops that are pending on that object. Any time you start a new op on an object, it is delayed until those pending ops are done. Similarly, every op has a list of objects that it takes as input, and won't run until those objects are ready.

The other big thing I would do in a rewrite from scratch is the basic architecture :

1. Write all my own threading primitives (semaphore, mutex, etc) and base them on a single waitset. (I basically have this already).

2. Write stack-ful coroutines.

3. When the low level Wait() is called on a stackful coroutine, instead yield the coroutine.

That way the coroutine code can just use Semaphore or whatever, and when it goes to wait on the semaphore, it will yield instead. It makes the coroutine code exactly the same as non-coroutine code and makes it "composable" (eg. you can call functions and they actually work), which I believe is crucial to real programming. This lets you write stackful coroutine code that does file IO or waits on async ops or whatever, and when you hit some blocking code it just automatically yields the coroutine (instead of blocking the whole worker thread).

This would mean that you could write coroutine code without any special syntax; so eg. you can call the same functions from coroutines as you do from non-coroutines and it Just Works the way you want. Hmm I think I wrote the same sentence like 3 times, but it's significant enough to bear repetition.

10/22/2012

10-22-12 - Windows 7 Start Menu Input Race

I've been super annoyed by some inconsistent behavior in the Windows 7 start menu for a while now. Sometimes I hit "Start - programname - enter" and nothing happens. I just sort of put it down to "god damn Windows is flakey and shit" but I finally realized yesterday exactly what's happening.

It's an input race , as previously discussed here

What happens is, you hit Start, and you get your focus in the type-in-a-program edit box. That part is fine. You type in a program name. At that point it does the search in the start menu thing in the background (it doesn't stall after each key press). In many cases there will be a bit of a delay before it updates the list of matching programs found.

If you hit Enter before it finds the program and highlights it, it just closes the dialog and doesn't run anything. If you wait a beat before hitting enter, the background program-finder will highlight the thing and hitting enter will work.

Very shitty. The start menu should not have keyboard input races. In this case the solution is obvious and trivial - when you hit enter it should wait on the background search task before acting on that key (but if you hit escape it should immediately close the window and abort the task without waiting).

I've long been an advocate of video game programmers doing "flakiness" testing by playing the game at 1 fps, or capturing recordings of the game at the normal 30 fps and then watching them play back at 1 fps. When you do that you see all sorts of janky shit that should be eliminated, like single frame horrible animation pops, or in normal GUIs you'll see things like the whole thing redraw twice in a row, or single frames where GUI elements flash in for 1 frame in the wrong place, etc.

Things like input races can be very easily found if you artificially slow down the program by 100X or so, so that you can see what it's actually doing step by step.

I'm a big believer in eliminating this kind of flakiness. Almost nobody that I've ever met in development puts it as a high priority, and it does take a lot of work for apparently little reward, and if you ask consumers they will never rate it highly on their wish list. But I think it's more important than people realize; I think it creates a feeling of solidness and trust in the application. It makes you feel like the app is doing what you tell it to, and if your avatar dies in the game it's because of your own actions, not because the stupid game didn't jump even though you hit the jump button because there was one frame where it wasn't responding to input.

10-22-12 - LZ-Bytewise conclusions

Wrapping this up + index post. Previous posts in the series :

cbloom rants 09-02-12 - Encoding Values in Bytes Part 1
cbloom rants 09-02-12 - Encoding Values in Bytes Part 2
cbloom rants 09-02-12 - Encoding Values in Bytes Part 3
cbloom rants 09-04-12 - Encoding Values in Bytes Part 4
cbloom rants 09-04-12 - LZ4 Optimal Parse
cbloom rants 09-10-12 - LZ4 - Large Window
cbloom rants 09-11-12 - LZ MinMatchLen and Parse Strategies
cbloom rants 09-13-12 - LZNib
cbloom rants 09-14-12 - Things Most Compressors Leave On the Table
cbloom rants 09-15-12 - Some compression comparison charts
cbloom rants 09-23-12 - Patches and Deltas
cbloom rants 09-24-12 - LZ String Matcher Decision Tree
cbloom rants 09-28-12 - LZNib on enwik8 with Long Range Matcher
cbloom rants 09-30-12 - Long Range Matcher Notes
cbloom rants 10-02-12 - Small note on LZHAM
cbloom rants 10-04-12 - Hash-Link match finder tricks
cbloom rants 10-05-12 - OodleLZ Encoder Speed Variation with Worker Count
cbloom rants 10-07-12 - Small Notes on LZNib
cbloom rants: 10-16-12 - Two more small notes on LZNib

And some little additions :

First a correction/addendum on cbloom rants 09-04-12 - LZ4 Optimal Parse :

I wrote before that going beyond the 15 states needed to capture the LRL overflowing the control byte doesn't help much (or at all). That's true if you only go up to 20 or 30 or 200 states, but if you go all the way to 270 states, so that you capture the transition to needing another byte, there is some win to be had (LZ4P-LO-332 got lztestset to 12714031 with small optimal state set, 12492631 with large state set).

If you just do it naively, it greatly increases memory use and run time. However, I realized that there is a better way. The key is to use the fact that there are so many code-cost ties. In LZ-Bytewise with the large state set, often the coding decision in a large number of states will have the same cost, and furthermore often the end point states will all have the same cost. When this happens, you don't need to make the decision independently for each state, instead you make one decision for the entire block, and you store a decision for a range of states, instead of one for each state.

eg. to be explicit, instead of doing :


in state 20 at pos P
consider coding a literal (takes me to state 21 at pos P+1)
consider various matches (takes me to state 0 at pos P+L)
store best choice in table[P][20]

in state 21 ...

do :

in states 16-260 at pos P
consider coding a literal (takes me to states 17-261 at pos P+1 which I saw all have the same cost)
consider various matches (takes me to state 0 at pos P+L)
store in table[P] : range {16-260} makes decision X

in states 261-263 ...

so you actually can do the very large optimal parse state set with not much increase in run time or memory use.

Second : I did a more complex variant of LZ4P (large window). LZ4P-LO includes "last offset". LZ4P-LO-332 uses a 3-bit-3-bit-2-bit control word (as described previously here : cbloom rants 09-10-12 - LZ4 - Large Window ) ; the 2 bit offset reserves one value for LO and 3 values for normal offsets.

(I consider this an "LZ4" variant because (unlike LZNib) it sends LZ codes as a strictly alternating LRL-ML pairs (LRL can be zero) and the control word of LRL and ML is in one byte)

Slightly better than LZ4P-LO-332 is LZ4P-LO-695 , where the numbering has switched from bits to number of values (so 332 should be 884 for consistency). You may have noticed that 6*9*5 = 270 does not fit in a byte, but that's fixed easily by forbidding some of the possibilities. 6-9-5 = 6 values for literals, 9 for match lengths, and 5 for offsets. The 5 offsets are LO + 2 bits of normal offset. So for example one of the ways that the 270 values is reduced is because an LO match can never occur after an LRL of 0 (the previous match would have just been longer), so those combinations are removed from the control byte.

LZ4P-LO-695 is not competitive with LZNib unless you spill the excess LRL and ML (the amount that is too large to fit in the control word) to nibbles, instead of spilling to bytes as in the original LZ4 and LZ4P. Even with spilling to nibbles, it's no better than LZNib. Doing LZ4P-LO-695, I found a few bugs in LZNib, so its results also got better.

Thirdly, current numbers :

raw lz4 lz4p332 lz4plo695 lznib d8 zlib OodleLZHLW
lzt00 16914 6473 6068 6012 5749 4896 4909
lzt01 200000 198900 198880 198107 198107 198199 198271
lzt02 755121 410695 292427 265490 253935 386203 174946
lzt03 3471552 1820761 1795951 1745594 1732491 1789728 1698003
lzt04 48649 16709 15584 15230 14352 11903 10679
lzt05 927796 460889 440742 420541 413894 422484 357308
lzt06 563160 493055 419768 407437 398780 446533 347495
lzt07 500000 265688 248500 240004 237120 229426 210182
lzt08 355400 331454 322959 297694 302303 277666 232863
lzt09 786488 344792 325124 313076 298340 325921 268715
lzt10 154624 15139 13299 11774 11995 12577 10274
lzt11 58524 25832 23870 22381 22219 21637 19132
lzt12 164423 33666 30864 29023 29214 27583 24101
lzt13 1041576 1042749 1040033 1039169 1009055 969636 923798
lzt14 102400 56525 53395 51328 51522 48155 46422
lzt15 34664 14062 12723 11610 11696 11464 10349
lzt16 21504 12349 11392 10881 10889 10311 9936
lzt17 53161 23141 22028 21877 20857 18518 17931
lzt18 102400 85659 79138 74459 76335 68392 59919
lzt19 768771 363217 335912 323886 299498 312257 268329
lzt20 1179702 1045179 993442 973791 955546 952365 855231
lzt21 679936 194075 113461 107860 102857 148267 83825
lzt22 400000 361733 348347 336715 331960 309569 279646
lzt23 1048576 1040701 1035197 1008638 989387 777633 798045
lzt24 3471552 2369885 1934129 1757927 1649592 2289316 1398291
lzt25 1029744 324190 332747 269047 230931 210363 96745
lzt26 262144 246465 244990 239816 239509 222808 207600
lzt27 857241 430350 353497 315394 328666 333120 223125
lzt28 1591760 445806 388712 376137 345343 335243 259488
lzt29 3953035 2235299 1519904 1451801 1424026 1805289 1132368
lzt30 100000 100394 100393 100010 100013 100020 100001
total 24700817 14815832 13053476 12442709 12096181 13077482 10327927

And comparison charts on the aggregated single file lzt99 :

Speeds are the best of 20 trials on each core; speed is the best of either x86 or x64 (usually x64 is faster). The decode times measured are slightly lower for everybody in this post (vs the last post of this type) because of the slightly more rigorous timing runs. For reference the decode speeds I measured are (mb/s) :


LZ4 :      1715.10235
LZNib :     869.1924302
OodleLZHLW: 287.2821629
zlib :      226.9286645
LZMA :       31.41397495

Also LZNib current enwik8 size : (parallel chunking (8 MB chunks) and LRM 12/12 with bubble)

LZNib enwik8 mml3 : 30719351
LZNib enwik8 stepml : 30548818

(all other LZNib results are for mml3)

10/16/2012

10-16-12 - Two more small notes on LZNib

Followup to Small Notes on LZNib

1. Because cost ties are common, and ties are not actually ties (due to "last offset"), just changing the order that you visit matches can change your compression. eg. if you walk matches from long to short or short to long or low offset to high offset, etc.

Another important way to break ties is for speed. Basically prefer long matches and long literal runs vs. a series of shorter ones that make the same output length. Because the code cost is integer bytes, you can do this pretty easily by just adding a small bias to the cost (one thousandth of a byte or whatever) each time you start a new match or LRL.

(more generally in an ideal world every compressor should have a lagrange parameter for space-speed tradeoff, but that's the kind of thing nobody ever gets around to)

2. Traditional LZ coders did not output matches unless they were cheaper than literals. That is, say you send a match len in 4 bits and an offset in 12 bits, so a match is 2 bytes - you would think that the minimum match length should be 3 - not 2 - because sending a 2 byte match is pointless (it's cheaper or the same cost to send those 2 bytes as literals (cheaper as literals if you are in a literal run-len already)). By using a larger MML, you can send higher match lengths in your 4 bits, so it should be a win.

This is not true if you have "last offset". With LO in your coder, it is often beneficial to send matches which are not a win (vs literals) on their own. eg. in the above example, minimum match length should be 2 in an LO coder.

This is one of those cases where text and binary data differ drastically. If you never tested on structured data you would not see this. Really the nature of LZ compression on text and binary is so different that it's worth considering two totally independent compressors (or at least some different tweaked config vals). Text match offsets fall off very steadily in a perfect curve, and "last offsets" are only used for interrupted matches, not for re-using an offset (and generally don't help that much). Binary match offsets have very sparse histograms with lots of strong peaks at the record sizes in the file, and "last offset" is used often just as a way of cheaply encoding the common record distance.

On text, it is in fact best to use an MML which makes matches strictly smaller than literals.

If I keep at this work in the future I'm sure I'll get around to doing an LZ specifically designed for structured data; it's sort of hopeless trying to find a compromise that works great on both; I see a lot more win possible.

10-16-12 - Thoughts on Bit-Packing Structs Before Compression

If you're trying to transmit some data compactly, and you are *not* using any back-end compression, then it's relatively straightforward to pack the structs through ad-hoc "bit packing" - you just want to squeeze them into as few bits as possible. But if you are going to apply a standard compressor after bit packing, it's a little less clear. In particular, a lot of people make mistakes that result in larger final data than necessary.

To be clear, there are two compression steps :


{ Raw structs } --[ad hoc]--> { Bit Packed } --[compressor]--> { Transmitted Data }

What you actually want to minimize is the size of the final transmitted data, which is not necessarily achieved with the smallest bit packed data.

The ideal scenario is if you know your back-end compressor, simply try a variety of ways of packing and measure the final size. You should always start with completely un-packed data, which often is a reasonable way to go. It's also important to keep in mind the speed hit of bit packing. Compressors (in particular, decompressors) are very fast, so even though your bit-packing may just consist of some simple math, it actually can very easily be much slower than the back-end decompressor. Many people incorrectly spend CPU time doing pre-compression bit-packing, when they would be better off spending that same CPU time by just running a stronger compressor and not doing any twiddling themselves.

The goal of bit-packing should really be to put the data in a form that the compressor can model efficienctly. Almost all compressors assume an 8-bit alphabet, so you want your data to stay in 8-bit form (eg. use bit-aligned packing, don't use non-power-of-2 multiplies to tightly pack values if they will cross a byte boundary). Also almost all compressors, even the best in the world (PAQ, etc) primarily achieve compression by modeling correlation between neighboring bytes. That means if you have data that does not have the property of maximum correlation to its immediate neighbor (and steady falloff) then some swizzling may help, just rearranging bytes to put the correlated bytes near each other and the uncorrelated bytes far away.

Some issues to consider :

1. Lossy bit packing.

Any time you can throw away bits completely, you have a big opportunity that you should exploit (which no back end compressor can ever do, because it sends data exactly). The most common case of this is if you have floats in your struct. Almost always there are several bits in a float which are pure garbage, just random noise which is way below the error tolerance of your app. Those bits are impossible to compress and if you can throw them away, that's pure win. Most floats are better transmitted as something like a 16 bit fixed point, but this requires application-specific knowledge about how much precision is really needed.

Even if you decide you can't throw away those bits, something that can help is just to get them out of the main stream. Having some random bytes mixed in to an otherwise nicely compressible stream really mucks up the order-0 statistics, so just putting them on the side is a nice way to go. eg. you might take the bottom 4 or 8 bits out of each float and just pass them uncompressed.

(in practical bone-head tips, it's pretty common for un-initialized memory to be passed to compressors; eg. if your structs are padded by C so there are gaps between values, put something highly compressible in the gap, like zero or a duplicate of the neighboring byte)

2. Relationships between values.

Any time you have a struct where the values are not completely independent, you have a good opportunity for packing. Obviously there are cases where one value in a struct can be computed from another and should just not be sent.

There are more subtle cases, like if A = 1 then B has certain statistics (perhaps it's usually high), while if A = 0 then B has other statistics (perhaps it's usually low). In these cases there are a few options. One is just to rearrange the transmission order so that A and B are adjacent. Most back end compressors model correlation between values that are adjacent, so putting the most-related values in a struct next to each other will let the back end find that correlation.

There are also often complicated mathematical relationships. A common case is a normalized vector; the 3 values are constrained in a way that the compressor will never be able to figure out (proof that current compressors are still very far away from the ideal of perfect compression). When possible you want to reduce these related values to their minimal set; another common case is rotation matrices, where 9 floats (36 bytes) can be reduced to 3 fixed points (6-9 bytes).

This is really exactly the same as the kinds of variable changes that you want to do physics; when you have a lot of values in a struct that are constrained together in some way, you want to identify the true number of degrees of freedom, and try to convert your values into independent unconstrained variables.

When numerical values are correlated to their neighbors, delta transformation may help. (this particularly helps with larger-than-byte values where a compressor will have a harder time figuring it out)

3. Don't mash together statistics.

A common mistake is to get too aggressive with mashing together values into bits in a way that wrecks the back-end statistical model. Most back end compressors work best if the bytes in the file all have the same probability histogram; that is, they are drawn from the same "source". (as noted in some of the other points, if there are multiple unrelated "sources" in your one data stream, the best thing to do is to separate them from each other in the buffer)

Let me give a really concrete example of this. Say you have some data which has lots of unused space in its bytes, something like :


bytes in the original have values :

0000 + 4 bits from source "A"
0001 + 4 bits from source "B"

(when I say "from source" I mean a random value drawn under a certain probability distribution)

You might be tempted to bit-pack these to compact them before the back end compressor. You might do something like this :


Take the top 4 bits to make a flag bit
Take 8 flag bits and put them in a byte

Then take the 4 bits of either A or B and put them together in the high and low nibble of a byte

eg, in nibbles :

0A 1B 1B 0A 0A 0A 1B 0A 

--[bit packed]-->

01100010 (binary) + ABBAAABA (nibbles)

(and A and B are not the hex numbers but mean 4 bits drawn from that source)

It looks like you have done a nice job of packing, but in fact you've really wrecked the data. The sources A and B had different statistics, and in the original form the compressor would have been able to learn that, because the flag bit was right there in the byte with the payload. But by packing it up tightly what you have done is made a bunch of bytes whose probability model is a mix of {bit flags},{source A},{source B}, which is a big mess.

I guess a related point is :

4. Even straightforward bit packing doesn't work for the reasons you think it does.

Say for example you have a bunch of bytes which only take on the values 0-3 (eg. use 2 bits). You might think that it would be a big win to do your own bit packing before the compressor and cram 4 bytes together into one. Well, maybe.

The issue is that the back end compressor will be able to do that exact same thing just as well. It can see that the bytes only take values 0-3 and thus will send them as 2 bits. It doesn't really need your help to see that. (you could help it if you had say some values that you knew were in 0-3 and some other values you knew were in 0-7, you might de-interleave those values so they are separated in the file, or somehow include their identity in the value so that their statistics don't get mixed up; see #5)

However, packing the bytes down can help in some cases. One is if the values are correlated to their neighbors; by packing them you get more of them near each other, so the correlation is modeled at an effective higher order. (eg. if the back end only used order-0 literals, then by packing you get order-3 (for one of the values anyway)). If the values are not neighbor-correlated, then packing will actually hurt.

(with a Huffman back end packing can also help because it allows you to get fractional bits per original value)

Also for small window LZ, packing down effectively increases the window size. Many people see advantages to packing data down before feeding it to Zip, but largely that is just reflective of the tiny 32k window in Zip (left over from the DOS days and totally insane that we're still using it).

5. Separating values that are independent :

I guess I've covered this in other points but it's significant enough to be redundant about. If you have two different sources (A and B); and there's not much correlation between the two, eg. A's and B's are unrelated, but the A's are correlated to other A's - you should try to deinterleave them.

A common simple case is AOS vs SOA. When you have a bunch of structs, often each value in the struct is more related to the same value in its neighbor struct than to other values within its own struct (eg. struct0.x is related to struct1.x more than to struct0.y). In this case, you should transform from array-of-structs to struct-of-arrays ; that is, put all the .x's together.

For example, it's well known that DXT1 compresses better if you de-interleave the end point colors from the palette interpolation indeces. Note that AOS-SOA transformation is very slow if done naively so this has to be considered as a tradeoff in the larger picture.

More generally when given a struct you want to use app-specific knowledge to pack together values that are strongly correlated and de-interleave values that are not.

10/07/2012

10-07-12 - Small Notes on LZNib

Some little thoughts.

1. It's kind of amazing to me how well LZNib does. (currently 30,986,634 on enwik8 with parallel chunked compress and LRM). I guess it's just the "asymptotic optimality" of LZ77; as the dictionary gets bigger, LZ77 approaches perfect compression (assuming the data source is static, which of course it never is, which is why LZ77 does not in fact approach the best compressor). But anyway, the point is with basic LZ the way matches are encoded becomes less and less important as the window gets bigger (and the average match length thus gets longer).

2. With byte-wise coders you have something funny in the optimal parser than you don't run into much with huffman or arithmetic coders : *ties*. That is, there are frequently many ways to code that have exactly the same code length. (in fact it's not uncommon for *all* the coding choices at a given position to produce the same total length).

You might think ties don't matter but in fact they do. One way you can break a tie is to favor speed; eg. break the tie by picking the encoding that decodes the fastest. But beyond that if your format has some feedback, the tie is important. For example in LZNib the "divider" value could be dynamic and set by feedback from the previous encoding.

In my LZNib I have "last offset" (repeat match), which is affected by ties.

3. My current decoder is around 800 mb/s on my machine. That's almost half the speed of LZ4 (around 1500 mb/s). I think there are a few things I could do to get a little more speed, but it's never going to get all the way. Presumably the main factor is the large window - LZ4 matches mostly come from L1 and if not then they are in L2. LZNib gets a lot of large offsets, thus more cache misses. It might help to do a lagrangian space-speed thing that picks smaller offsets when they don't hurt too much (certainly for breaking ties). (LZNib is also somewhat more branchy than LZ4 which is the other major source of speed loss)

4. One of the nice things about optimal parsing LZNib is that you can strictly pick the set of matches you need to consider. (and there are also enough choices for the optimal parser to make interesting decisions). Offsets can be sent in 12 bits, 20 bits, 28 bits, etc. so for each offset size you just pick the longest match in that window. (this is in contrast to any entropy-coded scheme where reducing to only a few matches is an approximation that hurts compression, or a fixed-size scheme like LZ4 that doesn't give the optimal parser any choices to make)

5. As usual I'm giving up some compression in the optimal parser by not considering all possible lengths for each match. eg. if you find a match of length 10 you should consider only using 3,4,5... ; I don't do that, I only consider lengths that result in a shorter match length code word. That is a small approximation but helps encoder speed a lot.

6. Since LZNib uses "last offset", the optimal parse is only approximate and that is an unsolved problem. Because big groups of offsets code to the same output size, the choice between those offsets should be made by how useful they are in the future as repeat matches, which is something I'm not doing yet.

10/05/2012

10-05-12 - OodleLZ Encoder Speed Variation with Worker Count

Thought I would look into this. One thing I've been wondering is whether putting workers on the hyper-threads helps or not.

Measured speed on enwik8. This is the slow optimal encoder to give it something to do. enwik8 is encoded by breaking into 4 MB chunks (24 of them). Each chunk gets 4 MB of dictionary overlap precondition. Matches before the overlap are found using the LRM (Long Range Matcher). The LRM is created for the whole file and shared between all chunks.

What we see :

The speed dip from 0 to 1 workers is expected, it's the cost of firing up threads and communication and chunking and such. (0 = synchronous, just encode on the main thread).

My machine has 4 real cores and 8 hyper-cores. From 1-4 workers we see not-quite-linear speedup, but big steps. Once we get into the hyperthreads, the benefit is smaller but I'm still seeing steady speedup, which surprises me a bit, I thought it would flatten out more after 4 workers.

(the wiggle at 7 is probably just a random fluctuation in Windows (some service doing something I didn't ask it to do, you bastards); I only ran this test once so the numbers are not very solid; normally I run 40 trials or so when measuring speeds on Windows).

And here's the Oodle ThreadProfile of the encode showing what's happening all the threads :


(click to zoom)

Of course part of the reason for the not-quite-linear speedup is the gap at the end when not all the workers are busy. You can fix that by using smaller chunks, but it's really not anything to worry too much about. While it does affect the latency of this single "encode enwik8" operation, it doesn't affect throughput of the overall system under multiple workloads.


OodleLZHLW enwik8 compressed size variation with different chunkings :


28,326,489   4 MB chunks - no LRM
27,559,112   4 MB chunks with LRM
27,098,361   8 MB chunks with LRM , 4 matches
26,976,079   16 MB chunks , 4 matches
26,939,463   16 MB chunks , 8 matches
26,939,812   16 MB chunks , 8 matches, with thresholds

In each case the amount of overlap is = the chunk size (it's really overlap that affects the amount of compression). After the first one, all others are with LRM. Note that the effective local dictionary size varies as you parse through a chunk; eg. with 4 MB chunks, you start with 4 MB of overlap, so you have an effective 4 MB local window, as you parse your window effectively grows up to a max of 8 MB, so the end of each chunk is better compressed than the beginning.

My LZHLW optimal parse only considers 4 matches normally; as the overlap gets bigger, that becomes a worse compromise. Part of the problem is how those matches are chosen - I just take the 4 longest matches (and the lowest offset at each unique length). Normally this compromise is okay, you get a decent sampling of matches to choose from; on moderate file sizes the cost from going to infinite to 16 to 4 matches is not that great, but as the dictionary gets bigger, you will sometimes fill all 4 matches with high offsets (because they provide the longest match lengths) and not any low offsets to try.

At 16 MB chunks (+16 overlap = 32 MB total window) it becomes necessary to consider more matches. (in fact there's almost no benefit in going from 8 MB to 16 MB chunks without increasing the number of matches).

I tried adding "thresholds"; requiring that some of the matches found be in certain windows, but it didn't help; that merits more investigation. Intuitively it seems to me that the optimal parser wants to be able to choose between some long high-offset matches and some shorter low-offset matches, so the question is how to provide it a few good selections to consider. I think there's definitely some more win possible in my optimal parser by considering more matches, or by having a better heuristic to choose which matches to consider.

10/04/2012

10-04-12 - Hash-Link match finder tricks

Some notes on the standard old Hash->Links match finder for LZ. (See previous posts on StringMatchTest Hash1b and Amortized Hashing or index post here )

Some additional tricks which are becoming more or less standard these days :

1. Carry-forward "follows" matches. Previously discussed, see Hash1b post. (also in the Hash1b post : checking for improvement first).

2. "Good enough length". Once you find a match of length >= GOOD_ENOUGH (256 or 1024 or so), you stop the search. This helps in super-degenerate areas; eg. you are at a big run of zeros and that has occured many times before in your file, you can get into a very bad O(N^2) thing if you aren't careful, so once you find a long match, just take it. Hurts compression very little. (note this is not just a max match length; that does hurt compression a bit more (on super-compressable files))

3. Extra steps when not finding matches. The first place I saw this was in LZ4 and Snappy, dunno where it was done first. The idea is when you fail to find a match, instead of stepping ahead by 1 you step ahead by some variable amount. As you continue to fail to find matches, that variable amount increases. Something like :


ptr += 1 + (numSearchesWithNoMatchFound>>5);

instead of just ptr++. The idea is that on incompressible files (or incompressible portions of files) you stop bothering with all the work to find matches that you won't find anyway. Once you get back to a compressible part, the step resets.

4. Variable "amortize" (truncated hash search). A variant of #3 is to use a variable limit for the amortized hash search. Instead of just stepping over literals and doing no match search at all, you could do a match search but with a very short truncated limit. Alternatively, if you are spending too much time in the match finder, you could reduce the limit (eg. in degenerate cases not helped by the "good enough len"). The amortize limit might vary between 64 and 4096.

The goal of all this is to even out the speed of the LZ encoder.

The ideal scenario for an LZ encoder (greedy parsing) is that it finds a very long match (and thus can step over many bytes without doing any lookup at all), and it finds it in a hash bucket which has very few other entries, or if there are other entries they are very easily rejected (eg. they mismatch on the first byte).

The worst scenario for an LZ encoder (without our tricks) is either : 1. there are tons of long matches, so we go and visit tons of bytes before picking one, or 2. there are no matches (or only a very short match) but we had to look at tons of pointers in our hash bucket to find it, and we will have to do hash lookups many times in the file because we are not finding long matches.

10/02/2012

10-02-12 - Small note on LZHAM

When I did my comparison of various compressors a little while ago, I also tested LZHAM, but I didn't include it in the charts because the numbers I was seeing from it were very strange. In particular, I saw very very slow decode speeds, which surprised me because it seems to test well in other peoples' benchmarks.

So I finally had a deeper look to sort it out. The short answer is that LZHAM has some sort of very long initialization (even for just the decoder) which makes its speed extremely poor on small buffers. I was seeing speeds like 2 MB/sec , much worse than LZMA (which generally gets 10-25 MB/sec on my machine). (this is just from calling lzham_lib_decompress_memory)

On large buffers, LZHAM is in fact pretty fast (some numbers below). The space-speed is very good (on large buffers); it gets almost LZMA compression with much faster decodes. Unfortunately the breakdown on small buffers makes it not a general solution at the moment IMO (it's also very slow on incompressible and nearly-incompressible data). I imagine it's something like the huffman table construction is very slow, which gets hidden on large files but dominates small ones.

Anyhoo, here are some numbers. Decode shows mb/s.

BTW BEWARE : don't pay too much attention to enwik8 results; compressing huge amounts of text is irrelevant to almost all users. The results on lzt99 are more reflective of typical use.

name lzt99 decode
raw 24700820 inf
lz4 14814442 1718.72
zlib 13115250 213.99
oodlelzhlw 10164511 287.54
lzham 10066153 61.24
lzma 9344463 29.77

name enwik8 dec
raw 100000000 inf
lz4 42210253 1032.34
zlib 36445770 186.96
oodlelzhlw 27729121 258.46
lzham 24769055 103.01
lzma 24772996 54.59

(lzma should beat lzham on enwik8 but I can't be bothered to fiddle with all the compress options to find the ones that make it win; this is just setting both to "uber" (and -9) parse level and setting dict size = 2^29 for both)

And some charts for lzt99. See the previous post on how to read the charts .

10-02-12 - Small note on Buffered IO Timing

On Windows, Oodle by default uses OS buffering for reads and does not use OS buffering for writes. I believe this is the right way to go 99% of the time (for games).

(see previous notes on how Windows buffering works and why this is fastest :
cbloom rants 10-06-08 - 2
cbloom rants 10-07-08 - 2
cbloom rants 10-09-08 - 2
)

Not buffering writes also has other advantages besides raw speed, such as not polluting the file cache; if you buffer writes, then first some existing cache page is evicted, then the page is zero'ed, then your bytes are copied in, and finally it goes out to disk. Particularly if you are streaming out large amounts of data, there's no need to dump out a bunch of read-cached data for your write pages (which is what Windows will do because its page allocation strategy is very greedy).

(the major exception to unbuffered writes being best is if you will read the data soon after writing; eg. if you're writing out a file so that some other component can read it in again immediately; that usage is relatively rare, but important to keep in mind)

Anyhoo, this post is a small note to remind myself of a caveat :

If you are benchmarking apps by their time to run (eg. as an exe on a command line), buffered writes can appear to be much much faster. The reason is that the writes are not actually done when the app exits. When you do a WriteFile to a buffered file, it synchronously reserves the page and zeroes it and copies your data in. But the actual writing out to disk is deferred and is done by the Windows cache maintenance thread at some later time. Your app is even allowed to exit completely with those pages unwritten, and they will trickle out to disk eventually.

For a little command line app, this is a better experience for the user - the app runs much faster as far as they are concerned. So you should probably use buffered writes in this case.

For a long-running app (more than a few seconds) that doesn't care much about the edge conditions around shutdown, you care more about speed while your app is running (and also CPU consumption) - you should probable use unbuffered writes.

(the benefit for write throughput is not the only compelling factor, unbuffered writes also consume less CPU due to avoiding a memset and memcpy).

10-02-12 - Small note on Adaptive vs Static Modeling

Even most people who work in compression don't realize this, but in fact in most cases Adaptive Models and Static Models can achieve exactly the same amount of compression.

Let me now try to make that note more precise :

With an adaptive model to really do things right you must :


1. Initialize to a smart/tuned initial condition (not just 50/50 probabilities or an empty model)

2. Train the model with carefully trained rates; perhaps faster learning at first then slowing down;
perhaps different rates in different parts of the model

3. Reset the model smartly at data-change boundaries, or perhaps have multiple learning scales

4. Be careful of making the adaptive model too big for your data; eg. don't use a huge model space
that will be overly sparse on small files, but also don't use a tiny model that can't learn about
big files

With a static model to do things right you must :

1. Transmit the model very compactly, using assumptions about what the model is like typically;
transmit model refreshes as deltas

2. Send model refreshes in the appropriate places; the encoder must optimally choose model refresh points

3. Be able to send variable amounts of model; eg. with order-1 huffman decide which contexts get their
own statistics and which go into a shared group

4. Be able to send the model with varying degrees of precision; eg. be able to approximate when that's better
for the overall size(model) + size(coded bits)

We've seen over and over in compression that these can be the same. For example with linear-prediction lossless image compression, assuming you are doing LSQR fits to make predictors, you can either use the local neighborhood and generate an LSQR in the decoder each time, or you can transmit the LSQR fits at the start of the file. It turns out that either way compression is about the same (!!* BUT only if the encoder in the latter case is quite smart about deciding how many fits to send and how precise they are and what pixels they apply to).

Same thing with coding residuals of predictions in images. You can either do an adaptive coder (which needs to be pretty sophisticated these days; it should have variable learning rates and tiers, ala the Fenwick symbol-rank work; most people do this without realizing it just by having independent statistics for the low values and the high values) or you can create static shaped laplacian models and select a model for each coefficient. It turns out they are about the same.

The trade off is that the static model way needs a very sophisticated encoder which can optimize the total size (sizeof(transmitted model) + sizeof(coded bits)) , but then it gets a simpler decoder.

(caveat : this is not applicable to compressors where the model is huge, like PPM/CM/etc.)

A lot of people incorrectly think that adaptive models offer better compression. That's not really true, but it is *much* easier to write a compressor that achieves good compression with an adaptive model. With static models, there is a huge over-complete set of ways to encode the data, and you need a very complex optimizing encoder to find the smallest rep. (see, eg. video coders).

Even something as simple as doing order-0 Huffman and choosing the optimal points to retransmit the model is a very hard unsolved problem. And that's just the very tip of the iceberg for static models; even just staying with order-0 Huffman you could do much more; eg. instead of retransmitting a whole model, send a delta instead. Instead of sending the delta to the ideal code lens, instead send a smaller delta to non-ideal codelens (that makes a smaller total len); instead of sending new code lens, select from one of your previous huffmans. Perhaps have 16 known huffmans that you can select from and not transmit anything (would help a lot for small buffers). etc. etc. It's very complex.

Another issue with static models is that you really need to boil the data down to its simplest form for static models to work well. For example with images you want to be in post-predictor space with bias adjusted and all that gubbins before using a static model; on text you want to be in post-BWT space or something like that; eg. you want to get as close to decorrelated as possible. With adaptive models it's much easier to just toss in some extra context bits and let the model do the decorrelation for you. Put another way, static models need much more human guidance in their creation and study about how to be minimal, whereas adaptive models work much better when you treat them badly.

9/30/2012

09-30-12 - Long Range Matcher Notes

Some notes on the LRM mentioned last time.

Any time you do a string search based on hashes you will have a degeneracy problem. We saw this with the standard "Hash1b" (Hash->links) string matcher. In short, the problem is if you have many occurances of the same hash, then exact string matching becomes very slow. The standard solution is to truncate your search at some number of maximum steps (aka "amortized hashing"), but that has potentially unbounded cost (though typically low).

We have this problem with LRM and I brushed it under the rug last time. When you are doing "seperate scan" (eg. not incrementally adding to the hash table), then there's no need to have a truncated search, instead you can just have a truncated insert. That is, if you're limitting your search to 10, then don't add 1000 of the same hash and only ever search 10, just add 10. In fact on my test files it's not terrible to limit the LRM search to just 1 (!).

But I'm not happy with that as a general solution because there is a potential for huge inefficiency. The really bad degenerate case looks something like this :


LRM hash length is 32 or whatever
Lots of strings in the file of length 32 have the same hash value
You only add 100 or so to the hash
One of the ones you didn't add would have provided a really good match

Typically, missing that match is not a disaster, because at the next byte you will roll to a new hash and look that up, and so on, so if you miss a 128k long match, you will usually find a (128k - 256) long match 256 bytes later. But it is possible to miss it for a long time if you are unlucky, and I like my inefficiency to be bounded. The more common bad case is that you get matches just a bit shorter than possible, and that happens many times, and it adds up to compression lost. eg. say hash length is 16 and there are 24 byte matches possible, but due to the reduced search you only find 16 or 18-length matches.

But most importantly, I don't like to throw away compression for no good reason, I want to know that the speed of doing it this approximate way is worth it vs. a more exact matcher.

There are a few obvious solutions with LRM :

1. Push matches backwards :

If you find a match at pos P of length L, that match might also have worked at pos (P-1) for length (L+1), but a match wasn't found there, either because of the approximate search or because hashes are only put in the dictionary every N bytes.

In practice you want to be scanning matches forward (so that you can roll the hash forward, and also so you can carry forward "last match follows" in generate cases), so to implement this you probably want to have a circular window of the next 256 positions or whatever with matches in them.

This is almost free (in terms of speed and memory use) so should really be in any LRM.

2. Multiple Hashes :

The simplest form of this is to do two hashes; like one of length 16 and one of length 64 (or whatever). The shorter hash is the main one you use to find most matches, the longer hash is there to make sure you can find the big matches.

That is, this is trying to reduce the chance that you miss out on a very long match due to truncating the search on the short hash. More generally, to really be scale-invariant, you should have a bunch of levels; length 16,64,256,1024,etc. Unfortunately implementing this the naive way (by simply having several independent hashes and tables) hurts speed by a linear factor.

3. Multiple Non-Redundant Hashes :

The previous scheme has some obvious inefficiencies; why are we doing completely independent hash lookups when in fact you can't match a 64-long hash if you don't match a 16-long hash.

So you can imagine that we would first do a 16-long hash , in a lookup where the hashes have been unique'd (each hash value only occurs once), then for each 16-long hash there is another table of the 64-long hashes that occured for that 16-long hash. So then we look up in the next. If one is found there, we look in the 256-long etc.

An alternative way to imagine this is as a sorted array. For each entry you store a hash of 16,64,256,etc. You compare first on the 16-long hash, then for entries where that is equal you compare on the 64-long hash, etc. So to lookup you first use the 16-long hash and do a sorted array lookup; then in each range of equal hashes you do another sorted array lookup on the 64-long hash, etc.

These methods are okay, but the storage requirements are too high in the naive rep. You can in fact store them compactly but it all gets a bit complicated.

4. Hash-suffix sort :

Of course it should occur to us that what we're doing in #3 is really just a form of coarse suffix sort! Why not just actually use a suffix sort?

One way is like this : for each 16-byte sequence of the file, replace it with a 4-byte U32 hash value, so the array shrinks by 4X. Now suffix-sort this array of hash values, but use a U32 alphabet instead of a U8 alphabet; that is, suffix strings only start on every 4th byte.

To lookup you can use normal sorted-array lookup strategies (binary search, interpolation search, jump-ins + binary or jump-ins + interpolation, etc). So you start with a 16-byte hash to get into the suffix sort, then if you match you use the next 16-byte hash to step further, etc.

9/28/2012

09-28-12 - LZNib on enwik8 with Long Range Matcher

I wrote a "Long Range Matcher" (ala SREP/etc. (see previous blog post )).

I'm using a rolling hash, and a sorted-array lookup. Building the lookup is insanely fast. One problem with it in an Optimal Parsing scenario is that when you get a very long LRM match, you will see it over and over (hence N^2 kind of badness), so I use a heuristic that if I get a match over some threshold (256 or 1024) I don't look for any more in that interval.

For a rolling hash I'm currently using just multiplicative hash with modulus of 2^32 and no additive constant. I have no idea if this is good, I've not had much luck finding good reference material on rolling hashes. (and yes of course I've read the Wikipedia and such easily Googleable stuff; I've not tested buzhash yet; I don't like table lookups for hashing, I needs all my L1 for myself)

LRM builds its search structure by hashing L bytes (LRM hash length is a parameter (default is 12)) every S bytes (S step is a parameter (default is 10)). Memory use is 8 bytes per LRM entry, so a step of 8 would mean the LRM uses memory equal to the size of the file. For large files you have to increase the step. Hash length does not affect memory use.

So anyhoo, I tested on enwik8. This is a test of different dictionary overlaps and LRM settings.

Compression works like this :


I split the file into 8 M chunks. (12 of them)

Chunks are compressed independently (in parallel).

Each chunk preconditions its dictionary with "overlap" bytes preceding the chunk.

Each chunk can also use the LRM to match the entire file preceding its overlap range.

So for each chunk the view of the file is like :

[... LRM matches here ...][ overlap precondition ][ my chunk ][ not my business ]

(note : enwik8 is 100mB (millions of bytes) not 100 MB, which means that 12*8 MB chunks = 96 MB actually covers the 100 mB).

(BTW of course compression is maximized if you don't do any chunking, or set the overlap to infinity; we want chunking for parallelism, and we want overlap to be finite to limit memory use; enwik8 is actually small enough to do infinite overlap, but we want a solution that has bounded memory use for arbitrarily large files)

With no further ado, some data. Varying the amount of chunk dictionary overlap :

overlap MB no LRM LRM
0 32709771 31842119
1 32355536 31797627
2 32203046 31692184
3 32105834 31628054
4 32020438 31568893
5 31947086 31518298
6 31870320 31463842
7 31797504 31409024
8 31731210 31361250
9 31673081 31397825
10 31619846 31355133
11 31571057 31316477
12 31527702 31281434
13 31492445 31253955
14 31462962 31231454
15 31431215 31206202
16 31408009 31189477
17 31391335 31215474
18 31374592 31202448
19 31361233 31192874

0 overlap means the chunks are totally independent. My LRM has a minimum match length of 8 and also must match a hash equal to the rolling hash length. The "with LRM" in the above test used a step of 10 and hash length of 12.

LRM helps less as the overlap gets bigger, because you find the most important matches in the LRM region. Also enwik8 being text doesn't really have that huge repeated block that lots of binary data has. (on many of my binary test files, turning on LRM gives a huge jump because some chunk is completely repeated from the beginning of the file to the end). On text it's more incremental.

We can also look at how compression varies with the LRM step and hash length :

lrm_step lrm_length compressed
0 0 32020443
32 32 31846039
16 32 31801822
16 16 31669798
12 16 31629439
8 16 31566822
12 12 31599906
10 12 31568893
8 12 31529746
6 12 31478409
8 8 31511345
6 8 31457094

(this test was run with 4 MB overlap). On text you really want the shortest hash you can get. That's not true for binary though, 12 or 16 is usually best. Longer than that hurts compression a little but may help speed.

For reference, some other compressors on enwik8 (from Matt's LTCB page )


enwik8
lzham alpha 3 x64 -m4 -d29                    24,954,329
cabarc 1.00.0601  -m lzx:21                   28,465,607
bzip2 1.0.2       -9                          29,008,736
crush 0.01        cx                          32,577,338
gzip 1.3.5        -9                          36,445,248


ADDENDUM : newer LRM, with bubble-back and other improvement :


LZNib enwik8
lots of win from bubble back (LRM Scanner Windowed) :

    32/32 no bubble : 31,522,926
    32/32 wi bubble : 31,445,380
    12/12 wi bubble : 30,983,058
    10/12 no bubble : 31,268,849
    10/12 wi bubble : 30,958,529
     6/10 wi bubble : 30,886,133

9/26/2012

09-26-12 - Optimizing Carefully and Compression Bugs

This Theora update is a good reminder to be careful when optimizing. Apparently they had a massively broken DCT which was just throwing away quality for no good reason.

This may seem really bone-headed but it's very very common. You may recall in some of my past posts I looked into some basic video stuff in great detail and found that almost every major video encoder/decoder is broken on the simple stuff :

the DCT (or whatever transform)
quantization/dequantization
color space conversion
upsample/downsample
translating float math into ints

These things are all relatively trivial, but it's just SO common to get them wrong, and you throw away a bottom bit (or half a bottom bit) when you do so. Any time you are writing a compressor, you need to write reference implementations of all these basic things that you know are right - and check them! And then a crucial thing is : keep the reference implementation around! Ideally you would be able to switch it on from the command line, or failing that with a build toggle, so at anytime you can go back and enable the slow mode and make sure everything works as expected.

(of course a frequent cause of trouble is that people go and grab an optimized integer implementation that they found somewhere, and it either is bad or they use it incorrectly (eg. maybe it assumes data that's biased in a certain way, or centered at 0, or scaled up by *8, or etc))

A lot of this basic stuff in video is very easy to do regression tests on (eg. take some random 8x8 data, dct, quantize, dequantize, idct, measure the error, it should be very low) so there's no excuse to get it wrong. But even very experienced programmers do get it wrong, because they get lazy. They might even start with a reference implementation they know is right, and then they start optimizing and translating stuff into ints or SIMD, and they don't maintain the slow path, and somewhere along the line a mistake slips in and they don't even know it.


I've been thinking about a more difficult problem, which is : how do you deal with bugs in compression algorithms?

I don't mean bugs like "it crashes" or "the decompressor doesn't reproduce the original data" - those are the easy kind of bugs and you just go fix them. I mean bugs that cause the compressor to not work the way you intended, and thus not compress as much as it should.

The very hard thing about these bugs is that you can have them and not even know it; I'm sure I have a hundred of them right now. Frequently they are tiny things like you have a less-than where you should have a less-or-equal.

To avoid them really requires a level of care that most programmers never use. You have to be super vigilant. Any time something surprises you or is a bit fishy, you can't just go "hmm that's weird, oh well, move on to the next task". You have to stop and think and look into it. You have to gather data obsessively.

Any time you implement some new idea and it doesn't give you the compression win you expected, you can't just say "oh well guess that didn't work", you have to treat it like a crash bug, and go set breakpoints and watch your variables and make sure it really is doing what you think; and if it is, then you have to gather stats about how often that code is hit and what the values are, and see where your expectations didn't match reality.


I've really been enjoying working on compression again. It's one of the most fun areas of programming that exists. What makes it great :

1. Clear objective measure of success. You can measure size and speed (or whatever other criteria) and see if you are doing well or not. (lossy compression is harder for this).

2. Decompressors are one extreme of fun "hacker" programming; they have to be very lean; great decompressors are like haikus, they're little pearls that you feel could not get any simpler without ruining them.

3. Compressors, on the other hand, can be big and slow, and you get to pull out all the big guns of algorithms for optimization and searching and so on.

9/24/2012

09-24-12 - LZ String Matcher Decision Tree

Revisiting this to clarify a bit on the question of "I want to do X , which string matcher should I use?"

Starting from the clearest cases to the least clear :

There are two types of parsing, I will call them Optimal and Greedy. What I really mean is Optimal = find matches at every position in the file, and Greedy = find matches only at some positions, and then take big steps over the match. (Lazy parses and such are in the Greedy category).

There are three types of windowing : 1. not windowed (eg. infinite window or we don't really care about windowing), 2. fixed window; eg. you send offsets in 16 bits so you have a 64k window for matches, 3. multi-window or "cascade" ; eg. you send offsets up to 128 in 1 byte , up to 16384 in 2 bytes, etc. and you want to find the best match in each window.

There are two scan cases : 1. Incremental scan; that is, we're matching against the same buffer we are parsing; matches cannot come from in front of our current parse position, 2. Separate scan - the match source buffer is independent from the current parse point, eg. this is the case for precondition dictionaries, or just part of the file that's well before the current parse point.

1. Optimal Parsing , No window, Incremental scan : Suffix Trie is the clear winner here. Suffix Trie is only a super clear winner when you are parsing and matching at the same time, since they are exactly the same work you double your time taken if they are separate. That is, you must be scanning forward, adding strings and getting matches. Suffix Trie can be extended to Cascaded Windowing in an approximate way, by walking to parents in the tree, but doing it exactly breaks the O(N) of the Suffix Trie.

2. Optimal Parsing, No window or single window, Separate Scan : Suffix Array is pretty great here. Separate scan means you can just take the whole buffer you want to match against and suffix sort it.

(BTW this is a general point that I don't think most people get - any time you are not doing incremental update, a sort is a superb search structure. For example it's very rarely best to use a hash table when you are doing separate scan, you should just have a sorted list, possibly with jump-ins)

3. Optimal Parsing, Windowed or Cascaded, Incremental or Separate Scan : there's not an awesome solution for this. One method I use is cascades of suffix arrays. I wrote in the past about how to use Suffix Array Searcher with Incremental Scan (you have to exclude positions ahead of you), and also how to extend it to Windowing. But those method get slow if the percentage of matches allowed gets low; eg. if you have a 1 GB buffer and a 64k window, you get a slowdown proportional to (1GB/64k). To address this I use chunks of suffix array; eg. for a 64k window you might cut the file into 256k chunks and sort each one, then you only have to search in a chunk that's reasonably close to the size of your window. For cascaded windows, you might need multiple levels of chunk size. This is all okay and it has good O(N) performance (eg. no degeneracy disasters), but it's rather complex and not awesome.

Another option for this case is just to use something like Hash->Links and accept its drawbacks. A more complex option is to use a hybrid; eg. for cascaded windows you might use Hash->Links for the small windows, and then Suffix Array for medium size windows, and Suffix Trie for infinite window. For very small windows (4k or less) hash->links (or even just a "cache table") is very good, so it can be a nice supplement to a matcher like suffix trie is not great at cascaded windows.

Addendum : "Suffix Array Sets" is definitely the best solution for this.

4. Greedy Parsing : Here SuffixArray and SuffixTrie both are not awesome, because they are essentially doing all the work of an optimal-style parse (eg. string matching at every position), which is a big waste of time if you only need the greedy matches.

Hash-Link is comparable to the best matcher that I know of for greedy parsing. Yann's MMC is generally a bit faster (or finds better matches at the same speed) but is basically in the same class. The pseudo-binary-tree thing used in LZMA (and I believe it's the same thing that was used in the original PkZip that was patented) is not awesome; sometimes it's slightly faster than hash-link, sometimes slightly slower. All Window relatively easily.

Hash-Link extends very naturally to cascaded windows, because you are always visiting links in order from lowest offset to highest, you can easily find exact best matches in each window of the cascade as you go.

With Greedy Parsing you don't have to worry about degeneracies quite so much, because when you find a very long match you are just going to take it and step over it. (that is, with optimal parse if you find a 32k long match, then at the next step you will find a 32k-1 match, etc. which is a bad N^2 (or N^3) thing if you aren't super careful (eg. use a SuffixTrie with correct "follows" implementation)). However, with lazy parsing you can still hit a mild form of this degeneracy, but you can avoid that pretty easily by just not doing the lazy eval if your first match length is long enough (over 1024 or whatever).

(BTW I'm pretty sure it's possible to do a Suffix Trie with lazy/incremental update for Greedy Parsing; the result should be similar to MMC but provide exact best matches without any degenerate bad cases; it's rather complex and I figure that if I want perfect matching I generally also want Optimal Parsing, so the space of perfect matching + greedy parsing is not that important)

Previous posts on string matching :

cbloom rants 06-17-10 - Suffix Array Neighboring Pair Match Lens
cbloom rants 09-23-11 - Morphing Matching Chain
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
cbloom rants 09-25-11 - More on LZ String Matching
cbloom rants 09-26-11 - Tiny Suffix Note
cbloom rants 09-27-11 - String Match Stress Test
cbloom rants 09-28-11 - Algorithm - Next Index with Lower Value
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression
cbloom rants 09-30-11 - Don't use memset to zero
cbloom rants 09-30-11 - String Match Results Part 1
cbloom rants 09-30-11 - String Match Results Part 2
cbloom rants 09-30-11 - String Match Results Part 2b
cbloom rants 09-30-11 - String Match Results Part 3
cbloom rants 09-30-11 - String Match Results Part 4
cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 10-01-11 - More Reliable Timing on Windows
cbloom rants 10-01-11 - String Match Results Part 6
cbloom rants 10-02-11 - How to walk binary interval tree
cbloom rants 10-03-11 - Amortized Hashing
cbloom rants 10-18-11 - StringMatchTest - Hash 1b
cbloom rants 11-02-11 - StringMatchTest Release

9/23/2012

09-23-12 - Patches and Deltas

A while ago Jon posted a lament about how bad Steam's patches are. Making small patches seems like something nice for Oodle to do, so I had a look into what the state of the art is for patches/deltas.

To be clear, the idea is computer 1 (receiver) has a previous version of a file (or set of files, but for now we'll just assume it's one file; if not, make a tar), computer 2 (transmitter) has a newer version and wishes to send a minimum patch, which computer 1 can apply to create the newer version.

First of all, you need to generate patches from uncompressed data (or the patch generator needs to be able to do decompression). Once the patch is generated, it should generally be compressed. If you're trying to patch a zip to a zip, there will be lots of different bits even if the contained files are the same, so decompress first before patching.

Second, there are really two classes of this problem, and they're quite different. One class is where the transmitter cannot see the old version that the receiver has; this is the case where there is no authoritative source of data; eg. in rsync. Another class is where the transmitter has all previous versions and can use them to create the diff; this is the case eg. for game developers creating patches to update installed games.

Let's look at each class.

Class 1 : Transmitter can't see previous version

This the case for rsync and Windows RDC (Remote Differential Compression).

The basic way all these methods work is by sending only hashes of chunks of data to each other, and hoping that when the hashes for chunks of the files match, then the bytes actually were the same. These methods are fallible - it is possible to get corrupt data if you have an unlucky hash match.

In more detail about how they work :

The file is divided into chunks. It's important that these chunks are chosen based on the *contents* of the file, not just every 256 bytes or whatever, some fixed size chunking. If you did fixed size chunking, then just adding 1 byte at the head of a file would make every chunk different. You want to use some kind of natural signature to choose the chunk boundaries. (this reminds me rather of SURF type stuff in image feature detection).

I've seen two approaches to finding chunking boundaries :

1. Pick a desired average chunk size of L. Start from the previous chunk end, and look ahead 2*L and compute a hash at each position. The next chunk boundary is set to the local min (or max) of the hash value in that range.

2. Pick a desired average chunk size of 2^B. Make a mask M with B random bits set. Compute a hash at each position in the file; any position with (hash & M) == (M) is a boundary; this should occur once in 2^B bytes, giving you an average chunk len of 2^B.

Both methods can fall apart in degenerate areas, so you could either enforce a maximum chunk size, or you could specifically detect degenerate areas (areas with the same hash at many positions) and handle them as a special case.

So once you have these chunk boundaries, you compute a strong hash for each chunk (MD5 or SHA or whatever; actually any many-bit hash is fine, the cryptographic hashes are widely over-used for this, they are generally slower to compute than an equally strong non-cryptographic hash). Then the transmitter and receiver send these hashes between each other; if the hashes match they assume the bytes match and don't send the bytes. If the hashes differ, they send the bytes for that chunk.

When sending the bytes for a chunk that needs updating, you can use all the chunks that were the same as context for a compressor.

If the file is large, you may wish to use multi-scale chunking. That is, first do a coarse level of chunking at large scale to find big regions that need transmision, then for each of those regions do finer scale chunking. One way to do this is to just use a constant size chunk (1024 bytes or whatever), and to apply the same algorithm to your chunk-signature set; eg. recurse (RDC does this).

Class 2 : Transmitter can see previous version

This case is simple and allows for smaller patches (as well as guaranteed, non-probablistic patches). (you probably want to do some simple hash check to ensure that the previous versions do in fact match).

The simplest way to do this is just to take an LZ77 compressor, take your previous version of the file and put it in your LZ dictionary, then compress the new version of the file. This will do byte-exact string matching and find any parts of the file that are duplicated in the previous version.

(aside : I went and looked for "preload dictionary" options in a bunch of mainstream compressors and couldn't find it in any of them. This is something that every major compressor should have, so if you are the author of FreeArc or 7zip or anything like that, go add a preload dictionary option)

(aside : you could use other compressors than LZ77 of course; for example you could use PPM (or CM) and use the previous version to precondition the model. For large preconditions, the PPM would have to be very high order, probably unbounded order. An unbounded order PPM would be just as good (actually, better) at differential compression than LZ77. The reason why we like LZ77 for this application is that the memory use is very low, and we want to use very large preconditions. In particular, the memory use (in excess of the window itself) for LZ77 compression can be very low without losing the ability to deduplicate large blocks; it's very easy to control, and when you hit memory limits you simply increase the block size that you can deduplicate; eg. up to 1 GB you can find all dupes of 64 bytes or more; from 1-2 GB you can find dupes of 128 bytes or more; etc. this kind of scaling is very hard to do with other compression algorithms)

But for large distributions, you will quickly run into the limits of how many byte-exact matches an LZ77 matcher can handle. Even a 32 MB preload is going to stress most matchers, so you need some kind of special very-large-window matcher to find the large repeated blocks.

Now at this point the approaches for very-large-window matching look an awful lot like what was done in class 1 for differential transmission, but it is really a different problem and not to be confused.

The standard approach is to pick a large minimum match length L (L = 32 or more) for the long-range matches, and to only put them in the dictionary once every N bytes (N 16 or more, scaling based on available memory and the size of the files). So basically every N bytes you compute a hash for the next L bytes and add that to the dictionary. Now when scanning over the new version to look for matches, you compute an L-byte hash at every position (this is fast if you use a rolling hash) and look that up.

One interesting variant of this is out-of-core matching; that is if the previous version is bigger than memory. What you can do is find the longest match using only the hashes, and then confirm it by pulling the previous file bytes back into memory only when you think you have the longest once. (SREP does some things like this; oddly SREP also doesn't include a "preload dictionary" option, or it could be used for making patches)

In the end you're just generating LZ matches though. Note that even though you only make dictionary entries every N bytes for L byte chunks, you can generate matches of arbitrary length by doing byte-by-byte matching off the end of the chunk, and you can even adjust to other offsets by sliding matches to their neighboring bytes. But you might not want to do that; instead for very large offets and match lengths you could just not send some bottom bits; eg. only send a max of 24 bits of offset, but you allow infinite window matches, so over 24 bits of offset you don't send some of the bottom bits.

Special Cases

So far we've only looked at pretty straightforward repeated sequence finding (deduplication). In some cases, tiny changes to original files can make lots of derived bytes differ.

A common case is executables; a tiny change to source code can cause the compiled exe to differ all over. Ideally you would back up to the source data and transmit that diff and regenerate the derived data, but that's not practical.

Some of the diff programs have special case handling for exes that backs out one of the major problems : jump address changes. Basically the problem is if something like the address of memcpy changes (or the location of a vtable, or address of some global variable, etc..), then you'll have diffs all over your file and generating a small patch will be hard.

I speculate that what these diffs do basically is first do the local-jump to absolute-jump transform, and then they create a mapping of the absolute addresses to find the same routine in the new & old files. They send the changed address, like "hey replace all occurances of address 0x104AC with 0x10FEE) so that chunks that only differ by some addresses moving can be counted as unchanged.

(bsdiff does some fancy stuff like this for executables) (ADDENDUM : not really; what bsdiff does is much more general and not as good on exes; see comments)

If you're trying to send small patches of something like lightmaps, you might have areas where you just increased the brightness of a light; that might change very pixel and create a huge diff. It might be possible to express deltas of image (and sound) data as linear transforms (add & scale). An alternative would be finding the original piece of content and just using it as a mocomp source (dictionary precondition) for an image compressor. But at some point the complexity of the diff becomes not worth it.

Links

In no particular order :

-ck hacking lrzip-0.613
ngramhashing - Rolling Hash C++ Library - Google Project Hosting
A new 900GB compression target
Patchdelta compression
Remote diff utility
SREP huge-dictionary LZ77 preprocessor
Long Range ZIP � Freecode
About Remote Differential Compression (Windows)
There is a Better Way. Instead of using fixed sized blocks, use variable sized b... Hacker News
bsdiff windows
ZIDRAV Free Communications software downloads at SourceForge.net
Binary diff (bsdiff)
Data deduplication (exdupe)
xdelta
Tridge (rsync)

BTW some of you have horrible page titles. "binary diff" is not a useful page title, nor is "data deduplication". It's like all the freaking music out there named "track1.mp3".

I have not done an extensive review of the existing solutions. I think bsdiff is very strong, but is limited to relatively small files, since it builds an entire suffix sort. I'm not sure what the best solution is for large file diffs; perhaps xdelta (?). The algorithms in rsync look good but I don't see any variant that makes "class 2" (transmitter has previous version) diffs. It seems neither lrzip nor srep have a "precondition dictionary" option (wtf?). So there you go.

old rants