9/15/2012

09-15-12 - Some compression comparison charts

These charts show the time to load + decompress a compressed file using various compressors.

(the compressors are the ones I can compile into my testbed and run from code, eg. not command line apps; they are all run memory to memory; I tried to run all compressors in max-compression-max-decode-speed mode , eg. turning on heavy options on the encode side. Decode times are generated by running each decompressor 10 times locked to each core of my box (80 total runs) and taking the min time; the cache is wiped before each decode. Load times are simulated by dividing the compressed file size by the disk speed parameter. All decoders were run single threaded.)

They are sorted by fastest decompressor. (the "raw" uncompressed file takes zero time to decompress).

"sum" = the sum of decomp + load times. This is the latency if you load the entire compressed file and then decompress in series.

"max" = the max of decomp & load times. This is the latency if the decomp and load were perfectly parallelizable, and neither ever stalled on the other.

The criterion you actually want to use is something between "sum" and "max", so the idea is you look at them both and kind of interpolate in your mind. (obviously you can replace "disk" with ram or network or "channel")

Discussion :

The compressors are ordered from left to right by speed. If you look at the chart of compressed file sizes, they should be going monotonically downward from left to right. Any time it pops up as you go to the right (eg. at snappy, minilzo, miniz, zlib) that is just a bad compressor; it has no hope of being a good choice in terms of space/speed tradeoff. The only ones that are on the "Pareto frontier" are raw, LZ4, OodleLZH, and LZMA.

Basically what you should see is that on a fast disk (100 mbps (and mb = millions of bytes, not 1024*1024)), a very slow decoder like LZMA does not make a lot of sense, you spend way too much time in decomp. On very slow data channels (like perhaps over the net) it starts to make sense, but you have to get to 5 mbps or slower before it becomes a clear win. (of course there may be other reasons that you want your data very small other than minimizing time to load; eg. if you are exceeding DVD capacity).

On a fast disk, the fast decompressors like LZ4 are appealing. (though even at 100 mbps, OodleLZH has a lower "max"; LZ4 has the best "sum").

Of the fast decoders, LZ4 is just the best. (in fact LZ4-largewindow would be even stronger). Zip is pretty poor; the small window is surely hurting it, it doesn't find enough matches which not only hurts compression, it hurts decode speed. Part of the problem is neither miniz nor zlib have super great decoders with all the tricks.

It's kind of ridiculous that we don't have a single decent mainstream free compression library. Even just zip-largewindow would be at least decent. (miniz could easily be extended to large windows; that would make it a much more competitive compressor for people who don't care about zlib compatibility)

If you are fully utilizing your CPU's, you may need a low-CPU decoder even if it's not the best choice in a vacuum. In fact because of that you should avoid CPU-hungry decoders even if they are the best by some simple measure like time to load. eg. even in cases where LZMA does seem like the right choice, if it's close you should avoid it, because you could use that CPU time for something else. You could say that any compressor that can decode faster than it can load compressed data is "free"; that is, the decode time can be totally hidden by parallelizing with the IO and you can saturate the disk loading compressed data. While that is true it assumes no other CPU work is being done, so does not apply to games. (it does sort of apply to archivers or installers, assuming you aren't using all the cores).

As a rough rule of thumb, compressors that are in the "sweet spot" take time that is roughly on par with the disk time to load their compressed data. That is, maybe half the time, or double the time, but not 1/10th the time of the disk (then they are going too fast, compressing too little, leaving too much on the table), and also not 10X the time of the disk (then they are just going way too slow and you'd be better off with less compression and a faster compressor).


The other thing we can do is draw the curves and see who's on the pareto frontier.

Here I make the Y axis the "effective mbps" to load and then decompress (sequentially). Note that "raw" is an x=y line, because the effective speed equals the disk speed.

Let me emphasize that these charts should be evaluated as information that goes into a decision. You do not just go "hey my disk is 80 mbps let me see which compressor is on top at that speed" and go use that. That's very wrong.

and the log-log (log base 2) :

You can see way down there at the bottom of the log-log, where the disk speed is 1.0 mbps, LZMA finally becomes best. Also note that log2 of 10 is a gigabyte per second, almost the speed of memory.

Some intuition about log-log compressor plots :

Over on the right hand side, all the curves will flatten out and become horizontal. This is the region where the decompress time dominates and the disk speed becomes almost irrelevant (load time is very tiny compared to decompress time). You see LZMA flattens out at relatively low disk speed (at 16 mbps (log2 = 4) it's already pretty flat). The speed over at the far right approaches the speed of just the decompressor running memory to memory.

On the left all the curves become straight lines with a slope of 1 (Y = X + B). In this area their total time is dominated by their loading time, which is just a constant times the disk speed. In a log-log plot this constant multiple becomes a constant addition - the Y intercept of each curve is equal to log2(rawLen/compLen) ; eg someone with 2:1 compression will hit the Y axis at log2(2) = 1.0 . You can see them stacked up hitting the Y axis in order of who gets the most compression.

Another plot we can do is the L2 mean of load time and decode time (sqrt of squares). What the L2 mean does is penalize compressors where the load time and decode time are very different (it favors ones where they are nearly equal). That is, it sort of weights the average towards the max. I think this is actually a better way to rate a compressor for most usages, but it's a little hand-wavey so take it with some skepticism.

9/14/2012

09-14-12 - Things Most Compressors Leave On the Table

It's very appealing to write a "pure" algorithmic compressor which just implements PPM or LZP or whatever in a very data agnostic way and make it quite minimal. But if you do that, you are generally leaving a lot on the table.

There are lots of "extra" things you can do on top of the base pure compressor. It makes it very hard to compare compressors when one of them is doing some of the extra things and another isn't.

I used to only write pure compressors and considered the extra things "cheating", but of course in practical reality they can sometimes provide very good bang-for-the-buck, so you have to do them. (and archivers these days are doing more and more of these things, so you will look bad in comparison if you don't do them).

Trying to dump out a list of things :

Parameter Optimization . Most compressors have some hard-coded parameters; some time it's an obvious one, like in LZMA you can set the # of position bits used in the context. Getting that number right for the particular file can be a big win. Other compressors have hard-coded tweaks that are not so obvious; for example almost all modern PPM or CM compressors use some kind of secondary-statistics table; the hash index made for that table is usually some heuristic, and tweaking it per file can be a big win.

Model Preconditioning . Any time your have a compressor that learns (eg. adaptive statistical coders, or the LZP cache table, or the LZ77 dictionary) - a "pure" compressor starts from an empty memory and then learns the file as it goes. But that's rarely optimal. You can usually get some win by starting from some pre-loaded state; rather than starting from empty and learning the file, you start from "default file" and learn towards the current file. (eg. every binary arithmetic probability should not start at 50% but rather at the expected final probability). And of course you can take this a step further by having a few different preconditions for different file types and selecting one.

Prefilters . BCJ (exe jump transform), deltas, image deltas, table record deltas (Bulat's DELTA), record transposes, various deswizzles, etc. etc. There are lots of prefilters possible, and they can provide very big wins for the amount of time they take. If you don't implement all the prefilters you are at a disadavantage to compressors that do. (for example, RAR has a pretty strong set of prefilters that are enabled by default, which means that RAR actually beats 7zip on lots of files, even though as a pure compressor it's much worse).

Header compression . Anything you send like buffer sizes or compressor parameters can generally be smaller by more advanced modeling. Typically this is just a few bytes total so not important, but it can become important if you transmit incremental headers, or something like static huffman codes. eg. something like Zip that can adapt by resending Huffmans, it's actually important to get that as small as possible, and it's usually something that's neglected because it's outside of the pure compressor.

Data Size Specialization . Most compressors either work well on large buffers or small buffers, not both; eg. if you do an LZSS , you might pick 3 byte offsets for large buffers, but on tiny buffers that's a huge waste; in fact you should use 1 byte offsets at first, and then switch to 2, and then 3. People rarely go to the trouble to have separately tuned algorithms for various buffer sizes.

Data specialization . Compressing text, structured records, images, etc. is actually all very different. You can get major win by special-casing for the major types of data (eg. text has weird features like the top bits tell you the type of character; word-replacing transforms are a big win, as are de-punctuators, etc. etc.).

Decompression . One of the new favorite tricks is decompressing data to compress it. If someone hands you a JPEG or a Zip or whatever and tells you to compress it as well as possible, of course the first thing you have to do is decompress it to undo the bad compressor so you can get back to the original bits.

This is almost all stuff I haven't done yet, so I have some big win in my back pocket if I ever get around to it.

In the compression community, I'm happy to see packages like FreeArc that are gathering together the prefilters so that they can be used with a variety of back-end "pure" compressors.

9/13/2012

09-13-12 - LZNib

LZNib is the straightforward/trivial way to do an LZ77 coder using EncodeMod for variable length numbers and 4-bit nibble aligned IO. That is, literals are always 8 bit; the control word is 4 bits and signals either a literal run len or a match length, using a range division value instead of a flag bit.

eg. if the nibble is < config_divider_lrl , it's a literal run len; if nibble is >= config_divider_lrl, it's a match.

The point of LZNib is to see how much compression is possible while keeping the decode speed close to the fastest of any reasonable compressor (LZ4,snappy,etc).

Testing different values of config_divider_lrl :

name raw config_divider_lrl=4 config_divider_lrl=5 config_divider_lrl=6 config_divider_lrl=7 config_divider_lrl=8 config_divider_lrl=9 config_divider_lrl=10 config_divider_lrl=11 config_divider_lrl=12 best
lzt00 16914 5639 5638 5632 5636 5654 5671 5696 5728 5771 5632
lzt01 200000 199360 199354 199348 199345 199339 199333 199324 199319 199314 199314
lzt02 755121 244328 243844 250836 255146 255177 255257 257754 260107 260597 243844
lzt03 3471552 1746220 1744630 1743728 1743043 1742718 1742728 1743191 1744496 1746915 1742718
lzt04 48649 13932 13939 13968 14015 14120 14184 14268 14319 14507 13932
lzt05 927796 422058 421115 420746 420592 418289 418200 418639 418854 418082 418082
lzt06 563160 414925 414080 412748 412223 409673 408884 408361 408435 407393 407393
lzt07 500000 237756 237318 237004 236910 236771 236949 237381 238091 239176 236771
lzt08 355400 309397 309490 308579 307706 307263 306418 305689 305495 305793 305495
lzt09 786488 302834 303018 303773 304350 305222 307405 308888 310649 314647 302834
lzt10 154624 11799 11785 11792 11800 11821 11843 11866 11885 11923 11785
lzt11 58524 22420 22341 22249 22288 22276 22322 22370 22561 22581 22249
lzt12 164423 28901 28974 28900 28957 29053 29122 29296 29381 29545 28900
lzt13 1041576 1072275 1068614 1058545 1047273 1025641 1025616 1025520 1025404 1024891 1024891
lzt14 102400 52010 51755 51595 51462 51379 51314 51298 51302 51341 51298
lzt15 34664 11846 11795 11767 11760 11740 11740 11756 11831 11837 11740
lzt16 21504 11056 11000 10961 10934 10911 10904 10893 10883 10892 10883
lzt17 53161 20122 20119 20152 20210 20288 20424 20601 20834 21091 20119
lzt18 102400 77317 77307 77274 77045 77037 77020 77006 76964 76976 76964
lzt19 768771 306499 307120 308138 309635 311801 314857 318983 323981 329683 306499
lzt20 1179702 975546 974447 973507 972326 971521 972060 971614 971569 985009 971521
lzt21 679936 99059 99182 99385 99673 100013 100492 101018 101652 102387 99059
lzt22 400000 334796 334533 334357 334027 333860 333733 333543 333501 337864 333501
lzt23 1048576 1029556 1026539 1023978 1021833 1019900 1018124 1016552 1015139 1013815 1013815
lzt24 3471552 1711694 1710524 1708577 1706969 1696663 1695663 1694205 1692996 1688324 1688324
lzt25 1029744 224428 224423 224306 229365 229362 229368 229603 227083 227546 224306
lzt26 262144 240106 239633 239200 238864 238538 238232 237960 237738 237571 237571
lzt27 857241 323147 323098 323274 323133 322050 322068 322799 322182 322573 322050
lzt28 1591760 343555 345586 348549 350601 352455 354077 356025 360583 364438 343555
lzt29 3953035 1445657 1442589 1440996 1440794 1437132 1437593 1440565 1442614 1442914 1437132
lzt30 100000 100668 100660 100656 100655 100653 100651 100651 100643 100643 100643
total 24700817 12338906 12324450 12314520 12308570 12268320 12272252 12283315 12296219 12326039 12212820

The divider at 8 is the same as using a flag bit + 3 bit payload in the nibble.

There does seem to be some win to be had by transmitting divider in the file (but it's not huge). Adaptive divider seems like an interesting thing to explore also, but it will make the decoder slower. Obviously it would be nice to be able to find the optimal divider for each file without just running the compressor N times.


Some comparison to other compressors. I tried to run all compressors with settings for max compression and max decode speed.

I'm having a bit of trouble finding anything to compare LZNib to. Obviously the LZ4+Large Window (LZ4P) that I talked about before is a fair challenger. I thought CRUSH would be the one, but it does very poorly on the big files, indicating that it has a small window (?). If anyone knows of a strong byte-aligned LZ that has large/unbounded window, let me know so I have something to compare against.

(ADDENDUM : tor -c1 can do byte-aligned IO and large windows, but it's a really bad byte-aligned coder)

(obviously things like LZ4 and snappy are not fair challengers because their small window makes them much worse; also note that zlib of course has a small window but is the only one with entropy coding; also several of these are not truly "byte aligned"; they have byte-aligned literals but do matches and control words bit by bit, which is "cheating" in this test (if you're allowed to do variable bit IO you can do much better; hell you may as well do huffman if you're doing variable bit IO) (though I'm also cheating because I believe I'm the only LZ here with a proper optimal parser) (I'm also cheating in a more subtle way, because I have this test set to tweak on and others don't)).

(aside : I don't understand why so many people are still writing small-window LZ's. Yes there is a certain use case for small-window LZ (eg. I did one for use on the SPU), but the most common use case for LZ these days is just "decompress the whole thing into a linear memory buffer", in which case you always have the entire buffer available for matches, so why not use it all?)

raw minilzo snappy quicklz 3 lz4 hc lzopack -9 ULZ c6 crush cx zlib lz4p 332 lznib div8
lzt00 16914 7195 7254 6082 6473 5805 6472 5487 4896 6068 5654
lzt01 200000 198906 198222 200009 198900 198934 199680 222477 198199 198880 199339
lzt02 755121 567421 552599 334625 410695 426187 312889 258902 386203 292427 255177
lzt03 3471552 2456002 2399663 2036985 1820761 1854718 1835797 1938138 1789728 1795951 1742718
lzt04 48649 20602 21170 16894 16709 14858 17359 13869 11903 15584 14120
lzt05 927796 590072 554964 480848 460889 453608 464602 444945 422484 440742 418289
lzt06 563160 536084 516818 563169 493055 490308 428137 432989 446533 419768 409673
lzt07 500000 297306 298207 268255 265688 242029 268271 245662 229426 248500 236771
lzt08 355400 356248 351850 317102 331454 314918 337423 315688 277666 322959 307263
lzt09 786488 460896 472498 372682 344792 345561 329608 304048 325921 325124 305222
lzt10 154624 21355 25960 17249 15139 14013 16238 12540 12577 13299 11821
lzt11 58524 28153 29121 25626 25832 23717 26720 23279 21637 23870 22276
lzt12 164423 52725 53515 38745 33666 31574 36077 29601 27583 30864 29053
lzt13 1041576 1045665 1041684 1041585 1042749 1041633 1048598 1061423 969636 1040033 1025641
lzt14 102400 61394 63638 55124 56525 52102 58509 51491 48155 53395 51379
lzt15 34664 16026 15417 13626 14062 12663 14016 12470 11464 12723 11740
lzt16 21504 12858 13165 11646 12349 11203 12554 11119 10311 11392 10911
lzt17 53161 28075 29415 23478 23141 20979 23829 19877 18518 22028 20288
lzt18 102400 100499 97931 81100 85659 74268 89973 76858 68392 79138 77037
lzt19 768771 495312 524686 411916 363217 360558 333732 302006 312257 335912 311801
lzt20 1179702 1181855 1161896 1037098 1045179 1042190 1013392 952329 952365 993442 971521
lzt21 679936 240528 240244 188446 194075 174892 125322 103608 148267 113461 100013
lzt22 400000 401446 397959 338837 361733 354449 355978 321598 309569 348347 333860
lzt23 1048576 1052692 1048694 1048585 1040701 1030737 1047609 985814 777633 1035197 1019900
lzt24 3471552 2668424 2613405 2425865 2369885 2521469 2040395 2080506 2289316 1934129 1696663
lzt25 1029744 361885 425735 351577 324190 326180 477377 297974 210363 332747 229362
lzt26 262144 261152 259327 244555 246465 242343 252297 237162 222808 244990 238538
lzt27 857241 460703 466747 435522 430350 395926 392139 335655 333120 353497 322050
lzt28 1591760 578289 617498 453170 445806 421804 401166 349753 335243 388712 352455
lzt29 3953035 2570903 2535625 2259281 2235299 2052597 2227835 2013763 1805289 1519904 1437132
lzt30 100000 100397 100015 100009 100394 100033 100782 112373 100020 100393 100653
total 24700817 17231068 17134922 15199691 14815832 14652256 14294776 13573404 13077482 13053476 12268320

BTW I finally got most of these compressors linked into my own test bed for this post. Review of their API's (and some ranting) :

minilzo , quicklz, LZ4 : yes, good, easy. API is basically compress(char *,int len,char *), as it should be.

One minor niggle with these guys : I'm actually leaning towards compress API's taking "void *" now for buffers. It's annoying dealing with API's where somebody thinks "char *" is the way to take an array and someone else wants "const unsigned char *", etc. If it's just memory, take void *, I think. I'm not completely sure about that.

Oh, also "uncompress" ? Where did that come from? It's "decompress". Come on!

Oh, and about LZO : so minilzo is nice and clean, but it only gives you access to the most crappy compressor. To get the good LZO variants you have to use the full LZO sdk which is a nightmare. So I didn't bother and just ran lzopack -9.

miniz : I don't love the single file miniz.c thing; I'd like to have an implementation .c and a header with the externs. It makes it way harder to read the externs and see what is supposed to be public. I also don't like the custom types (mz_uLong and such), I like "int" (I guess "size_t" is okay). But not bad, way better than :

zlib : WTF WTF ridiculous clusterfuck. There's an MSVC make file that seems to make a lib that is not what you want. Then the contrib has some projects for two MSVC versions and not others; then you have to make sure that you get the DLL import and WINAPI defines in your import the same as the built lib. Much hair pulling due to "_deflate not found" link errors.

(BTW I don't entirely blame zlib for that; how many man hours have been wasted due to the linker being so damn unhelpful. Hey jerks who work on the linker, if I'm trying to import "__stdcall deflate" and you only have "__cdecl deflate" then perhaps you could figure out that just maybe I wanted those to hook up. We've only been having this damn problem for the last 20 years. It's the kind of thing where one decent human being on that team would go "hmm I bet we could do something better here". The linker is like the overly precise nerd; if you ask "can you pass the salt" he's like "no, I don't have any salt" , and you're like "fuck you, I can see the salt right next to you" and he's like "nope, no salt", and you're like wtf wtf so you walk around and grab it and you say "okay, what's this?" and he says "that's kosher salt").

Compressed sizes from zlib were slightly smaller than miniz, so I posted them.

lzma : this is a mixed bag; the C++ interface is a total disaster of COM insanity, but you can avoid that and the C interface (LzmaEnc/LzmaDec) is okay. A bit of a mess and you oddly have to handle the props transmission yourself, but so much better than the C++ interface that it looks great in comparison. Very bizarre to expose so many technical encoder settings in the basic API though.

snappy : WTF ? No MSVC build, it uses some crazy STL crap for no reason (iostreams? are you kidding me? using std::string for buffers? WTF), it's all overly complex and bad-C++y. Really gross. There's a mess of files and I can't tell which ones I actually need. Oh, and it generates warnings like crazy; so it's all bloated "clean style" but not actually clean, which is the worst way to be. And then after futzing around with it, I'm rewarded with a really crappy compressor.

I guess I see the point of snappy; it's LZRW1 for modern CPUs. Okay, that's fine. And it is reasonably fast even on incompressible data. So it has a certain use case, perhaps for disk compression, or I guess Google uses it for network data compression or some such. But people have been taking snappy and using it as just a utility data compressor and it is totally terrible for that.

9/11/2012

09-11-12 - LZ MinMatchLen and Parse Strategies

I've been thinking about this and want to get my thoughts down while they're (somewhat) clear.

We've seen in the past that changing the LZ "min match len" can give big compression wins on some files. eg. say you have some LZ coder which normally has a minmatchlen of 3, you instead set minmatchlen to 4 so that no length 3 matches are emitted, and you find that the compressed size is smaller. ( see, for example, post on png-alike )

(for concreteness, assume a Huffman coded LZ, so that if you don't emit any matches of length 3, then that symbol takes up no code space; no prefix code is reserved for it if it doesn't occur)

Now, perhaps this is not a surprise in a normal parse, because a normal parse has some heuristic about whether a match is better than a literal (or a lazy match is better, etc.), and if that heuristic doesn't match the file then the normal parse will not find the optimum.

But this is also true for (typical) optimal parsers. You would think that if the matches of length 3 hurt compression, the optimal parser would not choose them. So what's going on?

Well, first of all, the problem is that the matches of length 3 do *help* compression, in a greedy local optimization sense. That is, if you have some assumption about the Huffman code lengths so that you can measure the cost of each choice, and you just ask "what's smaller to code, these 3 literals or this length-3 match?" the answer is the length-3 match. That's what makes this a tricky case; if it was more expensive it wouldn't get chosen.

But you can see what caused the problem - in order to do the optimal parse we had to make an assumption about the initial Huffman code lengths to compute code costs. This biases us towards a certain strategy of parse. The "optimal parse" is then just a local optimization relative to that seed; it can't find radically different parses.

In particular, what's happening with these files is that *if* you generate length-3 matches, they are slightly cheaper than 3 literals, but only barely so. However, if you don't generate any length-3 matches, that makes your other prefix codes shorter; in particular the literals get shorter and the length-4 matches get shorter. The result is a smaller file overall.

With an optimal parser, you could find the better parse by using a different seed Huffman cost, rather than using the hard rule of changing the minMatchLen. What you'd have to do is parse a bunch of files, pick a set of representative Huffman code lengths that are quite different from each other, then you can run your optimal parse seeded from each one and take the best. This is just giving you a bunch of different initial positions for your local search in an attempt to find the global minimum.

In heuristic lazy parsers (like LZMA ( see this blog post )) there are some tweaks about "should I prefer this match or that one" or "should I prefer a literal or match". You have the same problem there, that the parse strategy affects the statistical coder, and the statistical coder affects the parse strategy. The heuristics are tweaked to guide the parse in a way that we expect to be good, but its not the best way on all files.

For a while I've had this idea that I call "multi parse". The short description is to run many parses at once and take the best once. This is a bit different from a normal "optimal parse" in the sense that our specific goal is to avoid doing a single local minimization.

For example with an LZMA style heuristic parser, you could run several parses with different constants in the normal match vs repeat match and match vs lazy match decisions, as well as min match len.

The key thing is that you can run a "multi parse" as an inner loop, rather than an outer loop. That is, you run all the parses at once as you step over the file, rather than tweaking parameters on the outside and running the entire compression several times :


tweak outer params :

for each parse strategy S

  for each byte B
    
    use S on B

"multi-parse" :

for each byte B

  for each parse strategy S

    use S on B

this is a big difference, because a lot of work can be shared. In particular, the string matcher only needs to be run once for all the parse strategies. Also in many cases their work is redundant; eg. if you don't find any matches then all strategies must output a literal.

But the real big win comes once you realize that parse strategy doesn't have to be selected for the whole file. It can vary throughout the file, and different strategies may be best in different regions. If you have the timeline of all the strategies layed out, then you can try to find a shortest-path serpentine walk across the strategies.

9/10/2012

09-10-12 - LZ4 - Large Window

Continuing from : LZ4 Optimal Parse (also see the Encoding Values in Bytes series).

Trying the obvious ways to extend LZ4 to large windows.

In my view, the fundamental thing about LZ4 is the strictly alternating sequence of literal-run-len, match-len. That is, in LZ4 to send two matches in a row you must send a literal-run-len of 0 between them. This helps decoder speed a bit because it removes the "is this token a match or literal?" branch and makes you just always go literals-match-literals-match. (though you don't really get to avoid the branch, it's just moved into the looping on the run len).

I relax the control word structure to not just be 4 bits of LRL and 4 bits of ML. But I do keep strict bit-wise division of the control world, and strictly byte-by-byte literals and offsets. The obvious contenders are :


LZ4 variants are named like :

# bits of LRL - # bits of ML - # bits of offset : how offset is encoded

we have :

4-4-0 : 16  = "LZ4 classic" , always a 16 bit offset

4-4-0 : 15/23 = one bit of 16 bit offset is reserved to indicate a 3 byte offset (so windows are 1<<15 , 1<<23)

4-4-0 : encodemodWB = use encodemod for offset; send word first then bytes

3-4-1 : 16/24 =  3 bits of LRL, 1 bit in control flags 2 or 3 byte offset
     (4-3-1 is strictly worse than 3-4-1) 

3-3-2 : encodemodB = 2 bottom bits of offset go in the control ; remainder send with encodemod bytes
        this is the first variant that can send an offset in only 1 byte

3-3-2-B ML3-4 : same as above, but 1 byte offets get a min match len of 3 (instead of 4)

And the optimal parse compressed sizes are :

raw 4-4-0 : 16 4-4-0 : 15/23 4-4-0 : encodemodWB 3-4-1 : 16/24 3-3-2 : encodemodB 3-3-2-B MML3-4
lzt00 16914 6444 6444 6444 6551 6186 6068
lzt01 200000 198900 198900 198900 198905 198893 198880
lzt02 755121 410549 321448 314152 307669 315101 292427
lzt03 3471552 1815464 1804312 1804086 1807200 1799418 1795951
lzt04 48649 16461 16463 16461 16564 15938 15584
lzt05 927796 459700 457261 454931 460191 445986 440742
lzt06 563160 492938 429336 428734 431119 429374 419768
lzt07 500000 264112 264594 263954 266882 257550 248500
lzt08 355400 330793 330524 328740 336329 334284 322959
lzt09 786488 340317 327145 323145 321273 326352 325124
lzt10 154624 14845 14706 14627 14687 13714 13299
lzt11 58524 25749 25755 25750 25943 24555 23870
lzt12 164423 32485 32770 32470 32542 31272 30864
lzt13 1041576 1042749 1043586 1042984 1042836 1042747 1040033
lzt14 102400 56478 56734 56535 57516 55182 53395
lzt15 34664 13995 13996 13995 14102 13050 12723
lzt16 21504 12340 12340 12340 12517 11847 11392
lzt17 53161 23025 23167 23025 23163 22374 22028
lzt18 102400 85614 87190 85929 86374 86197 79138
lzt19 768771 359276 345974 339273 334512 337204 335912
lzt20 1179702 1043192 1011629 1004412 1004435 1002099 993442
lzt21 679936 192411 120808 120704 121908 115289 113461
lzt22 400000 361524 356885 353333 353676 353120 348347
lzt23 1048576 1040623 1038648 1034493 1039073 1038802 1035197
lzt24 3471552 2369040 1911004 1907645 1929223 1931560 1934129
lzt25 1029744 324107 324281 323513 323032 332437 332747
lzt26 262144 246334 248360 246587 247177 246667 244990
lzt27 857241 425694 386493 386056 387358 358184 353497
lzt28 1591760 437666 399105 393814 390517 390421 388712
lzt29 3953035 2230095 1563410 1554583 1553331 1537093 1519904
lzt30 100000 100394 100394 100394 100394 100394 100393
total 24700817 14773314 13273662 13212009 13246999 13173290 13053476
raw 4-4-0 : 16 4-4-0 : 15/23 4-4-0 : encodemodWB 3-4-1 : 16/24 3-3-2 : encodemodB 3-3-2 : encdemodB

Some thoughts : going to large window helps a lot, but the exact scheme doesn't matter much.

Obviously you could do better by going to more complexity; arbitrary dividers instead of integer bit counts; different min match lens for each # of offset bytes (eg. 3 for 1 byte, 4 for 2 bytes, 5 for 3 bytes, etc.).

Probably the biggest next step (which isn't a big speed hit) would be to add a "last offset" (aka "repeat match"). Last offset is a big LHS penalty on the shit platforms, but here our decoder is so simple that we have the potential to write the whole thing in assembly and just keep the last offset in a register.

BTW I don't think I said this explicitly in the last post, but the great thing about having an optimal parser is that it's much *easier* to write an optimal parser for an arbitrary coding scheme than it is to write a well tweaked heuristic. Each time you change the coding, you would have to re-tune the heuristic, whereas with an optimal parser you just plug in the code costs and let it run. Then once you have decided on a format, you can go back and try to find a heuristic that produces a good parse without the slowness of the optimal parse.

9/09/2012

09-09-12 - A Simple Tight-Packed Array

Trivial snippet for a tight-packed array with bit mask indicating which elements exist.

In the same family as this post :

cbloom rants 06-14-11 - A simple allocator

On newer chips with POPCNT this should be reasonably fast (assuming query is common and insert is rare, since insert requires a realloc/memmove or something similar). (and of course everything is a disaster on platforms where variable shift is slow).


struct PackedArray
{
    uint32  mask; // numItems = num_bits_set(mask)
    int32   items[1]; // variable allocation size
};


static inline uint32 num_bits_set( uint32 v )
{
    //return _mm_popcnt_u32(v);
    // from "Bit Twiddling Hacks" :
    v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
    uint32 c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
    return c;
}

bool PackedArray_Get(const PackedArray * a,int index,int32 * pgot)
{
    ASSERT( index >= 0 && index < 32 );

    uint32 mask = 1UL << index;
    if ( a->mask & mask )
    {
        // it exists, find it
        uint32 numPreceding = num_bits_set( a->mask & (mask-1) );   
        *pgot = a->items[numPreceding];
        return true;
    }
    else
    {
        return false;
    }
}

bool PackedArray_Put(PackedArray * a,int index,int32 item)
{
    ASSERT( index >= 0 && index < 32 );
    
    uint32 mask = 1UL << index;
    uint32 numPreceding = num_bits_set( a->mask & (mask-1) );   
        
    if ( a->mask & mask )
    {
        // it exists, replace
        a->items[numPreceding] = item;
        return true;
    }
    else
    {
        // have to add it
        // realloc items here or whatever your scheme is
        
        // make room :
        uint32 numFollowing = num_bits_set(a->mask) - numPreceding;
        
        // slide up followers :
        int32 * pitem = a->items + numPreceding;
        memmove(pitem+1,pitem,numFollowing*sizeof(int32));
        
        // put me in
        *pitem = item;
        
        a->mask |= mask;
        return false;
    }
}

(BTW the GCC builtins are annoying. What they should have done is provide a way to test if the builtin actually corresponds to a machine instruction, because if it doesn't I generally don't want their fallback implementation, I'll do my own thank you.)

9/06/2012

09-06-12 - Quick WebP lossless examination

Just had a quick run through the WebP-lossless spec. What I see :

0. Coding is pixel-by-pixel not byte-by-byte. (?)

This is a big change that is not at all clear in their explanation (I didn't pick up on it until I wrote this whole thing and had to come back and add it as item 0). It appears to me that coding is always pixel-by-pixel ; eg. they only send LZ matches for *whole pixels*. If this doesn't hurt compression, it should help decode speed, since coding branches are taken less often.

(ADDENDUM : a related major but not-emphasized change is that they have 3 separate huffmans for R,G,B literals (vs. PNG that just has a single huffman for all literals); note that on RGBA data, you could use LZMA as a back-end for a PNG-alike and the mod-4 literal stats of LZMA would correspond to RGBA; on RGB data, you need a mod-3 which no standard LZ does out of the box).

1. A large set of fixed spatial predictors.

They're basically the PNG predictors. They only allow the 1-ring causal neighbors. They do not include the truly great 1-ring predictor "ClampedGrad". They do not include arbitrary linear-combo predictors (which is very odd since they do include linear-combo color transforms). There's no ability to do more complex predictors using the 2-ring like CALIC (hello 1996 is calling and has some better tech for you). They do include the over-complex Select predictor (similar to PNG Paeth).

2. Division of image for coding purposes is in rectangular blocks instead of rows (like PNG).

This is surely a big win and allows the possibility of a very aggressive optimizing encoder. Obviously 2d images are better correlated in rectangular blocks than in long rows.

This is particularly a win in the modern world of very large images; if you have 16k pixels on a row it's silly to divide your image into rows for compression purposes; much better to put 16k pixels in a square block together.

(BTW webp wtf: 14 bit (16k) limit for width and height? Oh yeah, 640k of ram is all anyone needs. File sizes will never be bigger than 32 bits. WTF WTF. How many standards do you have to fuck up before you learn not to put in absolute limits that will be blown away in 5-10 years).

3. Arbitrary color transform.

They allow you to transmit the color transform as a series of lifting steps. I did a bunch of experiments on this in the past (and believe I wrote some rants about it) and found it to be not of much use on average, however in weird/rare cases it can help enormously.

However, this feature, like many others in WebP-LL, make a super efficient decoder implementation almost impossible. eg. if you know the color transform is LOCO you can write a very nice ASM decoder for that.

BTW WebP-LL looks unfairly better than PNG here because LOCO-PNG is not in the normal PNG standard. (it's in MNG)

4. 2d LZ77

The LZ77 seems to be just Zip for the most part (except with a 1M window - which makes low memory decoding impossible). Instead of LZ77 offsets being linear in row-scanline order, they reserve some special low offsets to be the local 2d neighborhood (eg. your upward neighbor). Again this looks like it should be a big win for the modern world of long-row images (because in linear order, a long row causes the upward location to be a large offset).

Certainly this should be a win for things like text where the same pattern is repeated vertically. (though for text an even bigger win would be LZ-modulo-offsets; eg. figure out that the natural repeat is an 8x16 block and send offsets that are multiples of that). (and hell for text you should just use JBIG or something from 20 years ago that's designed for text).

(aside : there was work ages ago about doing actual rectangular LZ for images. eg. instead of linear byte matches, you transmit a width & height and paste in a rectangle, and instead of just linear offsets you code to do a {dx,dy} offset. I don't believe it ever caught on due to the complexity, but it is theoretically appealing.)

Sadly they have not taken the opportunity to make an LZ that's actually competitive with 1995 (LZX). At the very least any modern LZ should have "last offset". Also for a codec that is pretty much PC-only there's very little reason to not use arithmetic coding. (I say it's PC-only because of the 1MB window and other forms of non-locality; that will preclude use in embedded devices).

Variable min-match-len also seems to be a big win for image data; transmitting MML in the header would have been nice.

5. Color Cache thingy

It's basically a simple cache table and you transmit the index in the cache table (very old compression technique). Meh, curious how much of a win this is. Very bad for load-hit-store platforms.

6. Multiple palettes

This is a trivial thing, but the ability to define local palettes on 2d regions is surely a big win for certain types of graphic art web images. (eg. in mixed text/color you may have regions of the image that are basically 2 bit) (allowing lossy detection of black&white could make this even better; lossy YCoCg color transform is something I would have liked to see as well).

7. Pixel formats?

The docs are just terrible on this, I can't tell what they support at all (they say "see the VP8 docs" which are even worse). Any modern format should support N channels and N bits per channel (at least N=8,16,32), as well as floating point channels with encodings like LogInt.

Conclusion :

It seems reasonable and there's nothing massively bone-headed in it. (it's much easier working on lossless (than lossy) because you can just measure the size).

For open formats, it's a very good idea to make them flexible (but not complex - there's a difference), because it lets the kiddies go off and write optimizers and find clever tricks. WebP-LL is pretty good in this regard but could have been better (transmit linear combo spatial predictor for example).

The compression ratio is decently close to state of the art (for fast decoders).

If you're going to make yet another image format, it should be more forward-thinking, since formats tend to either be ignored completely or stick around for 20+ years.

Further thoughts :

Image data can be roughly categorized into 3 types :

1. "Natural images" (continuous tone). This type of image is handled very well by linear predictors (BCIF, TMW, GLICBAWLS, CALIC, MRP, etc. etc.).

2. "Text". What I mean by text is few-colors, and with exactly repeating shapes. eg. some graphical symbol may appear several times in the image, just moved, not rotated or fuzzy. This type of image is handled well by large shape context predictors (JBIG) or 2d-LZ.

3. "Graphic art". This type of image has some repeating patterns, maybe gradients or solid color shapes. PNG and WebP-LL both handle this kind of thing reasonably well. The "natural image" compressors generally do poorly on them.

A real modern network image codec should really have 3 modes for these types of data, selected in rectangular regions. (or, like PAQ, simultaneously modeled and mixed).

ADDENDUM : it's important to distinguish between the compressor and the file format. I think webp-ll as a compressor is actually pretty impressive; they have achieved very good space-speed for the complexity level. As a file format it's not so great.

9/04/2012

09-04-12 - LZ4 Optimal Parse

I wrote an optimal parser for LZ4 to have as a reference point.

It's a very easy format to optimal parse, because the state space is small enough that you can walk the entire dynamic programming table. That is, you just make a table which is :


          <------ all file positions ------>
^       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
states  XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 
V       XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 

and then you just go fill all the slots. Like LZSS (Storer-Szymanski) it's simplest to walk backwards, that way any value in the table that you need from later positions is already computed.

(* see addendum at end)

For LZ4 the state is just the literal run len (there is no entropy coder; there is no "last offset"; and there are no carried bits between coding events - the way a match is coded is totally independent of what precedes it). I use 16 states. Whenever you code a literal, the state transition is just state++ , when you code a match the transition is always to state = 0.

There is a small approximation in my optimal parse; I don't keep individual states for literal run lens > 15. That means I do measure the cost jump when you go from 14 to 15 literals (and have to output an extra byte), but I don't measure the cost jump when you go from 15+254 to 15+255.

The optimal parse can be made very very slightly better by using 20 states or so (instead of 16). Then from state 15-20 you count the cost of sending a literal to be exactly 1 byte (no extra cost in control words or literal run len). At state 20 you count the cost to be 1 byte + (1/234) , that is you add in the amortized cost of the 1-1-1-1 code that will be used to send large literal run lengths. While this is better in theory, on my test set I don't get any win from going to more than 16 states.

Without further ado, the numbers :

raw greedy lazy optimal XXX lz4 -c0 lz4 -c1 lz4 -c2
lzt00 16914 6646 6529 6444 7557 6666 6489
lzt01 200000 198906 198901 198900 198922 198918 198916
lzt02 755121 412064 411107 410549 447200 412085 410705
lzt03 3471552 1827696 1821562 1815464 2024928 1827577 1820776
lzt04 48649 17521 16985 16461 21803 17540 16725
lzt05 927796 484749 462204 459700 518392 484764 460905
lzt06 563160 493815 493056 492938 508281 493838 493071
lzt07 500000 269945 266191 264112 300505 269945 265704
lzt08 355400 332459 332201 330793 335890 332476 331470
lzt09 786488 352802 346700 340317 416708 352821 344807
lzt10 154624 16517 15173 14845 21334 16537 15155
lzt11 58524 26719 26060 25749 30148 26737 25848
lzt12 164423 35168 33504 32485 60450 35187 33682
lzt13 1041576 1042758 1042750 1042749 1045665 1042774 1042765
lzt14 102400 57421 56834 56478 59832 57432 56541
lzt15 34664 14755 14055 13995 15702 14776 14078
lzt16 21504 12503 12396 12340 12908 12528 12365
lzt17 53161 24023 23450 23025 28657 24040 23157
lzt18 102400 86880 86109 85614 88745 86899 85675
lzt19 768771 381018 369110 359276 478748 381033 363233
lzt20 1179702 1051478 1047742 1043192 1073769 1051500 1045195
lzt21 679936 203405 196764 192411 244363 203410 194091
lzt22 400000 363832 362390 361524 371121 363845 361748
lzt23 1048576 1040762 1040758 1040623 1049408 1040778 1040717
lzt24 3471552 2391132 2372991 2369040 2426582 2391145 2369896
lzt25 1029744 328271 326682 324107 370278 328295 324206
lzt26 262144 246951 246876 246334 250159 246972 246481
lzt27 857241 478524 429287 425694 531932 478537 430366
lzt28 1591760 468568 455644 437666 580253 468589 445814
lzt29 3953035 2342525 2259916 2230095 2536827 2342537 2235202
lzt30 100000 100394 100394 100394 100410 100410 100410
total 24700817 15110207 14874321 14773314 16157477 15110591 14816193
raw greedy lazy optimal lz4 -c0 lz4 -c1 lz4 -c2

greedy, lazy, optimal are mine. They all use a suffix array for string searching, and thus always find the longest possible match. Greedy just takes the longest match. Lazy considers a match at the next position also and has a very simple heuristic for preferring it or not. Optimal is the big state table described above.

Yann's lz4 -c2 is a lazy parse that seems to go 3 steps ahead with some funny heurstics that I can't quite follow; I see it definitely considers the transition threshold of matchlen from 18 to 19, and also some other stuff. It uses MMC for string matching. His heuristic parse is quite good; I actually suspect that most of the win of "optimal" over "lz4 -c2" is due to finding better matches, not from making better parse decisions.

(Yann's lz4.exe seems to also add a 16 byte header to every file)

See also previous posts on LZ and optimal parsing :

cbloom rants 10-10-08 - 7 - On LZ Optimal Parsing
cbloom rants 10-24-11 - LZ Optimal Parse with A Star Part 1
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 2
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 3
cbloom rants 12-17-11 - LZ Optimal Parse with A Star Part 4
cbloom rants 01-09-12 - LZ Optimal Parse with A Star Part 5
cbloom rants 11-02-11 - StringMatchTest Release + String Match Post Index
cbloom rants 09-27-08 - 2 - LZ and ACB
cbloom rants 08-20-10 - Deobfuscating LZMA
cbloom rants 09-03-10 - LZ and Exclusions
cbloom rants 09-14-10 - A small note on structured data
cbloom rants 06-08-11 - Tech Todos

(*) ADDENDUM :

It turns out you can optimal parse LZ4 without keeping all the states, that is with just a single LZSS style backwards walk and only a 1-wide dynamic programming table. There are several subtle things that make this possible. See the comments on this post : LZ-Bytewise conclusions

old rants