09-16-08 - 1

I've been trying to figure out what's happened in data compression in the last 10 years. Here's my quick run-down :

H.264 : the new video codec; a very simple 4x4 transform used with highly tuned quantizers and predictive coders, and very sophisticated motion estimation and bit allocation. Many different profiles support different bit rates. There are *TONS* of papers on this beast and every aspect has been highly tuned.

PPMii / PPMd / PPMonstr / Durilca : stuff by Dmitry Shkarin. This was a while ago, it's the successor to PPMZ and very good. Some code and papers are available. One of the things that he does that everyone does now is detect data types and do some special munging to make the data more compressible.

PAQ by Matt Mahoney : bitwise coder with lots of context models and weighting, sort of like CTW, but really more like PPMZ. Uses SSE similar to the SEE in PPMZ. Model weighting is well done, uses many types of models such as models with skips. The best compressor at the moment, very slow. Bitwise compressors like PAQ are somewhat easier to write than byte-wise, because you don't have the issue of escapes. Escapes and sparse contexts are really hard to get perfectly right, with bits you always have a 0 and 1, you just need to put a probability on each.

Preprocessors & dictionaries are super common now. Lots of people use a variant of an "LZP" preprocessor, Shkarin and Mahoney both have one. Most of the top compressors will decompress already packed data so they can recompress it, they do an E8E9 transform on exes, and most are conditioned with some training model. (E8E9 transform changes relative jumps to absolute jumps; that way function calls to "strcpy" or whatever always look the same in bytes, so they're much more compressible). Most compressors also detect binary data semi-automatically and use skip-contexts.

ROLZ : aka "reduced offset LZ" ; apparently this is in RKive and many of the Russians are using it, eg. I believe it's in 7-Zip. I haven't found any actual papers or good descriptions of how they do it exactly, but the basic idea is to do something like an LZP to build a list of only a few possible match indexes using context prediction, write a code to select one of those with context modeling.

There are a bunch of big packages with source code, such as 7-Zip, LZMA, FreeArc, BALZ, and I'm sure many more. While this is cool, they seem to be huge messes with little or poor documentation. Holy crap the Russians write really god awful unreadable code. A lot of them look they were intentionally written for an obfuscated C coding contest, stuff like :

m = H(++*p, i & ( -i >> DS )) && *++p | ( dec(m.[(i>>MS)&0xFF) + t );
These are some actual lines of code from Shkarin's PPMD (and seriously, he is definitely one of the cleanest Russian coders) :
    SubRange.HighCount=(SubRange.scale += (SubRange.LowCount=LoCnt));
        psee2c += 2*(2*NumStats < t+NumMasked)+Flags;
        (FoundState=p)->Freq=(HiCnt += 4);  SummFreq += 4;

In the super high lossless compression area, there's a bunch of talk in the Russian forums such as encode.ru . It's super hard to actually find the good information in there though.

Lapped image transforms. Malvar's page at MS is a good place to start on this. Basically improves quality & compression of block based image coding by using basis functions that extend past the blocks. An alternative viewpoint is that a Lifting based smearing filter is run on the blocks at decompress, and the inverse filter is run at compress.

There's some new interesting stuff on blocksort. Jurgen Abel after BWT , and Michael Maniscalco has a bunch of stuff.

For lossless image compression, it seems to be stuff like GLICBALS and MRP , basically still DPCM but with rigorous linear predictor training.

There are some new arithmetic coders. One I found is "rc.c" in the MRP distribution which is called the "Russian Range coder". It's slightly different than the standard Schindler implementation that I use, but seems to preform almost identically. Another one is coder used by PAQ, the best one seems to be "fpaq0p". Again it's a very slightly different binary arithcoder implementation than the one I use but seems to perform almost identically. The big new one is the strange paper by Jarek Duda called "Asymmetric Binary System". The paper is totally unreadable IMO, it's more useful to try to find people's notes on it around the web, such as this .

No comments:

old rants