11/20/2008

11-20-08 - DXTC Part 3

So we know DXT1 is bad compared to a variable bitrate coder, but it is a small block fixed rate coder, which is pretty hard to deal with.

Small block inherently gives up a lot of coding capability, because you aren't allowed to use any cross-block information. H264 and HDPhoto are both small block (4x4) but both make very heavy use of cross-block information for either context coding, DC prediction, or lapping. Even JPEG is not a true small block coder because it has side information in the Huffman table that captures whole-image statistics.

Fixed bitrate blocks inherently gives up even more. It kills your ability to do any rate-distortion type of optimization. You can't allocate bits where they're needed. You might have images with big flat sections where you are actually wasting bits (you don't need all 64 bits for a 4x4 block), and then you have other areas that desperately need a few more bits, but you can't gived them to them.

So, what if we keep ourselves constrained to the idea of a fixed size block and try to use a better coder? What is the limit on how well you can do with those constraints? I thought I'd see if I could answer that reasonably quickly.

What I made is an 8x8 pixel fixed rate coder. It has zero side information, eg. no per-image tables. (it does have about 16 constants that are used for all images). Each block is coded to a fixed bit rate. Here I'm coding to 4 bits per pixel (the same as DXT1) so that I can compare RMSE directly, which is a 32 byte block for 8x8 pixels. It also works pretty well at 24 byte blocks (which is 1 bit per byte), or 64 for high quality, etc.

This 8x8 coder does a lossless YCoCg transform and a lossy DCT. Unlike JPEG, there is no quantization, no subsampling of chroma, no huffman table, etc. Coding is via an embedded bitplane coder with zerotree-style context prediction. I haven't spent much time on this, so the coding schemes are very rough. CodeTree and CodeLinear are two different coding techniques, and neither one is ideal.

Obviously going to 8x8 instead of 4x4 is a big advantage, but it seems like a more reasonable size for future hardware. To really improve the coding significantly on 4x4 blocks you would have to start using something like VQ with a codebook which hardware people don't like.

In the table below you'll see that CodeTree and CodeLinear generally provide a nice improvement on the natural images, about 20%. In general they're pretty close to half way between DXTC and the full image coder "cbwave". They have a different kind of perceptual artifact when they have errors - unlike DXTC which just make things really blocky, these get the halo ringing artifacts like JPEG (it's inherent in truncating DCT's).

The new coders do really badly on the weird synthetic images from bragzone, like clegg, frymire and serrano. I'd have to fix that if I really cared about these things.

One thing that is encouraging is that this coder does *very* well on the simple synthetic images, like the "linear_ramp" and the "pink_green" and "orange_purple". I think these synthetic images are a lot like what game lightmaps are like, and the new schemes are near lossless on them.

BTW image compression for paging is sort of a whole other issue. For one thing, a per-image table is perfectly reasonable to have, and you could work on something like 32x32 blocks. But more important is that in the short term you still need to provide the image in DXT1 to the graphics hardware. So you either just have to page the data in DXT1 already, or you have to recompress it, and as we've seen here the "real-time" DXT1 recompressors are not high enough quality for ubiquitous use.

ADDENDUM I forgot but you may have noticed the "ryg" in this table is also not the same as previous "ryg" - I fixed a few of the little bugs and you can see the improvement here. It's still not competitive, I think there may be more errors in the best fit optimization portion of the code, but he's got that code so optimized and obfuscated I can't see what's going on.

Oh, BTW the "CB" in this table is different than the previous table; the one here uses 4-means instead of 2-means, seeded from the pca direction, and then I try using each of the 4 means as endpoints. It's still not quite as good as Squish, but it's closer. It does beat Squish on some of the more degenerate images at the bottom, such as "linear_ramp". It also beats Squish on artificial tests like images that contain only 2 colors. For example on linear_ramp2 without optimization, 4-means gets 1.5617 while Squish gets 2.3049 ; most of that difference goes away after annealing though.

RMSE per pixel :

file Squish opt Squish CB opt CB ryg D3DX8 FastDXT cbwave KLTY CodeTree CodeLinear
kodim01.bmp 8.2808 8.3553 8.3035 8.516 8.9185 9.8466 9.9565 2.6068 5.757835 5.659023
kodim02.bmp 6.1086 6.2876 6.1159 6.25 6.8011 7.4308 8.456 1.6973 4.131007 4.144241
kodim03.bmp 4.7804 4.9181 4.7953 4.9309 5.398 6.094 6.4839 1.3405 3.369018 3.50115
kodim04.bmp 5.6913 5.8116 5.7201 5.8837 6.3424 7.1032 7.3189 1.8076 4.254454 4.174228
kodim05.bmp 9.6472 9.7223 9.6766 9.947 10.2522 11.273 12.0156 2.9739 6.556041 6.637885
kodim06.bmp 7.1472 7.2171 7.1596 7.3224 7.6423 8.5195 8.6202 2.0132 5.013081 4.858232
kodim07.bmp 5.7804 5.8834 5.7925 5.9546 6.3181 7.2182 7.372 1.4645 3.76087 3.79437
kodim08.bmp 10.2391 10.3212 10.2865 10.5499 10.8534 11.8703 12.2668 3.2936 6.861067 6.927792
kodim09.bmp 5.2871 5.3659 5.3026 5.4236 5.7315 6.5332 6.6716 1.6269 3.473094 3.479715
kodim10.bmp 5.2415 5.3366 5.2538 5.3737 5.7089 6.4601 6.4592 1.7459 3.545115 3.593297
kodim11.bmp 6.7261 6.8206 6.7409 6.9128 7.3099 8.1056 8.2492 1.8411 4.906141 4.744971
kodim12.bmp 4.7911 4.8718 4.799 4.9013 5.342 6.005 6.0748 1.5161 3.210518 3.231271
kodim13.bmp 10.8676 10.9428 10.9023 11.2169 11.6049 12.7139 12.9978 4.1355 9.044009 8.513297
kodim14.bmp 8.3034 8.3883 8.3199 8.5754 8.8656 9.896 10.8481 2.4191 6.212482 6.222196
kodim15.bmp 5.8233 5.9525 5.8432 6.0189 6.3297 7.3085 7.4932 1.6236 4.3074 4.441998
kodim16.bmp 5.0593 5.1629 5.0595 5.1637 5.5526 6.3361 6.1592 1.546 3.476671 3.333637
kodim17.bmp 5.5019 5.6127 5.51 5.6362 6.0357 6.7395 6.8989 1.7166 4.125859 4.007367
kodim18.bmp 7.9879 8.0897 8.0034 8.225 8.6925 9.5357 9.7857 2.9802 6.743892 6.376692
kodim19.bmp 6.5715 6.652 6.5961 6.7445 7.2684 7.9229 8.0096 2.0518 4.45822 4.353687
kodim20.bmp 5.4533 5.5303 5.47 5.5998 5.9087 6.4878 6.8629 1.5359 4.190565 4.154571
kodim21.bmp 7.1318 7.2045 7.1493 7.3203 7.6764 8.4703 8.6508 2.0659 5.269787 5.05321
kodim22.bmp 6.43 6.5127 6.4444 6.6185 7.0705 8.0046 7.9488 2.2574 5.217884 5.142252
kodim23.bmp 4.8995 5.0098 4.906 5.0156 5.3789 6.3057 6.888 1.3954 3.20464 3.378545
kodim24.bmp 8.4271 8.5274 8.442 8.7224 8.9206 9.9389 10.5156 2.4977 7.618436 7.389021
clegg.bmp 14.9733 15.2566 15.1516 16.0477 15.7163 21.471 32.7192 10.5426 21.797655 25.199576
FRYMIRE.bmp 10.7184 12.541 11.9631 12.9719 12.681 16.7308 28.9283 6.2394 21.543401 24.225852
LENA.bmp 7.138 7.2346 7.1691 7.3897 7.6053 8.742 9.5143 4.288 7.936599 8.465576
MONARCH.bmp 6.5526 6.6292 6.5809 6.7556 7.0313 8.1053 8.6993 1.6911 5.880189 5.915117
PEPPERS.bmp 6.3966 6.5208 6.436 6.6482 6.9006 8.1855 8.8893 2.3022 6.15367 6.228315
SAIL.bmp 8.3233 8.3903 8.3417 8.5561 8.9823 9.7838 10.5673 2.9003 6.642762 6.564393
SERRANO.bmp 6.3508 6.757 6.5572 6.991 7.0722 9.0549 18.3631 4.6489 13.516339 16.036401
TULIPS.bmp 7.5768 7.656 7.5959 7.8172 8.0101 9.3817 10.5873 2.2228 5.963537 6.384049
lena512ggg.bmp 4.8352 4.915 4.8261 4.877 5.1986 6.0059 5.5247 2.054319 2.276361
lena512pink.bmp 4.5786 4.6726 4.581 4.6863 5.0987 5.8064 5.838 3.653436 3.815336
lena512pink0g.bmp 3.7476 3.8058 3.7489 3.8034 4.2756 5.0732 4.8933 4.091045 5.587278
linear_ramp1.BMP 1.4045 2.1243 1.3741 1.6169 2.0939 2.6317 3.981 0.985808 0.984156
linear_ramp2.BMP 1.3377 2.3049 1.3021 1.5617 1.9306 2.5396 4.0756 0.628664 0.629358
orange_purple.BMP 2.9032 3.0685 2.9026 2.9653 3.2684 4.4123 7.937 1.471407 2.585087
pink_green.BMP 3.2058 3.3679 3.2 3.2569 3.7949 4.5127 7.3481 1.247967 1.726312

11/18/2008

11-18-08 - DXTC Part 2

First of all, let's dispell the illusion that some people have that DXT1 is "pretty good". DXT1 is fucking awful. It compresses at 1.33333 bits per byte (4 bits per pixel). That's very large as far as image compressors are concerned. For typical images, around 4.0 bpb is true lossless, around 1.5 is perceptually lossless, and around 0.5 is "very good". In fact wavelet compressors can get as low as 0.1 bpb and acheive about the same quality as DXT1. Despite this I've heard smart people saying that "DXT1 is pretty good". Yes, it is a convenient fixed size block, and yes the decoder is extremely fast and simple, but as far as quality is concerned it is not even close. At 1.33333 it should be near lossless.

Here are some numbers on various DXT1 compressors. The numbers in the table are the RMSE (sqrt of the L2 error). The far right column is a wavelet compressor for comparison; it's not the best wavelet compressor in the world by a long shot, it's "cbwave" which is very old and which I designed for speed, not maximum quality. In any case it gives you an idea how far off DXT1 is. (BTW I always try to show results in RMSE because it is linear in pixel magnitude, unlike MSE or PSNR which is a very weird nonlinear scale). More discussion after the table...

(ADD: the "ryg" results shown here have a bug and should be ignored; see DXTC Post Summary )

file Squish opt Squish CB opt CB ryg D3DX8 FastDXT cbwave
kodim01.bmp 8.2808 8.3553 8.352 8.6924 9.374 9.8466 9.9565 2.6068
kodim02.bmp 6.1086 6.2876 6.1287 6.3025 7.52 7.4308 8.456 1.6973
kodim03.bmp 4.7804 4.9181 4.8312 5.0225 5.855 6.094 6.4839 1.3405
kodim04.bmp 5.6913 5.8116 5.7285 5.9394 6.9408 7.1032 7.3189 1.8076
kodim05.bmp 9.6472 9.7223 9.707 10.112 10.8934 11.273 12.0156 2.9739
kodim06.bmp 7.1472 7.2171 7.1777 7.44 8.1005 8.5195 8.6202 2.0132
kodim07.bmp 5.7804 5.8834 5.8379 6.0583 6.8153 7.2182 7.372 1.4645
kodim08.bmp 10.2391 10.3212 10.346 10.747 11.3992 11.8703 12.2668 3.2936
kodim09.bmp 5.2871 5.3659 5.3306 5.5234 5.9884 6.5332 6.6716 1.6269
kodim10.bmp 5.2415 5.3366 5.2777 5.4633 5.9377 6.4601 6.4592 1.7459
kodim11.bmp 6.7261 6.8206 6.7643 7.0216 7.8221 8.1056 8.2492 1.8411
kodim12.bmp 4.7911 4.8718 4.8204 4.9863 5.6651 6.005 6.0748 1.5161
kodim13.bmp 10.8676 10.9428 10.925 11.4237 12.402 12.7139 12.9978 4.1355
kodim14.bmp 8.3034 8.3883 8.3398 8.6722 9.4258 9.896 10.8481 2.4191
kodim15.bmp 5.8233 5.9525 5.8568 6.0862 6.6749 7.3085 7.4932 1.6236
kodim16.bmp 5.0593 5.1629 5.0863 5.2851 5.8093 6.3361 6.1592 1.546
kodim17.bmp 5.5019 5.6127 5.5313 5.7358 6.4975 6.7395 6.8989 1.7166
kodim18.bmp 7.9879 8.0897 8.0192 8.3716 9.7744 9.5357 9.7857 2.9802
kodim19.bmp 6.5715 6.652 6.6692 6.91 8.0128 7.9229 8.0096 2.0518
kodim20.bmp 5.4533 5.5303 5.4895 5.6864 6.3457 6.4878 6.8629 1.5359
kodim21.bmp 7.1318 7.2045 7.1724 7.4582 8.1637 8.4703 8.6508 2.0659
kodim22.bmp 6.43 6.5127 6.4644 6.7137 7.8264 8.0046 7.9488 2.2574
kodim23.bmp 4.8995 5.0098 4.9244 5.0906 5.6989 6.3057 6.888 1.3954
kodim24.bmp 8.4271 8.5274 8.4699 8.8564 9.3906 9.9389 10.5156 2.4977
clegg.bmp 14.9733 15.2566 15.1755 15.7168 16.3563 21.471 32.7192 10.5426
FRYMIRE.bmp 10.7184 12.541 12.132 12.8278 12.989 16.7308 28.9283 6.2394
LENA.bmp 7.138 7.2346 7.1763 7.4264 8.1203 8.742 9.5143 4.288
MONARCH.bmp 6.5526 6.6292 6.5949 6.846 7.5162 8.1053 8.6993 1.6911
PEPPERS.bmp 6.3966 6.5208 6.4557 6.677 7.3618 8.1855 8.8893 2.3022
SAIL.bmp 8.3233 8.3903 8.3598 8.6627 9.8685 9.7838 10.5673 2.9003
SERRANO.bmp 6.3508 6.757 6.8385 7.9064 7.5303 9.0549 18.3631 4.6489
TULIPS.bmp 7.5768 7.656 7.6146 7.8786 8.4084 9.3817 10.5873 2.2228

Back to comparing the DXT1 encoders. BTW the test set here is the Kodak image set plus the Waterloo Bragzone image set. The Kodak set is all photographs that are pretty noisy, and there's not a huge difference in the coders. The Bragzone image set has some synthetic images with things like gradients which are harder to compress well, and there you can really dramatically see the bad encoders fall apart. In particular if you look at the results on "clegg" and "frymire" and "serrano" you can see how bad the "FastDXT" coder is.

The "Squish" in the table is the iterative cluster fit with uniform weighting. All coders work on linear RGB error; and the MSE is mean per pixel not per component.

The "CB" encoder is a simple 2-means fit. I seed the means by doing the PCA to find the best fit line. I put a plane perpendicular to that line through the average and take all the points on each half to be the two clusters, average the cluster to seed the means, and then iteratively refine the means by reclustering. Once I have the 2-means I do a simple search to find the best 565 DXT end points to find the two means. There are 3 cases to try :

1. put c0 and c1 = the two means

2. put c2 and c3 = the two means (so c0 and c1 are pushed out)

3. make (c0,c2) straddle mean 1 and (c3,c1) straddle mean 2 - this is best for gaussian clusters around the mean

A 2-means like this is slightly better than doing a pca "range fit" like Simon's fast method or the "ryg" method. If the data was all Gaussian noise, they would be equivalent, but of course it's not. You often get blocks that have a bunch of pixels at the low end which are all exactly the same color ( for example, all perfect black), and then a spread of a bunch of junk at the high end (like some orange, some yellow, etc.). You want to put one end point exactly on perfectly black and put the other endpoint at the center of the cluster on the high end.

"CB opt" and "Squish opt" take the results of the CB and Squish compressors and then improve them by iterative search. Simon Brown on his page mentions something about doing greedy endpoint refinement but claims it "doesn't converge because the indeces change". That's silly, of course it converges.

To do a greedy search : try wiggling one or both end points in 565 space. Find new best index fit for the new end points. Measure new error. If the error is lower, take the step.

Of course that works and it does converge and it's pretty simple and improves the encoding after Squish.

In fact you can do even better by doing simulated annealing instead of a plain greedy search. We should know by now that any time we have a greedy search like this where we can measure the error and it can get stuck in local minima, we can improve it with something like simulated annealing. I use 256 steps of annealing with a sinusoidal decreasing temperature pattern. I randomly pick a way to wiggle the two end points (you need to consider wiggling both, not just single wiggles). I try the wiggle and measure the error delta. Negative errors (improvements) are always taken, positive errors are taken probabilitistically based on the temperature. This is what was done in the "opt" coders above.

Most of the time the annealing search just makes a small improvement, but once in a while (such as on "frymire" and "serrano") it really finds something remarkable and makes a big difference.

Simon's cluster fit alg is very sweet. I didn't really understand it at first from the description of the code, so I'm going to redescribe it for myself here.

The basic idea is you find an assignment of indices, and from that solve a best fit to give you the end points for those indices. If you think about all the colors in the block living in 3x3 color space, assigning indices is like giving labels to each point. All the points with the same label are a "cluster". Then each cluster is fit with either an end point or one of the interpolated points.

Once you know the clusters, putting the best two end points is a simple linear optimization. You can solve the least squares analytically, it's not like a matrix iteration least squares problem or anything.

So the trick is how to decide the indices. If you tried them all, it would be 4^16 ways, which is way too many. So what Simon does is create a linear ordering of the points using the PCA axis, then try all clusterings that can be created by planes perpendicular to that linear axis. That's equivalent to just trying all groupings of the indeces sorted by their dot along the PCA axis. That is, the clusters are

[0,i) , [i,j) [j,k) [k,16)

for all i,j,k
which is a manageable amount to search, and gives you most of the interesting clusterings that you care about. Something that might improve Squish is tring a few different axes and picking the best.

BTW this end point optimization is very approximate. One issue is that it assumes the best index for each point doesn't change. It also of course just uses real number arithmetic to make the 1/3 points, not the actual integer rounding that's done. Those factors are actually pretty big.

11-18-08 - DXTC

(for links to all posts in this series see DXTC Post Summary )

I've been doing a little work on DXT1 compression. The main reason I'm looking at it is because I want to do something *better* than DXT1, and before I do that, I want to make sure I'm doing the DXTC right.

For now let's gather prior art :

The only real decision the coder gets to make is what the two 565 endpoints are. The other details you just have to get right - reconstruct the palette interpolants right, and assign the 16 indices to the closest palette entries right. For the moment I don't care about speed, I just want to make sure the indices are actually the best choices.

All the papers by Ignacio and J.M.P. van Waveren are modern classics. One thing of note : the "simplified" index finder that JMP has in his original paper is wrong. On page 12 he says "Evaluating the above expression reveals that the sub expression ( !b3 & b4 ) can be omitted because it does not significantly contribute to the final result" - apparently that is not true, because when I use the "rewritten" index finder it produces bad indexes. His original version of the index finder bit twiddle is fine.

One thing I've found with the index finding is that the integer and nonlinear effects are pretty strong and any ideas you have about planar geometry fail if you assume that your 4 points are colinear (they are not actually). For example, an obvious idea is to do a dot product test to pick between the 4 points. First dot product with the middle separating plane, then the two planes between the next pairs of points. This doesn't work. Part of the problem is because of the bad rounding in the generation of the 1/3 point. The JMP method is actually comparing vs the true distances, so that part is good.


In more detail, an obvious idea is to do something like this :

int dot = (color - c0) * (c1 - c0);

int bmid = (dot > mid01);
int b02 = (dot > mid02);
int b31 = (dot > mid31);

// using the DXT1 palette order label 0,2,3,1

int threebit = (bmid<<2) + (b02<<1) + b31;

int index = remap[threebit];

// remap is an 8-entry table that remaps from 3 bits to 2 bits to give an index

That "works" just as much as the "ryg" code works or the "Extreme DXT" method of finding indexes work - which is to say that it doesn't work.

FastDXT appears to be an implementation of the id "Real Time DXT" paper.

Ericsson Texture Compression (ETC2) is similar to DXTC but different; this is a pretty good paper, there are some interesting ideas here like the T and H blocks. It gets slightly better quality in some cases. It's obvious that you can beat DXTC by having a base color and a log-scaled delta color, rather than two end points. The two 565 end points is wasteful; you could for example do a 777 base color and a 444 log scaled delta.

Tight Frame Normal Map Compression is similar to DXT5 (really to 3Dc) with some small improvement on normal maps.

Extreme DXT Compression by Peter Uliciansky is indeed super fast. However, it has some errors. The division method that he uses to assign the indeces is not correct. You can improve it by offsetting the end points correctly to put the division quantization boundaries in the right place, but even then it's still not exactly correct unless the diagonal of the color bbox is along {1,1,1}. The error from this is pretty small and he's going for max speed, so it's semi reasonable what he does. Note that using the full range of the bbox for the division but insetting it to make the dxt1 colors (as he does) improves the indeces slightly, but it's not actually the correct range scale.

In more detail :

To find indices for the DXTC colors c0 and c1

You should use

base = c0 + (c0 - c1)/6

range = (c1 - c0) * 4/3

index = (color - base) * 4 / range;

or

index = (color - base) * 3 / (c1 - c0);

not :

index = (color - c0) * 4 / (c1 - c0);

But even if you do that it's still wrong because the planes through color space do not actually separate the colors from their closest palette entry.

MSDN compressed format documentation now has the right rounding for the 1/3 points, but the old DirectX8 docs have the wrong rounding. The old docs said


color_2 = (2*color_0 + 1*color_1 + 1)/3
color_3 = (1*color_0 + 2*color_1 + 1)/3

Which is actually better, but is not what the standard or the hardware does.

Simon Brown's Squish is very good. His Cluster Fit idea is elegant and clever and produces quite good quality.

There's a thread on MollyRocket about DXTC with some convenient code from "ryg". It's nice simple code, but it has some errors. The way he finds the indices is the dot product method which is wrong. Also the way he computes the 1/3 and 2/3 colors for the palette using LerpRGB is wrong, it rounds differently that the hardware really does.

One thing that some simple code gets wrong is the way to quantize from 24 bit to 565. You can't just take the top bits, because the 565 values are not reconstructed to the middle of their bins. You need to correct for the offsetting. You can either use the true range, which is something like -4 to 259, or you can skew the bits in the quantizer. The "ryg" code has a perfect quantizer that's very neat in the "As16Bit" function. An example is that the value 250 should map to 30 in 5 bits, not 31.

BTW the YCoCg-DXT5 method is pretty good on current hardware, but I'm not really looking at it. It's obviously inherently wasteful, it's just a way of making use of the current hardware, but it's not really what you'd want to be doing. It's also very large, at around 2.6 bits per byte (8 bits per pixel).

More advanced topics to come.

11/11/2008

11-11-08 - REYES

Gritz' course on Renderman : To Infinity and Beyond is the best reference I've found on REYES. I'm still curious about the details of how to efficiently do the part from micropolygons to pixel colors.

The brute force way takes a micropolygon and puts a bound on it that cover its entire temporal sweep across the current frame. That is then binned into all the pixels it touches. In each pixel, 16 (or so) supersampling locations are checked with precomputed jittered sample locations and times. At each spot where a sample is found, the color & Z are added to a list for later blending. Note with fast moving objects and depth of field, it can get really inefficient, because micropolygons have to get tested in many pixels.

The Pixar Library is a nice collection of papers. There's a new paper I hadn't seen on point-based GI that's based on the Bunell disk method for realtime GI but enhanced for better quality. It seems extremely simple and the results look good. It's not super accurate in terms of solving the rendering equation with minimal error, but that's not really what we want anyway. Though actually now that I think about it it's just a variant on the ancient "render the world from each surfel's point of view" method for GI and doesn't really add much.

Anyway, back to REYES. Ignoring temporal sampling, the spatial sampling stuff can all be done with a normal rasterizer I'm pretty sure. The micropolygon subdivision and binning is just your fragment rasterizer (a rasterizer is just a device that creates micropolygons that are at most pixel size and makes a grid that's aligned with the screen grid). When you decide a fragment goes in a pel, you don't just stuff the color in the frame buffer, instead you stick it in the list to sample against the stochastic supersampled grid test just like REYES.

But it occurs to me that if you're building up a list of fragments in each pel like this, you may as well do analytic coverage than stochastic sampling. When you rasterize your triangles, compute the area of each pel that's covered. This can be done efficienctly incrementally just like a Wu antialiased line drawer for the case of only one triangle edge interesting a pel, if 2 or 3 edges intersect one pel it's more work.

Add the fragment to the pel list, consisting of {color, alpha, Z, area}. Once you have all these fragments sort them back-to-front and accumulate into the framebuffer. Fragments that have the same Z are first gathered together before being alpha-blended onto the current value in the frame buffer. That means that two triangles that share an edge act like a continuous surface and will correctly give you 100% area covers and will correctly not alpha blend with each other. This gives you exact anti-aliasing with alpha blending.

11-11-08 - Mapped Drives

Is there a way to make "Disconnected Network Drives" try to connect from DOS or from code? I'm talking about when you "net use" to make a network path to a drive letter, but you reboot or something so it's disconnected. It does a sort of lazy connect, where if you go click on it in Explorer it will hook right up, but if you go try to dir it from a CLI before you explore to it, you can't get to it.

I'd like to make my CLI file copy utils do something like "if target is a disconnect network drive, wake it up first".


The answer seems to rely somewhere in the functions RealDriveType, IsNetDrive, and WNetRestoreConnection.

These are functions that the DOJ forced MS to expose and the documentation is total balls. This page has better info than MSDN :

http://vbnet.mvps.org/index.html?code/network/isnetdrive.htm

I'm seeing some weird things with them though. For one thing, IsNetDrive is not a light query. It actually stalls and tries to wake up the drive right there, so most of the time all you have to do is call IsNetDrive and that actually will restore the connection. IsNetDrive can stall out for a very long time.

IsNetDrive also lies, as that page I linked details.

I haven't gotten WNetRestoreConnection to do anything useful. It always tells me "The local device name is already in use".

WNetRestoreConnection also pops up an interactive dialog box when it has errors which is fucking evil.


So ... in every case I've tried so far IsNetDrive actually refreshes the connection. I haven't yet found a need to actually call WNetRestoreConnection, which is good because it doesn't seem to work.

11/03/2008

11-03-08 - Curves

I finally read Ignacio's talk on displaced subdivision with the NV geometry shader pipeline. You can find the whole talk with audio and such online but I like just the PDF . There's a lot of good stuff in there. A lot of the issues are general issues for low-high poly work, it's just that most people so far have been just ignoring the problems and creating meshes with bad discontinuities.

Ignacio's stuff is based on the Charles Loop paper "Approximating Catmull-Clark Subdivision Surfaces with Bicubic Patches". It's funny that it's really just a fancy version of the PN Triangles hack; the actual geometry surface made is not continuous, but there's a separate tangent field over it which makes the shading look smooth.

While there I noticed a bunch of Loop papers I hadn't read. The Loop-Blinn thing from 05 was particularly interesting because I've talked to a few people in the last N years about rendering fonts better. Thatcher worked on it a bit at Oddworld, and recently Sean's been doing it at RAD for our new GUI product. The Loop Blinn Paper described how to use a pixel shader to shade quadratic or cubic curves. The cubic stuff seems like rather a pain in the butt, but piecewise quadratic is certainly good enough for glyphs. Also they sort of gloss over the details of anti-aliasing, which is of course crucial; there's not much point in being this careful with curves unless you're going to get the antialiasing right.

The MDK Curvy Blues blog covers the same material and claims there are some mistakes in the Loop-Blinn paper. There's also a nice GLSL conics paper and demo app of the Loop-Blinn technique.

Zheng Qin, Michael McCool and Craig Kaplan seem to be doing a lot of work on this stuff at Waterloo. I sort of assume you're familiar with the simple Valve Alpha-test Distance Field thing ; the Qin work is the basis of the Valve technique and is more robust and general, but much more complicated. The Valve technique is probably okay in a game development scenarios where artists are monitoring how their work looks in game and are used to tweaking with weird algorithms, but it definitely has parameters that you have to tweak, and the use of single (or even two) distances can cause bad degeneracies if you're not careful, whenever you have details that are close to the size of a pixel in the distance map.

10/10/2008

10-10-08 : On the Art of Good Arithmetic Coder Use

On the Art of Good Arithmetic Coder Use

I think it's somewhat misleading to have these arithmetic coder libraries. People think they can take them and just "do arithmetic coding". The reality is that the coder is only a small part of the arithmetic coding process, and it's the easiest part. The other parts are subtle and somewhat of an art.

The other crucial parts are how you model your symbol, how you present symbols to the coder, how you break up the alphabet, and different type of adaptation.

If you have a large alphabet, like 256 symbols or more, you don't want to just code with that alphabet. It has few problems. For one, decoding is slow, because you need to do two divides, and then you have to binary search to find your symbol from a cumulative probability. Aside from the speed issue, you're not compressing super well. The problem is you have to assign probabilities to a whole mess of symbols. To get good information on all those symbols, you need to see a lot of characters to code. It's sort of like the DSP sampling problem - to get information on a big range of frequencies in audio you need a very large sampling window, which makes your temporal coherence really poor. In compression, many statistics change very rapidly, so you want quick adaptation, but if you try to do that on a large alphabet you won't be gathering good information on all the symbols. Peter Fenwick has some good old papers on multi-speed adaptation by decomposition for blocksorted values.

In compression most of our alphabets are highly skewed, or at least you can decompose them into a highly skewed portion and a flat portion. Highly skewed alphabets can be coded very fast using the right approaches. First of all you generally want to sort them so that the most probable symbol is 0, the next most probable is 1, etc. (you might not actually run a sort, you may just do this conceptually so that you can address them in order from MPS to LPS). For example, blocksort output or wavelet coefficients already are sorted in this order. Now you can do cumulative probability searches and updates much faster by always starting at 0 at summing towards 0, so that you rarely walk very far into the tree.

You can also decompose your alphabet into symbols that more accurately represent its "nature" which will give you better adaptation. The natural alphabet is the one that mimics the actual source that generated your code stream. Of course that's impossible to know, but you can make good guesses. Consider an example :


The source generates symbols thusly :

It chooses Source1 or Source2 with probability P
Source1 codes {a,b,c,d} with fixed probabilities P1(a),P1(b),...
Source2 codes {e,f,g,h} with fixed probabilities P2(e),P2(f),...

The probability P of sources 1 and 2 changes over time with a semi-random walk
After each coding event, it either does P += 0.1 * (1 - P) or P -= 0.1 * P


If we tried to just code this with an alphabet {a,b,c,d,e,f,g,h} and track the adaptation
we would have a hell of a time and not do very well.

Instead if we decompose the alphabet and code a binary symbol {abcd,efgh} and then code
each half, we can easily do very well.  The coding of the sub-alphabets {a,b,c,d} and
{e,f,g,h} can adapt very slowly and gather a lot of statistics to learn the probabilities
P1 and P2 very well.  The coding of the binary symbol can adapt quickly and learn the
current state of the random decision probability P.

This may seem rather contrived, but in fact lots of real world sources look exactly like this. For example if you're trying to compress HTML, it switches between basically looking like English text and looking like markup code. Each of those is a seperate set of symbols and probabilities. The probabilites of the characters within each set are roughly constantly (not really, but they're relatively constant compared to the abruptness of the switch), but where a switch is made is random and hard to predict so the probability of being in one section or another needs to be learned very quickly and adapt very quickly.

We can see how different rates of adaptation can greatly improve compression.

Good decomposition also improves coding speed. The main way we get this is by judicious use of binary coding. Binary arithmetic coders are much faster - especially in the decoder. A binary arithmetic decoder can do the decode, modeling, and symbol find all in about 30 clocks and without any division. Compare that to a multisymbol decoder which is around 70 clocks just for the decode (two divides), and that doesn't even include the modeling and symbol finding, which is like 200+ clocks.

Now, on general alphabets you could decompose your multisymbol alphabet into a series of binary arithmetic codes. The best possible way to do this is with a Huffman tree! The Huffman tree tries to make each binary decision as close to 50/50 as possible. It gives you the minimum total code length in binary symbols if you just wrote the Huffman codes, which means it gives you the minimum number of coding operations if you use it to do binary arithmetic coding. That is, you're making a binary tree of coding choices for your alphabet but you're skewing your tree so that you get to the more probable symbols with fewer branches down the tree.

(BTW using the Huffman tree like this is good for other applications. Say for example you're trying to make the best possible binary search tree. Many people just use balanced trees, but that's only optimal if all the entries have equal probability, which is rarely the case. With non-equal probabilities, the best possible binary search tree is the Huffman tree! Many people also use self-balancing binary trees with some sort of cheesy heuristic like moving recent nodes near the head. In fact the best way to do self-balancing binary trees with non equal probabilities is just an adaptive huffman tree, which has logN updates just like all the balanced trees and has the added bonus of actually being the right thing to do; BTW to really get that right you need some information about the statistics of queries; eg. are they from a constant-probability source, or is it a local source with very fast adaptation?).

Anyhoo, in practice you don't really ever want to do this Huffman thing. You sorted your alphabet and you usually know a lot about it so you can choose a good way to code it. You're trying to decompose your alphabet into roughly equal probability coding operations, not because of compression, but because that gives you the minimum number of coding operations.

A very common case is a log2 alphabet. You have symbols from 0 to N. 0 is most probable. The probabilities are close to geometric like {P,P^2,P^3,...} A good way to code this is to write the log2 and then the remainder. The log2 symbol goes from 0 to log2(N) and contains most of the good information about your symbol. The nice thing is the log2 is a very small alphabet, like if N is 256 the log2 only goes up to 9. That means coding it is fast and you can adapt very quickly. The remainder for small log2's is also small and tracks quickly. The remainder at the end is a big alphabet, but that's super rare so we don't care about it.

Most people now code LZ77 offsets and lengths using some kind of semi-log2 code. It's also a decent way to code wavelet or FFT amplitudes. As an example, for LZ77 match lengths you might do a semi-log2 code with a binary arithmetic coder. The {3,4,5,6} is super important and has most of the probability. So first code a binary symbol that's {3456} vs. {7+}. Now if it's 3456 send two more binary codes. If it's {7+} do a log2 code.

Another common case is the case that the 0 and 1 are super super probable and everything else is sort of irrelevant. This is common for example in wavelets or DCT images at high compression levels where 90% of the values have been quantized down to 0 or 1. You can do custom things like code run lengths of 0's, or code binary decisions first for {01},{2+} , but actually a decent way to generally handle any highly skewed alphabet is a unary code. A unary code is the huffman code for a geometric distribution in the case of P = 50% , that is {1/2,1/4,1/8,1/16,...} ; we code our symbols with a series of binary arithmetic codings of the unary representation. Note that this does not imply that we are assuming anything about the actual probability distribution matching the unary distribution - the arithmetic coder will adapt and match whatever distribution - it's just that we are optimal in terms of the minimum number of coding operations only when the probability distribution is equal to the unary distribution.

In practice I use four arithmetic coder models :

1. A binary model, I usually use my rung/ladder but you can use the fixed-at-pow2 fractional modeller too.

2. A small alphabet model for 0-20 symbols with skewed distribution. This sorts symbols from MPS to LPS and does linear searches and probability accumulates. It's good for order-N adaptive context modeling, N > 0

3. A Fenwick Tree for large alphabets with adaptive statistics. The Fenwick Tree is a binary tree for cumulative probabilities with logN updates. This is what I use for adaptive order-0 modeling, but really I try to avoid it as much as possible, because as I've said here, large alphabet adaptive modeling just sucks.

4. A Deferred Summation semi-adaptive order-0 model. This is good for the semi-static parts of a decomposed alphabet, such as the remainder portion of a log2 decomposition.

Something I haven't mentioned that's also very valuable is direct modeling of probability distributions. eg. if you know your probabilites are Laplacian, you should just model the laplacian distribution directly, don't try to model each symbol's probability directly. The easiest way to do this usually is to track the average value, and then use a formula to turn the average into probabilities. In some cases this can also make for very good decoding, because you can make a formula to go from a decoded cumulative probabilty directly to a symbol.

ADDENDUM BTW : Ascii characters actually decompose really nicely. The top 3 bits is a "selector" and the bottom 5 bits is a "selection". The probabilities of the bottom 5 bits need a lot of accuracy and change slowly, the probabilities of the top 3 bits change very quickly based on what part of the file you're in. You can beat 8-bit order0 by doing a separated 3-bit then 5-bit. Of course this is how ascii was intentionally designed :


    0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F  0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
0  NUL SOH STX ETX EOT ENQ ACK BEL BS  HT  LF  VT  FF  CR  SO  SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM  SUB ESC FS  GS  RS  US
1   SP  !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /  0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ?
2   @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O  P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _
3   `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o  p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~ DEL

10-10-08 : On LZ Optimal Parsing


10-10-08

On LZ Optimal Parsing

So the LZ-Huffman I've done for RAD/Oodle does some new stuff with optimal parsing. I want to write up what I'm doing so I remember it, and I'm also going to try to write about the semi-optimal parsing that a lot of people are doing now.

First let's review a bit. In LZ coding you have a lot of choices about how you code things. Greedily choosing the longest match at each point dOpoes not produce the best code stream. There is a very good ancient paper by Storer and Szymanski that introducing optimal parsing, and their method is both fast and 100% optimal, but it is only "optimal" in the sense of coding the fewest number of literals possible. See my old notes on Storer-Szymaski optimal parsing . Now, for the old LZSS coder that used fixed bit coding (8 bit literals, 24 bit matches, or whatever), that optimality is also the same as code length optimality. But in reality what we care about is code length optimality - we want to produce the shortest possible file - which is not necessarilly the same as coding the fewest possible literals.

There are sort of two interesting domains of modern LZ coders, and optimal parsing matters a whole lot to both of them. One domain is the super fast, super simple LZ coder like my LZH. Optimal parsing is important to LZH because the matches are so important to compression, you really want to get the best matches you can. The other domain is something like LZMA or BALZ (an ROLZ) that does a lot of context coding of literals. In those cases, the context coder for literals is so good that you really want to only code good matches, and often the literals will be so cheap you want to send them as literals; also the cost of literals and matches is highly variable for those coders.

Optimal parsing is a path search in a graph. For game developers you can think about A* as I talk about this. You start yourself at the head of the file, and your goal is to walk to the end of the file in the shortest possible path. From each node in the graph, your exit paths are the possible coding operations at that point in the file, eg. write a literal, write match X of length Y. Each coding operation has a certain cost which is equal to the number of bits output if you do that coding operation, and then you step along that link to the next point in the file as specified by the match length.

Note that with the original LZSS we only needed to consider either matching or not matching at each position in the file, since any time you wrote a match the best thing to do was to write the maximum match length. With true code-optimal parsing, you have to consider all possible match offsets and all possible match lengths at each location (!!). This is a huge number of coding choices, but we can rule out almost all of them. (more on this later).

So, the brute force straightforward optimal parser should be obvious at this point. Start with an empty code stream and push position 0 on the stack. Repeat forever : pop the stack. Try all coding operations at the position specified in the stack. Push all of those coding operations on the stack with the code stream they made, and the position of the end of the match. This works fine and produces a 100% optimal parse, but is insanely slow. If you have N coding choices at each position it makes N^L operations for a length L file.

With adaptive coding (eg. you're using an adaptive entropy coder for literals matches and offsets), you cannot get a 100% optimal parse unless you do this full search, because it does not have the optimal subgraph property. That is if you have the optimal path from A to C, and that path goes through B, that optimal path is not the same as taking he optimal path from A to B and the optimal path from B to C. To speed things up we are going to start making approximations, and we are going to generally assume a looser version of the optimal subgraph property. What this means in terms of LZ coding is that if you are at a certain position in the file, the best way to code the whole file does not necessarily include the best way to code from {begin} to the current position.

Let's revisit the original LZSS parse now. LZSS does no entropy coding, so it doesn't matter how you get to a given byte of the stream, you will code the remainder in the same way. This is obviously a "Dynamic Programming" problem. You have shared sub-portions of the problem that can be collapsed by storing their result in a table. The optimal path from point P to the end is the same regardless of how you get to P. You could now proceed by doing the brute force approach and search depth-first, and each time you find a path, you store all the sub-paths that you visited, and then you reuse them instead of recomputing them. Or you can do it a more elegant way which is identical, which is to walk backwards through the file and at each position P store the optimal path from the P to the end. When you find the optimal path at P obviously all the future subgraphs that you might need are already done because you're going backwards. So, we now think of the LZSS optimal parse as a way of solving the graph search using dynamic programming.

The case of a static entropy coder is simpler so let me focus on that first. Offsets and lengths and literals are written in some # of bits that is known in advance and we wish to make the shortest possible code stream. We can do a sort of LZSS backwards parse to do this quickly. It's not as simple as LZSS because we cannot assume that the best match choice at each step is the longest one.

Optimal Parser for Static Statistics :

We're going to walk backwards like LZSS. At each position already considered (future positions in the file), we have already found the optimal choice at that spot. We have various coding choices at the current position, either literal or various {offset,length} possibilities. The total cost of each of those coices is : {Cost to code the current choice} + {Table cost at current pos + match length} , that is the already computed cost to finish the walk after you step match length.

It should be obvious that we can solve this just by considering every coding operation possible at each position and walk backwards. That gives us a total cost of O(N*L) , which is a lot better than O(N^L) , but it's still slow because N is very large. So let's start making approximations.

What are all these coding choices that we have? We found various offsets that we matched at least min match length (usually 3). This gives us a bunch of possible offsets and the maximum match length at each offset : {Off1,Len1} , {Off2,Len2}. In general we also have to consider all lengths >= 3 at each offset.

The first approximation is the assumption that for any given match length, the lowest code cost comes from the smallest possible offset that matches with that length. That is, we assume that if Off1 < Off2 , then cost(Off1) <= cost(Off2). This is certainly true for all non-pathological files in the real world. This kind of assumption is also self-satisfying, that is by making this assumption, the optimal parser will prefer shorter offsets, which makes them more likely, which makes their code length shorter. It's a positive feedback loop; I'll talk more about feedback later.

With this approximation, we no longer need to consider all offsets and all match lengths. Now we only need to consider all match lengths, and the offset for any length is chosen for us - it's the lowest offset that codes that length. Our code choices now look like :


you find a match of length 7 at Off1, length 11 at Off2, length 22 at Off3
with Off1 < Off2 < Off3

choose from :

Length 3-7 at Off1
Length 8-11 at Off2
Lebgth 12-22 at Off3

Okay, so we could do this, but it's still too slow. We need to reduce our choices further.

One option is to only consider the longest possible length at each offset. That is, assume that if you code a length at a given offset, the best choice is the longest length at that offset. So in our example you would only choose from {7 at Off1, 11 at Off2, 22 at Off3}. That is sort of okay, but it actually gives up a lot of compression. It misses too many good choices. Apparently there are some good coding paths that involve taking a shorter than maximal match.

To get to the really cool optimization, we need to stop and think about this intuitively. Why would you not write a maximal match? It should almost always give us a shorter total code len, because writing a longer match is very cheap and lets us not code symbols after the match. The cost of a given match is


Cost(offset) + Cost(match length) + CostTable( at pos + match length )

Let's assume for now that the cost of the offset and the match length are seperable (they aren't used to predict each other). The cost of each match length is an entropy code that generally codes shorter lengths in fewer bits. It's sub-linear in match length, but it is generally increasing. However, it increases less than the average cost of future symbols. Note that for long match lengths, the cost of len and (len+1) may be exactly equal depending on how you code your lengths. For example of you use a semi-log2 code, then the cost of many large match lengths is identical. That means coding longer matches is free in terms of the current code cost.

The CostTable generally increases backwards linearly proportional to the average entropy. That is :


CostTable[ {end} ] = 0
CostTable[ end - 1 ] = cost to code form (end-1) to end
CostTable[ end - N ] = N * H

on average ; where H is the average entropy of one symbol

However, even though CostTable is linearly increasing backwards on average, it is not always. Consider for example what CostTable looks like in the neighborhood of a very good very long match. Say at some position P this long match is available with length L. At position (P+1) the best coding choice will be that same match with length (L-1), same at (P+2) etc. The cost table in this area will look like :

CostTable[ P-1 ] = Cost to code literal at (P-1) + Cost(L) + CostTable[ P + L ];
CostTable[ P   ] = Cost(L  ) + CostTable[ P + L ];
CostTable[ P+1 ] = Cost(L-1) + CostTable[ P + L ];
CostTable[ P+2 ] = Cost(L-2) + CostTable[ P + L ];
CostTable[ P+3 ] = Cost(L-3) + CostTable[ P + L ];

If L is large, like say 100 or more, the Cost(L) and the Cost(L-1) will be very very similar. In this region, CostTable[] is increasing as you step backwards, but it's only increasing very slowly, far more slowly than the average slope of H. We see that CostTable goes backwards like N*H in a line, but it deviates up and down across the line. Some areas you code worse than average entropy, some areas you code much better than average entropy. The areas where you code better than average entropy are candidates for choosing a shorter than maximal match length!

What are we really doing as we step back through the file and choose the best match length? We have this constant table Cost(L) that tells us the cost of coding each match length. This table is increasing, it's smallest at Cost(3) and largest at Cost(infinity). We find the maximum match length at a given offset and the minimum at a given offset (the minimum is the smallest length that can't be coded from a lower offset). We take the Cost(L) table in that range and add it onto CostTable() in that range, then pick the lowest value after the sum. Generally CostTable increases backwards faster than Cost(L) increases forwards, but not always! As we saw in the example above with a long match, CostTable can increase backwards slower than Cost(L) increases forwards.

So, how do we find this spots quickly? Well, when we are looking at coding at spot P, we have already visited all the CostTable spots > P. What we'd like to know is, Is CostTable ahead of us increasing sublinearly? And if so, where is it MOST sublinear? That is, Where is CostTable[P] as far below (end - P)*H ? That will tell us the most likely spot to code a less than maximal match.

We could build a binary search tree for these spots as we go backwards, which would be logN and all, but I used a simpler solution. At each spot P, I store the best sublinear spot preceding that spot (preceding in file order, actually occuring later in the backwards walk). That is :


I chose a coding at pos P and updated CostTable[P]

BestSpotPreceding[P] = P;
int step = 1;
while( CostTable[P] + step * TinyCost < CostTable[P + step] )
{
 BestSpotPreceding[P + step] = P;
 step++;
}

Normally when CostTable is increasing a lot as you step back, this does not get triggered at all. That is, CostTable[P+1] is usually a H smaller than CostTable[P]. But in weird cases, the cost of coding this symbol was almost free, so BestSpotPreceding points back to us. BestSpotPreceding gets updated in waves that terminate at cheap code spots. Normally you keep stepping back, then you hit a really cheap code spot and a wave percolates forward, filling out BestSpotPreceding, then you step some more normally, then you hit a cheap code spot and it waves through again. Sort of carries in an arithmetic coder. Anyway, this is technically still O(N*L), but it's a very cheap operation and it's way way less than N*L in practice.

So, how do we use this? Instead of considering only the maximum length at each offset ( {7 at Off1, 11 at Off2, 22 at Off3} in our previous example ), we also consider matches at BestSpotPreceding[] in that offset. That is :

Look in 3-7 at Off1
Consider a 7 at Off1
Also consider BestSpotPreceding[Pos + 7] at Off1 (if it's >= 3 )

Look in 8-11 at Off2
Consider an 11 at Off2
Also consider BestSpotPreceding[Pos + 11] at Off2 (if it's >= 8)

Look in 12-22 at Off3
Consider a 22 at Off3
Also consider BestSpotPreceding[Pos + 22] at Off3 (if it's >= 12)

And those are all the matches we have to consider. I did this and found it hurt less than 0.001 bpb compared to the full optimal parse that considers all possible coding operations at each position.

ADDENDUM #1 : I should mention there's another cool way to get this optimization if you don't want to do the BestSpotPreceding thing. We again want to ramp up our intuition for the problem. The cases where we need to make a non-trivial decision about match lengths are cases when we are writing two matches in a row, both greater than minimum length, and they overlap. That means there's a big range of ways we can code the same block. For example :

 

At pos P we find a match of length 7
At pos P+4 we find a match of length 10

The area from pos P to [P+14] can be coded with two matches, the ways are :

{4,10}
{5,9}
{6,8}
{7,7}

The area of overlap of the matches all codes to the same offset, it's the same offset that was found at P+4, but if you match past that at P, then you just match less of it. The amount of overlap is the number of different coding choices we have.

We're coding the same two offsets and skipping the same number of symbols, so the only difference in code cost is the cost to code the match lengths. The code costs of the choices are :


Cost(len 4) + Cost(len 10)
Cost(len 5) + Cost(len 9)
Cost(len 6) + Cost(len 8)
Cost(len 7) + Cost(len 7)

But we don't even need to compute these. The cost of coding a match len is almost always increasing in len, and increasing more slowly for larger lens, that is :

(len1 > len2) implies codecost( len1 ) >= codecost( len2 )

(len1 > len2) implies ( codecost( len1 +1 ) - codecost(len1) ) <= ( codecost( len2 + 1 ) - codecost( len2 ) )

That is, code costs go up for larger lens, but they go up more slowly. It's like a sqrt(x) function - sharp at the beginning and then levels out. That means the lowest total code cost is the one that's most skewed. You want to extend the longer match as much as possible because that's very cheap, and you want to shorten the shorter match as much as possible because that saves you a lot. Intuitively you can think the difference between coding a len of 3 and 4 is very big, but the difference between 511 and 512 is tiny tiny.

In our example that means the best choice is :

Cost(len 4) + Cost(len 10)

The simple heuristic to do this in general is :


Consider coding option at pos P
Find max match len possible at each offset
Consider coding with max match len

len = max match len
while ( matchoffset[ P + len - 1] == matchoffset[ P + len ] )
 len--;
if ( len != max match len )
 consider coding at len

Note the similarity with the BestSpotPreceding heuristic. We try coding with our longest match len. Then we also try coding with the *shortest* match len that still reaches the same match after we jump forward. We're trying the two most skew possibilities, our match as long as possible and our match as short as possible, constrained to be getting the same match following. (matchoffset is already computed because we're going backwards).

ADDENDUM #2 : More early outs. LowestCostPreceding. @@ todo : write this.

There are some more minor issues we need to consider.

The first one is that if we are doing a Huffman (or static arithmetic) coder to entropy code the length, offset & literal, then we can't actually know the code length in advance. And the way we choose to do the coding will affect the statistics which will affect the code lengths, which affects how we choose to do the coding, ad infinitum. It's a feedback loop.

As usual in these kind of optimization searches, we make the semi-static assumption. We make a good guess for the code lengths, find the optimal parse for those code lengths, use that parse to guess new code lengths, optimal parse again, etc. Hopefully this smoothly converges. Well, it doesn't. The problem is that the coding alphabet is actually pretty sparse. On medium size files there are not that many literals and there are plenty of offsets and match lengths you might not code at all in one particular encoding choice. For example, in one pass you might not code any matches of length {31} at all. Now when you do the next pass you need something reasonable in there. If you use the actual code lengths, you can strangely skew how you do the coding, and it can actually ping pong a lot in the search and not settle down.

To improve this, I suggest smoothing the counts in the early passes. That is, the chance of a length 31 match should be somewhere between the chance of a length 30 and length 32 match, regardless of the coincidence that we happened not to code with it in the last pass. To do this, I fit a laplacian probability model to lengths and offsets, and in the early passes I blend that model in to smooth out the regularize the distribution. As I do more passes I decrease the weight of that blended simple model and use the raw counts more.

In practice, you can avoid doing multiple passes of the actual optimal parser without hurting compression too much. You can do a first pass with just a normal non optimal parser and build statistics from that. Smooth those statistics with a regularizing simple model. Then do the optimal parse with those statistics, and you're done. Doing more passes will improve the coding decisions, though.

The other issue is that it's still a bit slow. Most of the slowness comes from the areas where very long matches are possible. If there's a long match of length L at pos P, we also consider (L-1) at (P+1), and (L-2) at (P+2), etc., and for each of those we generally find many closer offsets to also consider, roughly O(L) of them, which makes us O(L^2) over that region. On super degerate files like "pic" in the Calgary Corpus, this makes us very slow. However, it's easy to fix. In this area with the long match, it really doesn't matter much what we do, because all of them are good choices. This is a property that in areas of super high compression, the importance to the whole file is much less. You want to maximize your wins on the hard to compress spots, because in terms of output bytes they are more important. So, in these big long match areas, we just cut out the whole area inside the long match, and we always code the long match at the beginning. (actually I allow coding at P and P+1 up to P+K for a small K ; K is like 4 and L is like 100).

Optimal Parser for Adaptive Coders :

Now, with real modern coders you aren't doing static statistics, they're adapting as you go. You could just make a guess at the overall statistics and use the semi-static optimal parser that I just described, and then code the stream with your adaptive coders. That's better than not optimal parsing at all, but it's giving up a lot. Adaptive coders get a big win by changing their statistics as they scan through the file, and an optimal parser can make much better decisions if it uses the actual local statistics. For example, in one part of the file, literals might code very well, and the optimal parser should favor literals more, in another part of the file there might be lots of very short match lengths, and the optimal parser should code for that, etc.

One way to solve this is to use a different kind of semi-static approximation and optimal parse forward in blocks. For each block, you pretend your coder are static and use some fixed code lengths. Now, run the semi-static optimal parser that I described above to find the optimal coding for the next 1000 bytes or so. Now step up through that coding and output codes with your adaptive entropy coder. Only output 900 bytes or so, don't go all the way to the end of the block so that you avoid edge effects. Now hold use the current statistics to find the optimal parse for the next 1000 bytes. This in fact works very well, and I believe was first used by RKive many years ago. He called it "dynamic programming" which is sort of right in the sense that all backward optimal parses are a type of dynamic programming.

There's another form of adaptive optimal parse that a lot of people do now using forward walking. It's called "flexible parsing", and I believe a form of this is used in LZMA (add: no, that's not LZMA).

The simplest form of "flexible parse" is just the old "lazy evaluation" of Zip. Lazy matching walks through the file and considers either writing {match} or {literal + match at next pos}. It's a slightly non-greedy parser. Flexible parsing is basically the same thing, but we consider a few different choices.

To do this, we run through the whole file first and find the best match at each position. (you don't actually have to do this in practice, you could just look ahead some amount N from the current position to avoid multiple passes). In any cases, the match search at each pos is only done once, and we have them ready far enough in the future to do the flexible parse.

So, we consider a few coding choices at the current location. We choose the one that gives the lowest code cost to the end of the file. To make good decisions, it needs to also consider the coding after you take the current operation. That is, it's like a recursive function :


CodeCostRecurive( P )
{
 consider n in N choices
  cost of choice n = Cost(code n) + CodeCostRecurive( P + len(n) )
 return best of N
}

now of course if we actually did this full recursion we would be doing the whole O(N^L) tree like we didn't want to do. But we don't need to do that because we're only using this to make the decision at the current byte. That is, we can use the "short horizon" assumption - coding choices far in the future don't matter that much to our current coding choice. It's like only considering a few moves ahead in a Chess program (which is a bad thing to do there, but a good thing here). In fact, you only need to look 2 moves ahead here to make good coding choices.

Also, you don't need to consider very many N choices. N = 4 or 8 is plenty. Certainly consider coding a literal and coding the longest possible match, and then heuristically pick some other likely good choices in between (such as the longest matches available with some shorter offsets, or the longest matches possible from previous-match-references if you have those).

How do we terminate the recursion? After 2 steps (considering one op and then one more op after that) we just need an estimate of the cost of coding the rest of the file. That's easy, it's just :


CodeCostEstimate( P ) = ( {end} - P ) * H

That is, just assume the remainder is coded at the entropy. The entropy you use here should be a conservative overestimate of the entropy; for example just use the entropy of what you've done so far and add a fudge factor. This ensures that you actually prefer writing matches and don't do something strange if you start the file in a very low entropy region.

This flexible parse sounds kind of expensive, but in fact with N = 4, it's only 16 table lookups to compute all the code lengths, then we pick the best one. This of course is not a true optimal parse, but it's pretty damn good, and the fact that it's using the true current adaptive statistics makes up for the fact that it's not doing the best possible parse.

ADDENDUM : I should note that this kind of forward parse is also better at making good decisions with PMR's (Previous Match References). PMR's are cheap to code offsets, so they should always be considered in the optimal parse when they are possible matches. Going backwards you can't make good decisions about trying to use PMR's because you haven't done the earlier parts of the file yet.

10/08/2008

10-08-08 : Arithmetic coders throw away accuracy in lots of little places.


10-08-08

Arithmetic coders throw away accuracy in lots of little places. A full precision range update would be :


low += (symlow * range) / symtot;
range = (symrange * range) / symtot;

Intead we do :

U32 r = range / symtot;
low += symlow * r;
range = symrange * r;

This has changed the parentheses of the multiplication, which lets us do math in 32 bits and also saves a divide. It's inaccuracte, we would waste some range, so we check for being at the end of the range and put the wasted space on there :

if ( (symlow + symrange) == symtot )
{
 U32 t = symlow * (range / symtot);
 low += t;
 range -= t;
}
else
{
 U32 r = range / symtot;
 low += symlow * r;
 range = symrange * r;
}

We can see the waste by comparing the range update for the top symbol in the two cases :

case1 = range - symlow * (range / symtot);
case2 = symrange * (range / symtot);

waste = case1 - case2;
waste = range - symlow * (range / symtot) - symrange * (range / symtot);
waste = range - (symtot - symrange) * (range / symtot) - symrange * (range / symtot);
waste = range - symtot * (range / symtot);
waste = range % symtot;

Putting the waste on the last symbol is not great. To make the most of it, you should really reorganize your alphabet so that the most probable symbol is the last one. Symtot is usually around (1<<14) which means the waste is on average (1<<13). Range is between (1<<24) and (1<<32) so the fractional waste is between 2^-11 and 2^-19. This appears to come out to about a waste of 1 byte every 4096 on text with the alphabet rearranged to put the space (0x20) at the end.

I should note that many many people have arithcoders that do this check to put the wasted range on the last symbol, but they *don't* do the work to make sure the last symbol is the MPS (the most probable symbol). If you don't do that, it's rather pointless. With english text for example you might be putting the extra range on the symbol "255" which basically never happens, so you are doing work for nothing.

Sean asked a good question that had me a little confused yesterday. Why do some of the renormalization updates shift in 1's to "high" and some don't? You might see :


while ( range < MinRange )
{
 *outptr++ = low >> 24;
 low <<= 8;
 range <<= 8;
}


or :


while ( range < MinRange )
{
 *outptr++ = low >> 24;
 low <<= 8;
 range <<= 8;
 range += 255;
}

I thought maybe there was some subtle math reason why you would choose not to put in the 255. Nope. It's just another approximation + optimization. Shifting in the 1 bits to range is the more accurate correct way to do it, but at that point range is already in 2^24 to 2^32, so adding on 255 doesn't amount to much. In practice on my real file tests it doesn't even save 1 byte. I guess it should save something like 1 every 2^20. So, most people just leave it out to save the 1 instruction.

Here are some numbers :


r:\bigtest.bin : inLen : 25651384

True Entropy : 18090226.9 : 5.642

// these are all using cum prob max = 1<<14 and doing a divide for range

radArith cycles: 64.9
radArith : encLen : 18090853 = 5.642 bpb

// radArith is a Michael Schindler 31 bit range encoder with check to put waste range on top sym

Arithmetic_Codec cycles: 58.2
Arithmetic_Codec : encLen : 18091236 = 5.642 bpb

// Arithmetic_Codec is FastAC as modified by me with no check for waste on top sym

rrArithCoder cycles: 57.3
rrArithCoder : encLen : 18091236 = 5.642 bpb

// rrArithCoder is my reimplemented 32 bit range coder similar to FastAC ; yay I beat them by 1 clock !

rrArithBinaryModel cycles: 35.1
rrArithBinaryModel : encLen : 18844782 = 5.877 bpb

// rrArithBinaryModel is a bit-by-bit adaptive model + coder ; that's cycles *per bit* , it's 280 cycles/byte

rrArithCoderRawBits cycles: 24.1
rrArithCoderRawBits : encLen : 25651385 = 8.000 bpb

// rrArithCoderRawBits just puts the bytes out through the arithmetic coder
// this basically just measures the speed of the renorm loop

rrHuffman cycles: 14.5
rrHuffman : encLen : 18179876 = 5.670 bpb

// Huffman encoding is still 4X faster and doesn't cost you a whole lot in compression efficiency

10/07/2008

10-07-08 : A little more on arithmetic coding ...


10-07-08

A little more on arithmetic coding ...

Frédéric Kayser sent me the link to this FastAC page with a bunch of papers and source code by Amir Said. It's pretty good stuff, I thought I'd give a synposis of what's in there :

The coder is almost a vanilla Michael Schindler style "range coder". There are a few little tweaks in there that differ from the standard implementation.

1. Instead of a "virtual queue" for the bytes that might carry, he goes ahead and just writes them out, and then if a carry is detected he goes back through the output bytes and increments them while they're 0xFF. This violates one of the rules of standard arithmetic coder design, which is that you never touch output bytes after you stream them out, so that they can be transmitted or written to disk or whatever. Now of course if you're just outputting to a big buffer in memory his way is totally fine, and it appears to be faster than bothering with all the variables to account for the virtual queue.

2. He actually uses the full 32 bits for range instead of 31 bits like in the standard Schindler. We use 31 bits so that we can see the carry in the top bit. Said has a neat little trick for detecting the carry :


U32 old_low = range_low;

range = range_high - range_low;
range_high = range_low + symhigh * (range / symtot);
range_low += symlow * (range / symtot);

if ( range_low < old_low ) // carry

(This maybe could be even faster if there were an elegant way to directly get the carry flag in C). Using the full 32 bits is nice because it makes you byte aligned and eliminates some little ugly bit junk you have to do with shifting bytes 7 and 1 to get that last bit right. This lets you avoid a LOT of shift and mask work which is a big win.

Now, as for the models, his "Adapative_Data_Model" is 100% exactly my Deferred Summation model from DCC96. You can read about the original DefSum in my papers section or get the ancient code in the statistical coders section. From what I can tell there's nothing new or interesting in the modeling, it's 100% just deferred summation to get you power of two CumProbMax so you can turn your division into a shift, he uses my fast decode table, etc.

Anyway, the implementation is pretty good and the source code is pretty clean, so it's a decent place to start. However, I don't recommend actually using it as is. The models are terrible and the API interface is very bad.

The reason why "Deferred Summation" never amounted to much is that arithmetic coding a big hunk of symbols with order-0 modeling is kind of pointless. You just never actually want to do that. If you're doing arithmetic coding, you're also doing higher order modeling or LZ coding, or something, so that you don't just have static statistics. Providing API's that are locked to these silly models is lame. The correct API interfaces for an arithmetic coder are the ones I have in "arithc" , just

Encode( symlow, symfreq, symtot );
EncodeTotIsPow2( symlow, symfreq, symtotshift );

The binary models are also silly. He does deferred summation for the bit model as well, which is not necessary. You can either do the implicitly rescaling fixed-tot bit model that I just wrote about recently, or you can use a rung-ladder model like I have in crblib which uses a table to find power of 2 probabilities from a state.

Said's codec is in fact very fast. It's almost 10% faster that any arithmetic coder I've ever seen. That's pretty damn good. I'm going to modify it a bit to provide what I think are reasonable API's and post it.

Okay here it is : cb_Arithmetic_Codec.zip ; a tiny .h with a modified version of FastAC. This is pretty rough, I tried to modify the original FastAC code as little as possible.

old rants