Small block inherently gives up a lot of coding capability, because you aren't allowed to use any cross-block information. H264 and HDPhoto are both small block (4x4) but both make very heavy use of cross-block information for either context coding, DC prediction, or lapping. Even JPEG is not a true small block coder because it has side information in the Huffman table that captures whole-image statistics.
Fixed bitrate blocks inherently gives up even more. It kills your ability to do any rate-distortion type of optimization. You can't allocate bits where they're needed. You might have images with big flat sections where you are actually wasting bits (you don't need all 64 bits for a 4x4 block), and then you have other areas that desperately need a few more bits, but you can't gived them to them.
So, what if we keep ourselves constrained to the idea of a fixed size block and try to use a better coder? What is the limit on how well you can do with those constraints? I thought I'd see if I could answer that reasonably quickly.
What I made is an 8x8 pixel fixed rate coder. It has zero side information, eg. no per-image tables. (it does have about 16 constants that are used for all images). Each block is coded to a fixed bit rate. Here I'm coding to 4 bits per pixel (the same as DXT1) so that I can compare RMSE directly, which is a 32 byte block for 8x8 pixels. It also works pretty well at 24 byte blocks (which is 1 bit per byte), or 64 for high quality, etc.
This 8x8 coder does a lossless YCoCg transform and a lossy DCT. Unlike JPEG, there is no quantization, no subsampling of chroma, no huffman table, etc. Coding is via an embedded bitplane coder with zerotree-style context prediction. I haven't spent much time on this, so the coding schemes are very rough. CodeTree and CodeLinear are two different coding techniques, and neither one is ideal.
Obviously going to 8x8 instead of 4x4 is a big advantage, but it seems like a more reasonable size for future hardware. To really improve the coding significantly on 4x4 blocks you would have to start using something like VQ with a codebook which hardware people don't like.
In the table below you'll see that CodeTree and CodeLinear generally provide a nice improvement on the natural images, about 20%. In general they're pretty close to half way between DXTC and the full image coder "cbwave". They have a different kind of perceptual artifact when they have errors - unlike DXTC which just make things really blocky, these get the halo ringing artifacts like JPEG (it's inherent in truncating DCT's).
The new coders do really badly on the weird synthetic images from bragzone, like clegg, frymire and serrano. I'd have to fix that if I really cared about these things.
One thing that is encouraging is that this coder does *very* well on the simple synthetic images, like the "linear_ramp" and the "pink_green" and "orange_purple". I think these synthetic images are a lot like what game lightmaps are like, and the new schemes are near lossless on them.
BTW image compression for paging is sort of a whole other issue. For one thing, a per-image table is perfectly reasonable to have, and you could work on something like 32x32 blocks. But more important is that in the short term you still need to provide the image in DXT1 to the graphics hardware. So you either just have to page the data in DXT1 already, or you have to recompress it, and as we've seen here the "real-time" DXT1 recompressors are not high enough quality for ubiquitous use.
ADDENDUM I forgot but you may have noticed the "ryg" in this table is also not the same as previous "ryg" - I fixed a few of the little bugs and you can see the improvement here. It's still not competitive, I think there may be more errors in the best fit optimization portion of the code, but he's got that code so optimized and obfuscated I can't see what's going on.
Oh, BTW the "CB" in this table is different than the previous table; the one here uses 4-means instead of 2-means, seeded from the pca direction, and then I try using each of the 4 means as endpoints. It's still not quite as good as Squish, but it's closer. It does beat Squish on some of the more degenerate images at the bottom, such as "linear_ramp". It also beats Squish on artificial tests like images that contain only 2 colors. For example on linear_ramp2 without optimization, 4-means gets 1.5617 while Squish gets 2.3049 ; most of that difference goes away after annealing though.
|file||Squish opt||Squish||CB opt||CB||ryg||D3DX8||FastDXT||cbwave KLTY||CodeTree||CodeLinear|