3/27/2012

03-27-12 - DXT is not enough

I've been sent this GDC talk a few times now, so I will write some notes about it. (BTW thanks to all who sent it; I don't really follow game dev news much, so I rely on you all to tell me when there's something interesting I should know about).

There's nothing wrong with the general subject matter of the talk, and the points are along the right track in a vague sort of way, but there's just absolutely no effort put into it. I put more effort into the average blog post. If you aren't going to actually do the testing against other methods and measurement on a real (public) data set and see if your ideas are actually good, then just don't do a talk.

Anyhoo, a quick summary of the talk in plain text :

JPEG and then realtime DXTC is kind of bonkers (I agree). To make DXTC smaller, he applies Zip, then pre-huffman, pre-delta of colors, rearranging the colors & indices to be in two separate blocks, and then "codebooks", and finally 8x8 and even 16x16 blocks.

There are a lot of problems with this talk. The first is the assumption that you are using Zip on the back end. (BTW Zip is not a type of LZW at all). Zip is ancient and has a tiny window, there's no reason to use zip. If you just use a better back end, most of what he does next is irrelevant. Essentially a lot of what he does (such as the codebooks and the pre-huffman) are just ways of extending Zip, effectively making the sliding window larger.

Second, whenever you are doing these things you need to consider the memory use and processor use tradeoffs.

For example, reorganizing the DXTC data to separate the colors and the indeces does in fact help. (I do it in Oodle, optionally). But that doesn't make it a clear win. It actually takes a huge amount of CPU. Just swizzling memory around like that can be slower than a very advanced LZ decompressor. (unless you are lucky enough to be on a PC which has an amazing big cache, amazing fast memory, and an amazing out of order processor that can hide cache misses). So you have to consider what is the speed cost of doing that reorg vs. other ways you could use the CPU time to improve compression (eg. running LZMA or LZX or whatever instead of Zip). Even on a PC, the reorg will ruin large block write-combining. For me, reorg took me from 500 MB/s to 300 MB/s or so, and the gain is only a few percent, not obviously worth it. (my back end is much better than Zip so the gains are much smaller, or not there at all).

The only real idea in the talk is going to 8x8 blocks. That is in fact a valid thing to do, but here the rigor of the talk completely falls apart, and instead we get "look at this blurry slide in a terrible lighting environment, can you see the loss?". Errmmm, okay. To be fair it's no worse than the typical GDC graphics talk where you get "look at this picture of my global illumination technique, can you see any artifacts?" ; ermm, well you have chosen the scene to show me, and I'm a hundred feet away looking at a slide, and I can't zoom in or examine the areas I think look funny, so of course it should look good, but in fact, yes I do see lighting artifacts!

Any time you start introducing loss you have to ask : how does this loss I'm injecting compare to other ways I could reduce bitrate and increase loss? An easy one to check is just to halve the resolution of your image (in both dimensions). That's a 4:1 compression, and quite often looks just fine visually (eg. on smooth data it is one of the best possible ways to create loss). And of course since we're in this domain you need to compare against JPEG-DXTC.

CRUNCH addresses this subject much better, even doing some actual tests, and it has some much more interesting ideas.

See my previous writings on DXTC in general.

Now some actual rigor :

DXT1 is a 4 bpp (bits per pixel) format. Additional lossless compression can get you below 4 bpp, but getting to 1 bpp is unrealistic. Here I will show the results for compressors of increasing compression : zip, rrlzhlw, and lzma. The "reorg" here is just separating colors and indices; other reorgs do not help if the back end compressor is rrlzhlw or better.

zip rrlzhlw lzma reorg zip reorg rrlzhlw reorg lzma
kodim01.bmp 3.187 2.962 2.786 2.98 2.794 2.683
kodim02.bmp 2.984 2.738 2.574 2.703 2.575 2.484
kodim03.bmp 2.768 2.534 2.373 2.494 2.344 2.254
kodim04.bmp 3.167 2.931 2.751 2.913 2.727 2.625
kodim05.bmp 3.463 3.272 3.155 3.238 3.108 2.999
kodim06.bmp 3.039 2.827 2.626 2.755 2.635 2.514
kodim07.bmp 2.862 2.622 2.489 2.634 2.469 2.366
kodim08.bmp 3.416 3.197 3.073 3.211 3.041 2.936
kodim09.bmp 2.919 2.701 2.497 2.658 2.525 2.4
kodim10.bmp 3.074 2.838 2.644 2.803 2.638 2.525
kodim11.bmp 3.001 2.827 2.655 2.791 2.668 2.542
kodim12.bmp 2.86 2.645 2.446 2.583 2.451 2.343
kodim13.bmp 3.517 3.331 3.182 3.299 3.159 3.042
kodim14.bmp 3.296 3.104 2.94 3.078 2.922 2.803
kodim15.bmp 3.067 2.835 2.675 2.798 2.632 2.547
kodim16.bmp 2.779 2.565 2.362 2.543 2.401 2.276
kodim17.bmp 3.077 2.849 2.659 2.788 2.653 2.544
kodim18.bmp 3.495 3.315 3.181 3.255 3.106 3.025
kodim19.bmp 3.09 2.878 2.685 2.827 2.698 2.571
kodim20.bmp 2.667 2.486 2.302 2.406 2.308 2.22
kodim21.bmp 3.087 2.893 2.7 2.804 2.712 2.582
kodim22.bmp 3.39 3.213 3.046 3.168 3.005 2.901
kodim23.bmp 3.221 2.985 2.826 2.949 2.758 2.646
kodim24.bmp 3.212 2.986 2.86 3.009 2.826 2.724
clegg.bmp 2.987 2.75 2.598 2.712 2.576 2.459
FRYMIRE.bmp 1.502 1.318 1.224 1.417 1.3 1.209
LENA.bmp 3.524 3.332 3.209 3.304 3.136 3.062
MONARCH.bmp 3.28 3.055 2.916 2.999 2.835 2.741
PEPPERS.bmp 3.381 3.2 3.073 3.131 2.962 2.881
SAIL.bmp 3.425 3.234 3.123 3.197 3.047 2.967
SERRANO.bmp 1.601 1.39 1.289 1.484 1.352 1.26
TULIPS.bmp 3.511 3.27 3.164 3.227 3.061 2.974
total 97.849 91.083 86.083 90.158 85.424 82.105
gain 7.691 5.659 3.978

What you should be able to see : reorg zip is roughly the same as rrlzhlw (without reorg), and reorg rrlzhlw is about the same as lzma (without reorg). Note that reorg is *slow* ; rrlzhlw without reorg decodes quite a bit faster than zip with reorg, so speed is not a good reason to prefer that. (I suppose simplicity of coding is one advantage it has). The gain from reorging decreases as you go to better back-ends.

I should also point out that doing something like reorg lzma is kind of silly. If you really want the maximum compression of DXTC textures, then surely a domain-specific context-based arithmetic coder will do better, and be faster too. (see for example "Lossless Compression of Already Compressed Textures" , Strom and Wennersten ; not a great paper, just the very obvious application of normal compression techniques to ETC (similar to DXTC) texture data).

In the next post I'll ramble a bit about future possibilities.

3 comments:

ryg said...

So this DXT reorg is literally just switching 32-bit words around, right? (At least for DXT1)

How does your layout look, and when does it run? If you do it after the whole texture has been decoded, then sure, that's gonna hurt (because a full texture probably won't fit inside L1 or L2). But what if you do the reorg on smaller chunks, e.g. 16k/32k blocks? That way you can do the reordering before data drops out of the cache. That's usually the big win.

cbloom said...

The problem is in the decoder.

You need to output something like :

C1I1C2I2C3I3...

C = 32 bits of color
I = 32 bits of index

but your decoder is seeing C1C2C3...I1I2I3....

So when you do the output you have to first write out :

C1..C2..C3..

then go back and fill in the blanks with the indices after you get them.

I guess if you don't gather together all the colors and indices but just gather together 16k/32k chunks so the interleaving can be done in cache, that would fix it, but you would also lose some of the win.

cbloom said...

I suppose the full solution would to have two separate decoder contexts for colors and indices, and then interleave the encoded streams in chunks of 4k or so.

Then the decoder does :

decode 4k worth of colors from context 1
output C1..C2..C3..
decode 4k worth of indices from context 2
output ..I1..I2..I3..

which should be reasonably fast and makes the encoding act as if they were grouped in larger blocks. (eg. by "separate context" I mean LZ matches in the color context go back into the previous chunks of colors, not into the indices)

old rants