cbloom rants: 04-05-12 - DXT is not enough

4/05/2012

04-05-12 - DXT is not enough - Part 2

As promised last time , a bit of rambling on the future.

1. R-D optimized DXTC. Sticking with DXT encoding, this is certainly the right way to make DXTC smaller. I've been dancing around this idea for a while, but it wasn't until CRUNCH came out that it really clicked.

Imagine you're doing something like DXT1 + LZ. The DXT1 creates a 4 bpp (bits per pixel) output, and the LZ makes it smaller, maybe to 2-3 bpp. But, depending on what you do in your DXT1, you get different output sizes. For example, obviously, if you make a solid color block that has all indices of 0, then that will be smaller after LZ than a more complex block.

That is, we think of DXT1 as being a fixed size encoding, so the optimizers I wrote for it a while ago were just about optimizing quality. But with a back end, it's no longer a fixed size encoding - some choices are smaller than others.

So the first thing you can do is just to consider size (R) as well as quality (D) when making a choice about how to encode a block for DXTC. Often there are many ways of encoding the same data with only very tiny differences in quality, but they may have very different rates.

One obvious case is when a block only has one or two colors in it, the smallest encoding would be to just send those colors as the end points, then your indices are only 0 or 1 (selecting the ends). Often a better quality encoding can be found by sending the end point colors outside the range of the block, and using indices 2 and 3 to select the interpolated 1/3 and 2/3 points.

Even beyond that you might want to try encodings of a block that are definitely "bad" in terms of quality, eg. sending a solid color block when the original data was not solid color. This is intentionally introducing loss to get a lower bit rate.

The correct way to do this is with an R-D optimized encoder. The simplest way to do that is using lagrange multipliers and optimizing the cost J = R + lambda * D.

There are various difficulties with this in practice; for one thing exposing lambda is unintuitive to clients. Another is that (good) DXTC encoding is already quite slow, so making the optimization metric be J instead of D makes it even slower. Many simple back-end coders (like LZ) are hard to measure R for a single block for. And adaptive back-ends make parallel DXTC solvers difficult.

2. More generally we should ask why are we stuck with trying to optimize DXTC? I believe the answer is the preferred way that DXTC is treated by current hardware. How could we get away from that?

I believe you could solve it by making the texture fetch more programmable. Currently texture fetch (and decode) is one of the few bits of GPU's that still totally fixed function. DXTC encoded blocks are fetched and decoded into a special cache on the texture unit. This means that DXTC compressed textures can be directly rendered from, and also that rendering with DXTC compressed textures is actually faster than rendering from RGB textures due to the decreased memory bandwidth needs.

What we want is future hardware to make this part of the pipeline programmable. One possibility is like this : Give the texture unit its own little cache of RGB 4x4 blocks that it can fetch from. When you try to read a bit of texture that's not in the cache, it runs a "texture fetch shader" similar to a pixel shader or whatever, which outputs a 4x4 RGB block. So for example a texture fetch shader could decode DXTC. But it could also decode JPEG, or whatever.

6 comments:

Tom Forsyth said...: The idea of a texel shader has been kicked around ever since pixel shaders were added, but mainly for clever compositing shaders (multiple layers of detail maps, using heightfield normal maps and turning them into vector ones on cache miss, etc). The problem with the DXT path is it still needs to be really fast, and that usually means custom HW with lots of bit-twiddling and tiny palettes. It's hard to make those operations programmable without hurting perf, or without just adding a bunch of area that nothing else on the chip uses. If you can figure out a decode that can be done with small changes to the standard shader pipeline, then there's an interesting path forwards.; April 5, 2012 at 9:31 PM
alex peterson said...: I like the idea of exposing that part of the pipeline. Give developers a way to get in there and let them be responsible if they slow their application down for the sake of some custom decompression etc. Hopefully we'll see this soon!; April 9, 2012 at 8:36 PM
ryg said...: Even just allowing it adds significant complications to the hardware: cache stuff is kinda timing sensitive wrt to sizing stuff correctly, and having completely unpredictable latency in the middle makes it hard. Also if you dispatch to variable-latency shader code you won't get around having multiple outstanding misses at the same time with hit-under-miss processing; it gets pretty gnarly fast.

It's especially awkward because GPU shader cores really have horrible latencies for everything, and really need work in large batches to compensate for it. E.g. on Fermi, a single ALU op has about 20 cycles latency. Decoding a texture L1/L2 cache line's worth of data is a shitty granularity to be spinning up a shader core for, and when you do, it's gonna be several hundreds or even thousands of cycles to complete even a very simpler "texture fetch shader" - as if you had missed L2 and did the fetch from memory. It just doesn't fit well in the current architectures.

Fundamentally, caches just need a very different design when you have to assume that a double-digit percentage of your cache lines is allocated-but-outstanding for a sustained period.; April 9, 2012 at 9:08 PM
cbloom said...: Yeah, I should clarify a few points :

1. Obviously the most interesting possibility for texel shader is some combination of decoding, compositing, baking lighting, procedural texture generation, etc. Just decompressing is not that compelling.

2. The point of this thought exercise is not to say necessarily "we should have texel shaders" but to see why DXTC is in this preferred position at the moment, and what exactly would we have to do to remove it.

So, @ryg :

My imagination was that you would batch up work somehow. When vertex/pixel shaders try to fetch a texel, if it's not there they get their stack pushed waiting on that result; once you get a bunch of texel requests you run a batch of them.

It's not that different than the normal vertex shading results cache. In the end it's just dynamic programming, storing computed results and running a shader to fill the slots that are needed. Granted the cache scheme needs to be more complex than the current vertex cache.; April 10, 2012 at 9:57 AM
brunogm said...: Can you comment on this Texture compressor, that reuses some DXT hardware in a clever way.

http://pholia.tdi.informatik.uni-frankfurt.de/~philipp/publications/ftcpaper-4.p.pdf; April 20, 2012 at 4:44 PM
Anonymous said...: The main thrust of the linked pdf seems to be the idea of encoding one endpoint explicitly and one as delta from it, allowing more precision when the endpoints are near each other.

The above is one of the features of the BC6H texture format introduced in DX 11. So the paper seems to be of little interest now.; April 25, 2012 at 12:55 PM

cbloom rants

4/05/2012

04-05-12 - DXT is not enough - Part 2

6 comments:

old rants