1. R-D optimized DXTC. Sticking with DXT encoding, this is certainly the right way to make DXTC smaller. I've been dancing around this idea for a while, but it wasn't until CRUNCH came out that it really clicked.
Imagine you're doing something like DXT1 + LZ. The DXT1 creates a 4 bpp (bits per pixel) output, and the LZ makes it smaller, maybe to 2-3 bpp. But, depending on what you do in your DXT1, you get different output sizes. For example, obviously, if you make a solid color block that has all indices of 0, then that will be smaller after LZ than a more complex block.
That is, we think of DXT1 as being a fixed size encoding, so the optimizers I wrote for it a while ago were just about optimizing quality. But with a back end, it's no longer a fixed size encoding - some choices are smaller than others.
So the first thing you can do is just to consider size (R) as well as quality (D) when making a choice about how to encode a block for DXTC. Often there are many ways of encoding the same data with only very tiny differences in quality, but they may have very different rates.
One obvious case is when a block only has one or two colors in it, the smallest encoding would be to just send those colors as the end points, then your indices are only 0 or 1 (selecting the ends). Often a better quality encoding can be found by sending the end point colors outside the range of the block, and using indices 2 and 3 to select the interpolated 1/3 and 2/3 points.
Even beyond that you might want to try encodings of a block that are definitely "bad" in terms of quality, eg. sending a solid color block when the original data was not solid color. This is intentionally introducing loss to get a lower bit rate.
The correct way to do this is with an R-D optimized encoder. The simplest way to do that is using lagrange multipliers and optimizing the cost J = R + lambda * D.
There are various difficulties with this in practice; for one thing exposing lambda is unintuitive to clients. Another is that (good) DXTC encoding is already quite slow, so making the optimization metric be J instead of D makes it even slower. Many simple back-end coders (like LZ) are hard to measure R for a single block for. And adaptive back-ends make parallel DXTC solvers difficult.
2. More generally we should ask why are we stuck with trying to optimize DXTC? I believe the answer is the preferred way that DXTC is treated by current hardware. How could we get away from that?
I believe you could solve it by making the texture fetch more programmable. Currently texture fetch (and decode) is one of the few bits of GPU's that still totally fixed function. DXTC encoded blocks are fetched and decoded into a special cache on the texture unit. This means that DXTC compressed textures can be directly rendered from, and also that rendering with DXTC compressed textures is actually faster than rendering from RGB textures due to the decreased memory bandwidth needs.
What we want is future hardware to make this part of the pipeline programmable. One possibility is like this : Give the texture unit its own little cache of RGB 4x4 blocks that it can fetch from. When you try to read a bit of texture that's not in the cache, it runs a "texture fetch shader" similar to a pixel shader or whatever, which outputs a 4x4 RGB block. So for example a texture fetch shader could decode DXTC. But it could also decode JPEG, or whatever.