Sean reminded me that he tackled this before; I dunno if he has any notes about it on the public net, he can link them. His basic idea was to do a full solve for the entire down-sampled image. It's quite simple if you think about. Consider the case of 2X up & down sampling. The bilinear filter upsample will make a high res image where each pixel is a simple linear combo of 4 low res. You take the L2 error :
E = Sum[all high res pixel] ( Original - Upsampled ) ^2
For Sean's full solution approach, you set Upsampled = Bilinear_Upsample( X) , and just solve this for X without any assumption of how X is made from Original. For an N-pixel low res image you have 4N error terms, so it's plenty dense (you could also artificially regularize it more by starting with a low res image that's equal to the box down-sample, and then solve for the deltas from that, and add an extra "Tikhonov" regularization term that presumes small deltas - this would fix any degenerate cases).
I didn't do that. Instead I assumed that I want a discrete local linear filter and solved for what it should be.
A discrete local linear filter is just a bunch of coefficients. It must be symmetric, and it must sum to 1.0 to be mean-preserving (flat source should reproduce flat exactly). Hence it has the form {C2,C1,C0,C0,C1,C2} with C0+C1+C2 = 1/2. (this example has two free coefficients). Obviously the 1-wide case must be {0.5,0.5} , then you have {C1,0.5-C1,0.5-C1,C1} etc. as many taps as you want. You apply it horizontally and then vertically. (in general you could consider asymetric filters, but I assume H & V use the same coefficients).
A 1d application of the down-filter is like :
L_n = Sum[k] { C_k * [ H_(2*n-k) + H_(2*n+1+k) ] }
That is : Low pixel n = filter coefficients times High res samples centered at (2*n * 0.5) going out both directions.
Then the bilinear upsample is :
U_(2n) = (3/4) * L_n + (1/4) * L_(n-1)
U_(2n+1) = (3/4) * L_n + (1/4) * L_(n+1)
Again we just make a squared error term like the above :
E = Sum[n] ( H_n - U_n ) ^2
Substitute the form of L_n into U_n and expand so you just have a matrix equation in terms of H_n and C_k. Then do a solve for the C_k. You can do a least-squares solve here, or you can just directly solve it because there are generally few C's (the matrix is # of C's by # of pixels).
Here's how the error varies with number of free coefficients (zero free coefficients means a pure box downsample) :
r:\>bmputil mse lenag.256.bmp bilinear_down_up_0.bmp rmse : 15.5437 psnr : 24.3339 r:\>bmputil mse lenag.256.bmp bilinear_down_up_1.bmp rmse : 13.5138 psnr : 25.5494 r:\>bmputil mse lenag.256.bmp bilinear_down_up_2.bmp rmse : 13.2124 psnr : 25.7454 r:\>bmputil mse lenag.256.bmp bilinear_down_up_3.bmp rmse : 13.0839 psnr : 25.8302you can see there's a big jump from 0 to 1 but then only gradually increasing quality after that (though it does keep getting better as it should).
Two or three free terms (which means a 6 or 8 tap filter) seems like the ideal width to me - wider than that and you're getting very nonlocal which means ringing and overfitting. Optimized on all my test images the best coefficients I get are :
// 8 taps :
static double c_downCoef[4] = { 1.31076, 0.02601875, -0.4001217, 0.06334295 };
// 6 taps :
static double c_downCoef[3] = { 1.25 , 0.125, - 0.375 };
(the 6-tap one was obviously so close to those perfect fractions that I just manually rounded it; I assume
that if I solved this analytically that's what I would get. The 8-tap one is not so obvious to me what it
would be).
Now, how do these static ones compare to doing the lsqr fit to make coefficients per image ? They're 99% of the benefit. For example :
// solve :
lena.512.bmp : doing solve exact on 3 x 524288
{ 1.342242526 , -0.028240414 , -0.456030369 , 0.142028257 } // rmse : 10.042138
// static fit :
lena.512.bmp : // rmse : 10.116388
------------
// static fit :
clegg.bmp : // rgb rmse : 50.168 , gray rmse : 40.506
// solve :
fitting : clegg.bmp : doing lsqr on 3 x 1432640 , c_lsqr_damping = 0.010000
{ 1.321164423 , 0.002458499 , -0.381711250 , 0.058088329 } // rgb rmse : 50.128 , gray rmse : 40.472
So it seems to me this is in fact a very simple and high quality way to down-sample to make the best
reproduction after bilinear upsampling.
I'm not even gonna touch the issue of the [0,255] range clamping or the fact that your low res image should actually be considered discrete, not continuous.
ADDENDUM : it just occured to me that you might do the bilinear 2X upsampling using offset-taps instead of centered taps. That is, centered taps reconstruct like :
+---+ +-+-+ | | | | | | | -> +-+-+ | | | | | +---+ +-+-+That is, the area of four high res pixels lies directly on one low res pixel. Offset taps do :
+---+ | | | | -+-+- | | -> | | | | -+-+- +---+ | |that is, the center of a low res pixel corresponds directly to a high res pixel.
With centered taps, the bilinear upsample weights in 1d are always (3/4,1/4) then (1/4,3/4) , (so in 2d they are 9/16, etc.)
With offset taps, the weights in 1d are (1) (1/2,1/2) (1) etc... that is, one pixel is just copied and the tweeners are averages.
Offset taps have the advantage that they aren't so severely variance decreasing. Offset taps should use a single-center down-filter of the form :
{C2,C1,C0,C1,C2}
(instead of {C2,C1,C0,C0,C1,C2} ).
My tests show single-center/offset up/down is usually slightly worse than symmetric/centered , and occasionally much better. On natural/smooth images (such as the entire Kodak set) it's slightly worse. Picking one at random :
symmetric :
kodim05.bmp : { 1.259980122 , 0.100375561 , -0.378468204 , 0.018112521 } // rmse : 25.526521
offset :
kodim05.bmp : { 0.693510045 , 0.605009745 , -0.214854612 , -0.083665178 } // rgb rmse : 26.034
that pattern holds for all. However, on weird images it can be better, for example :
symmetric :
c:\src\testproj>Release\TestProj.exe t:\test_images\color\bragzone\clegg.bmp f
{ 1.321164423 , 0.002458499 , -0.381711250 , 0.058088329 } // rgb rmse : 50.128 , gray rmse : 40.472
offset :
c:\src\testproj>Release\TestProj.exe t:\test_images\color\bragzone\clegg.bmp f
{ 0.705825115 , 0.561705835 , -0.267530949 } // rgb rmse : 45.185 , gray rmse : 36.300
so ideally you would choose the best of the two. If you're decompressing in a pixel shader you need another parameter for whether to offset
your sampling UV's by 0.5 of a pixel or not.
ADDENDUM : I got Humus working with a KLT color transform. You just do the matrix transform in the shader after fetching "YUV" (not really YUV any more). It helps on the bad cases, but still doesn't make it competitive. It's better just to go with DXT1 or DXT5-YCoCg in those cases. For example :
On a pure red & blue texture :
Humus YCoCg : rmse : 11.4551 , psnr : 26.9848 ssim : 0.9529 , perc : 80.3841% Humus KLT with forced Y = grey : KLT : Singular values : 56.405628,92.022781,33.752548 KLT : 0.577350,0.577350,0.577350 KLT : -0.707352,0.000491,0.706861 KLT : 0.407823,-0.816496,0.408673 rmse : 11.4021 , psnr : 27.0251 ssim : 0.9508 , perc : 79.9545% Humus KLT : KLT : Singular values : 93.250313,63.979282,0.230347 KLT : -0.550579,0.078413,0.831092 KLT : -0.834783,-0.051675,-0.548149 KLT : -0.000035,-0.995581,0.093909 rmse : 5.6564 , psnr : 33.1140 ssim : 0.9796 , perc : 87.1232% (note the near perfect zero in the last singular value, as it should be) DXT1 : rmse : 3.0974 , psnr : 38.3450 ssim : 0.9866 , perc : 89.5777% DXT5-YCoCg : rmse : 2.8367 , psnr : 39.1084 ssim : 0.9828 , perc : 88.1917%So, obviously a big help, but not enough to be competitive. Humus also craps out pretty bad on some images that have single pixel checkerboard patterns. (again, any downsampling format, such as JPEG, will fail on the same cases). Not really worth it to mess with the KLT, better just to support one of the other formats as a fallback.
One thing I'm not sure about is just how bad the two texture fetches is these days.