# cbloom rants

## 6/17/2009

### 06-17-09 - Inverse Box Sampling - Part 2

Okay, in Part 1.5 I asked about the downsample that was the best inverse of bilinear upsampling. I have a solution that pleases me.

Sean reminded me that he tackled this before; I dunno if he has any notes about it on the public net, he can link them. His basic idea was to do a full solve for the entire down-sampled image. It's quite simple if you think about. Consider the case of 2X up & down sampling. The bilinear filter upsample will make a high res image where each pixel is a simple linear combo of 4 low res. You take the L2 error :

E = Sum[all high res pixel] ( Original - Upsampled ) ^2

For Sean's full solution approach, you set Upsampled = Bilinear_Upsample( X) , and just solve this for X without any assumption of how X is made from Original. For an N-pixel low res image you have 4N error terms, so it's plenty dense (you could also artificially regularize it more by starting with a low res image that's equal to the box down-sample, and then solve for the deltas from that, and add an extra "Tikhonov" regularization term that presumes small deltas - this would fix any degenerate cases).

I didn't do that. Instead I assumed that I want a discrete local linear filter and solved for what it should be.

A discrete local linear filter is just a bunch of coefficients. It must be symmetric, and it must sum to 1.0 to be mean-preserving (flat source should reproduce flat exactly). Hence it has the form {C2,C1,C0,C0,C1,C2} with C0+C1+C2 = 1/2. (this example has two free coefficients). Obviously the 1-wide case must be {0.5,0.5} , then you have {C1,0.5-C1,0.5-C1,C1} etc. as many taps as you want. You apply it horizontally and then vertically. (in general you could consider asymetric filters, but I assume H & V use the same coefficients).

A 1d application of the down-filter is like :

L_n = Sum[k] { C_k * [ H_(2*n-k) + H_(2*n+1+k) ] }

That is : Low pixel n = filter coefficients times High res samples centered at (2*n * 0.5) going out both directions.

Then the bilinear upsample is :

U_(2n) = (3/4) * L_n + (1/4) * L_(n-1)

U_(2n+1) = (3/4) * L_n + (1/4) * L_(n+1)

Again we just make a squared error term like the above :

E = Sum[n] ( H_n - U_n ) ^2

Substitute the form of L_n into U_n and expand so you just have a matrix equation in terms of H_n and C_k. Then do a solve for the C_k. You can do a least-squares solve here, or you can just directly solve it because there are generally few C's (the matrix is # of C's by # of pixels).

Here's how the error varies with number of free coefficients (zero free coefficients means a pure box downsample) :

```
r:\>bmputil mse lenag.256.bmp bilinear_down_up_0.bmp  rmse : 15.5437 psnr : 24.3339

r:\>bmputil mse lenag.256.bmp bilinear_down_up_1.bmp  rmse : 13.5138 psnr : 25.5494

r:\>bmputil mse lenag.256.bmp bilinear_down_up_2.bmp  rmse : 13.2124 psnr : 25.7454

r:\>bmputil mse lenag.256.bmp bilinear_down_up_3.bmp  rmse : 13.0839 psnr : 25.8302

```
you can see there's a big jump from 0 to 1 but then only gradually increasing quality after that (though it does keep getting better as it should).

Two or three free terms (which means a 6 or 8 tap filter) seems like the ideal width to me - wider than that and you're getting very nonlocal which means ringing and overfitting. Optimized on all my test images the best coefficients I get are :

```
// 8 taps :

static double c_downCoef[4] = { 1.31076, 0.02601875, -0.4001217, 0.06334295 };

// 6 taps :

static double c_downCoef[3] = { 1.25 , 0.125, - 0.375 };

```
(the 6-tap one was obviously so close to those perfect fractions that I just manually rounded it; I assume that if I solved this analytically that's what I would get. The 8-tap one is not so obvious to me what it would be).

Now, how do these static ones compare to doing the lsqr fit to make coefficients per image ? They're 99% of the benefit. For example :

```
// solve :
lena.512.bmp : doing solve exact on 3 x 524288
{ 1.342242526 , -0.028240414 , -0.456030369 , 0.142028257 }  // rmse : 10.042138

// static fit :
lena.512.bmp :  // rmse : 10.116388

------------

// static fit :
clegg.bmp :  // rgb rmse : 50.168 , gray rmse : 40.506

// solve :
fitting : clegg.bmp : doing lsqr on 3 x 1432640 , c_lsqr_damping = 0.010000
{ 1.321164423 , 0.002458499 , -0.381711250 , 0.058088329 }  // rgb rmse : 50.128 , gray rmse : 40.472

```
So it seems to me this is in fact a very simple and high quality way to down-sample to make the best reproduction after bilinear upsampling.

I'm not even gonna touch the issue of the [0,255] range clamping or the fact that your low res image should actually be considered discrete, not continuous.

ADDENDUM : it just occured to me that you might do the bilinear 2X upsampling using offset-taps instead of centered taps. That is, centered taps reconstruct like :

```
+---+    +-+-+
|   |    | | |
|   | -> +-+-+
|   |    | | |
+---+    +-+-+

```
That is, the area of four high res pixels lies directly on one low res pixel. Offset taps do :
```
+---+     | |
|   |    -+-+-
|   | ->  | |
|   |    -+-+-
+---+     | |

```
that is, the center of a low res pixel corresponds directly to a high res pixel.

With centered taps, the bilinear upsample weights in 1d are always (3/4,1/4) then (1/4,3/4) , (so in 2d they are 9/16, etc.)

With offset taps, the weights in 1d are (1) (1/2,1/2) (1) etc... that is, one pixel is just copied and the tweeners are averages.

Offset taps have the advantage that they aren't so severely variance decreasing. Offset taps should use a single-center down-filter of the form :

{C2,C1,C0,C1,C2}

My tests show single-center/offset up/down is usually slightly worse than symmetric/centered , and occasionally much better. On natural/smooth images (such as the entire Kodak set) it's slightly worse. Picking one at random :

```
symmetric :
kodim05.bmp : { 1.259980122 , 0.100375561 , -0.378468204 , 0.018112521 }   // rmse : 25.526521

offset :
kodim05.bmp : { 0.693510045 , 0.605009745 , -0.214854612 , -0.083665178 }  // rgb rmse : 26.034

```
that pattern holds for all. However, on weird images it can be better, for example :
```
symmetric :
c:\src\testproj>Release\TestProj.exe t:\test_images\color\bragzone\clegg.bmp f
{ 1.321164423 , 0.002458499 , -0.381711250 , 0.058088329 }  // rgb rmse : 50.128 , gray rmse : 40.472

offset :
c:\src\testproj>Release\TestProj.exe t:\test_images\color\bragzone\clegg.bmp f
{ 0.705825115 , 0.561705835 , -0.267530949 }  // rgb rmse : 45.185 , gray rmse : 36.300

```
so ideally you would choose the best of the two. If you're decompressing in a pixel shader you need another parameter for whether to offset your sampling UV's by 0.5 of a pixel or not.

ADDENDUM : I got Humus working with a KLT color transform. You just do the matrix transform in the shader after fetching "YUV" (not really YUV any more). It helps on the bad cases, but still doesn't make it competitive. It's better just to go with DXT1 or DXT5-YCoCg in those cases. For example :

On a pure red & blue texture :

```
Humus YCoCg :

rmse : 11.4551 , psnr : 26.9848
ssim : 0.9529 , perc : 80.3841%

Humus KLT with forced Y = grey :

KLT : Singular values : 56.405628,92.022781,33.752548
KLT : 0.577350,0.577350,0.577350
KLT : -0.707352,0.000491,0.706861
KLT : 0.407823,-0.816496,0.408673

rmse : 11.4021 , psnr : 27.0251
ssim : 0.9508 , perc : 79.9545%

Humus KLT  :

KLT : Singular values : 93.250313,63.979282,0.230347
KLT : -0.550579,0.078413,0.831092
KLT : -0.834783,-0.051675,-0.548149
KLT : -0.000035,-0.995581,0.093909

rmse : 5.6564 , psnr : 33.1140
ssim : 0.9796 , perc : 87.1232%

(note the near perfect zero in the last singular value, as it should be)

DXT1 :

rmse : 3.0974 , psnr : 38.3450
ssim : 0.9866 , perc : 89.5777%

DXT5-YCoCg :

rmse : 2.8367 , psnr : 39.1084
ssim : 0.9828 , perc : 88.1917%

```
So, obviously a big help, but not enough to be competitive. Humus also craps out pretty bad on some images that have single pixel checkerboard patterns. (again, any downsampling format, such as JPEG, will fail on the same cases). Not really worth it to mess with the KLT, better just to support one of the other formats as a fallback.

One thing I'm not sure about is just how bad the two texture fetches is these days.

Anonymous said...

It's probably worth comparing your result to what you would get by just applying any of the standard mipmapping filters without paying any attention to the bilerp reconstruction, to have some baseline.

I never published anything since I wasn't actually able to get any interesting results.

I think the interesting aspect of the full solve is the opportunity to introduce non-linearities; if you're doing linear programming to solve the linear least-squares problem, you can introduce additional linear constraints, e.g. you can attempt to limit a given output value to a particular range, say the max range of N of the neighboring source pixels, to try to limit ringing, or similarly you could apply an output constraint x_n0 < xn1 if your analysis indicates that the pixels in that area should be monotonically increasing.

I'm not sure any of this will improve PSNR--ringing tends to occur in the solved output because by displacing a value "too far" the midpoint value tends to come out better--so it's more likely in the service of perceptual quality.

cbloom said...

Comparison to standard filters :

r:\>bmputil mse lenag.256.bmp bilinear_down_up_0.bmp rmse : 15.5437 psnr : 24.3339

r:\>bmputil mse lenag.256.bmp bilinear_down_up_1.bmp rmse : 13.5138 psnr : 25.5494

r:\>bmputil mse lenag.256.bmp bilinear_down_up_2.bmp rmse : 13.2124 psnr : 25.7454

r:\>bmputil mse lenag.256.bmp bilinear_down_up_3.bmp rmse : 13.0839 psnr : 25.8302

BoxFilter
filter 2 : lenag.256.bmp : // rgb rmse : 15.544 , gray rmse : 15.544

CubicFilter
filter 3 : lenag.256.bmp : // rgb rmse : 15.919 , gray rmse : 15.919

MitchellFilter B = 1/3
filter 4 : lenag.256.bmp : // rgb rmse : 14.956 , gray rmse : 14.956

MitchellFilter B = 0
filter 5 : lenag.256.bmp : // rgb rmse : 14.527 , gray rmse : 14.527

GaussianFilter
filter 6 : lenag.256.bmp : // rgb rmse : 15.656 , gray rmse : 15.656

SincFilter
filter 7 : lenag.256.bmp : // rgb rmse : 14.623 , gray rmse : 14.623

Cyan said...

I guess this method is applicable to the "downsampled Co/Cg channels" of Humus texture format, since they are intended to be stretched 2x using bilinear filter.

I wonder if it is suitable for real-time downsizing though.
Simply averaging the 2x2 box is a trivial task. Your results with n=1 seems to prove their is a significant quality gain to look for, but is it achievable in real-time ?

cbloom said...

Well, the simple { 1.25 , 0.125, - 0.375 } filter certainly can be done in real time, and is a decent win. An SSE/MMX implementation should be super fast. It depends exactly where you want to be on the quality/speed tradeoff.

Cyan said...

I'm more on the "speed" part than quality, although it's always possible to borrow cheap quality improvements, as long as RT is not jeopardized.

So on first sight, I tend to agree with your suggestion.

Now, i understand your formula in the context of 1D, but it's less clear for me how it translates into 2D.

If i do understand correctely, for n=2, it means it will be necessary to sample a full 6x6 block to calculate the value of the downsampled inner 2x2 block.
This translates into an equation of 36 input, instead of 4 (for the basic case). At the very least, it means 9x slower, probably more since the coefficients are not as simple.

That's why i'm also interested in the more simple n=1, with just 2 coefficients. It would translate into "only" 16 inputs, still 4x slower, but a kind of middle ground compared to n=2. So if n=2 proves too slow, maybe n=1 will be good enough.
However, the "static" coefficients for n=1 are not present in your article, if i'm not mistaken.

cbloom said...

The way to do 2d filtering is always to do 1d filtering twice. Ryg blog has some good stuff on fast filtering.

You want to be doing MMX loads of 8 pixels at once, then you're just sliding a window along the row doing muls and accumulates.

Cyan said...

The ryg's articles are very interesting to read, although some of the tricks explained are only applicable to the "blur" filter he exposes (especially the "filter in constant time"), and seems unlikely to benefit inverse box sampling. But that's nonetheless very instructive.

By the way, are the static coefficients for n=1 {5/4, -1/4} ?

cbloom said...

This is what's in my old code, but I don't vouch for the 1 term solution. For some reason I recall thinking that 2 terms produced much better results and was the minimum, but I don't recall why exactly I though that.

(by "term" I mean the number of free variables, so "1 term" is actually a 4-tap discrete filter; symmetry and sum determine the remaining taps)

#if DOWNSAMPLE_TERMS == 3

//static double c_downCoef[DOWNSAMPLE_TERMS+1] = { 1.26746, 0.08675, -0.372, 0.01779 };
static double c_downCoef[DOWNSAMPLE_TERMS+1] = { 1.31076, 0.02601875, -0.4001217, 0.06334295 };

#elif DOWNSAMPLE_TERMS == 2

static double c_downCoef[DOWNSAMPLE_TERMS+1] = { 1.25 , 0.125, - 0.375 };

#elif DOWNSAMPLE_TERMS == 1

static double c_downCoef[DOWNSAMPLE_TERMS+1] = { 0.95, 0.05 };

#else

static double c_downCoef[DOWNSAMPLE_TERMS+1] = { 1.0 };

#endif