12-11-10 - Perceptual Notes of the Day

You have to constantly be aware of how you're pooling data. There's no a-priori reason to prefer one way or the other; when a human looks at an image do they see the single worst spot, or do they see the average? If you show someone two images, one with a single very bad error, and another with many small errors, which one do they say is worse?

There are also various places to remap scales. Say you have measured a quantity Q in two images, one way to make an error metric out of that is :

err = Lp-norm {  remap1(  remap2( Q1 ) - remap2( Q2 ) )  }

Note that there are two different remapping spots - remap1 converts the delta into perceptually linear space, while remap2 converts Q into space where it can be linearly delta'd. And then of course you have to tweak the "p" in Lp-norm.

You're faced with more problems as soon as you have more quantities that you wish to combine together to make your metric. Combine them arithmetically or geometrically? Combine before or after spatial pooling? Weight each one in the combination? Combine before or after remapping? etc.

This is something general that I think about all the time, but it also leads into these notes :

For pixel values you have to think about how to map them (remap2). Do you want light linear? (gamma decorrected) or perhaps "perceptually uniform" ala CIE L ?

The funny thing is that perception is a mapping curve that's opposing to gamma correction. That is, the monitor does something like signal ^ 2.2 to make light. But then the eye does something like light ^ (1/3) (or perhaps more like log(light)) to make mental perception. So the net mapping of pixel -> perceptuion is actually pretty close to unity.

On the subject of Lp norms, I wrote this in email (background : a while ago I found that L1 error was much better for my video coder than L2) :

I think L1 was a win mainly because it resulted in better R/D bit distribution.

In my perceptual metric right now I'm using L2 (square error) + masking , which appears to be better still.

I think that basically L1 was good because if you use L2 without masking, then you care about errors in noisy parts of the image too much.

That is, say your image is made of two blocks, one is very smooth, one has very high variation (edges and speckle).

If you use equal quantizers, the noisy block will have a much larger error.

With L2 norm, that means R/D will try very hard to give more bits to the noisy block, because changing 10 -> 9 helps more than changing 2 -> 1.

With L1 norm it balances out the bit distribution a bit more.

But L2-masked norm does the same thing and appears to be better than L1 norm.

Lastly a new metric :

PSNR-HVS-T is just "psnr-hvs" but with thresholds. (*3) In particular, if PSNR-HVS is this very simple pseudo-code :

on all 8x8 blocks
dct1,dct2 = dct of the blocks, scaled by 1/ jpeg quantizers

for i from 0 -> 63 :
 err += square( dct1[i] - dct2[i] )

then PSNR-HVS-T is :

on all 8x8 blocks
dct1,dct2 = dct of the blocks, scaled by 1/ jpeg quantizers

err += w0 * square( dct1[0] - dct2[0] )

for i from 1 -> 63 :
  err += square( threshold( dct1[i] - dct2[i] , T ) )

the big difference between this and "PSNR-HVS-M" is that T is just a constant, not the complex adaptive mask that they compute (*1).

threshold is :

threshold(x,t) = MAX( (x - T), 0 )

The surprising thing is that PSNR-HVS-T is almost as good as PSNR-HVS-M , which is in turn basically the best perceptual metric (*2) that I've found yet. Results :

PSNR-HVS-T Spearman : 0.935155

fit rmse :
PSNR-HVS   : 0.583396  [1,2]
VIF        : 0.576115  [1]
PSNR-HVS-M : 0.564279  [1]
IW-MS-SSIM : 0.563879  [2]
PSNR-HVS-T : 0.557841
PSNR-HVS-M : 0.552151  [2]

In particular, it beats the very complex and highly tweaked out IW-MS-SSIM. While there are other reasons to think that IW-MS-SSIM is a strong metric (in particular, the maximum difference competition work), the claims that they made in the papers about superiority of fitting to MOS data appear to be bogus. A much simpler metric can fit better.

footnotes :

*1 = note on the masking in PSNR-HVS-M : The masked threshold they compute is proportional to the L2 norm of all the DCT AC coefficients. That's very simple and unsurprising, it's a simple form of variance masking which is well known. The funny thing they do is multiply this threshold by a scale :

scale = ( V(0-4,0-4) + V(0-4,4-8) + V(4-8,0-4) + V(4-8,4-8) ) / ( 4*V(0-8,0-8) )

V(x,y) = Variance of pixels in range [x,y] in the block

threshold *= sqrt( scale );

that is, divide the block into four quadrants; take the variance within each quadrant and divide by the variance of the whole block. What this "scale" does is reduce the threshold in areas where the AC activity is caused by large edges. Our basic threshold was set by the AC activity, but we don't want the important edge shapes to get crushed, so we're undoing that threshold when it corresponds to major edges.

Consider various extremes; if the block has no large-scale structure, only random variation all over, then the variance within each quad will equal the variance of the whole, which will make scale ~= 1. On the other hand, if there's a large-scale edge, then the variance on either side of the edge may be quite small, so at least one of the quads has a small variance, and the result is that the threshold is reduced.

Two more subtle follows up to this :

1A : Now, the L2 sum of DCT AC coefficients should actually be proportional to the variance of the block ! That means instead of doing

 threshold^2 ~= (L2 DCT AC) * (Variance of Quads) / (Variance of Blocks)

we could just use (Variance of Quads) directly for the threshold !! That is true, however when I was saying "DCT" earlier I meant "DCT scaled by one over JPEG quantizer" - that is, the DCT has CSF coefficients built into it, which makes it better than just using pixel variances.

1B : Another thought you might have is that we are basically using the DCT L2 AC but then reducing it for big edges. But the DCT itself is actually a decent edge detector, the values are right there in DCT[0,1] and DCT[1,0] !! So the easiest thing would be to just not include those terms in the sum, and then don't do the scale adustment at all!

That is appealing, but I haven't found a formulation of that which performs as well as the original complex form.

(*2) = "best" defined as able to match MOS data, which is a pretty questionable measure of best, however it is the same one that other authors use.

(*3) = as usual, when I say "PSNR-HVS" I actually mean "MSE-HVS", and actually now I'm using RMSE because it fits a little better. Very clear, I know.

1 comment:

ryg said...

There's a pretty amazing page on the design of CELT (Vorbis successor by xiph.org): http://people.xiph.org/~xiphmont/demo/celt/demo.html. It's audio not video, but there's some interesting bits in there. The "constrained energy" bit (trying to preserve total energy in each spectral band) sounds like it should be applicable (in general terms) to image/video coding too.

In fact a lot of it transfers nicely to image coding. The short/long window selection corresponds to adaptive block size determination (to code "transients" better, which in image parlance are of course edges). For low frequencies, maintaining energy and minimizing quantization error do roughly the same thing; for high frequencies, it basically trades ringing for noise, which doesn't sound too bad.

If only every codec came with this kind of high-level design overview... extra bonus points for the "abandoned techniques" part!

old rants