First of all, one simple think to ask is what is the "sharpness" of the image. One way to measure this is to compare the image with a blurred version (lowpassed if you prefer) version of itself. Basically if an image is already blurry (eg. smooth ramps) and you run a blur filter on it, it doesn't change much, but if it was very sharp (eg. black and white pixel checkerboard), it changes a lot. See for example this page on MTF with nice pictures.

One nice way to measure this is the ratio of the highpass energy over the lowpass energy. To motivate that what we are basically doing is :

I = original image L = lowpass of I Sharpness = Energy[I] / Energy[L] H = highpass = I - L I = L + H Energy[I] = Energy[L] + Energy[H] Sharpness = 1 + Energy[H] / Energy[L]where "Energy" might be something like transform to Fourier domain and sum the power spectrum. Now, this Sharpness has an implicit cutoff frequency f that is the parameter of the lowpass. So it's really S(f) and we could scan f around and make a chart of the sharpness at various frequencies. To measure preservation of detail, you want to compare S(f) at all f's.

Now we'd like to have something like this that is more discrete and also localized. We want to ask if the detail at a specific spot is preserved.

A natural idea is the (abs or square) sum of laplacian filters. Something like :

In a local neighborhood of I : Energy = L1 or L2 sum of Laplacian filters on I L = blur of I Sharpness = Energy[I] / Energy[L]Instead of scanning the lowpass cutoff around, we just picked some single blur amount, but then we can do this in a multi-scale way. Let I0 = original image, I1 = blur I0, I2 = blur I1, etc. , then S0 = E[I0]/E[I1], S1 = E[I1]/E[I2], etc.. To measure preservation of detail at various scales, we compare S0,S1,S2.. from each image to S0,S2,S3.. of the other image (on each local neighborhood). That is, we require the detail level is preserved in that area in the same frequency band.

That is, we make a Gaussian pyramid of images that are blurred more and more, and then we take the energy in each level vs the energy in the parent.

But the laplacian is just the delta of each level from its parent (roughly), something like I0 - I1. So we can just make these delta images, D0 = I0 - I1, D1 = I1 - I2, and then S0 = |D0|/|D1| (just magnitudes, not "energy" measures).

By now the similarity to wavelets should be obvious. The fine detail images are just the high pass parts of the wavelet. So really all we're doing is looking at the L1 or L2 sum of coefficients in each band pass of a wavelet and comparing to the sum in the parent.

But wavelets also suggest something more that we could have done from the beginning - instead of a symmetric lowpass/highpass we can do seperate ones for horizontal, vertical, and diagonal. This tells us not just the amount of energy but a bit more about its shape. So instead of just a sharpness Sn we could measure Sn_H,Sn_V and Sn_D using a wavelet. This would be like using a horizontal laplacian [-1, 2, -1], a vertical, and an "X" shaped diagonal one.

And wavelets suggest something more - we could just use block transforms to do the same thing. An 8x8 Haar is a wavelet on a local chunk (and an 8x8 DCT has "wavelet structure" too). In particular you can arrange it into frequency bands like so :

01223333 11223333 22223333 22223333 33333333 33333333 33333333 33333333and then take the L1 or L2 sum in each region and ask for preservation.

The similarity to x264's SATD energy metric is obvious. They use Haar and take the L1 sum of the energy in all the frequency bands to measure the total energy in the block. But we can be a lot more specific. In fact it suggests a whole "multi-accuracy" kind of delta.

Do 8x8 Haar or DCT. Compare 8x8 blocks A and B. Add terms : 1. each of the 64 coefficients should be the same : += |A_ij - B_ij| 2. the sum of each savelet band should be the same, that is if you use the diagram above for groups 0-3, then within each group there is H,V, and D, add : S(g,H/V/D) = Sum{in group g, H,V or D} += | S(A,g,H) - S(B,g,H) | += | S(A,g,V) - S(B,g,V) | += | S(A,g,D) - S(B,g,D) | for g in 0-3 3. ignore the H,V,D and do the same thing just for the frequency subband sums : += | S(A,g) - S(B,g) | 4. ignore the frequency subbands and do the sum of all the coefficients : += | S(A) - S(B) |

These are error terms that go from fine to coarse. This last one (#4) is the most coarse and is the "SATD". Adding the multiple terms together means that if we have errors that screw up the highest precision test (#1) but preserve the other measures, we prefer that kind of error. eg. we prefer the energy to move somewhere nearby in frequency space rather than just disappear.

Now obviously if your coder works on something like 8x8 blocks then you don't want to run a test metric that is also 8x8 block based, certainly not if it's aligned with the coding blocks. You could run 8x8 test metric blocks that are offset by 4 so they straddle neighbors, or you could do 16x16 test blocks centered on each 8x8 code block, or you could do 8x8 test blocks but do one for every pixel instead of just at aligned locations.

## 3 comments:

Have you looked at work from the computer vision community, particularly saliency functions, and feature (eg, edge) detectors?

After all, pretty much by definition, edges are sharp and sharp things are edges (well, kinda).

You might want to check Peter Kovesi's work. His phase congruency stuff is pretty cool.

http://www.csse.uwa.edu.au/~pk/Research/MatlabFns/#phasecong

Yeah I have; the Phase stuff is pretty interesting, though it's definitely still out of the norm.

BTW after I wrote this I found some papers by Lai and Kuo on Haar Wavelet image quality assessment which are nearly identical to the basic ideas here.

I have one major problem with most of the research on this topic - they make very explicit use of the actual physical brightness of the display and the angular size of pixels, and use those in very precise models of human vision (for eg. spatial frequency contrast response functions and contrast thresholds). The problem with that is that viewing conditions are not all the same, so you can't really know the angular size of a pixel so precisely.

There's tons of papers on this stuff. Some of the most interesting ones I've seen so far are

ones on "Texture Characterization" from the computer vision / image classification camp

stuff on "Perceptual Contrast Enhancement" from the HDR image processing and medical vision camp

but I've got like a hundred papers and haven't read them all yet.

A few more links for the pile:

http://en.wikipedia.org/wiki/Unsharp_masking

http://en.wikipedia.org/wiki/Scale_space

Post a Comment