3/10/2010

03-10-10 - Distortion Measure

What are the things we might put in an ideal distortion measure? This is going to be rather stream of conscious rambling, so beware. Our goal is to make output that "looks like" the input, and also that just looks "good". Most of what I talk about will assume that you are running "D" on something like 4x4 or 8x8 blocks and comparing it to "D" on other blocks, but of course it could be run on a gaussian windowed patch, just some way of localizing distortion on a region.

I'm going to ignore the more "macroscopic" issues of which frame is more important than another frame, or even which object within a frame is more important - those are very important issues I'm sure, but they can be added on later, and are beyond the scope of current research anyway. I want to talk about the microscopic local distortion rating D. The key thing is that the numerical value of D assigns a way to "score" one distortion against another. This not only lets you choose the way your error looks on a given block (choosing the one with lowest score obviously), it also determines how your bits are allocated around the frame in an R/D framework (bits will go to places that D says are more important).

It should be intuitively obvious that just using D = SSD or SAD is very crude and badly broken. One pixel step of numerical error clearly has very different importance depending on where it is. How might we do better ?

1. Value Error. Obviously the plain old metric of "output value - input value" is useful even just as a sanity check and regularizer ; it's the background distortion metric that you will then add your other biasing factors to. All other things being equal you do want output pixels to exactly match input pixels. But even here there's a funny issue of what measure you use. Probably something in the L-N norms, (L1 = SAD, L2 = SSD). The standard old world metric is L2, because if you optimize for D = L2, then you will minimize your MSE and maximize your PSNR, which is the goal of old literature.

The L-N norms behave differently in the way they rate one error vs another. The higher N is, the more importance it puts on the largest error. eg. L-infinity only cares about the largest error. L-2 cares more about big errors than small ones. That is, L2 makes it better to change 100->99 than 1->0. Obviously you could also do hybrid things like use L1 and then add a penalty term for the largest error if you think minimizing the maximum error is important. I believe that *where* the error occurs is more important than what its value is, as we will discuss later.

2. DC Preservation. Changes in DC are very noticeable. Particularly in video, the eye is usually tracking mainly one or two foreground objects; what the means is that most of the frame we are only seeing with our secondary vision (I don't know a good term for this, it's not exactly peripheral vision since it's right in front of you, but it's not what your brain is focused on, so you see it at way lower detail). All this stuff that we see with secondary vision we are only seeing the gross properties of it, and one of those is the DC. Another issue is that if a bunch of blocks in the source have the same DC, and you change one of them in the output, that is sorely noticeable.

I'm not sure if it's most important to preserve the median or the mean or what exactly. Usually people preserve the mean, but there are certainly cases where that can screw you up. eg. if you have a big field of {80} with a single pixel spike on it, you want to preserve that background {80} everywhere no matter what the spike does in the output. eg. {80,80,255,80,80} -> {80,80,240,80,80} is better than making it go -> {83,83,240,83,83} even though the latter has better mean preservation.

3. Edge Preservation. Hard edges, especially long straight lines or smooth curves, are very visible to humans and any artifact in them stands out. The importance of edges varies though; it has something to do with the length of the edge (longer edges are more major visual features) and with the contrast range of the region around the edge : eg. an edge that separates two very smooth sections is super visible, but an edge that's one of many in a bunch of detail is less important (preserving the detail there is important, but the exact shape of the edge is not). eg. a patch of grass or leaves might have tons of edges, but their exact shape is not crucial. An image of hardwood floor has tons of long straight parallel edges and preserving those exactly is very important. The border between objects is typically very important.

Obviously there's the issue of keeping the edges that were in the original and also the issue of not making new edges that weren't in the original. eg. introducing edges at block boundaries or from ringing artifacts or whatever. As with edge preservation, the badness of these introduces edges depends on the neighborhood - it's much worse to make them in a smooth patch than once that's already noisy. (in fact in a noisy patch, ringing artifacts are sort of what you want, which is why JPEG can look better than naive wavelet coders on noisy data).

4. Smooth -> Smooth (and Flat -> Flat). Changing smooth input to not smooth is very bad. Old coders failed hard on this by making block boundaries. Most new coders now handle this easily inherently either because they are wavelet or use unblocking or something. There are still some tricky cases though, such as if you have a smooth ramp with a bit of gaussian noise speckle added to it. Visually the eye still sees this as "smooth ramp" (in fact if you squint your eyes the noise speckly goes away completely). It's very important for the output to preserve this underlying smooth ramp; many good modern coders see the noise speckle as "detail" that should be preserved and wind up screwing up the smooth ramp.

5. Detail/Energy Preservation. The eye is very sensitive to whether a region is "noisy" or "detailed", much more so than exactly what that detail is. Some of the JPEG style "threshold of visibility" stuff is misleading because it makes you think the eye is not sensitive to high frequency shapes - true, but you do see that there's "something" there. The usual solution to this is to try to preserve the amount of high frequency energy in a block.

There are various sub-cases of this. There's true noise (or real life video that's very similar to true noise) in which case the exact pixel values don't matter much at all as long as the frequency spectrum and distribution of the noise is reproduced. There's detail that is pretty close to noise, like tree leaves, grass, water, where again the exact pixels are not very important as long as the character of the source is preserved. Then there's "false noise" ; things like human hair or burlap or bricks can look a lot like noise to naive analysis metrics, but are in fact patterned texture in which case messing up the pattern is very visible.

There are two issues here - obviously there's trying to match the source, but there's also the issue of matching your neighbors. If you have a bunch of neighboring source blocks with a certain amount of energy, you want to reproduce that same patch in the output - you don't want to have single blocks with very different energy, because they will stand out. Block energy is almost like DC level in this way.

6. Dynamic range / sdev Preservation. Of course related to previous metrics, but you can definitely see when the dynamic range of a region changes. On an edge it's very easy to see if a high contrast edge becomes lower contrast. Also in noise/detail areas the main things you notice are the DC, the amount of noise, and the range of the noise. One reason its so visible is because of optical fusion and affects on DC brightness. That is, if you remove the bright specks from a background it makes the whole region look darker. Because of gamma correction, {80,120} is not the same brightness as {100,100}. Now theoretically you could do gamma-corrected DC preservation checks, but there are difficulties in trying to be gamma correct in your error metrics since the gamma remapping sort of does what you want in terms of making changes of dark values relatively more important; maybe you could do gamma-correct DC preservation and then scale it back using gamma to correct for that.

It's unclear to me whether the important thing is the absolute [low,high] range, or the statistical width [mean-sdev,mean+sdev]. Another option would be to sort the values from lowest to highest and look at the distribution; the middle is the median, then you have the low and high tails on each side; you sort of want to preserve the shape of that distribution. For example the input might have the high values in a kind of gaussian falloff tail with most values near median and fewer as it gets higher; then the output should have a similar distribution, but exactly matching the high value is not important. The same block might have all of its low values at exactly 0 ; in that case the output should also have those values at exactly 0.


Whatever all the final factors are, you are left with how to scale them and combine them. There are two issues on scaling : power and coefficient. Basically you're going to combine the sub-distortions something like this :


D = Sum_n { Cn * Dn^Pn }

Dn = distortion sub type n
Cn = coefficient n
Pn = power n

The power Pn lets you change the units that Dn are measured in; it lets you change how large values of Dn contribute vs. how small values contribute. The cofficient Cn obviously just overall scales the importance of each Dn vs. the other D's.

It's actually not that hard to come up with a good set of candidate distortion terms like I did above, the problem is once you have them (the various Dn) - what are the Cn and Pn to combine them?

No comments:

old rants