It's subtle. For one thing, a smooth area is only not "salient" if it's smooth in both the original and the disorted image. If you do something like creating block artifacts, then those block edges are salient (high visual attention factor). Also, they work at sort of different scales. Saliency works on intermediate frequencies - the brain specifically cares about detail at a certain scale that detects edges and shapes, but not speckle/texture and not gross offsets. Contrast or variance masking applies mainly to the aditional frequencies in an area that has a strong medium-scale signal.
MS-SSIM actually has a lot of tweaked out perceptual hackiness. There's a lot of vision modelling implicit in it. They use a Gaussian window with a tweaked out size - this is a model of the spatial filter of the eye. The different scales have different importance contributions - this is a model of the CSF, variable sensitivity and different scales. The whole method of doing a Pearson-style correlation means that they are doing variance masking.
The issue of feeding back a perceptual delta into a non-perceptual compressor is an interesting one.
Say you have a compressor which does R-D optimization, but its D is just MSE. You can pretty easily extend that to do weighted R-D, with a weight either per-pixel or per-block. Initially all the weights are the same (1.0). How do you adjust the weights to optimize an external perceptual metric?
Assume your perceptual metric returns not just a number but an error map, either per pixel or per block.
First of all, what you should *not* do is to simply find the parts with large perceptual error and increase their weight. That would indeed cause the perceptual error at those spots to reduce - but we don't know if that is actually a good thing in an R-D sense. What that would do is even out the perceptual error, and evening it out is not necessarily the best R-D result. This is, however, an easy to get the minimax perceptual error!
Algorithm Minimax Perceptual Error : 1. Initial W_i = 1.0 2. Run weighted R-D non-perceptual compressor (with D = weighted MSE). 3. Find perceptual error map P_i 4. W_i += lambda * P_i (or maybe W_i *= e^(lambda * P_i)) 5. goto 2what about minimizing the total perceptual error?
What you need to do is transform the R/D(MSE) that the compressor uses into R/D(percep). To do that, you need to make W_i ~= D(percep)/D(mse).
If the MSE per block or pixel is M_i, the first idea is just to set W_i = ( P_i / M_i ) . This works well if P and M have a near-linear relationship *locally* (note they don't need to be near-linear *globally* , and the local linear relationship can vary from one location to another).
However, the issue is that the slope of P_i(R) may not be near linear to M_i(R) (as functions of rate) over its entire scale. Obviously for small enough steps, they are always linear.
So, you could run your compress *twice* to create two data points, so you have M1,M2 and P1,P2. Then what you want is :
P ~= P1 + ((P2 - P1)/(M2 - M1)) * (M - M1) slope := (P2 - P1)/(M2 - M1) P ~= slope * M + (P1 - slope * M1) C := (P1 - slope * M1) P ~= slope * M + C I want J = R + lambda * P use J = R + lambda * (slope * M + C) J = R + lambda * slope * M + lambda * C so W_i = slope_i J = R + lambda * W * Mwith the extra lambda * C term ; C is just a constant per block, so it can be ignored for R-D optimization on that block. If you care about the overall J of the image, then you need the sum of all C's for that, which is just :
C_tot = Sum_i (P1_i - slope_i * M1_i)it's obviously just a bias to turn our fudged MSE-multiplied-by-slope into the proper scale of the perceptual error.
I think this is pretty strong, however it's obviously iterative, and you must keep the steps of the iteration small - this is only valid if you don't make very big changes in each step of the iteration.
There's also a practical difficulty that it can be hard to actually generate different M1 and M2 on some blocks. You want to generate both M1 and M2 very near the settings that you will be using for the final compress, because large changes could take you to vastly different perceptual areas. But, when you make small changes to R, some parts of the image may change, but other parts might not actually change at all. This gives you M1=M2 in some spots which is a divide by zero, which is annoying.
In any case I think something like this is an interesting generic way to adapt any old compressor to be perceptual.
Note that this kind of "perceptual importance map" generation cannot help your coder make internal decisions based on perceptual optimization - it can only affect bit allocation. That is, it can help a coder with variable bit allocation (either through truncation or variable quantization) to optimize the bit allocation.
In particular, a coder like H264 has lots of flexibility; you can choose a macroblock mode, an inter block can choose a movec, an intra block can choose a prediction mode, you can do variable quantizers, and you can truncate coefficients. This kind of iterative approach will not help the mode decisions to pick modes which have less perceptual error, but it will help to change the bit allocation to give more bits to blocks where it will help the preceptual error the most.