1. Symmetry. A lot of researchers ask for the property M(X,Y) = M(Y,X). There's absolutely no reason to think this is a good property of a perceptual image metric. You may well want to treat the reference (original) differently than the distorted image in your metric.
Obviously for image database similarity metrics or something, you want symmetry, but that's not what we're dealing with here. Also you can of course trivially make a symmetric metric from an asymetric one by doing S(X,Y) = A(X,Y) + A(Y,X)
2. Offset/Scale Invariance. Again people demand M(X,Y) = M(S*X + O,S*Y + O) , that is invariance to constant scale and offset. There is no reason to think this is good (in fact there is reason to think this is bad). Pixel values have an absolute meaning and 2->4 is not the same as 4->6.
3. Rotation Invariance. A lot of people ask for rotation invariance, and are careful to normalize their wavelet basis or DCT or whatever so that they have rotation invariance. Again, bad idea. The human eye has different perception of detail in different directions - strongest horizontally, slightly weaker vertically, and much weaker diagonally. Also real images have by far the most correlation horizontally - people with cameras tend to hold them aligned to the horizon, not at completely random angles. Images are not rotationally invariant in the real world, so why should your metric be?
For example one simple thing they do in the PSNR-HVS paper is to do a one step wavelet transform first to make LL,LH,HL,HH bands, then you run whatever metric you want on the four bands, and then weight their contribution thusly :
UQI-HVS = 0.5779ULL + 0.1582UHL + 0.1707ULH + 0.0932UHH
This kind of issue comes up in non-obvious ways as well. For example if you do a unitary DCT (or any other basis) transform on a block, the L2 norm is preserved, so you might think that L2 norm is theoretically more logical, however we have no reason to believe that human vision is invariant under basis-space rotations, therefore there is no reason to prefer the rotation-invariant norm.
4. Not quite in the same category, but another thing that I have my doubts about is use of things like visual attention focus. The basic idea is you can predict what part of the image/video people actually care about, and give more bits to those objects. This sounds nice in theory, but in practice you can't know where people will be looking, so if you look at an ensemble of observers who look at a variety of places, you might improve your average score, but you hurt the worst case (people who were looking at the area you took bits away from). In practice there's also the issue that when you take bits away from an unimportant area, you can create artifacts there, and then those artifacts become visually annoying and thus make that area more of a focus. eg. in old crappy MPEG when you take bits away from the background and give it to the human actors in the foreground, you can make huge blocky ringy awfulness in the background.
A major problem I see in paper after paper is that people don't define their terms clearly. One of biggest issues is "luma" , where people will just start writing equations with an "L" or a "Y" in it without defining exactly what that means. Is it light-linear (eg. CIE Y) ? Perceptually uniform "lightness" (CIE L)? What's the scale? eg. what does L=1 mean?