Let's stop for a second and talk about the tests we've been running. First of all, as noted in a comment, these are not the final
tests. This is the preliminary round for me to work out my pipeline and make sure I'm running the compressors right, and to reduce
the competitors to a smaller set. I'll run a final round on a large set of images and post a graph for each image.
What is this MS-SSIM-SCIELAB exactly? Is it a good perceptual metric?
SCIELAB (see 1 and
2 ) is a color transform that
is "perceptually uniform" , eg. a delta of 1 has the same human visual importance at all locations in the color cube, and is
also spatially filtered to account for the difference in human chroma spatial resolution. The filter is done in the opponent-color
domain, and basically the luma gets a sharp filter and the chroma gets a wide filter. Follow the blog posts for more details.
SCIELAB is pretty good at accounting for one particular perceptual factor of error measurement - the difference in importance of
luma vs. chroma and the difference in spatial resolution of luma vs. chroma. But it doesn't account for lots of other perceptual
factors. Because of that, I recognize that using SCIELAB is somewhat distorting in the test results. What it does is give a bonus
to compressors that are perceptually tuned for this particular issue, and doen't care about other issues. More on this later. (*1)
(ADDENDUM : of course using SCIELAB also gives an advantage to people who use a colorspace similar to LAB, such as old JPEG YUV,
and penalize people who use YCoCg which is not so close. Whether or not you consider this to be fair depends on how accurate you
believe LAB to be as an approximation of how the eye sees).
MS-SSIM is multi-scale SSIM and I use it on the SCIELAB data. There are a few issues about this which are problematic.
How do you do SSIM for multi-component data?
It's not that clear. You can just do the SSIM on each component and then combine - probably with an arithmetic mean, but if you look
at the SSIM a bit you might get other ideas. The basic factor in SSIM is a term like :
ss = 2 * x*y / (x*x + y*y )
If x and y are the same, this is 1.0 , the more different they are, the smaller it is. In fact this is a normalized dot product or
a "triangle metric". That is :
ss = ( (x+y)^2 - x*x - y*y ) / (x*x + y*y )
if you pretend x and y are the two short edges of a triangle, this is the length of the long edge between them minus the length squared
of each edge.
Now, this has just been scalar, but to go to multi-component SSIM you could easily imagine the right thing to do is to go to vector ops :
ss = 2 * Dot(x,y) / ( LenSqr(x) + LenSqr(y) )
That might in fact be a good way to do n-component SSIM, it's impossible to say without doing human rating tests to see if it's better or not.
Now while we're at it let's make a note on this basic piece of the SSIM.
ss = 2 * x*y / (x*x + y*y ) = 1 - (x-y)^2 / ( x*x + y*y )
we can see that all its done is take the normal L2 MSE term , and scale it by the inverse magnitude of the values. The reason they do this
is to make SSIM "scale independent" , that is if you replace x and y with sx and sy you get the same number out for SSIM. But in fact what
it does is make errors in low values much more important than errors in high values.
ssim :
delta from 1->3
2*1*3 / ( 1*1 + 3*3 ) = 6 / 10 = 0.6
delta from 250->252
2*250*252 / ( 250*250 + 252*252 ) = 0.999968
Now certainly it is true that a delta of 2 at value 1 is more important than a delta of 2 at value 250 (so called "luma masking") - but is it really *this* much more important?
In terms of error (1 - ssim), the different is 0.40 vs. 0.000032 , or 1250000 % greater.
I think not. My conjecture is that this scale-independence aspect of SSIM is wrong, it's over-counting low value errors vs. high-value errors.
(ADDENDUM : I should note that real SSIM implementations have a hacky constant term added to the numerator and denominator which reduce this
exaggeration)
As usual in the SSIM papers they show that SSIM is better at detective true visual quality than a straw man opponent - pure RMSE. But what is in SSIM ?
It's plain old MSE with a scaling to make low values count more, and it's got a local "detail" detection term in the form of block sdevs. So if you're
going to make a fair comparison, you should test against something similar. You could easily do RMSE on scaled values (perhaps log-scale values, or
simple sqrt-values) to make low value errors count more, and you could easily add a detail preservation term by measuring local activity and adding an
RMSE-like term for that.
What's the point of even looking at the RMSE numbers? If we just care about perceptual quality, why not just post that?
Well, a few reasons. One, as noted previously, we don't completely trust our perceptual metric, so having the RMSE numbers provide a
bit of a sanity check fallback for that. For another, it lets us sort of check on the perceptual tuning of the compressor. For example
if we find something that does very well on RGB-RMSE but badly on the perceptual metric, that tells us that it has not been perceptually
tuned; it might actually be an okay compressor if it has good RMSE results. Having multiple metrics and multiple bit rates and multiple
test images sort of let you peer into the function of the compressor a bit.
What's the point of this whole process? Well there are a few purposes for me.
One is to work out a simple reproducable pipeline in which I can test my own compressors and get a better idea of whether they are
reasonably competitive. You can't just compare against other people's published results because so many of the test are
done on bad data, or without enough details to be reproducable. I'd also like to find a more perceptual metric that I can use.
Another reason is for me to actually test a lot of the claims that people bandy about without much support, like is H264-Intra really
a very good still image coder? Is JPEG2000 really a lame duck that's not worth bothering with? Is JPEG woefully old and easy to beat?
The answers to those and other questions are not so clear to me.
Finally, hopefully I will set up an easy to reproduce test method so that anyone at home can make these results, and then hopefully
we will see other people around the web doing more responsible testing. Not bloody likely, I know, but you have to try.
(*1) : you can see this for example in the x264 -stillimage results, where they are targetting "perceptual error" in a way that I don't
measure. There may be compressors for example which are successfully targetting some types of perceptual error and not targetting
the color-perception issue, and I am unfairly showing them to be very poor.
However, just because this perceptual metric is not perfect doesn't mean we should just give up and use RMSE. You have to use the best
thing you have available at the time.
Generally there are two classes of perceptual error which I am going to just brazenly coin terms for right now : occular and cognitive.
The old JPEG/ SCIELAB / DCTune type perceptual error optimization is pretty much all occular. That is, they are involved in studying
the spatial resolution of rods vs. cones, the occular nerve signal masking of high contrast impulses, the thresholds of visibility of
various DCT shapes, etc. It's sort of a raw measure of how the optical signal gets to the brain.
These days we are more interested in the "cognitive" issues. This is more about things like "this looks smudgey" or "there's ringing here"
or "this straight line became non-straight" or "this human face got scrambled". It's more about the things that the active brain focuses
on and notices in an image. If you have a good model for cognitive perception, you can actually make an image that is really screwed up
in an absolute "occular" sense, but the brain will still say "looks good".
The nice thing about the occular perceptual optimization is that we can define it exactly and go study it and come up with a bunch of
numbers. The cognitive stuff is way more fuzzy and hard to study and put into specific metrics and values.
Some not very related random links :
Perceptual Image Difference Utility
New cjpeg features
NASA Vision Group - Publications
JPEG 2000 Image Codecs Comparison
IJG swings again, and misses Hardwarebug
How-To Extract images from a video file using FFmpeg - Stream #0
Goldfishy Comparison WebP, JPEG and JPEG XR
DCTune 2.0 README
A brief note on this WebP vs. JPEG test :
Real world analysis of google�s webp versus jpg English Hard
First of all he uses a broken JPEG compressor (which is then later fixed). Second he's showing these huge images way scaled down, you have to dig around for a link to
find them in their native sizes. He's using old JPEG-Huff without a post-unblock ; okay, that's fine if you want to compare against ancient JPEG, but you could easily
test against JPEG-Arith with unblocking. But the real problem is the file sizes he compresses to. They're around 0.10 - 0.15 bpp ; to get the JPEGs down to that
size he had to set "quality" to something like 15. That is way outside of the functional range of JPEG. It's abusing the format - the images are actually huge, then
compressed down to a tiny number of bits, and then scaled down to display.
Despite that, it does demonstrate a case where WebP is definitely significantly better - smooth gradients with occasional edges. If WebP is competitive with JPEG on photographs
but beats it solidly on digital images, that is a reasonable argument for its superiority.