10-16-10 - Image Comparison Part 11 - Some Notes on the Tests

Let's stop for a second and talk about the tests we've been running. First of all, as noted in a comment, these are not the final tests. This is the preliminary round for me to work out my pipeline and make sure I'm running the compressors right, and to reduce the competitors to a smaller set. I'll run a final round on a large set of images and post a graph for each image.

What is this MS-SSIM-SCIELAB exactly? Is it a good perceptual metric?

SCIELAB (see 1 and 2 ) is a color transform that is "perceptually uniform" , eg. a delta of 1 has the same human visual importance at all locations in the color cube, and is also spatially filtered to account for the difference in human chroma spatial resolution. The filter is done in the opponent-color domain, and basically the luma gets a sharp filter and the chroma gets a wide filter. Follow the blog posts for more details.

SCIELAB is pretty good at accounting for one particular perceptual factor of error measurement - the difference in importance of luma vs. chroma and the difference in spatial resolution of luma vs. chroma. But it doesn't account for lots of other perceptual factors. Because of that, I recognize that using SCIELAB is somewhat distorting in the test results. What it does is give a bonus to compressors that are perceptually tuned for this particular issue, and doen't care about other issues. More on this later. (*1) (ADDENDUM : of course using SCIELAB also gives an advantage to people who use a colorspace similar to LAB, such as old JPEG YUV, and penalize people who use YCoCg which is not so close. Whether or not you consider this to be fair depends on how accurate you believe LAB to be as an approximation of how the eye sees).

MS-SSIM is multi-scale SSIM and I use it on the SCIELAB data. There are a few issues about this which are problematic.

How do you do SSIM for multi-component data?

It's not that clear. You can just do the SSIM on each component and then combine - probably with an arithmetic mean, but if you look at the SSIM a bit you might get other ideas. The basic factor in SSIM is a term like :

ss = 2 * x*y / (x*x + y*y )

If x and y are the same, this is 1.0 , the more different they are, the smaller it is. In fact this is a normalized dot product or a "triangle metric". That is :

ss = ( (x+y)^2 - x*x - y*y ) / (x*x + y*y )

if you pretend x and y are the two short edges of a triangle, this is the length of the long edge between them minus the length squared of each edge.

Now, this has just been scalar, but to go to multi-component SSIM you could easily imagine the right thing to do is to go to vector ops :

ss = 2 * Dot(x,y) / ( LenSqr(x) + LenSqr(y) )

That might in fact be a good way to do n-component SSIM, it's impossible to say without doing human rating tests to see if it's better or not.

Now while we're at it let's make a note on this basic piece of the SSIM.

ss = 2 * x*y / (x*x + y*y ) = 1 - (x-y)^2 /  ( x*x + y*y )

we can see that all its done is take the normal L2 MSE term , and scale it by the inverse magnitude of the values. The reason they do this is to make SSIM "scale independent" , that is if you replace x and y with sx and sy you get the same number out for SSIM. But in fact what it does is make errors in low values much more important than errors in high values.

ssim :

    delta from 1->3

    2*1*3 / ( 1*1 + 3*3 ) = 6 / 10 = 0.6

    delta from 250->252

    2*250*252 / ( 250*250 + 252*252 ) = 0.999968

Now certainly it is true that a delta of 2 at value 1 is more important than a delta of 2 at value 250 (so called "luma masking") - but is it really *this* much more important? In terms of error (1 - ssim), the different is 0.40 vs. 0.000032 , or 1250000 % greater. I think not. My conjecture is that this scale-independence aspect of SSIM is wrong, it's over-counting low value errors vs. high-value errors. (ADDENDUM : I should note that real SSIM implementations have a hacky constant term added to the numerator and denominator which reduce this exaggeration)

As usual in the SSIM papers they show that SSIM is better at detective true visual quality than a straw man opponent - pure RMSE. But what is in SSIM ? It's plain old MSE with a scaling to make low values count more, and it's got a local "detail" detection term in the form of block sdevs. So if you're going to make a fair comparison, you should test against something similar. You could easily do RMSE on scaled values (perhaps log-scale values, or simple sqrt-values) to make low value errors count more, and you could easily add a detail preservation term by measuring local activity and adding an RMSE-like term for that.

What's the point of even looking at the RMSE numbers? If we just care about perceptual quality, why not just post that?

Well, a few reasons. One, as noted previously, we don't completely trust our perceptual metric, so having the RMSE numbers provide a bit of a sanity check fallback for that. For another, it lets us sort of check on the perceptual tuning of the compressor. For example if we find something that does very well on RGB-RMSE but badly on the perceptual metric, that tells us that it has not been perceptually tuned; it might actually be an okay compressor if it has good RMSE results. Having multiple metrics and multiple bit rates and multiple test images sort of let you peer into the function of the compressor a bit.

What's the point of this whole process? Well there are a few purposes for me.

One is to work out a simple reproducable pipeline in which I can test my own compressors and get a better idea of whether they are reasonably competitive. You can't just compare against other people's published results because so many of the test are done on bad data, or without enough details to be reproducable. I'd also like to find a more perceptual metric that I can use.

Another reason is for me to actually test a lot of the claims that people bandy about without much support, like is H264-Intra really a very good still image coder? Is JPEG2000 really a lame duck that's not worth bothering with? Is JPEG woefully old and easy to beat? The answers to those and other questions are not so clear to me.

Finally, hopefully I will set up an easy to reproduce test method so that anyone at home can make these results, and then hopefully we will see other people around the web doing more responsible testing. Not bloody likely, I know, but you have to try.

(*1) : you can see this for example in the x264 -stillimage results, where they are targetting "perceptual error" in a way that I don't measure. There may be compressors for example which are successfully targetting some types of perceptual error and not targetting the color-perception issue, and I am unfairly showing them to be very poor.

However, just because this perceptual metric is not perfect doesn't mean we should just give up and use RMSE. You have to use the best thing you have available at the time.

Generally there are two classes of perceptual error which I am going to just brazenly coin terms for right now : occular and cognitive.

The old JPEG/ SCIELAB / DCTune type perceptual error optimization is pretty much all occular. That is, they are involved in studying the spatial resolution of rods vs. cones, the occular nerve signal masking of high contrast impulses, the thresholds of visibility of various DCT shapes, etc. It's sort of a raw measure of how the optical signal gets to the brain.

These days we are more interested in the "cognitive" issues. This is more about things like "this looks smudgey" or "there's ringing here" or "this straight line became non-straight" or "this human face got scrambled". It's more about the things that the active brain focuses on and notices in an image. If you have a good model for cognitive perception, you can actually make an image that is really screwed up in an absolute "occular" sense, but the brain will still say "looks good".

The nice thing about the occular perceptual optimization is that we can define it exactly and go study it and come up with a bunch of numbers. The cognitive stuff is way more fuzzy and hard to study and put into specific metrics and values.

Some not very related random links :

Perceptual Image Difference Utility
New cjpeg features
NASA Vision Group - Publications
JPEG 2000 Image Codecs Comparison
IJG swings again, and misses Hardwarebug
How-To Extract images from a video file using FFmpeg - Stream #0
Goldfishy Comparison WebP, JPEG and JPEG XR

A brief note on this WebP vs. JPEG test :

Real world analysis of google�s webp versus jpg English Hard

First of all he uses a broken JPEG compressor (which is then later fixed). Second he's showing these huge images way scaled down, you have to dig around for a link to find them in their native sizes. He's using old JPEG-Huff without a post-unblock ; okay, that's fine if you want to compare against ancient JPEG, but you could easily test against JPEG-Arith with unblocking. But the real problem is the file sizes he compresses to. They're around 0.10 - 0.15 bpp ; to get the JPEGs down to that size he had to set "quality" to something like 15. That is way outside of the functional range of JPEG. It's abusing the format - the images are actually huge, then compressed down to a tiny number of bits, and then scaled down to display.

Despite that, it does demonstrate a case where WebP is definitely significantly better - smooth gradients with occasional edges. If WebP is competitive with JPEG on photographs but beats it solidly on digital images, that is a reasonable argument for its superiority.


Anonymous said...

You may want to try running jpegrescan for generating "optimal" progressive jpegs. (Only matters for jpeghuff, though.)

There ought to be a program that tries to optimize the jpeg quantization matrices for ssim output--this is a little unfair since obvious optimizing *for* ssim when you *measure* ssim is bogus, but might be interesting to see, anyway.

cbloom said...

" You may want to try running jpegrescan for generating "optimal" progressive jpegs. "

Looking it up ...
Unpossible, it's some insane unix script. Would be easier to write my own version of that.

"There ought to be a program that tries to optimize the jpeg quantization matrices for ssim output"

Yup. Unclear how to do that other than brute force search of some kind though.

"this is a little unfair since obvious optimizing *for* ssim when you *measure* ssim is bogus"

Well, x264 is set for ssim. There are two separate things you can test - 1. how well can the encoder hit the metric if you tell it the metric and let it optimize for it, and 2. how good is the encoder generically and hope that your metric is weird enough that none of the encoders are trained for your metric.

Anonymous said...

Yep, that's why it's only a little unfair!

Didn't we decide x264's metric was some weird absolute sum of transformed differences that isn't actually the same as SSIM? But maybe it follows it closely, or something.

cbloom said...

"Didn't we decide x264's metric was some weird absolute sum of transformed differences that isn't actually the same as SSIM? But maybe it follows it closely, or something. "

Yeah, but they have lots of tweaks that can scale it and weight it that are turned on by the different "--tune" options. And the main thing --tune ssim seems to do is set some modes for the variance-adaptive quantization.

Unknown said...

Dear cbloom, I am working for Human Monitoring and hipix. Please download the zip file from this link:


Examine the two comparisons of hipix to JPEG. I used Irfanview, saving to JPEG without any additional metadata.

Please DO NOT USE PSNR OR SSIM or any other BLIND metric. Just look at the two images using your eyes. These images are not some casual or carefully picked images. They are two industry widely-used test screens, one a typical SMPTE and the second a classic test image.

I believe that looking at these comparisons - unless you are blind, which I hope you are not – you’ll have to admit that your conclusions regarding hipix are fundamentally wrong. I expect that after the examination of these images you’ll have the decency to apologize for your slandering hipix, based on what I see as a totally unprofessional examination.

Since the source images are attached in the zip file you can recreate the results by yourself.

Anybody who’ll take the trouble to look at these comparisons with JPEG, will tell you that you need no special tools but rather mediocre eyesight to see the huge advantage of hipix over JPEG.

cbloom said...

BTW if you want to send me some Hipix related stuff you can use email if you like. cb at my domain

Please see


I posted a test image there; I think "unless you are blind" you would agree that the JPEG looks much better.

I agree of course that the most important thing in the end is using your own eyes, but if something is worse in RMSE and worse in SSIM it better have some really brilliant perceptual optimization to be better overall. So far to this date there has never been an image compressor like that ever, maybe Hipix is the first one, but it requires some pretty strong proof to say "this is better even though our analytic metrics say it is not".

I mean, Kakadu and x264 both well beat JPEG on both the RMSE and SCIELAB-SSIM metrics, but Hipix did not, so something is very fishy.

As for the PDI test image, I can't find an "original" for it that's not a jpeg, do you have an actual lossless original? I'd like to make my own version of the test images, because you are right the jpeg you sent does look very bad. But obviously you have to do the encoding well to be fair. You should note in the post that I was comparing Hipix to JPEG-PAQ.

I'll look into this more when I get a chance. It would be nice if I had a Hipix command line, because generating tests with the GUI is very annoying.

cbloom said...

Though one thing I notice right off the bat is it looks like you are using a common unfair trick of comparing to JPEG in a bit rate region outside its functional range.

It looks like that image is at 0.35 bpp , and you have to use JPEG quality = 20 to get there.

Everyone knows that JPEG's quality slopes off much faster than modern coders at very low bit rates, so it's easy to make it look very bad.

To do a fair test, please use bit rates around 1.0

A lot of people doing tests these days seem to want to encode very huge images at very low bit rates, which I don't think is a very useful test scenario.

A common mistake that testers make is to crank the bit rate lower and lower with the idea of making the difference easy to see. That in fact might change the results completely, as the best algorithm is not the same at all bit rates.

old rants