RMSE of fit vs. observed MOS data :
RMSE_RGB : 1.052392 SCIELAB_RMSE : 0.677143 SCIELAB_MyDelta : 0.658017 MS_SSIM_Y : 0.608917 MS_SSIM_IW_Y : 0.555934 PSNRHVSM_Y : 0.521825 PSNRHVST_Y : 0.500940 PSNRHVST_YUV : 0.480360 MyDctDelta_Y : 0.476927 MyDctDelta_YUV : 0.444007
BTW I don't actually use the raw RMSE as posted above. I bias by the sdev of the observed MOS data - that is, smaller sdev = you care about those points more. See previous blog posts on this issue. The sdev biased scores (which is what was posted in previous blog posts) are :
RMSE_RGB : 1.165620 SCIELAB_RMSE : 0.738835 SCIELAB_MyDelta : 0.720852 MS_SSIM_Y : 0.639153 MS_SSIM_IW_Y : 0.563823 PSNRHVSM_Y : 0.551926 PSNRHVST_Y : 0.528873 PSNRHVST_YUV : 0.515720 MyDctDelta_Y : 0.490206 MyDctDelta_YUV : 0.458081 Combo : 0.436670 (*)
(* = ADDENDUM : I added "Combo" which is the best linear combo of SCIELAB_MyDelta + MS_SSIM_IW_Y + MyDctDelta_YUV ; it's a static linear combo, obviously you could do better by going all Netflix-Prize-style and treating each metric as an "expert" and doing weighted experts based on various decision attributes of the image; eg. certain metrics will do better on certain types of images so you weight them from that).
For sanity check I made plots (click for hi res) ; the X axis is the human observed MOS score, the Y axis is the fitted metric :
Sanity is confirmed. (the RMSE_RGB plot has those horizontal lines because one of the distortion types is RGB random noise at a few fixed RMSE levels - you can see that for the same amount of RGB RMSE noise there are a variety of human MOS scores).
ADDENDUM : if you haven't followed old posts, this is on the TID2008 database (without "exotics"). I really need to find another database to cross-check to make sure I haven't over-trained.
Some quick notes of what worked and what didn't work.
What worked : Variance Masking of high-frequency detail Variance Masking of DC deltas PSNRHVS JPEG-style visibility thresholds Using the right spatial scale for each piece of the metric (eg. what size window for local sdev, what spatial filter for DC delta) Space-frequency subband energy preservation Frequency subband weighting What didn't work : Luma Masking LAB or other color spaces than YUV in most metrics anything but "Y" as the most important part of the metric Nonlinear mappings of signal and perception (other than the nonlinear mapping already in gamma correction)
4 comments:
Wow, that looks pretty damn sweet.
It's interesting that the MyDct plot has a few outliers where the visual quality is underestimated, but not where it's significantly overestimated. I wonder if that's just a blip or reproducible with other datasets.
How does the spatial band energy preservation work? Do you partition the DCT coefficients into a set of buckets of roughly similar frequency content and take a weighted L2 norm of the difference, or is it more complicated?
"It's interesting that the MyDct plot has a few outliers where the visual quality is underestimated, "
Yeah, I was wondering what that is. If I was spending more time on this I would look at those images and see what it is about them that's different.
"Do you partition the DCT coefficients into a set of buckets of roughly similar frequency content and take a weighted L2 norm of the difference, or is it more complicated? "
Yep, pretty much just that. Though L1 is better than L2 and the exact composition and weighting of the groups matters. I described an older version of the idea here :
http://cbloomrants.blogspot.com/2010/10/10-30-10-detail-preservation-in-images.html
I have been unable to find any information regarding how to apply these metrics to YUV data - for example, linear weights to apply to each plane. For example, although you have a YUV implmentation of PNSR-HVS-M, the original Matlab source by the author only works in grayscale.
@bztdlinux - yes, that is one of the issues that is just glossed over by most.
In some of my metrics I use this :
double planeWeights[3] = { 1.0, 0.879837, 0.412461 };
but it's more subtle than that. For example the variance masking usually just comes from the Y plane but affects all 3 planes.
On many of the test sets you can do well just measuring error in Y, because they don't stress chroma-only error well. Really we need much bigger & more varied test sets and good human ratings on them.
Post a Comment