1/12/2011

01-12-11 - ImDiff Sample Run and JXR test

This is the output of a hands-off fully automatic run :


(on lena 512x512 RGB ) :


I was disturbed by how bad JPEG-XR was showing so I went and got the reference implementation from the ISO/ITU standardization committee and built it. It's here .

They provide VC2010 projects, which is annoying, but it built relatively easily in 2005.

Unfortunately, they just give you a bunch of options and not much guide on how to get the best quality for a given bit rate. Dear encoder writers : you should always provide a mode that gives "best rmse" or "best visual quality" for a given bit rate - possibly by optimizing your options. They also only load TIF and PNM ; dear encoder writers : you should prefer BMP, TGA and PNG. TIF is an abortion of an over-complex format (case in point : JXR actually writes invalid TIFs from its decoder (the tags are not sorted correctly)).

There are two ways to control bit-rate, either -F to throw away bit levels or -q to quantize. I tried both and found no difference in quality (except that -F mucks you up at high bit rate). Ideally the encoder would choose the optimal mix of -F and -q for R/D. I used their -d option to set UV quant from Y quant.

There are three colorspace options - y420,y422,y444. I tried them all. With no further ado :

Conclusions :

This JXR encoder is very slightly better than the MS one I was using previously, but doesn't differ significantly. It appears the one I was using previously was in YUV444 color space. Obviously Y444 gives you better RMSE behavior at high bitrate, but hurts perceptual scores.

Clearly the JXR encoders need some work. The good RMSE performance tells us it is not well perceptually optimized. However, even if it was perceptually optimized it is unlikely it would be competitive with the good coders. For example, Kakadu already matches it for RMSE, but kills it on all other metrics.


BTW you may be asking "cbloom, why is it that plain old JPEG (jpg_h) tests so well for you, when other people have said that it's terrible?". Well, there are two main reasons. #1 is they use screwed up encoders like Photoshop that put thumbnails or huge headers in the JPEG. #2 and probably the main reason is that they test at -3 or even -4 logbpp , where jpg_h falls off the quality cliff because of the old-fashioned huffman back end. #3 is that they view the JPEG at some resolution other than 1:1 (under magnification of minification); any image format that is perceptually optimized must be encoded at the viewing resolution.

One of the ways you can get into that super-low-bitrate domain where JPEG falls apart is by using images that are excessively high resolution for your display, so that you are always scaling them down in display. The solution of course is to scale them down to viewing resolutions *before* encoding. (eg. lots of images on the web are actually 1600x1200 images, encoded at JPEG Q=20 or something very low, and then displayed on a web page at a size of 400x300 ; you would obviously get much better results by using a 400x300 image to begin with and encoding at higher quality).

1/10/2011

01-10-11 - Perceptual Metrics

Almost done.

RMSE of fit vs. observed MOS data :


RMSE_RGB             : 1.052392
SCIELAB_RMSE         : 0.677143
SCIELAB_MyDelta      : 0.658017
MS_SSIM_Y            : 0.608917
MS_SSIM_IW_Y         : 0.555934
PSNRHVSM_Y           : 0.521825
PSNRHVST_Y           : 0.500940
PSNRHVST_YUV         : 0.480360
MyDctDelta_Y         : 0.476927
MyDctDelta_YUV       : 0.444007

BTW I don't actually use the raw RMSE as posted above. I bias by the sdev of the observed MOS data - that is, smaller sdev = you care about those points more. See previous blog posts on this issue. The sdev biased scores (which is what was posted in previous blog posts) are :


RMSE_RGB             : 1.165620
SCIELAB_RMSE         : 0.738835
SCIELAB_MyDelta      : 0.720852
MS_SSIM_Y            : 0.639153
MS_SSIM_IW_Y         : 0.563823
PSNRHVSM_Y           : 0.551926
PSNRHVST_Y           : 0.528873
PSNRHVST_YUV         : 0.515720
MyDctDelta_Y         : 0.490206
MyDctDelta_YUV       : 0.458081
Combo                : 0.436670 (*)

(* = ADDENDUM : I added "Combo" which is the best linear combo of SCIELAB_MyDelta + MS_SSIM_IW_Y + MyDctDelta_YUV ; it's a static linear combo, obviously you could do better by going all Netflix-Prize-style and treating each metric as an "expert" and doing weighted experts based on various decision attributes of the image; eg. certain metrics will do better on certain types of images so you weight them from that).

For sanity check I made plots (click for hi res) ; the X axis is the human observed MOS score, the Y axis is the fitted metric :

Sanity is confirmed. (the RMSE_RGB plot has those horizontal lines because one of the distortion types is RGB random noise at a few fixed RMSE levels - you can see that for the same amount of RGB RMSE noise there are a variety of human MOS scores).

ADDENDUM : if you haven't followed old posts, this is on the TID2008 database (without "exotics"). I really need to find another database to cross-check to make sure I haven't over-trained.

Some quick notes of what worked and what didn't work.


What worked :

Variance Masking of high-frequency detail

Variance Masking of DC deltas

PSNRHVS JPEG-style visibility thresholds

Using the right spatial scale for each piece of the metric
  (eg. what size window for local sdev, what spatial filter for DC delta)

Space-frequency subband energy preservation

Frequency subband weighting


What didn't work :

Luma Masking

LAB or other color spaces than YUV in most metrics

anything but "Y" as the most important part of the metric

Nonlinear mappings of signal and perception
  (other than the nonlinear mapping already in gamma correction)

01-10-11 - Perceptual Results - PDI

This is my 766x1200 version of the PDI test image (that I made by scaling down a 3600 tall jpeg one).

Hipix and JPEG-XR are both very bad. I wonder if the JPEG-XR encoder I'm using could be less than the best? If someone knows a reference to the best command line windows JPEG-XR encoder, please post it. The one I'm using is from some Microsoft HD-Photo SDK distribution.

It's interesting to see how the different encoders do on the different metrics. x264, webp and kakadu are all identical under MyDctDelta. Kakadu falls down in SSIM, but does much better on SCIELAB. This tells you that kakadu is not preserving local detail characteristics as well, but is preserving smooth overall DC levels much better.

01-10-11 - Perceptual Results - mysoup

Two sets of compressors cuz I have too many on this image ; jpg_pack is on both charts as a baseline.




As before, we see AIC , Hipix and JPEG-XR are all very poor. WebP is marginally okay (the early encoder I'm using is still very primitive) ; it does well on SSIM, but doesn't have the good low bitrate behavior you would expect from a modern coder. x264 and Kakadu are quite good. My two coders (vtims and newdct) are better than the terrible trio but not close to the good ones (I need to fix my shit).

01-10-11 - Perceptual Results - Moses

Moses is a 1600x1600 photo of a guy.

01-10-11 - Perceptual Metrics Warmup - x264 Settings

Beginning the warmup. Quick guide to these charts :

On the left you will see either "err" or "fit". "err" is the raw score of the metric. "fit" is the metric after fitting to a 0-10 human visual quality scale. SSIM err is actually percent acos angle as usual.

The fit score is 0-10 for 0 = complete ass and 10 = perfect, but I have set the graph range to 3-8 , because that is the domain we normally care about. 8 = very hard to tell the difference.

x264 on mysoup , testing different "tune" options. I'm using my y4m to do the color convert for them which helps a lot.

Well "tune psnr" is in fact best on psnr - you can see a big difference on the RMSE chart. "tune ssim" doesn't seem to actually help much on SSIM, it only beats "tune psnr" at very low bit rate. "tune stillimage" just seems to be broken in my build of x264.

Oh, and I use "xx" to refer to x264 because I can't have numbers in the names of things.

Henceforth we will use tune = ssim. (change : psnr)

ADDENDUM :

I looked into this a little more on another image (also trying x264 with no explicit tune specified) :

You can see that "--tune ssim" does help a tiny bit on MS-SSIM , but it *hurts* IW-MS-SSIM , which is a better metric (it hurts also on MyDctDeltaNew). Though the differences are pretty negligible for our level of study. No explicit tune x264 is much worse. "tune psnr" seems to be the best option according to our best metrics.

01-10-11 - Perceptual Metrics Warmup - JPEG Settings

This is a repeat of old info, just warming up the new system and setting baselines.

Our JPEG is just regular old IJG JPEG with various lossless recompressors ; results on PDI :

As seen before, PAQ has some screwups at low bitrate but is otherwise very close to packjpg, and flat quantization matrix is obviously best for RMSE but worse for visual quality ("flat" here is flat + pack).

From now on we will use jpg_pack as the reference point.

1/05/2011

01-05-11 - QuadraticExtremum

In today's story I do someone's homework for them :

/*
QuadraticExtremum gives you the extremum (either minimum or maximum)
of the quadratic that passes through the three points x0y0,x1y1,x2y2
*/
inline double QuadraticExtremum(
    double x0,double y0,
    double x1,double y1,
    double x2,double y2)
{   
    // warning : this form is pretty but it's numerically very bad
    //  when the three points are near each other
    
    double s2 = x2*x2;
    double s1 = x1*x1;
    double s0 = x0*x0;
    
    double numer = y0*(s1-s2) + y1*(s2-s0) + y2*(s0-s1);
    double denom = y0*(x1-x2) + y1*(x2-x0) + y2*(x0-x1);
    
    double ret = 0.5 * numer / denom;
    
    return ret;
}

01-05-11 - Golden 1d Searches

GoldenSearch1d finds the minimum of some function if you know the finite range to look in. Search1d_ExpandingThenGoldenDown looks in the infinite interval >= 0 . Pretty trivial thing, but handy.

The Golden ratio arises from doing a four point search and trying to reuse evaluations. If your four points are { 0, rho, 1-rho ,1 } and you shrink to the lower three { 0, rho, 1-rho }, then you want to reuse the rho evaluation as your new high interior point, so you require (1-rho)^2 = rho , hence 1-rho = (sqrt(5)-1)/2

BTW because you have various points you could obviously use interpolation search (either linear or quadratic) at various places here and reduce the number of evaluations needed to converge. See Brent's method or Dekker's method .

Also obviously this kind of stuff only works on functions with simple minima.

Also obviously if func is analytic and you can take derivatives of it there are better ways. I use this for evaluating things that aren't simple functions, but have nice shapes (such as running my image compressor with different quantizers).



template< typename t_functor >  
double GoldenSearch1d( t_functor func, double lo, double v_lo, double hi, double v_hi, double minstep )
{
    const double rho = 0.381966;
    const double irho = 1.0 - rho; // = (sqrt(5)-1)/2 

    // four points :
    // [lo,m1,m2,hi]

    double m1 = irho*lo + rho*hi;
    double m2 = irho*hi + rho*lo;

    double v_m1 = func( m1 );
    double v_m2 = func( m2 );
    
    while( (m1-lo) > minstep )
    {
        // step to [lo,m1,m2] or [m1,m2,hi]
        // only one func eval per iteration :
        if ( MIN(v_lo,v_m1) < MIN(v_hi,v_m2) )
        {
            hi = m2; v_hi = v_m2;
            m2 = m1; v_m2 = v_m1;
            m1 = irho*lo + rho*hi;
            v_m1 = func( m1 );
        }
        else
        {
            lo = m1; v_lo = v_m1;
            m1 = m2; v_m1 = v_m2;
            m2 = irho*hi + rho*lo;
            v_m2 = func( m2 );
        }
        
        ASSERT( fequal(m2, irho*hi + rho*lo) );
        ASSERT( fequal(m1, irho*lo + rho*hi) );
    }
    
    // could do a cubic fit with the 4 samples I have now
    //  but they're close together so would be numerically unstable
    
    /*
    //return (lo+hi)/2.0;
    
    if ( v_m1 < v_m2 ) return m1;
    else return m2;
    
    /*/
    // return best of the 4 samples :
    if ( v_lo < v_m1 ) { v_m1 = v_lo; m1 = lo; }
    if ( v_hi < v_m2 ) { v_m2 = v_hi; m2 = hi; }
    
    if ( v_m1 < v_m2 ) return m1;
    else return m2;
    /**/
}

template< typename t_functor >  
double GoldenSearch1d( t_functor func, double lo, double hi, double minstep )
{
    double v_lo = func( lo );
    double v_hi = func( hi );
    
    return GoldenSearch1d( func, lo, v_lo, hi, v_hi, minstep );
}

template< typename t_functor >  
double Search1d_ExpandingThenGoldenDown( t_functor func, double start, double step, double minstep , const int min_steps = 8)
{
    struct Triple
    {
        double t0,f0;
        double t1,f1;
        double t2,f2;
    };
    
    Triple cur;
    cur.t2 = cur.t1 = cur.t0 = start;
    cur.f2 = cur.f1 = cur.f0 = func( start );
        
    int steps = 0;
    
    Triple best = cur;
    
    for(;;)
    {
        cur.t0 = cur.t1; 
        cur.f0 = cur.f1;
        cur.t1 = cur.t2; 
        cur.f1 = cur.f2;
        
        cur.t2 = cur.t1 + step;
        cur.f2 = func( cur.t2 );
        
        if ( cur.f1 <= best.f1 )
        {
            best = cur;
        }
        
        // if we got worst and we're past min_steps :
        if ( cur.f2 > cur.f1 && steps > min_steps )
            break;
                
        const double golden_growth = 1.618034; // 1/rho - 1
        step *= golden_growth; // grow step by some amount
        steps++;
    }
        
    // best is at t1 bracketed in [t0,t2]   
    // could save one function eval by passing in t1,f1 as well
    
    return GoldenSearch1d(func,best.t0,best.f0,best.t2,best.f2,minstep);
}


Usage example :



#define MAKE_FUNCTOR(type,func) \
struct STRING_JOIN(func,_functor) { \
  type operator() (type x) { return func(x); } \
};

double TFunc( double x )
{
    return 100 / ( x + 1) + x;
}

MAKE_FUNCTOR(double,TFunc);

int main(int argc,char *argv[])
{

double t = Search1d_ExpandingThenGoldenDown(TFunc_functor(),0.0,1.0,0.0001);

lprintfvar(t);
lprintf(TFunc(t) , "\n");

return 0;
}

12/11/2010

12-11-10 - Perceptual Notes of the Day

You have to constantly be aware of how you're pooling data. There's no a-priori reason to prefer one way or the other; when a human looks at an image do they see the single worst spot, or do they see the average? If you show someone two images, one with a single very bad error, and another with many small errors, which one do they say is worse?

There are also various places to remap scales. Say you have measured a quantity Q in two images, one way to make an error metric out of that is :


err = Lp-norm {  remap1(  remap2( Q1 ) - remap2( Q2 ) )  }

Note that there are two different remapping spots - remap1 converts the delta into perceptually linear space, while remap2 converts Q into space where it can be linearly delta'd. And then of course you have to tweak the "p" in Lp-norm.

You're faced with more problems as soon as you have more quantities that you wish to combine together to make your metric. Combine them arithmetically or geometrically? Combine before or after spatial pooling? Weight each one in the combination? Combine before or after remapping? etc.

This is something general that I think about all the time, but it also leads into these notes :

For pixel values you have to think about how to map them (remap2). Do you want light linear? (gamma decorrected) or perhaps "perceptually uniform" ala CIE L ?

The funny thing is that perception is a mapping curve that's opposing to gamma correction. That is, the monitor does something like signal ^ 2.2 to make light. But then the eye does something like light ^ (1/3) (or perhaps more like log(light)) to make mental perception. So the net mapping of pixel -> perceptuion is actually pretty close to unity.

On the subject of Lp norms, I wrote this in email (background : a while ago I found that L1 error was much better for my video coder than L2) :


I think L1 was a win mainly because it resulted in better R/D bit distribution.

In my perceptual metric right now I'm using L2 (square error) + masking , which appears to be better still.

I think that basically L1 was good because if you use L2 without masking, then you care about errors in noisy parts of the image too much.

That is, say your image is made of two blocks, one is very smooth, one has very high variation (edges and speckle).

If you use equal quantizers, the noisy block will have a much larger error.

With L2 norm, that means R/D will try very hard to give more bits to the noisy block, because changing 10 -> 9 helps more than changing 2 -> 1.

With L1 norm it balances out the bit distribution a bit more.

But L2-masked norm does the same thing and appears to be better than L1 norm.


Lastly a new metric :

PSNR-HVS-T is just "psnr-hvs" but with thresholds. (*3) In particular, if PSNR-HVS is this very simple pseudo-code :


on all 8x8 blocks
dct1,dct2 = dct of the blocks, scaled by 1/ jpeg quantizers

for i from 0 -> 63 :
 err += square( dct1[i] - dct2[i] )

then PSNR-HVS-T is :

on all 8x8 blocks
dct1,dct2 = dct of the blocks, scaled by 1/ jpeg quantizers

err += w0 * square( dct1[0] - dct2[0] )

for i from 1 -> 63 :
  err += square( threshold( dct1[i] - dct2[i] , T ) )

the big difference between this and "PSNR-HVS-M" is that T is just a constant, not the complex adaptive mask that they compute (*1).


threshold is :

threshold(x,t) = MAX( (x - T), 0 )

The surprising thing is that PSNR-HVS-T is almost as good as PSNR-HVS-M , which is in turn basically the best perceptual metric (*2) that I've found yet. Results :


PSNR-HVS-T Spearman : 0.935155

fit rmse :
PSNR-HVS   : 0.583396  [1,2]
VIF        : 0.576115  [1]
PSNR-HVS-M : 0.564279  [1]
IW-MS-SSIM : 0.563879  [2]
PSNR-HVS-T : 0.557841
PSNR-HVS-M : 0.552151  [2]

In particular, it beats the very complex and highly tweaked out IW-MS-SSIM. While there are other reasons to think that IW-MS-SSIM is a strong metric (in particular, the maximum difference competition work), the claims that they made in the papers about superiority of fitting to MOS data appear to be bogus. A much simpler metric can fit better.

footnotes :

*1 = note on the masking in PSNR-HVS-M : The masked threshold they compute is proportional to the L2 norm of all the DCT AC coefficients. That's very simple and unsurprising, it's a simple form of variance masking which is well known. The funny thing they do is multiply this threshold by a scale :


scale = ( V(0-4,0-4) + V(0-4,4-8) + V(4-8,0-4) + V(4-8,4-8) ) / ( 4*V(0-8,0-8) )

V(x,y) = Variance of pixels in range [x,y] in the block

threshold *= sqrt( scale );

that is, divide the block into four quadrants; take the variance within each quadrant and divide by the variance of the whole block. What this "scale" does is reduce the threshold in areas where the AC activity is caused by large edges. Our basic threshold was set by the AC activity, but we don't want the important edge shapes to get crushed, so we're undoing that threshold when it corresponds to major edges.

Consider various extremes; if the block has no large-scale structure, only random variation all over, then the variance within each quad will equal the variance of the whole, which will make scale ~= 1. On the other hand, if there's a large-scale edge, then the variance on either side of the edge may be quite small, so at least one of the quads has a small variance, and the result is that the threshold is reduced.

Two more subtle follows up to this :

1A : Now, the L2 sum of DCT AC coefficients should actually be proportional to the variance of the block ! That means instead of doing


 threshold^2 ~= (L2 DCT AC) * (Variance of Quads) / (Variance of Blocks)

we could just use (Variance of Quads) directly for the threshold !! That is true, however when I was saying "DCT" earlier I meant "DCT scaled by one over JPEG quantizer" - that is, the DCT has CSF coefficients built into it, which makes it better than just using pixel variances.

1B : Another thought you might have is that we are basically using the DCT L2 AC but then reducing it for big edges. But the DCT itself is actually a decent edge detector, the values are right there in DCT[0,1] and DCT[1,0] !! So the easiest thing would be to just not include those terms in the sum, and then don't do the scale adustment at all!

That is appealing, but I haven't found a formulation of that which performs as well as the original complex form.

(*2) = "best" defined as able to match MOS data, which is a pretty questionable measure of best, however it is the same one that other authors use.

(*3) = as usual, when I say "PSNR-HVS" I actually mean "MSE-HVS", and actually now I'm using RMSE because it fits a little better. Very clear, I know.

old rants