12/09/2010

12-09-10 - Rank Lookup Error

Trying some other methods of testing to make sure the function fit isn't screwing me up too much.

Spearman rank correlations (it's just the Pearson correlation on sort ranks) :


mse.txt                 0.66782
square_ms_ssim_y.txt    0.747733
scielabL1.txt           0.855355
scielabL2.txt           0.868207
vif.txt                 0.87631
ms_ssim_bad_down.txt    0.880403
ms_ssim.txt             0.89391
iw_ms_ssim.txt          0.901085
wsnr.txt                0.90905
mydctdelta.txt          0.932998
my_psnr_hvs_m.txt       0.94082
mydctdeltanew.txt       0.944086

I just thought of a new way to check the fit for this kind of scenario which I think is pretty cool.

You have two vectors of values. One vector is the original which you wish to match, the other is output by your program to try to match the original vector. The problem is that your program outputs in some other scale, different units which may involve some warping of the curve, and you don't know what it is. You wish to find how close your program is to the target without worry about matching that curve.

Well, one way is "rank lookup error" :


given vectors "orig" and "test"

find "test_rank" such that
  r = test_rank[i] means item i is the r'th in sorted order in the vector test

find "sorted_orig" = orig, sorted

Sum{i} :
  err += square( orig[i] - sort_orig[ test_rank[ i ] ] )

that is, the fit value for mapping from test's scale to orig is to find the sort index within test, and lookup the value in the sorted list of originals.

Obviously this isn't quite ideal; it does handle ties and near-ties pretty well though (better than Spearman/Kendall, because you get less error contribution when you get the rank wrong of two items with very similar value). Most importantly it avoids all the fidgetty function fitting stuff.

Here are the scores with "rank lookup rmse" :


mse.txt                 1.310499
square_ms_ssim_y.txt    1.090392
scielabL1.txt           0.853738
scielabL2.txt           0.835508
ms_ssim_bad_down.txt    0.716507
ms_ssim.txt             0.667152
wsnr.txt                0.643821
vif.txt                 0.591508
iw_ms_ssim.txt          0.575474
mydctdelta.txt          0.548946
my_psnr_hvs_m.txt       0.543057
mydctdeltanew.txt       0.494918

if nothing else it's a good sanity check for the fitting stuff.

Also with rank-lookup-error you don't have to worry about transforms like acos for ssim or undo-psnr or anything else that's monotonic.

For comparison, these were the fit scores :


mse.txt                 1.169794
square_ms_ssim_y.txt    1.005849
scielabL1.txt           0.819501
scielabL2.txt           0.808595
ms_ssim_bad_down.txt    0.690689
ms_ssim.txt             0.639193
wsnr.txt                0.632670
vif.txt                 0.576114
iw_ms_ssim.txt          0.563879
my_psnr_hvs_m.txt       0.552151
mydctdelta.txt          0.548938
mydctdeltanew.txt       0.489756

mydctdeltanew is a new one; it just uses mydctdelta for only the AC part of the DCT (excludes the DC which is gross luma differences), and then it adds back on the gross luma difference using a form similar to IW-MS-SSIM (that is, using large filters and then using both variance masking and saliency boosting).

12-09-10 - Perceptual vs TID

My TID2008 score is the RMSE of fitting to match MOS values (0-9) ; fit from perceptual metric to MOS score is of the form Ax + Bx^C , so 0 -> 0 , and for fitting purposes I invert and normalize the scale so that 0 = no error and 1 = maximum error. For fitting I try {value},{acos(value)}, and {log(value)} and use whichever is best.

I also exclude the "exotic" distortions from TID because they are weird and not something well handled by most of the metrics (they favor SSIM which seems to be one of the few that handles them okay). I believe that including them is a mistake because they are very much unlike any distortion you would ever get from a lossy compressor.

I also weight the MOS errors by 1/Variance of each MOS ; basically if the humans couldn't agree on a MOS for a certain image, there's no reason to expect the synthetic metric to agree. (this is also like using an entropy measure for modeling the human data; see here for example)


MSE        : 1.169795  [1,2]
MSE-SCIELAB: 0.808595  [2]
MS-SSIM    : 0.691881  [1,2]
SSIM       : 0.680292  [1]
MS-SSIM-fix: 0.639193  [2] (*)
WSNR       : 0.635510  [1]
PSNR-HVS   : 0.583396  [1,2] (**)
VIF        : 0.576115  [1]
PSNR-HVS-M : 0.564279  [1] (**)
IW-MS-SSIM : 0.563879  [2]
PSNR-HVS-M : 0.552151  [2] (**)
MyDctDelta : 0.548938  [2]

[1] = scores from TID reference "metrics_values"
[2] = scores from me
[1,2] = confirmed scores from me and TID are the same

the majority of these metrics are Y only, using rec601 Y (no gamma correction) (like metric mux).

WARNING : the fitting is a nasty evil business, so these values should be considered plus/minus a few percent confidence. I use online gradient descent (aka single layer neural net) to find the fit, which is notoriously tweaky and sensitive to annoying shit like the learning rate and the order of values and so on.

(*) = MS-SSIM is the reference implementation and I confirmed that I match it exactly and got the same score. As I noted previously , the reference implementation actually uses a point subsample to make the multiscale pyramid. That's obviously a bit goofy, so I tried box subsample instead, and the result is "MS-SSIM-fix" - much better.

(**) = PSNR-HVS have received a PSNR-to-MSE conversion (ye gods I hate PSNR) ; so my "PSNR-HVS-M" is actually "MSE-HVS-M" , I'm just sticking to their name for consistency. Beyond that, for some reason my implementation of PSNR-HVS-M ([2]) is better than theirs ([1]) and I haven't bothered to track down why exactly (0.564 vs 0.552).

WSNR does quite well and is very simple. It makes the error image [orig - distorted] and then does a full image FFT, then multiplies each tap by the CSF (contrast sensitivity function) for that frequency, and returns the L2 norm.

PSNR-HVS is similar to WSNR but uses 8x8 DCT's instead of full-image FFT. And of course the CSF for an 8x8 DCT is just the classic JPEG quantization matrix. That means this is equivalent to doing the DCT and scaling by one over the JPEG quantizers, and then taking the L2 delta (MSE). Note that you don't actually quantize to ints, you stay in float, and the DCT should be done at every pixel location, not just 8-aligned ones.

VIF is the best of the metrics that comes with TID ; it's extremely complex, I haven't bothered to implement it or study it that much.

PSNR-HVS-M is just like PSNR-HVS, but adds masked thresholds. That is, for each tap of the 8x8 DCT, a just noticeable threshold is computed, and errors are only computed above the threshold. The threshold is just proportional to the L2 AC sum of the DCT - this is just variance masking. They do some fiddly shit to compensate for gross variance vs. fine variance. Notably the masking is only used for the threshold, not for scaling above threshold (though the CSF still applies to scaling above threshold).

IW-MS-SSIM is "Information Weighted" MS-SSIM. It's a simple weight term for the spatial pooling which makes high variance areas get counted more (eg. edges matter). As I've noted before, SSIM actually has very heavy variance masking put in (it's like MSE divided by Variance), which causes it to very severely discount errors in areas of high variance. While that might actually be pretty accurate in terms of "visibility", as I noted previously - saliency fights masking effects - that is, the edge areas are more important to human perception of quality. So SSIM sort of over-crushes variance and IW-SSIM puts some weight back on them. The results are quite good, but not as good as the simple DCT-based metrics.

MyDctDelta is some of the ideas I wrote about here ; it uses per-pixel DCT in "JPEG space" (that is, scaled by 1/JPEG Q's), and does some simple contrast-band masking, as well as multi-scale sum deltas.

There's a lot of stuff in MyDctDelta that could be tweaked that I haven't , I have no doubt the score on TID could easily be made much better, but I'm a bit afeared of overtraining. The TID database is not big enough or varied enough for me to be sure that I'm stressing the metric enough.


Another aside about SSIM. The usual presentation says compute the two terms :


V = (2*sigma1*sigma2 + C2)/(sigma1_sq + sigma2_sq + C2);
C = (sigma12 + C3)/(sigma1*sigma2 + C3);

for variance and correlation dot products. We're going to wind up combining these multiplicatively. But then notice that (with the right choice of C3=C2/2)

V*C = (2*sigma12 + C2/2)/(sigma1_sq + sigma2_sq + C2);

so we can avoid computing some terms.

But this is a trick. They've actually changed the computation, because they've changed the pooling. The original SSIM has three terms : mean, variance, and correlation dot products. They are each computed on a local window, so you have the issue of how you combine them into a single score. Do you pool each one spatially and then cross-multiply? Or do you cross-multiply and then pool spatially ?


SSIM = Mean{M} * Mean{V} * Mean{C}

or

SSIM = Mean{M*V*C}

which give quite different values. The "efficient" V*C SSIM winds up using :

SSIM = Mean{M} * Mean{V*C}

which is slightly worse than doing separate means and multiplying at the end (which is what MS-SSIM does).

12/07/2010

12-07-10 - Patents

Ignacio wrote about software patents a while ago and it's got me thinking.

First of all I applaud the general idea that every individual is responsible for acting morally, regardless of the environment they live in. I see far too many people who use the broken system as an excuse to behave badly themselves. eg. lots of people are polluting it doesn't matter how much I hurt the environment, the tax system is broken it doesn't matter how much I cheat, the patent system sucks so I have to be a part of it, etc. I believe this attitude is a facile rationalization for selfish behavior.

In any case, let's get back to the specific topic of patents.

In my youth I used to be the lone pro-patent voice in a sea of anti-patent peers. Obviously I was against the reality of the patent system, which consists of absurd over-general patents on things that aren't actually innovations, and the fact that expert review of technical issues before a court is an absurd farce in which money usually wins. But I was pro the basic idea of patents, partly for purely selfish reasons, because I had these technical inventions I had made and was hoping I could somehow get rich from them, as I saw others do. As a young individual inventor, I believed that getting a patent was my only way to get a fair price for my work.

Now I have a different view for a few reasons :

1. Policy should be made based on the reality of an issue, not your theoretical ideal.

The reality is that patents (and particularly software patents) are ridiculously broken. The court system does not have the means to tell when patents are reasonable or not, and it is unrealistic to think that that can be fixed.

2. The purpose of laws should be to ensure the greatest good for society.

Even if you think software patents are good for the lonely independent developer, that in itself is not reason enough to have them. You have to consider the net benefit to society. I believe that the world would be better off without software patents, but this is a little tricky.

What are the pro-patent arguments ?

One is that they encourage research funding, that companies wouldn't spend money on major research if they didn't have the patent system to ensure a monopoly for their invention.

I find this argument generally absurd. Do you think that IBM or Microsoft really wouldn't fund research that they believe will improve their business if they couldn't patent the result? Companies will fund research any time it is likely profitable; a long-term monopoly from a patent doesn't really change the research equation from "not profitable" to "profitable", it changes it to "ridiculously profitable".

Furthermore, in reality, most of the major tech companies don't actually use their patents to keep monopolies on technology, rather they engage in massive cross-license agreements to get open access to technologies. Patents wind up being a huge friction and cost for these companies, and you have to maintain a big war chest of your own patents to ensure that you can participate in the cross-licensing. The end result of this is yet another oligarchy. The big tech companies form cross-license agreements, and smaller players are frozen out. This is a huge friction to free market innovation.

Now, I believe one legitimate point is that in a world without patents, companies might be more motivated to keep their innovations secret. One pro-patent argument is that it allows companies to patent their work and then publish without fear of losing it. Of course, that is a bit of a false promise, because the value to the public of getting a publication which describes a patented algorithm is dubious. (yes obviously there is some value because you get to read about it, but you don't actually get to use it!).

Finally it is absolutely offensive to me that researchers who receive public funding are patenting their works, or that university professors patent their works. If you receive any public funding, your work should be in the public domain (and it should not be published in the pay-for-access journals of the ACM or IEEE).

But this is an issue that goes beyond software. Are any patents good for society? Certainly medical patents have encouraged lots of expensive research in recent years, and this is often used as a pro-patent argument. But lots of those new expensive drugs have been shown to be no better than their cheap predecessors. Certainly patents provide a massive incentive for drug companies to push the new expensive monopolized product on you, which is a bad effect for society. Would you actually have significantly less useful research if there were no patents? Well, 30-40% of medical research is publicly funded, so that wouldn't go away, and without patents that publicly funded research would be much more efficient, because they could be open and not worry about infringement, and furthermore it would be more focused on research that provides tangible benefits as opposed to research that leads to profits. It's a complicated issue, but it's definitely not obvious that the existance patents actually improve the net quality of medical treatment.

In summary, I believe that patents do accomplish some good, but you have to weigh that against the gains you would get if you had no patents. I believe the good from no patents is greater than the good from patents.


In any case, hoping for patents to go away is probably a pipe dream.

Smaller goals are these :

1. I find it absolutely sick that public universities are patenting things. That needs to stop. Professors/researchers need to take the lead by refusing to patent their inventions.

Any corporation that receives the bullshit "R&D" tax break should be required to make all their patents public domain. Anyone that gets a DoD or NSF research grant should be required to make the results of their work public domain. How can you justify taking public money for R&D and then locking out the public from using the results?

2. Some rich charity dude should create the "public patent foundation" whose goal is to supports the freedom of ideas, and has the big money to fight bullshit patents in court. The PPF could also actively work to publish prior art and even in some cases to apply for patents, which would then be officially released into the public domain.

A more extreme idea would be to make the PPF "viral" like the GPL - build up a big war chest of patents, and then release them all for free use - but only to other people who release their own patents under the same license. All the PPF has to do is get a few important patents and it can force the opening of them all.

(deja vu , I just realized I wrote this before )

12/06/2010

12-06-10 - More Perceptual Notes

To fit an objective synthetic measure to surveyed human MOS scores, the VQEG uses the form :

A1 + A2 * x + A3 * logistic( A4 * x + A5 )

I find much better fits from the simpler form :

B1 + B2 * x ^ B3

Also, as I've mentioned previously, I think the acos of SSIM is a much more linear measure of quality (SSIM is like a Pearson correlation, which is like a dot product, so acos makes it like an angle). This is borne out by fitting to TID2008 - the acos of SSIM fits better to the human scores (though fancy fit functions can hide this - this is most obvious in a linear fit).

But what's more, the VQEG form does not preserve the value of "no distortion", which I think is important. One way to do that would be to inject a bunch of "zero distortion" data points into your training set, but a better way is to use a fit form that ensures it explicitly. In particular, if you remap the measured MOS and your objective score such that 0 = perfect and larger = more distorted, then you can use a fit form like :


C1 * x + C2 * x ^ C3

(C3 >= 0) , so that 0 maps to 0 absolutely, and you still get very good fits (in fact, better than the "B" form with arbitrary intercept). (note that polynomial fit (ax+bx^2+cx^3) is very bad, this is way better).

A lot of people are using Kendall or Spearman rank correlation coefficients to measure perceptual metrics (per VQEG recommendation), but I think that's a mistake. The reason is that ranking is not what we really care about. What we want is a synthetic measure which correctly identifies "this is a big difference" vs. "this is a small difference". If it gets the rank of similar differences a bit wrong, that doesn't really matter, but it does show up as a big difference in Kendall/Spearman. In particular, the *value* difference of scores is what we care about. eg. if it should have been a 6 and you guess 6.1 that's no big deal, but if you guess it's a 9 that's very bad. eg. if your set of MOS scores to match is { 3, 5.9, 6, 6.1, 9 } , then getting the middle 3 in the wrong order is near irrelevant, but the Kendall & Spearman have no accounting for that, they are just about rank order.

The advantage of rank scores of course is that you don't have to do the functional fitting stuff above, which does add an extra bias, because some objective scores might work very well with that particular functional fit, while others might not.

12/02/2010

12-02-10 - Perceptual Metric Rambles of the Day

There's an interesting way in which saliency (visual attention) fights masking (reduction of sensitivity due to local variance). People who weight image metrics by saliency give *more* importance to high contrast areas like edges. People who weight image metrics by masking give *less* importance to high contrast areas. Naively it seems like those would just cancel out.

It's subtle. For one thing, a smooth area is only not "salient" if it's smooth in both the original and the disorted image. If you do something like creating block artifacts, then those block edges are salient (high visual attention factor). Also, they work at sort of different scales. Saliency works on intermediate frequencies - the brain specifically cares about detail at a certain scale that detects edges and shapes, but not speckle/texture and not gross offsets. Contrast or variance masking applies mainly to the aditional frequencies in an area that has a strong medium-scale signal.

MS-SSIM actually has a lot of tweaked out perceptual hackiness. There's a lot of vision modelling implicit in it. They use a Gaussian window with a tweaked out size - this is a model of the spatial filter of the eye. The different scales have different importance contributions - this is a model of the CSF, variable sensitivity and different scales. The whole method of doing a Pearson-style correlation means that they are doing variance masking.

The issue of feeding back a perceptual delta into a non-perceptual compressor is an interesting one.

Say you have a compressor which does R-D optimization, but its D is just MSE. You can pretty easily extend that to do weighted R-D, with a weight either per-pixel or per-block. Initially all the weights are the same (1.0). How do you adjust the weights to optimize an external perceptual metric?

Assume your perceptual metric returns not just a number but an error map, either per pixel or per block.

First of all, what you should *not* do is to simply find the parts with large perceptual error and increase their weight. That would indeed cause the perceptual error at those spots to reduce - but we don't know if that is actually a good thing in an R-D sense. What that would do is even out the perceptual error, and evening it out is not necessarily the best R-D result. This is, however, an easy to get the minimax perceptual error!


Algorithm Minimax Perceptual Error :

1. Initial W_i = 1.0
2. Run weighted R-D non-perceptual compressor (with D = weighted MSE).
3. Find perceptual error map P_i
4. W_i += lambda * P_i
    (or maybe W_i *= e^(lambda * P_i))
5. goto 2

what about minimizing the total perceptual error?

What you need to do is transform the R/D(MSE) that the compressor uses into R/D(percep). To do that, you need to make W_i ~= D(percep)/D(mse).

If the MSE per block or pixel is M_i, the first idea is just to set W_i = ( P_i / M_i ) . This works well if P and M have a near-linear relationship *locally* (note they don't need to be near-linear *globally* , and the local linear relationship can vary from one location to another).

However, the issue is that the slope of P_i(R) may not be near linear to M_i(R) (as functions of rate) over its entire scale. Obviously for small enough steps, they are always linear.

So, you could run your compress *twice* to create two data points, so you have M1,M2 and P1,P2. Then what you want is :


P ~= P1 + ((P2 - P1)/(M2 - M1)) * (M - M1)

slope := (P2 - P1)/(M2 - M1)

P ~= slope * M + (P1 - slope * M1)

C := (P1 - slope * M1)

P ~= slope * M + C

I want J = R + lambda * P

use

J = R + lambda * (slope * M + C)
J = R + lambda * slope * M + lambda * C

so

W_i = slope_i

J = R + lambda * W * M

with the extra lambda * C term ; C is just a constant per block, so it can be ignored for R-D optimization on that block. If you care about the overall J of the image, then you need the sum of all C's for that, which is just :

C_tot = Sum_i (P1_i - slope_i * M1_i)

it's obviously just a bias to turn our fudged MSE-multiplied-by-slope into the proper scale of the perceptual error.

I think this is pretty strong, however it's obviously iterative, and you must keep the steps of the iteration small - this is only valid if you don't make very big changes in each step of the iteration.

There's also a practical difficulty that it can be hard to actually generate different M1 and M2 on some blocks. You want to generate both M1 and M2 very near the settings that you will be using for the final compress, because large changes could take you to vastly different perceptual areas. But, when you make small changes to R, some parts of the image may change, but other parts might not actually change at all. This gives you M1=M2 in some spots which is a divide by zero, which is annoying.

In any case I think something like this is an interesting generic way to adapt any old compressor to be perceptual.

ADDENDUM :

Note that this kind of "perceptual importance map" generation cannot help your coder make internal decisions based on perceptual optimization - it can only affect bit allocation. That is, it can help a coder with variable bit allocation (either through truncation or variable quantization) to optimize the bit allocation.

In particular, a coder like H264 has lots of flexibility; you can choose a macroblock mode, an inter block can choose a movec, an intra block can choose a prediction mode, you can do variable quantizers, and you can truncate coefficients. This kind of iterative approach will not help the mode decisions to pick modes which have less perceptual error, but it will help to change the bit allocation to give more bits to blocks where it will help the preceptual error the most.

12-02-10 - Orphaned Tech Ideas

If you have a really good idea, you have to just go for it right away. Sometimes I have an idea, but I'm busy with something else, so I just write it down, and then try to come back to it later, but when you come back the excitement is gone. You have to pounce on that moment of inspiration.

That said, there are many things I'd like to do right now but probably have to just set aside and never come back to :

DXTC for lightmaps (and other non-photographic data). In particular there should be a way to preserve smooth data with much higher precision.

DXTC optimized for perceptual metrics. Should be easy, I'll probably do this.

cblib::hash_table entry currently always has {hash, key, data} , should change that to a functor with calls { hash(), key(), data() } so that user can choose to store them or not. (in many cases you don't need seperate key and data because the key is the data).

csv util suite. csv -> google chart, csv lsqr fit, csv eval, etc. basically matlab in command line utils that work on csv's. Would let me do processing from batch files and make html logs of compression runs and such.

Texture synthesis from examples. Fast enough for realtime so it can be used with SVT. Should be progressive. I believe this is the right way to do large world texturing. Texture synthesis from procedural/perlin noise stuff is too hard for artists to use, and doing fancy synthesis offline and then storing it ala Id just strikes me as profoundly silly.

Really good JPEG decoder. Maximum-likelihood decompression. Nosratinia and POCS. You can basically remove all blocking/ringing artifacts. This seems to me like a very useful and important piece of software and I'm surprised it doesn't exist.

Luma aided chroma upsample. Could be part of "Really good JPEG decoder" ; also useful for improving video decompression quality.

Finish lossy image comparison test and release the exe which autogens nice HTML reports.

11/30/2010

11-30-10 - Tchebychev

A little utility, and example of how I use templates : (this is not meant to be efficient, it's for offline apps, not realtime use)

template < int n >
struct Tchebychev
{
    double operator () ( double x )
    {
        return 2.0 * x * Tchebychev < n-1>()(x) - Tchebychev < n-2>()(x);
    }
};

template < >
struct Tchebychev < 0 >
{
    double operator () ( double x )
    {
        return 1.0;
    }
};

template < >
struct Tchebychev < 1 >
{
    double operator () ( double x )
    {
        return x;
    }
};

double TchebychevN(int n,double x)
{
    switch(n) {
    case 0 : return Tchebychev<0>()(x); 
    case 1 : return Tchebychev<1>()(x); 
    case 2 : return Tchebychev<2>()(x); 
    case 3 : return Tchebychev<3>()(x); 
    case 4 : return Tchebychev<4>()(x); 
    case 5 : return Tchebychev<5>()(x); 
    case 6 : return Tchebychev<6>()(x); 
    case 7 : return Tchebychev<7>()(x); 
    case 8 : return Tchebychev<8>()(x); 
    case 9 : return Tchebychev<9>()(x); 
    NO_DEFAULT_CASE
    }
    return 0;
}

template < typename functor >
void FindTchebychevCoefficients(functor f,double lo = 0.0, double hi = 1.0, int N = (1<<20))
{
    double PI_over_N = PI/N;

    #define NUM_T   6

    double t[NUM_T] = { 0 };
    
    for(int k=0;k < N;k++)
    {
        double pk = PI_over_N * (k + 0.5);
        double x_k = cos(pk);
        double farg = lo + (hi - lo) * 0.5 * (x_k+1.0);
        double fval = f( farg );
        for(int j=0;j < NUM_T;j++)
        {
            t[j] += fval * cos( pk * j );
        }
    }
    for(int j=0;j < NUM_T;j++)
    {
        t[j] *= 1.0 / N;
        if ( j != 0 )
            t[j] *= 2.0;

        //lprintfvar(t[j]);
        lprintf("t %d: %16.9f , %16.8g\n",j,t[j],t[j]);
    }
    
    double errSqr[NUM_T] = { 0 };
    double errMax[NUM_T] = { 0 };
    for(int i=0;i < N;i++)
    {
        double xunit = (i+0.5)/N;
        double farg = lo + (hi - lo) * xunit;
        double xt = xunit*2.0 - 1.0;
        double y = f(farg);
        double p = 0.0;
        for(int j=0;j < NUM_T;j++)
        {
            p += t[j] * TchebychevN(j,xt);
            errSqr[j] += fsquare(y-p);
            errMax[j] = MAX( errMax[j] , fabs(y-p) );
        }
    }
    for(int j=0;j < NUM_T;j++)
    {
        lprintf("%d : errSqr = %g errMax = %g\n",j,errSqr[j]/N,errMax[j]);
    }
}

You can also very easily find inner products by using the Integrate<> template I posted earlier plus this simple product adaptor :


template < typename f1, typename f2 >
struct FunctorProduct
{
    double operator () ( double x )
    {
        return f1()(x) * f2()(x);
    }
};

(eg. you can trivially find Hermite or Legendre series this way by doing inner products).

Another handy helper is a range remapper :


template < typename functor >
struct RemapFmTo
{
    double m_fmLo; double m_fmHi;
    double m_toLo; double m_toHi;
            
    RemapFmTo( 
            double fmLo, double fmHi,
            double toLo, double toHi )
    : m_fmLo(fmLo), m_fmHi(fmHi), m_toLo(toLo), m_toHi(toHi)
    {
    }
    
    double operator () ( double x )
    {
        double t = (x - m_fmLo) / (m_fmHi - m_fmLo);
        double y = t * (m_toHi - m_toLo) + m_toLo;
        return functor()(y);
    }
};

template < typename functor >
struct RemapUnitTo
{
    double m_toLo; double m_toHi;
            
    RemapUnitTo( 
            double toLo, double toHi )
    :  m_toLo(toLo), m_toHi(toHi)
    {
    }
    
    double operator () ( double x )
    {
        double y = x * (m_toHi - m_toLo) + m_toLo;
        return functor()(y);
    }
};

Now we can trivially do what we did before to find the optimal approximation in a known range :


    struct SqrtOnePlusX
    {
        double operator () ( double x )
        {
            return sqrt( 1 + x );
        }
    };


    RemapUnitTo< SqrtOnePlusX > tf(-0.075f,0.075f);
    FindTchebychevCoefficients( tf );

and the output is :

t 0:      0.999647973 ,       0.99964797
t 1:      0.037519818 ,      0.037519818
t 2:     -0.000352182 ,   -0.00035218223
t 3:      0.000006612 ,   6.6121459e-006
t 4:     -0.000000155 ,  -1.5518238e-007
t 5:      0.000000004 ,   4.0790168e-009
0 : errSqr = 0.000469204 errMax = 0.0378787
1 : errSqr = 5.78832e-008 errMax = 0.000358952
2 : errSqr = 2.12381e-011 errMax = 6.77147e-006
3 : errSqr = 1.18518e-014 errMax = 1.59378e-007
4 : errSqr = 8.23726e-018 errMax = 4.19794e-009
5 : errSqr = 6.54835e-021 errMax = 1.19024e-010

I guess Remez minimax polynomials are better, but this is so easy and it gives you a good starting point, then you can just numerically optimize after this anyway.

ADDENDUM :

obviously the TchebychevN dispatcher sucks and is silly, but you don't actually use it in practice; you know which polynomials you want and you use something like :


double approx = 
    c0 * Tchebychev<0>(x) + 
    c1 * Tchebychev<1>(x) + 
    c2 * Tchebychev<2>(x) + 
    c3 * Tchebychev<3>(x);

which the compiler handles quite well.

I also did Legendre polynomials :


template < int n >
struct Legendre
{
    double operator () ( double x )
    {
        double L_n1 = Legendre < n-1 >()(x);
        double L_n2 = Legendre < n-2 >()(x);
        const double B = (n-1)/(double)n; // n-1/n
        return (1+B)*x*L_n1 - B*L_n2;
    }
};

template< > struct Legendre<0>
{
    double operator () ( double x ) { return 1.0; }
};

template< > struct Legendre<1>
{
    double operator () ( double x ) { return x; }
};


// but ryg's loop variants are mostly just better :

__forceinline double legendre_ryg(int n, double x)
{
  if (n == 0) return 1.0;
  if (n == 1) return x;

  double t = x, t1 = 1, t2;
  for (int i=1; i < n; i++) {
    t2 = t1; 
    t1 = t;
    const double B = (i)/(double)(i+1);
    t = (1+B)*x*t1 - B*t2;
  }
  return t;
}

__forceinline double chebyshev_ryg(int n, double x)
{
  if (n == 0) return 1.0;
  if (n == 1) return x;

  double t = x, t1 = 1, t2;
  for (int i=1; i < n; i++) {
    t2 = t1;
    t1 = t;
    t = 2.0 * x * t1 - t2;
  }
  return t;
}

// you can find Legendre coefficients like so :

    RemapUnitTo< Legendre<0> > ShiftedLegendre0(-1,1);

    double c0 = Integrate(0.0,1.0,steps,MakeFunctorProduct(functor(),ShiftedLegendre0));
    // (don't forget normalization)

// using new helper :

template < typename f1, typename f2 >
FunctorProduct< f1,f2 > MakeFunctorProduct( f1 _f1 , f2 _f2 ) { return FunctorProduct< f1,f2 >(_f1,_f2); }

11-30-10 - Reference

You can do weighted least squares using a normal least squares solver. To solve :

A x = b 

in the least squares sence (that is, minimize |Ax - b|^2) , with weights W , you just are minimizing the error :

E = Sum_i { W_i * ( Sum_j {A_ij * x_j} - b_i )^2 }

E = Sum_i { ( Sum_j { sqrt(W_i) * A_ij * x_j} - sqrt(W_i) * b_i )^2 }

so set

A'_ij = A_ij * sqrt(W_i)
b'_i = b_i * sqrt(W_i)

then

E = Sum_i { ( Sum_j { A'_ij * x_j } - b'_i )^2 }

so we can just do a normal lsqr solve for x using A' and b'.


If you have a Gaussian random event, probability of x is P(x), the (natural) bits to code an event is -ln(P(x)) =


P(x) = 1/(sigma * sqrt(2pi)) * e^( - 0.5 * ((x-c)/sigma)^2 )

H = -ln(P(x)) = ln(sigma * sqrt(2pi)) + 0.5 * ((x-c)/sigma)^2

H = const + ln(sigma) + 0.5 * ((x-c)/sigma)^2

And bits to code events is a good measure of how well events fit a model, which is useful for training a model to observed events. In particular if we ignore the constants and the ln term,

total H = Sum_ij { (1/sigma_i^1) * (x_j - c_i)^2 }

the bit cost measure is just L2 norm weighted by one over sigma^2 - which is intuitive, matching the prediction of very narrow Gaussians is much more important than matching the prediction of very wide ones.


Now that I have autoprintf, I find this very handy :


    #define lprintfvar(var) lprintf(STRINGIZE(var) " : ",var,"\n");

11/29/2010

11-29-10 - Useless hash test

Clocks to do the full hash computation and lookup on char * strings of english words :


hash functions :

 0 : FNV1A_Hash_Jesteress
 1 : FNV1A_HashStr_WHIZ
 2 : FNVHashStr
 3 : HashFNV1
 4 : MurmurHash2_Str
 5 : HashKernighanRitchie
 6 : HashBernstein
 7 : HashLarson
 8 : Hash17
 9 : Hash65599
10 : stb_hash
11 : HashWeinberger
12 : HashRamakrishna
13 : HashOneAtTime
14 : HashCRC

The hash functions are not given the string length, they must find it. For hashes that work on dwords the fast way to do that is using


#define DWORD_HAS_ZERO_BYTE(V)       (((V) - 0x01010101UL) & ~(V) & 0x80808080UL)

Hash functions are all inlined of course. The easy way to do that is to use macros to make functors for your hash functions and then pass them into a templated hash test function, something like this :


#define TEST_HT(hashfunc) \
  struct STRING_JOIN(htops_,hashfunc) : public hash_table_ops_charptr \
   { inline hash_type hash_key(const char_ptr & t) { return hashfunc( t ); } }; \
  typedef hash_table < char_ptr,Null,STRING_JOIN(htops_,hashfunc) > STRING_JOIN(ht_,hashfunc); \
  test_a_hash < STRING_JOIN(ht_,hashfunc)>(STRINGIZE(hashfunc),words,queries);

TEST_HT(FNV1A_HashStr_WHIZ);
TEST_HT(FNVHashStr);

This is using cblib:hash_table which is a reprobing hash table, with the hash values in the table, quadratic reprobing, and with special key values for empty & deleted (not a separate flag array). cblib:hash_table is significantly faster than khash or std::hash_map (though it takes more memory than khash and other similar hash tables due to storing the hash value). (BTW this was studied before extensively; see earlier blog posts; the four major deviations from khash are all small wins : 1. don't use separate flags because they cause an extra cache miss, 2. use cache-friendly quadratic reprobe, not rehashing, 3. use pow2 table size not prime, 4. store hash value).

The four tests in the chart above are :


blue:   StringHash_Test("M:\\CCC\\canterbury\\LCET10.txt","M:\\CCC\\canterbury\\PLRABN12.txt");
yellow: StringHash_Test("m:\\ccc\\book1","m:\\ccc\\book1");
red:    StringHash_Test("M:\\CCC\\words\\Sentence-list_00,032,359_English_The_Holy_Bible.txt","M:\\CCC\\words\\Word-list_00,038,936_English_The Oxford Thesaurus, An A-Z Dictionary of Synonyms.wrd");
green:  StringHash_Test("m:\\ccc\\large\\enwik8","m:\\ccc\\book2");

make hash from words in first arg
lookup words in second arg

(generally second file is much smaller than first, so we get some hits and some misses)

Anyway, the result is that the Whiz and Jesteress variants of FNV1 by Georgi 'Sanmayce' are in fact quite good in this kind of usage. They rely on having fast unaligned dword reads, so they are pretty much Intel-only, which makes them a bit pointless. But here they are :


U32 HashFNV1(const char *key) 
{
    U32 hash = 2166136261;
    while(*key)
        hash = (16777619 * hash) ^ (*key++);
    return hash;
}

inline uint32 FNV1A_HashStr_WHIZ(const char *str)
{
    const uint32 PRIME = 709607;//15607;
    uint32 hash32 = 2166136261;
    const char *p = str;

    for(;;)
    {
        uint32 dw = *(uint32 *)p;
        if ( DWORD_HAS_ZERO_BYTE(dw) )
            break;

        hash32 = (hash32 ^ dw) * PRIME;
        p += sizeof(uint32);
    }
    while( *p )
    {
        hash32 = (hash32 ^ *p) * PRIME;
        p++;
    }
    
    return hash32;
}

U32 FNV1A_Hash_Jesteress(const char *str)
{
    const U32 PRIME = 709607;
    U32 hash32 = 2166136261;
    const char *p = str;

    for(;;)
    {
        uint32 dw1 = *(uint32 *)p;
        if ( DWORD_HAS_ZERO_BYTE(dw1) )
            break;
        
        p += 4;
        hash32 = hash32 ^ _lrotl(dw1,5);
        
        uint32 dw2 = *(uint32 *)p;
        if ( DWORD_HAS_ZERO_BYTE(dw2) )
        {
            // finish dw1 without dw2
            hash32 *= PRIME;
            break;
        }
        
        p += 4;
            
        hash32 = (hash32 ^ dw2) * PRIME;        
    }
    
    while( *p )
    {
        hash32 = (hash32 ^ *p) * PRIME;
        p++;
    }
    
    return hash32;
}

I have no doubt these could be massaged to be a bit faster through careful study of the asm (in particular the handling of the 0-3 byte tail doesn't look awesome). An even better thing would be to ensure all your strings are 4-byte aligned and that after the null they have null bytes to fill the final dword.

Anyway, you can pretty much ignore all this, because hash functions have to be tested in-situ (eg. on your exact data, on your hash table implementation, on your platform), but it is what it is.

BTW FNV1 is a lot faster than FNV1a. Also the mixing of Whiz and Jesteress are quite poor and they do have a lot more collisions on some values, but that appears to not matter that much on this particular kind of test.

11/20/2010

11-20-10 - Function approximation by iteration , Part 3 - Reciprocal Square Root

See part 1 or part 2 .

Now that we have our toolbox, we can attack a popular issue in 3d engines - the reciprocal square root (used primarily in normalizing vectors). We want to find y = 1 / sqrt(x) .

Let me first state that I'm not doing all this because it's actually important for optimization. If you care about speed all you have to do is set :


 /arch:SSE2 /fp:fast

and get on with your life. Worrying about this issue is a clear case of pointless micro-optimization. I'm doing this for the mathematical curiosity. Nevertheless, lots of people continue to worry about this :

Some Assembly Required - Timing Square Root
ompf.org � View topic - SSE Patterns
Ken Turkowski's Computer Graphics Papers
Inverse Square Root ECE 250 University of Waterloo
HLNUM.ORG Documentation RSqrt
hlnum frsqrt
High order algorithms to find square roots and inverses
computation of square root & reciprocal calculation

To do reciprocal square root we're going to start with y0 from the IEEE floating point trick. In games this is commonly know as "Carmack's trick from Quake 3" , but it goes back to Greg Walsh apparently :

Beyond3d part 1
Beyond3d part 2

Also on the Slashdot page about that article Paul Hsieh mentions that apparently Kahan knew of this idea, but doesn't know the time frame exactly.

BTW Chris Lomont has the best paper about this IEEE trick for recipsqrt, but the Wikipedia page is a good start

So the initial approximation is :


const int magic = 0x5f375a86;

inline float rsqrtf_walsh( float x )
{
    union { int i; float f; } u;
    u.f = x;
    u.i = magic - (u.i>>1);
    return u.f;
}

(more on the magic number later, but see Lomont for an explination of where it comes from). (BTW I don't mean to imply that it's "magic" by using that name; it's a standard IEEE bit trick).

This has rather large error so we are going to use this as y0 and find an iteration for y1.

We can now use our dimensional analysis method. Our unitless number must be


u = x * y0*y0

and our iteration must be :

y1 = y0 * f(u)

and we'll look at a few possible forms for f :

f1 = A + B * u

f2 = A + B / u

f3 = A + B * (1-u) + C * (1-u)^2

we can confirm that these forms are sane by finding them in other ways.

If we set f(y) = 1/y^2 - x , then the root f(y) = 0 occurs where y = 1/sqrt(x) , so we can use Newton's method to find the roots of f(y). The result is :


y1 = y0 * ( (3/2) - 1/2 x y0^2 )

or :

f1 , with A = 3/2 , B = - 1/2

If we set f(y) = y^2 - 1/x , the root is in the same place and we can use Newton's method and we get :

y1 = 0.5 * ( y0 + 1 / (x*y0) )

or :

f2 , with A = B = 1/2

If we use the first f(y) form and do a "Halley" or "Householder" iteration instead of a Newton iteration, that's form f3 with A = 1 , B = 1/2, C = 3/8.


(maximum % relative error) :

y0 : (no iteration) : 3.436526

y0 + one Newton     : 0.175126  (f1 with A = 3/2 , B = - 1/2)

y0 + two Newtons    : 0.000470

y0 + one Babylonian : 0.060595  (f2 with A = B = 1/2)

y0 + one Halley     : 0.010067  (f3 with A = 1 , B = 1/2, C = 3/8)

obviously these methods do not have equal computational complexity. In particular, the Babylonian method uses a divide, so it's much slower than the other methods. A serious study of function approximation should compare the quality to the CPU cost to make sure the method is at a good place on the speed/quality curve.

But of course we know the punch line is coming : we don't need to use those values for ABC. Because the initial estimate y0 has a known error range, we can optimize ABC over that range. We won't bother with doing it analytically, we'll just numerically search. The result is :


maximum percent relative error :

y0 : (no iteration) : 3.436526

y0 + one Newton     : 0.175126  (f1 with A = 3/2 , B = - 1/2)
method f1 optimal   : 0.087665  ( A = 1.501338 , B = -0.500461 )

y0 + one Babylonian : 0.060595  (f2 with A = B = 1/2)
method f2 optimal   : 0.030292  ( A = B = 0.499848 )

y0 + one Halley     : 0.010067  (f3 with A = 1 , B = 1/2, C = 3/8)
method f3 optimal   : 0.002523  ( A = 1.0 , B = 0.5011, C = 0.375608 )

by optimizing the coefficients we get roughly a 2X reduction of maximum error *for free* ! (4X for Halley - because it's quadratic; two newton steps would also get a 4X improvement). It's no more CPU work because a float constant multiply is a float constant multiply. We're just using better constants. As noted in the last post - these constants are no longer an iteration that you can repeat over and over, they are optimized for the first step only.

Lastly, let's go back to that magic number 0x5f375a86 that was used in the y0 approximation. Can we do better if we optimize that? The answer is basically no. (* see addendum)


In Quake3 Carmack used 0x5f3759df ; Carmack + 1 Newton has a max relative error of 0.175214 %

In his paper Lomont finds 0x5f375a86 is slightly better ; Lomont + 1 Newton gives 0.175129 %

but that level of tweakage is pretty irrelevant because they missed the big tweak, which is using an optimal iteration instead of a newton iteration :

inline float rsqrtf_cb_method1( float x )
{
    const int magic = 0x5F37592F;
    const float A = 1.501338f;
    const float B = -0.500461f;

    union { int i; float f; } u;
    u.f = x;
    u.i = magic - (u.i>>1);
    float y0 = u.f;

    float Bx = B * x;
    
    return y0 * ( A + Bx * y0 * y0 );
}

rsqrtf_cb_method1 has a relative error of 0.087714 %

the magic number I happened to use is different than Lomont or Carmack, but it's actually irrelevant - you can use anything in 0x5f375000 - 0x5f376000 and you will get the same final error (but you will get slightly different values for A and B). The point is that the bias from the magic number can be balanced by optimizing A or B very effectively. So you don't have to worry about it - just get the magic number vaguely close. (* see addendum)

To demonstrate that this iteration should not be repeated, let's change :


    return y0 * ( A + Bx * y0 * y0 );

to

    float y1 = y0 * ( A + Bx * y0 * y0 );
    return y1 * ( A2 + B2 * x * y1 * y1 );

now the errors for two steps are :

first step is cb_method1             : 0.087714 %

repeat same step ( A2 = A , B2 = B ) : 0.087716 %  (!!)

second step Newton (A2= 3/2, B2=-1/2): 0.000127 %

second step optimal                  : 0.000072 %  (A2 = 1.500000596f , B2 = -0.500000060f)

You should see obviously that the closer you get to the answer, the less optimizing helps, because the Newton iteration (which is based on Taylor series) becomes optimal as epsilon goes to zero.

Now before you get excited and go copy-pasting rsqrtf_cb_method1 around, let me stop you. As I said at the beginning, it's pointless because you have :


inline float rsqrtf_sse( float x )
{
    return _mm_cvtss_f32( _mm_rsqrt_ss( _mm_set_ss( x ) ) );
}

with max relative error : 0.032585%

One Newton step after this takes you to 0.000029% error ; optimizing the iteration only gets you to 0.000026% so it's very pointless.

ADDENDUM :

Jan Kadlec (rrrola) posted a comment pointing to his prior work on the topic . He finds a better initial magic number and then followup floats by doing a more rigorous search. What I wrote at (*) about the magic number not mattering wasn't quite right. I was doing greedy searches in the optimization, and got stuck in a local minimum. You have to have quite a good non-greedy searcher to find his solution. The exact value of it doesn't matter much, but there is this much better solution if you make a big jump to another neighborhood :

Copy-pasting rrrola's version for reference , with 0.0652 % max relative error.


float inv_sqrt_rrrola(float x)
{
  union { float f; uint32 u; } y = {x};
  y.u = 0x5F1FFFF9ul - (y.u >> 1);
  return 0.703952253f * y.f * (2.38924456f - x * y.f * y.f);
}

Also, something you can perhaps help me with : Pizer had done even earlier work on this topic and roughly found the right (right = Kadlec) constant by finding the magic number that minimizes (1 + e_max)/(1 + e_min). Doing that he gets 0x5F1F1412.

If we say ratio = approx/correct , then what Pizer is minimizing is ratio_max/ratio_min , which is the same as minimizing log(ratio_max) - log(ratio_min) . I can't reproduce Pizer's result and I'm not sure why that would be the thing you want to minimize. Intuitively it does make sense that what you care about is really *log* relative error, not relative error. That is, going +100% is clearly better than going -100% in a relative sense (this is sort of like the fact that doing -100% in the stock market is much worse than doing +100% is good, as I wrote long ago, we should really talk about log-return).

I guess one way to see it is if you think about e as a floating point number; what we really want is for the error of the approximation to show up as far to the right of the decimal as possible. The number of binary digits to the right of the decimal is the log of the error. But that would just mean minimizing log( MAX(e_max,-e_min) ) so something is missing in my understanding of this bit.

old rants