cbloom rants: 11/2010

11/30/2010

11-30-10 - Tchebychev

A little utility, and example of how I use templates : (this is not meant to be efficient, it's for offline apps, not realtime use)


template < int n >
struct Tchebychev
{
    double operator () ( double x )
    {
        return 2.0 * x * Tchebychev < n-1>()(x) - Tchebychev < n-2>()(x);
    }
};

template < >
struct Tchebychev < 0 >
{
    double operator () ( double x )
    {
        return 1.0;
    }
};

template < >
struct Tchebychev < 1 >
{
    double operator () ( double x )
    {
        return x;
    }
};

double TchebychevN(int n,double x)
{
    switch(n) {
    case 0 : return Tchebychev<0>()(x); 
    case 1 : return Tchebychev<1>()(x); 
    case 2 : return Tchebychev<2>()(x); 
    case 3 : return Tchebychev<3>()(x); 
    case 4 : return Tchebychev<4>()(x); 
    case 5 : return Tchebychev<5>()(x); 
    case 6 : return Tchebychev<6>()(x); 
    case 7 : return Tchebychev<7>()(x); 
    case 8 : return Tchebychev<8>()(x); 
    case 9 : return Tchebychev<9>()(x); 
    NO_DEFAULT_CASE
    }
    return 0;
}

template < typename functor >
void FindTchebychevCoefficients(functor f,double lo = 0.0, double hi = 1.0, int N = (1<<20))
{
    double PI_over_N = PI/N;

    #define NUM_T   6

    double t[NUM_T] = { 0 };
    
    for(int k=0;k < N;k++)
    {
        double pk = PI_over_N * (k + 0.5);
        double x_k = cos(pk);
        double farg = lo + (hi - lo) * 0.5 * (x_k+1.0);
        double fval = f( farg );
        for(int j=0;j < NUM_T;j++)
        {
            t[j] += fval * cos( pk * j );
        }
    }
    for(int j=0;j < NUM_T;j++)
    {
        t[j] *= 1.0 / N;
        if ( j != 0 )
            t[j] *= 2.0;

        //lprintfvar(t[j]);
        lprintf("t %d: %16.9f , %16.8g\n",j,t[j],t[j]);
    }
    
    double errSqr[NUM_T] = { 0 };
    double errMax[NUM_T] = { 0 };
    for(int i=0;i < N;i++)
    {
        double xunit = (i+0.5)/N;
        double farg = lo + (hi - lo) * xunit;
        double xt = xunit*2.0 - 1.0;
        double y = f(farg);
        double p = 0.0;
        for(int j=0;j < NUM_T;j++)
        {
            p += t[j] * TchebychevN(j,xt);
            errSqr[j] += fsquare(y-p);
            errMax[j] = MAX( errMax[j] , fabs(y-p) );
        }
    }
    for(int j=0;j < NUM_T;j++)
    {
        lprintf("%d : errSqr = %g errMax = %g\n",j,errSqr[j]/N,errMax[j]);
    }
}

You can also very easily find inner products by using the Integrate<> template I posted earlier plus this simple product adaptor :


template < typename f1, typename f2 >
struct FunctorProduct
{
    double operator () ( double x )
    {
        return f1()(x) * f2()(x);
    }
};

(eg. you can trivially find Hermite or Legendre series this way by doing inner products).

Another handy helper is a range remapper :


template < typename functor >
struct RemapFmTo
{
    double m_fmLo; double m_fmHi;
    double m_toLo; double m_toHi;
            
    RemapFmTo( 
            double fmLo, double fmHi,
            double toLo, double toHi )
    : m_fmLo(fmLo), m_fmHi(fmHi), m_toLo(toLo), m_toHi(toHi)
    {
    }
    
    double operator () ( double x )
    {
        double t = (x - m_fmLo) / (m_fmHi - m_fmLo);
        double y = t * (m_toHi - m_toLo) + m_toLo;
        return functor()(y);
    }
};

template < typename functor >
struct RemapUnitTo
{
    double m_toLo; double m_toHi;
            
    RemapUnitTo( 
            double toLo, double toHi )
    :  m_toLo(toLo), m_toHi(toHi)
    {
    }
    
    double operator () ( double x )
    {
        double y = x * (m_toHi - m_toLo) + m_toLo;
        return functor()(y);
    }
};

Now we can trivially do what we did before to find the optimal approximation in a known range :


    struct SqrtOnePlusX
    {
        double operator () ( double x )
        {
            return sqrt( 1 + x );
        }
    };


    RemapUnitTo< SqrtOnePlusX > tf(-0.075f,0.075f);
    FindTchebychevCoefficients( tf );

and the output is :


t 0:      0.999647973 ,       0.99964797
t 1:      0.037519818 ,      0.037519818
t 2:     -0.000352182 ,   -0.00035218223
t 3:      0.000006612 ,   6.6121459e-006
t 4:     -0.000000155 ,  -1.5518238e-007
t 5:      0.000000004 ,   4.0790168e-009
0 : errSqr = 0.000469204 errMax = 0.0378787
1 : errSqr = 5.78832e-008 errMax = 0.000358952
2 : errSqr = 2.12381e-011 errMax = 6.77147e-006
3 : errSqr = 1.18518e-014 errMax = 1.59378e-007
4 : errSqr = 8.23726e-018 errMax = 4.19794e-009
5 : errSqr = 6.54835e-021 errMax = 1.19024e-010

I guess Remez minimax polynomials are better, but this is so easy and it gives you a good starting point, then you can just numerically optimize after this anyway.

ADDENDUM :

obviously the TchebychevN dispatcher sucks and is silly, but you don't actually use it in practice; you know which polynomials you want and you use something like :


double approx = 
    c0 * Tchebychev<0>(x) + 
    c1 * Tchebychev<1>(x) + 
    c2 * Tchebychev<2>(x) + 
    c3 * Tchebychev<3>(x);

which the compiler handles quite well.

I also did Legendre polynomials :


template < int n >
struct Legendre
{
    double operator () ( double x )
    {
        double L_n1 = Legendre < n-1 >()(x);
        double L_n2 = Legendre < n-2 >()(x);
        const double B = (n-1)/(double)n; // n-1/n
        return (1+B)*x*L_n1 - B*L_n2;
    }
};

template< > struct Legendre<0>
{
    double operator () ( double x ) { return 1.0; }
};

template< > struct Legendre<1>
{
    double operator () ( double x ) { return x; }
};


// but ryg's loop variants are mostly just better :

__forceinline double legendre_ryg(int n, double x)
{
  if (n == 0) return 1.0;
  if (n == 1) return x;

  double t = x, t1 = 1, t2;
  for (int i=1; i < n; i++) {
    t2 = t1; 
    t1 = t;
    const double B = (i)/(double)(i+1);
    t = (1+B)*x*t1 - B*t2;
  }
  return t;
}

__forceinline double chebyshev_ryg(int n, double x)
{
  if (n == 0) return 1.0;
  if (n == 1) return x;

  double t = x, t1 = 1, t2;
  for (int i=1; i < n; i++) {
    t2 = t1;
    t1 = t;
    t = 2.0 * x * t1 - t2;
  }
  return t;
}

// you can find Legendre coefficients like so :

    RemapUnitTo< Legendre<0> > ShiftedLegendre0(-1,1);

    double c0 = Integrate(0.0,1.0,steps,MakeFunctorProduct(functor(),ShiftedLegendre0));
    // (don't forget normalization)

// using new helper :

template < typename f1, typename f2 >
FunctorProduct< f1,f2 > MakeFunctorProduct( f1 _f1 , f2 _f2 ) { return FunctorProduct< f1,f2 >(_f1,_f2); }

11-30-10 - Reference

You can do weighted least squares using a normal least squares solver. To solve :


A x = b

in the least squares sence (that is, minimize |Ax - b|^2) , with weights W , you just are minimizing the error :


E = Sum_i { W_i * ( Sum_j {A_ij * x_j} - b_i )^2 }

E = Sum_i { ( Sum_j { sqrt(W_i) * A_ij * x_j} - sqrt(W_i) * b_i )^2 }

so set

A'_ij = A_ij * sqrt(W_i)
b'_i = b_i * sqrt(W_i)

then

E = Sum_i { ( Sum_j { A'_ij * x_j } - b'_i )^2 }

so we can just do a normal lsqr solve for x using A' and b'.

If you have a Gaussian random event, probability of x is P(x), the (natural) bits to code an event is -ln(P(x)) =


P(x) = 1/(sigma * sqrt(2pi)) * e^( - 0.5 * ((x-c)/sigma)^2 )

H = -ln(P(x)) = ln(sigma * sqrt(2pi)) + 0.5 * ((x-c)/sigma)^2

H = const + ln(sigma) + 0.5 * ((x-c)/sigma)^2

And bits to code events is a good measure of how well events fit a model, which is useful for training a model to observed events. In particular if we ignore the constants and the ln term,


total H = Sum_ij { (1/sigma_i^1) * (x_j - c_i)^2 }

the bit cost measure is just L2 norm weighted by one over sigma^2 - which is intuitive, matching the prediction of very narrow Gaussians is much more important than matching the prediction of very wide ones.

Now that I have autoprintf, I find this very handy :


    #define lprintfvar(var) lprintf(STRINGIZE(var) " : ",var,"\n");

11/29/2010

11-29-10 - Useless hash test

Clocks to do the full hash computation and lookup on char * strings of english words :


hash functions :

 0 : FNV1A_Hash_Jesteress
 1 : FNV1A_HashStr_WHIZ
 2 : FNVHashStr
 3 : HashFNV1
 4 : MurmurHash2_Str
 5 : HashKernighanRitchie
 6 : HashBernstein
 7 : HashLarson
 8 : Hash17
 9 : Hash65599
10 : stb_hash
11 : HashWeinberger
12 : HashRamakrishna
13 : HashOneAtTime
14 : HashCRC

The hash functions are not given the string length, they must find it. For hashes that work on dwords the fast way to do that is using


#define DWORD_HAS_ZERO_BYTE(V)       (((V) - 0x01010101UL) & ~(V) & 0x80808080UL)

Hash functions are all inlined of course. The easy way to do that is to use macros to make functors for your hash functions and then pass them into a templated hash test function, something like this :


#define TEST_HT(hashfunc) \
  struct STRING_JOIN(htops_,hashfunc) : public hash_table_ops_charptr \
   { inline hash_type hash_key(const char_ptr & t) { return hashfunc( t ); } }; \
  typedef hash_table < char_ptr,Null,STRING_JOIN(htops_,hashfunc) > STRING_JOIN(ht_,hashfunc); \
  test_a_hash < STRING_JOIN(ht_,hashfunc)>(STRINGIZE(hashfunc),words,queries);

TEST_HT(FNV1A_HashStr_WHIZ);
TEST_HT(FNVHashStr);

This is using cblib:hash_table which is a reprobing hash table, with the hash values in the table, quadratic reprobing, and with special key values for empty & deleted (not a separate flag array). cblib:hash_table is significantly faster than khash or std::hash_map (though it takes more memory than khash and other similar hash tables due to storing the hash value). (BTW this was studied before extensively; see earlier blog posts; the four major deviations from khash are all small wins : 1. don't use separate flags because they cause an extra cache miss, 2. use cache-friendly quadratic reprobe, not rehashing, 3. use pow2 table size not prime, 4. store hash value).

The four tests in the chart above are :


blue:   StringHash_Test("M:\\CCC\\canterbury\\LCET10.txt","M:\\CCC\\canterbury\\PLRABN12.txt");
yellow: StringHash_Test("m:\\ccc\\book1","m:\\ccc\\book1");
red:    StringHash_Test("M:\\CCC\\words\\Sentence-list_00,032,359_English_The_Holy_Bible.txt","M:\\CCC\\words\\Word-list_00,038,936_English_The Oxford Thesaurus, An A-Z Dictionary of Synonyms.wrd");
green:  StringHash_Test("m:\\ccc\\large\\enwik8","m:\\ccc\\book2");

make hash from words in first arg
lookup words in second arg

(generally second file is much smaller than first, so we get some hits and some misses)

Anyway, the result is that the Whiz and Jesteress variants of FNV1 by Georgi 'Sanmayce' are in fact quite good in this kind of usage. They rely on having fast unaligned dword reads, so they are pretty much Intel-only, which makes them a bit pointless. But here they are :


U32 HashFNV1(const char *key) 
{
    U32 hash = 2166136261;
    while(*key)
        hash = (16777619 * hash) ^ (*key++);
    return hash;
}

inline uint32 FNV1A_HashStr_WHIZ(const char *str)
{
    const uint32 PRIME = 709607;//15607;
    uint32 hash32 = 2166136261;
    const char *p = str;

    for(;;)
    {
        uint32 dw = *(uint32 *)p;
        if ( DWORD_HAS_ZERO_BYTE(dw) )
            break;

        hash32 = (hash32 ^ dw) * PRIME;
        p += sizeof(uint32);
    }
    while( *p )
    {
        hash32 = (hash32 ^ *p) * PRIME;
        p++;
    }
    
    return hash32;
}

U32 FNV1A_Hash_Jesteress(const char *str)
{
    const U32 PRIME = 709607;
    U32 hash32 = 2166136261;
    const char *p = str;

    for(;;)
    {
        uint32 dw1 = *(uint32 *)p;
        if ( DWORD_HAS_ZERO_BYTE(dw1) )
            break;
        
        p += 4;
        hash32 = hash32 ^ _lrotl(dw1,5);
        
        uint32 dw2 = *(uint32 *)p;
        if ( DWORD_HAS_ZERO_BYTE(dw2) )
        {
            // finish dw1 without dw2
            hash32 *= PRIME;
            break;
        }
        
        p += 4;
            
        hash32 = (hash32 ^ dw2) * PRIME;        
    }
    
    while( *p )
    {
        hash32 = (hash32 ^ *p) * PRIME;
        p++;
    }
    
    return hash32;
}

I have no doubt these could be massaged to be a bit faster through careful study of the asm (in particular the handling of the 0-3 byte tail doesn't look awesome). An even better thing would be to ensure all your strings are 4-byte aligned and that after the null they have null bytes to fill the final dword.

Anyway, you can pretty much ignore all this, because hash functions have to be tested in-situ (eg. on your exact data, on your hash table implementation, on your platform), but it is what it is.

BTW FNV1 is a lot faster than FNV1a. Also the mixing of Whiz and Jesteress are quite poor and they do have a lot more collisions on some values, but that appears to not matter that much on this particular kind of test.

11/20/2010

11-20-10 - Function approximation by iteration , Part 3 - Reciprocal Square Root

See part 1 or part 2 .

Now that we have our toolbox, we can attack a popular issue in 3d engines - the reciprocal square root (used primarily in normalizing vectors). We want to find y = 1 / sqrt(x) .

Let me first state that I'm not doing all this because it's actually important for optimization. If you care about speed all you have to do is set :


 /arch:SSE2 /fp:fast

and get on with your life. Worrying about this issue is a clear case of pointless micro-optimization. I'm doing this for the mathematical curiosity. Nevertheless, lots of people continue to worry about this :

Some Assembly Required - Timing Square Root
ompf.org � View topic - SSE Patterns
Ken Turkowski's Computer Graphics Papers
Inverse Square Root ECE 250 University of Waterloo
HLNUM.ORG Documentation RSqrt
hlnum frsqrt
High order algorithms to find square roots and inverses
computation of square root & reciprocal calculation

To do reciprocal square root we're going to start with y0 from the IEEE floating point trick. In games this is commonly know as "Carmack's trick from Quake 3" , but it goes back to Greg Walsh apparently :

Beyond3d part 1
Beyond3d part 2

Also on the Slashdot page about that article Paul Hsieh mentions that apparently Kahan knew of this idea, but doesn't know the time frame exactly.

BTW Chris Lomont has the best paper about this IEEE trick for recipsqrt, but the Wikipedia page is a good start

So the initial approximation is :


const int magic = 0x5f375a86;

inline float rsqrtf_walsh( float x )
{
    union { int i; float f; } u;
    u.f = x;
    u.i = magic - (u.i>>1);
    return u.f;
}

(more on the magic number later, but see Lomont for an explination of where it comes from). (BTW I don't mean to imply that it's "magic" by using that name; it's a standard IEEE bit trick).

This has rather large error so we are going to use this as y0 and find an iteration for y1.

We can now use our dimensional analysis method. Our unitless number must be


u = x * y0*y0

and our iteration must be :


y1 = y0 * f(u)

and we'll look at a few possible forms for f :


f1 = A + B * u

f2 = A + B / u

f3 = A + B * (1-u) + C * (1-u)^2

we can confirm that these forms are sane by finding them in other ways.

If we set f(y) = 1/y^2 - x , then the root f(y) = 0 occurs where y = 1/sqrt(x) , so we can use Newton's method to find the roots of f(y). The result is :


y1 = y0 * ( (3/2) - 1/2 x y0^2 )

or :

f1 , with A = 3/2 , B = - 1/2

If we set f(y) = y^2 - 1/x , the root is in the same place and we can use Newton's method and we get :


y1 = 0.5 * ( y0 + 1 / (x*y0) )

or :

f2 , with A = B = 1/2

If we use the first f(y) form and do a "Halley" or "Householder" iteration instead of a Newton iteration, that's form f3 with A = 1 , B = 1/2, C = 3/8.


(maximum % relative error) :

y0 : (no iteration) : 3.436526

y0 + one Newton     : 0.175126  (f1 with A = 3/2 , B = - 1/2)

y0 + two Newtons    : 0.000470

y0 + one Babylonian : 0.060595  (f2 with A = B = 1/2)

y0 + one Halley     : 0.010067  (f3 with A = 1 , B = 1/2, C = 3/8)

obviously these methods do not have equal computational complexity. In particular, the Babylonian method uses a divide, so it's much slower than the other methods. A serious study of function approximation should compare the quality to the CPU cost to make sure the method is at a good place on the speed/quality curve.

But of course we know the punch line is coming : we don't need to use those values for ABC. Because the initial estimate y0 has a known error range, we can optimize ABC over that range. We won't bother with doing it analytically, we'll just numerically search. The result is :


maximum percent relative error :

y0 : (no iteration) : 3.436526

y0 + one Newton     : 0.175126  (f1 with A = 3/2 , B = - 1/2)
method f1 optimal   : 0.087665  ( A = 1.501338 , B = -0.500461 )

y0 + one Babylonian : 0.060595  (f2 with A = B = 1/2)
method f2 optimal   : 0.030292  ( A = B = 0.499848 )

y0 + one Halley     : 0.010067  (f3 with A = 1 , B = 1/2, C = 3/8)
method f3 optimal   : 0.002523  ( A = 1.0 , B = 0.5011, C = 0.375608 )

by optimizing the coefficients we get roughly a 2X reduction of maximum error *for free* ! (4X for Halley - because it's quadratic; two newton steps would also get a 4X improvement). It's no more CPU work because a float constant multiply is a float constant multiply. We're just using better constants. As noted in the last post - these constants are no longer an iteration that you can repeat over and over, they are optimized for the first step only.

Lastly, let's go back to that magic number 0x5f375a86 that was used in the y0 approximation. Can we do better if we optimize that? The answer is basically no. (* see addendum)


In Quake3 Carmack used 0x5f3759df ; Carmack + 1 Newton has a max relative error of 0.175214 %

In his paper Lomont finds 0x5f375a86 is slightly better ; Lomont + 1 Newton gives 0.175129 %

but that level of tweakage is pretty irrelevant because they missed the big tweak, which is using an optimal iteration instead of a newton iteration :


inline float rsqrtf_cb_method1( float x )
{
    const int magic = 0x5F37592F;
    const float A = 1.501338f;
    const float B = -0.500461f;

    union { int i; float f; } u;
    u.f = x;
    u.i = magic - (u.i>>1);
    float y0 = u.f;

    float Bx = B * x;
    
    return y0 * ( A + Bx * y0 * y0 );
}

rsqrtf_cb_method1 has a relative error of 0.087714 %

the magic number I happened to use is different than Lomont or Carmack, but it's actually irrelevant - you can use anything in 0x5f375000 - 0x5f376000 and you will get the same final error (but you will get slightly different values for A and B). The point is that the bias from the magic number can be balanced by optimizing A or B very effectively. So you don't have to worry about it - just get the magic number vaguely close. (* see addendum)

To demonstrate that this iteration should not be repeated, let's change :


    return y0 * ( A + Bx * y0 * y0 );

to

    float y1 = y0 * ( A + Bx * y0 * y0 );
    return y1 * ( A2 + B2 * x * y1 * y1 );

now the errors for two steps are :


first step is cb_method1             : 0.087714 %

repeat same step ( A2 = A , B2 = B ) : 0.087716 %  (!!)

second step Newton (A2= 3/2, B2=-1/2): 0.000127 %

second step optimal                  : 0.000072 %  (A2 = 1.500000596f , B2 = -0.500000060f)

You should see obviously that the closer you get to the answer, the less optimizing helps, because the Newton iteration (which is based on Taylor series) becomes optimal as epsilon goes to zero.

Now before you get excited and go copy-pasting rsqrtf_cb_method1 around, let me stop you. As I said at the beginning, it's pointless because you have :


inline float rsqrtf_sse( float x )
{
    return _mm_cvtss_f32( _mm_rsqrt_ss( _mm_set_ss( x ) ) );
}

with max relative error : 0.032585%

One Newton step after this takes you to 0.000029% error ; optimizing the iteration only gets you to 0.000026% so it's very pointless.

ADDENDUM :

Jan Kadlec (rrrola) posted a comment pointing to his prior work on the topic . He finds a better initial magic number and then followup floats by doing a more rigorous search. What I wrote at (*) about the magic number not mattering wasn't quite right. I was doing greedy searches in the optimization, and got stuck in a local minimum. You have to have quite a good non-greedy searcher to find his solution. The exact value of it doesn't matter much, but there is this much better solution if you make a big jump to another neighborhood :

Copy-pasting rrrola's version for reference , with 0.0652 % max relative error.


float inv_sqrt_rrrola(float x)
{
  union { float f; uint32 u; } y = {x};
  y.u = 0x5F1FFFF9ul - (y.u >> 1);
  return 0.703952253f * y.f * (2.38924456f - x * y.f * y.f);
}

Also, something you can perhaps help me with : Pizer had done even earlier work on this topic and roughly found the right (right = Kadlec) constant by finding the magic number that minimizes (1 + e_max)/(1 + e_min). Doing that he gets 0x5F1F1412.

If we say ratio = approx/correct , then what Pizer is minimizing is ratio_max/ratio_min , which is the same as minimizing log(ratio_max) - log(ratio_min) . I can't reproduce Pizer's result and I'm not sure why that would be the thing you want to minimize. Intuitively it does make sense that what you care about is really *log* relative error, not relative error. That is, going +100% is clearly better than going -100% in a relative sense (this is sort of like the fact that doing -100% in the stock market is much worse than doing +100% is good, as I wrote long ago, we should really talk about log-return).

I guess one way to see it is if you think about e as a floating point number; what we really want is for the error of the approximation to show up as far to the right of the decimal as possible. The number of binary digits to the right of the decimal is the log of the error. But that would just mean minimizing log( MAX(e_max,-e_min) ) so something is missing in my understanding of this bit.

11-20-10 - Function approximation by iteration , Part 2

See part 1 .

Let's get to an example. We want y = sqrt(x) , but without actually doing a square root.

You can get an iteration for this in various ad-hoc ways. Let's see one real quick :


y = sqrt(x)

square it

y^2 = x

divide by y

y = x/y

add y

2y = y + x/y

y = 0.5*(y + x/y)

y_1 = 0.5*(y_0 + x/y_0)

and that's an iteration for sqrt. BTW in all these notes I'll be writing y_0 and y_1 as two steps, but I really mean y_n and y_n+1.

You can get this in another funny way :


y = sqrt(x)

y = sqrt( y * (x/y) )

y = geometric_mean( y , (x/y) )

y_1 = arithmetic_mean( y_0, (x/y_0) )

and you can see that it converges because if y is bigger than sqrt(x), then x/y is smaller than sqrt(x), so the mean must be closer to the answer.

But these are sort of ad-hoc weird ways of deriving the iteration. Let's do it in a more formal way which will be more useful in the general case. (BTW this is called the "Babylonian method for sqrt" and you can also get it trivially from Newton's method).

Say we have y_0 which is an approximation of sqrt(x) that we got some other way. It's close to the real answer y = sqrt(x) , so we write :


y = sqrt(x)

y^2 = y_0^2 * (1 + e)

or

y_0 = y / sqrt(1 + e)

e = "epsilon" is small

y_0^2 * (1 + e) = y^2 = x

1+e = x/y_0^2

e = x/y_0^2 - 1

the choice of how I've factored e out from (y/y_0) is pretty arbitrary, there are lots of choices about how you do that and all are valid. In this case the important thing is that e multiplied on y_0, not added (eg. I did not choose y^2 = y_0^2 + e).

So far this has all been exact. Now we can write :


y = sqrt(x) = sqrt( y_0^2 * (1 + e) ) = y_0 * sqrt( 1 + e )

exactly :

y = y_0 * sqrt( 1 + e)

approximately :

y ~= y_0 * approx( sqrt( 1 + e) )

so this is our iteration :

y_1 =  y_0 * approx( sqrt( 1 + e) )

The principle step in these kind of iterations is to take a twiddle-equals for y (an approximation of y) and make than an exact equals (assignment) for an iteration of y_n.

Now the only issue is how we approximate sqrt( 1 + e). First let's do it with a Taylor expansion :


approx( sqrt( 1 + e) ) = 1 + 0.5 * e + O(e^2)

y_1 = y_0 * ( 1 + 0.5 * e )

y_1 = y_0 * ( 1 + 0.5 * (x/y_0^2 - 1) )

y_1 = y_0 * ( 0.5 + 0.5 * x/y_0^2 )

y_1 = 0.5 * ( y_0 + x / y_0 )

which is the Babylonian method.

But as we discussed last time, we can do better. In particular, if we know the error bound of y_0, then we know the range of e, and we can use a better approximation of sqrt( 1 + e). Specifically if we use the IEEE float-int kind of method for finding y_0 :


float y_0(float x)
{
    union { int i; float f; } u;
    u.f = x;
    const int ieee_sqrt_bias = (1<<29) - (1<<22) - 301120;
    u.i = ieee_sqrt_bias + (u.i >> 1);
    return u.f;
}

Then the maximum relative error of y_0 is 3.53% That means (1 - 0.0353) < |y_0/y| < (1 + 0.0353)


(1 - 0.0353) < 1/sqrt( 1 + e) < (1 + 0.0353)
(1 - 0.0353)^2 < 1/( 1 + e) < (1 + 0.0353)^2
(1 - 0.0353)^-2 > ( 1 + e) > (1 + 0.0353)^-2
(1 - 0.0353)^-2 -1 > e > (1 + 0.0353)^-2 -1
0.07452232 > e > -0.06703023

we'll expand the range and just say |e| <= 0.075 ; so instead of just using a Taylor expansion that assumes e is small, we can minimize the maximum error over that range.

We do that by expanding in the basis {1} and {x} (if you like this is the orthogonal Legendre or Tchebychev basis, but that's kind of silly to say when it's two functions). To find the optimal coefficient you just do the integral dot product with sqrt(1+e) over the appropriate range. The result is :


sqrt(1+e) ~= 0.999765  + 0.037516 * ( e/0.075 )

sqrt(1+e) ~= 0.999765  + 0.5002133 * e

y_1 = y_0 * ( 0.999765  + 0.5002133 * e )

y_1 = y_0 * ( 0.999765  + 0.5002133 * (x/y_0^2 - 1) )

y_1 = y_0 * ( 0.4995517  + 0.5002133 * x/y_0^2 )

y_1 = 0.4995517 * y_0 + 0.5002133 * x/y_0

We'll call this the "legendre" step for sqrt. Note that this is no longer an iteration that can be repeated, like a Newton iteration - it is specific to a known range of y_0. This should only be used for the first step after getting y_0.

And what is the result ?


max relative error :

y_0 ieee trick : 3.5276%

y_1 babylonian : 0.0601%

y_1 legendre   : 0.0389%

a nice improvement!

Now that we understand the principles behind how this works, we can just get a bit more hacky and find the answers more directly.

We want to find an iteration for sqrt(x) , and we have some y_0. Let's use a physics "dimensional analysis" kind of argument. We pretend x has units of "Widgets". Then sqrt(x) has units of Widgets^1/2 , as does y_1. We only have x and y_0 to work with, so the only unitless thing we can write is


u = y_0^2/x

(or any power of u), and our iteration must be of the form :


y_1 = y_0 * f(u)

for some function f

now , f can be any function of u, but to minimize the difficulty of the math we specifically choose


f(u) = A * (1/u) + B + C * u

now the issue is just finding the A,B,C which minimize the error.

The Babylonian/Newton iteration is just A = B = 0.5 (C = 0)

Numerical optimization tells me the optimal C is near 0, and optimal AB are near A = B = 0.499850


max relative error :

y_0 ieee trick : 3.5276%

y_1 babylonian : 0.0601%

y_1 legendre   : 0.0389%

y_1 numerical  : 0.0301%

the result is half the error of the babylonian method, for no more work - just a better choice of constant (0.499850 instead of 0.5).

11/19/2010

11-19-10 - Hashes and Cache Tables

Writing some things I've been perking on for a while...

There's a hash function test at strchr.com :

Hash functions An empirical comparison - strchr.com

At first this test looks pretty cool, but on further examination it has a lot of flaws. For one thing, they're basically testing only english words. If what you care about is short english text, then maybe the results are good, but if you care about other things they are not. I do like the fact that they are testing the total time to compute the hash + the time to lookup. However, when you do that then the details of the hash table become crucial, and it seems they are using an externally-chained hash table which is a severe bias (I'd like to see tests using stlport std::hash_map, rdestl, khash, google's "dense hash", or the cblib hash_table, etc. - some standard reference hash implementations). You also need to talk about your hash fill ratio, whether you store the hash value, whether the data is interned, etc. These will affect which hash function is "best".

There are also major architectural issues. For example, the large multiplies of Murmur are not fast on all current platforms. Furthermore, on some platforms you can't do unaligned DWORD reads fast, and you may have to do endian swaps. These issues do not affect all hashes in the same way, so to compare hashes without considering these issues is a mistake.

Also, to do a fair comparison you really need to understand the details of each hash. Some are designed for use only in prime-size tables. Some should be collapsed by xor folding, others should not. You cannot simply plug a bunch of different hash functions into the same code and have a fair comparison.

But perhaps most crucially, the hashes are not optimized to the same amount. Some are unrolled, others are not. Some are reading dwords, some are reading bytes. This is not a fair comparison. Really the only way to do a perfect comparison is to optimize each hash function as much as possible, because hashes have a different degree of capacity for optimization. For example, Adler and Fletcher (which are checksums not hashes, BTW) can easily be made fully 16-byte parallel (see for example asmcommunity Adler32 ; there are also some very good Altivec versions floating around), most hashes cannot be directly parallelized the same way (they can only be "trivially" parallized in the sense of running multiple hashes at once in parallel on interleaved input streams).

(ADDENDUM : actually the biggest problem with the test is probably that the hash functions are called through function pointer, not inlined; this is a severe penalty for the cheaper hash functions; in particular with properly inlined hash functions I ran a similar test and found FNV to just destroy Murmur, which is the opposite of what they find)

Anyway, enough being a grouch, let's get to what I've been thinking about.

One piece of news : there's a (beta) update to Murmur2 called (unsurprisingly) Murmur3 :

MurmurHash3 information and brief performance results

Murmur3 improves the basic mixing a bit to fix some flaws in Murmur2. It also works on larger data elements, which is a mixed blessing. That will make throughput on large keys faster, but hurts performance on small keys. One of the nice things about Murmur2 was how simple it was.

And some links from the past for reference :

What I wrote previously about hash table lookup :

cbloom rants 10-17-08 - 1
cbloom rants 10-19-08 - 1
cbloom rants 10-21-08 - 4
cbloom rants 10-21-08 - 5

My final answer for hashing strings was a variant of FNV, hash value stored in the table, max 70% full, quadratic reprobing.

File hashing :

cbloom rants 08-21-10 - Adler32

My final answer for file hashing was Bob Jenkin's Lookup3, modified to be 4-way SIMD. Since Lookup3 inherently works on 12 bytes at a time, the 4-way simd makes it work on 48 bytes at a time, and I then unroll that 8 times so that I can work on 384 bytes at a time (which is crucially 3 whole 128 byte cache lines).

So anyway. When you're talking about hashes it's important to distinguish what you're using them for. There are lots of variations, but I think there are three primary categories :

1. Hash for hash table lookup (with full resolution, either by in-table probing or external chains).

In this case, collisions are only bad in that they cause you to take more CPU time to probe. You will never fail to find what you are looking up due to collisions. There's a very subtle trade off here between hash quality and the time it takes to compute the hash - the winner will depend intricately on the exact data you are hashing and what kind of hash table implementation you are using. I'm sure this is one of those scenarios where different authors will all claim a different hash function is "best" because they were testing in different scenarios.

2. File checksum/integrity/corruption/hacking testing.

In this case you want the hash function to change any time the file has been changed. There are then two subsets of this : catching accidental corruption and catching intentional corruption. To check the ability to catch accidental corruption, you want to make sure your hash is robust to common types of damage, such as file truncation, injection of zeros, single bit flips, etc. Catching intentional corruption protects you from attacks. We're not talking about cryptographic security, but a good hash should make it very difficult to construct alternative data which is both *valid* and has the same hash.

3. Hashes for "cache tables". This is the main thing I want to talk about.

A "cache table" is a hash table in which collisions are not resolved (or they are resolved only some finite number of times). So far as I know there's not a standard term for this, but we found this Google Code project :

cache-table - Project Hosting on Google Code

And I think "cache table" is a pretty good name for it. It's similar to CPU caches, but without a backing store. That is, a traditional "cache" is basically a "cache table", but then if the data is not found in the cache, you go and get it out of the slow location. We say "cache table" to make it clear that in this case there is no slow location - if it's not in the cache you don't find it.

Cache tables are very popular in data compression. To my knowledge they were first used popularly in LZRW. I then made heavy use of them for LZP (and PPMCB, which is very similar to the way modern compressors use them). They are now the basis of most modern compressors such as PAQ and related context mixers. Cache tables in traditional data compression usage never resize, you simply set your memory usage and that's how big the cache table is.

Usage of a cache table is something like this :


h = hash(key)

look up table[h]

(optionally) check if table[h] is right for key
  if so return its data

optionally :
reprobe, either by stepping h and/or by searching "ways"

# of reprobe steps is finite
if failed to find, return not in table

There are a few optional places. A crucial one is whether you do exact resolution or simply accept collisions. (in the old language of LZP, this is "context confirmation"). That is, for minimum memory use, table[h] might not store anything about key. That means when hashes collide, you can't tell, and you might get back an entry which does not correspond to your key.

Sometimes this is just fine. For example, in LZRW style string matchers, if you get back the wrong data you will simply not get a string match. In other cases you might want some amount of collision resolution, but only probabilistically. For example, PAQ stores only a few bits of key in the table, so when hash collisions occur it can detect them most of the time, but accepts a certain amount of false positives. This issue should always be considered as a trade off of memory usage - is it better to use the memory to resolve collisions more accurately, or just to have a larger table so that fewer collision occur?

The other issue is how you reprobe, and that is closely related to your insertion strategy. When you try to insert and find a collision, you have a variety of choices. You can try all the possible reprobes for that hash value and pick the one that is "best" to replace under some criteria (perhaps such as the last used, or simply the slot that's empty). If you're using "ways" you might just do round-robin insertion in the ways, which is a probabilistic replacement strategy.

Anyway, I think "cache tables" is an interesting topic which I haven't seen explored much in the literature. It's particularly interesting in compression, because when you get a false collision (eg. you retreive data from the table and think it applies to you when it doesn't), that doesn't ruin your compressor, it just makes you code that event with more bits than ideal. So you can measure the cost of collisions as bit excess, and you ideally want to tune collisions to occur for less important data. Hash function collisions in cache tables affect both speed and the quality of results - eg. more collisions means more dropped data - but you also usually care exactly *which* data is dropped.

The question of reprobes within a linear tables vs. ways (multiple independent linear tables) is interesting, and sometimes the optimal solution is to use both. For example, my "medium speed" string matcher for the RAD LZ77 uses two independent hash values (in the same table) and 2 ways. That is, there are two linear tables, and you look at two spots within each table.

I think that Cuckoo hashing for cache tables is a particularly interesting possibility. "Cuckoo cache tables" could do the normal Cuckoo thing (where you have two hashes for each key and you swap data around to make space for an insertion), but you set a maximum number of swaps on insertion (something like 4 or 8), and if you have to do the cuckoo swap step and hit the maximum number, you just stop and let some item fall out of the cache. You can also easily extend this by keeping track of age or quality or something as you do the Cuckoo, and choose to drop the worst item instead of simply dropping the last one in the chain. You don't have to worry about the pathological Cuckoo case of having a cycle you can't resolve - you simply drop an item in that case (in fact you don't even have to detect that case, you'll just go around the cycle up to your maximum finite number of steps).

(ADDENDUM : I tried cuckoo cache tables )

11-19-10 - Games without Strings

Quite a number of major video games have shipped with embarassingly large amounts of time spent in "strcmp" and embarassingly large amounts of memory spent on strings.

I'm of the opinion that games should not ship with strings. However, strings are very useful for development. How should we reconcile this? The basic strategy in all cases will be to replace the string with just an integer GUID or index. Then instead of strcmp you just do == test on the ints, and instead of storing lots of strings in memory, you just have ints to refer to other objects.

First of all, let's say that the premature optimization of making all your references into compact indices from the beginning is a mistake IMO. The main place that strings come up and where they are super useful is as the names of content. That is, textures, models, prefs, NPC types, weapon types, levels, attributes, etc. If I want to make an NPC it's nice to just be able to say, okay use the "ToughMarine" model and give him "AggressiveEnemy" behavior and "BigShotgun" weapon. Often those strings correspond to file names, either directly or indirectly, which is very handy because it lets you add new elements by just making new files, and it lets you find the content that a string corresponds to.

The bad premature optimization that I'm talking about is to work with "databases" of content right from the start rather than loose files. Then your references can be indices. So instead my NPC would use model 7, behavior 3, and weapon 41. This makes your references fast and compact right from the start, but the penalties to development process are just too severe. (some of the old game engines did this, but I believe it is now understood in the industry that this is very bad for source control and large teams and is being phased out).

So, we don't want that method, we want to work with strings during development, but ship without strings.

( some example of unnacceptable workflow in my opinion : when artists have to yell out "who has the texture name database checked out? I have to add a new string to it" , or when there's an error in the content hookup and the game prints out "error : npc 13 couldn't find model 9" (though, kudos for at least printing an error) )

One option is to have a "bake" step that converts all the loose content and string references into these packed databases and indexes. Basically it has to load up every piece of content, see what all the strings in the game are and convert them to an index, then order all the data by its index so you can find it. While this does work, it's pretty painful. Whole-game bake operations like this are pretty disastrous for development, because it means you can't test the real shipping game without doing some big time consuming process. I'm a big believer in being able to run at least portions of the game in their real shipping state all the time during development, not just at the very end. It makes it hard to just change one piece of content or add something and have.

Another option is to have a sort of "incremental baking" from a GUID server. I've seen this work in practice, so it is viable, but it's a little scary. Basically the idea is you keep a central string table that maps unique strings to indices (and vice versa). Each time you make a new piece of content or make a new string, you have to send a packet to the GUID server and get the index for that string. Now you can refer to your piece of content by that index, so you have compact references. While this does certainly work, relying on being able to communicate with the GUID server for development is a bit scary. Also if you ever accidentally get a bug in the GUID system you could corrupt a bunch of content. To deal with both of those issues, it would be nice to keep the original strings around in the content files during development as backup for the central repository.

The option we used for Oddworld Stranger was string hashing. In the shipping game, every 32 bit char * was replaced with a 32 bit integer hash of the char *. Using a hash makes the string -> index mapping deterministic and local, you don't have to talk to your neighbor to make sure you are getting a unique int. This method has lots of advantages and only one (rather large) disadvantage : the possibility of hash collisions. When you get a hash collision you wind up with a perplexing but amusing bug, like you try to put "SinisterHelmet" on a guy and instead you get "RubberChicken" attached to his head because those string hashes collided. During development you need to keep both the hash and the string around.

To handle collisions, we had a debug mode that would load up a level with the strings and the hashes and check if any strings had the same hashes. Note that you don't need to check collisions across the whole game, but only for content that can possibly be loaded at the same time. On Stranger I think we wound up with 2 collisions over the whole course of development. The solution to those collisions was simply to rename some content. eg. "FloppyHat" was colliding with "PurpleSparkle" , so we just renamed it to "PurpleSparkle2" and we were hash collision free. I definitely do not advocate the 32-bit hash method for large scale game development. It's okay on a small team where you can check things like that once in a while and handle it, but with big distributed teams and huge amounts of content it's not viable.

The simplest fix is to go to a 64-bit hash. Then the probability of collision becomes so small that you could just deal with it in the unlikely event that it happens. Note that in the shipping game you never actually generate these hashes, so the speed of the hash function is irrelevant; in the final game they are opaque GUIDs that are used to link pieces of content together.

With the hash+string method there's also an issue of how you store your content and what you do with the string in the shipping version of the game. What we did is just to store both inline in the data files (eg. 4 bytes of hash then the string bytes immediately following), and in the shipping game you just read the string in then immediately discard it. But if you want to be really super optimal and flat-load your data, then you would have to factor out the strings at some point. One option is to make all your data files in chunks, and make one of the chunks be "nonfinal" data, and put the hash -> string mapping in the nonfinal data chunk. Then you can "flat load" the rest of the data.

But none of these solutions is definitively awesome, so I'm curious what other people do or if there is a better solution.

ADDENDUM : I knew I'd seen this written about before, but forgot about it.

Mick West wrote about it for game developer but doesn't really address the difficult issues.

Ivo Beltchev describes CRCView - a VC extension to do GetStr() for CRC's. I wonder how well that works; I've always found the VC expression evaluator to stop working right when I need it most.

Mischief.mayhem puts it together. He suggests using an external SQL database for GUID->String mapping and collision resolution.

BTW I'm not sure why you bother with hashing if you are using a global database. Just make the first string be "1", the second string be "2", etc. But I'm not super convinced that the global database is an okay method.

11/18/2010

11-18-10 - Bleh and TID2008

Bleh, I just spent the whole day trying to track down a bug in my MS-SSIM implementation, that turned out to be a total red herring. I was trying to test my MS-SSIM on the TID2008 database to confirm their results (they find that MS-SSIM is the best by far). (note that perceptual image papers currently have the nasty property that every author demonstrates that their method is the "best" , because they all use different testing methods and on different databases).

Anyway, I was getting totally wrong results, so I went over my MS-SSIM implementation with a fine toothed comb, checked everything against the original Matlab; I found a few deviations, but they were all irrelevant. The problem was I was running the files in the TID database in a different order than the TID test program wanted, so it was matching the wrong file to the wrong file.

As part of that, I downloaded all the implementations of MS-SSIM I could find. The best one is in Metrix Mux . Most of the others have some deviation from the original. For example, many people get the Gaussian window wrong (A Gaussian with sdev 1.5 is e^(- 0.5 * (x/1.5)^2 ) - people leave off the 0.5), others incorrectly apply the window at every pixel position (you should only apply it where the whole window is inside the image, not off any edge), another common flaw is to get the downsample wrong; the way authors do it is with a Daub9 lowpass tap and then *point* downsampling (they use :1:2:end matlab notation which is a point downsample). Anyway, a lot of these details are pretty random. Also of note : Metrix Mux uses rec601 luma and does not gamma correct.

The TID2008 perceptual distortion database is a lot better than the Live database, but it's still not great.

The problem with both of them is that the types of distortions applied is just too small of a subset. Both of them mainly just apply some random noise, and then they both apply JPEG and JPEG2000 distortions.

That's okay if you want a metric that is good at specifically judging the human evaluation of those types of distortions. But it's a big problem in general. It means that metrics which consider other factors are not given credit for their considerations.

For example, TID2008 contains no hue rotations, or images that have constant luma channels but visible detail in chroma. That means that metrics which only evaluate luma fidelity do quite well on TID2008. It has no images where just R or just G or just OpponentBY is modified, so you can't tell anything about the importance of different color channels to perception of error.

TID2008 has 25 images, which is too many really; you only need about 8. Too many of the TID images are the same, in the sense that they are photographs of natural scenes. You only need 2 or 3 of those to be a reasonable representative sample, since natural scene photos have amazingly consistent characteristics. What is needed are more distortion types.

Furthermore, TID has a bunch of distortion types that I believe are bogus; in particular all the "exotic" distortions, such as injecting huge solid color rectangles into the image, or changing the mean luminance. The vast majority of metrics do not handle this kind of distortion well, and TID scores unfairly penalize those metrics. The reason it's bogus is that I believe these types of distortions are irrelevant to what we are doing most of the time, which is measuring compressor artifacts. No compressor will ever make distortions like that.

And also on that thread, too many of the distortions are very large. Again many metrics only work well near the threshold of detection (that is, the distorted image almost looks the same as the original). That limitted function is actually okay, because that's the area that we really care about. The most interesting area of work is near threshold, because that is where we want to be making our lossless compressed data - you want it to be as small as possible, but still pretty close to visually unchanged. By having very huge distortions in your database, you give too many points to metrics that handle those huge distortions, and you penalize

Lastly, because the databases are all too small, any effort to tune to the databases is highly dubious. For example you could easily do something like the Netflix prize winners where you create an ensemble of experts - various image metrics that you combine by estimating how good each metric will be for the current query. But training that on these databases would surely just give you overfitting and not a good general purpose metric. (as a simpler example, MS-SSIM has tons of tweaky hacky bits, and I could easily optimize those to improve the scores on the databases, but that would be highly questionable).

Anyway, I think rather than checking scores on databases, there are two other ways to test metrics that are better :

1. Metric competition as in "Maximum differentiation (MAD) competition: A methodology for comparing computational models of perceptual quantities". Unfortunately for any kind of serious metrics, the gradient is hard to solve analytically, so it requires numerical optimization and this can be very time consuming. The basic method I would take is to take an image, try a few random steps, measure how they affect metrics M1 and M2, then form a linear combo of steps such that the affect on M2 is zero, while the affect on M1 is maximized. Obviously the metrics are not actually linear, but in the limit of tiny steps they are.

2. Metric RD optimization. Using a very flexible hypothetical image coder (*), do R/D optimization using the metric you wish to test. A very general brute for RD optimizer will dig out any oddities in the metric. You can test metrics against each other by making the image that this compressor will general for 1.0 bits per pixel (or whatever).

* = you shouldn't just use JPEG or something here, because then you are only exploring the space of how the metric rates DCT truncation errors. You want an image compressor that can make lots of weird error shapes, so you can see how the metric rates them. It does not have to be an actual competitive image coder, in fact I conjecture that for this use it would be best to have something that is *too* general. For example one idea is to use an over-complete basis transform coder, let it send coefficients for 64 DCT shapes, and also 64 Haar shapes and maybe some straight edges and linear ramps, etc. H264-Intra at least has the intra predictors, but maybe also in-frame motion vectors would add some more useful freedom.

11/16/2010

11-16-10 - A review of some perceptual metrics

In general this appears to be an active area of research, with tons of good work coming out in just the last few years. It's finally getting the attention it deserves, thank god.

In general there are two major approaches to perceptual metrics : HVS masking/filtering methods , and structural/statistical methods.

The HVS methods are mainly still using something like a per pixel delta, but they are weighting or masking or remapping it to account for various perceptual effects. Largely they are considering the capabilities of the eye, as opposed to what the brain processes and takes from the image. This includes stuff like rod/cone spatial resolution, perceptually linear response to intensity differences, contrast masking, luma masking, relative contrast sensitivity, etc.

The original major HVS work was the JPEG standard itself with its quantization matrices that are based on CSF (contrast sensitivity function) measurements. After that Watson's DCTune and related work added masking and other effects. This has been extended over the years. Probably the strongest modern work is Lin's JND (just noticeable differences) stuff.

JND - Associate Professor Lin Weisi

The JND papers work out complicated models of visual masking to figure out what kind of deltas between images are imperceptible.

Obviously color spaces like CIELAB etc are part of the HVS image difference metric, as is the SCIELAB filter. (I guess CIEDE2000 is the newer version)

A lot of the HVS stuff doesn't really deal well with color - most of the old data is for luma only. The Fairchild/iCAM group is doing a lot of work on perception of color and mainly HVS based metrics :

iCAM An Image Appearance Model
Fairchild Color Publications & Presentations
garrett m. johnson

Garrett Johnson's Thesis is the best reference I have found for lots of data on the HVS with actual useful formulas and numbers.

Many of the newer works use a whole-image FFT in opponent color space so that they can apply a CSF (ala JPEG quantization matrices) and a rod/cone spatial resolution filter (ala SCIELAB).

There are lots of metrics made over the years that incorporate some amount of these HVS factors, for example PSNR-HVS-M :

Nikolay Ponomarenko homepage - PSNR-HVS-M download page

For video there are more HVS factors to consider such as temporal contrast sensitivity and motion masking.

There are also a lot of papers on contrast visibility and color appearance models from the HDR Tone Mapping research guys; in fact there are some good Siggraph courses such as Image Quality Metrics from Siggraph 2000 and Survey of Color for Computer Graphics . There were also a bunch of papers on selective raytracing that gave good details on perceptual metrics.

However, in general the HVS-centric methods are a bit out of favor and "structural methods" appear to be better. Again most of the structural methods work on luma only and ignore color.

The most well-known structure method is SSIM , and there have been many extensions of SSIM that purport to improve it, such as gradient-SSIM and region-pooled-SSIM or feature-type (edge,smooth,texture) pooled SSIM.

The LIVE team has a ton of good papers, with the best ones being on "structural" type metrics :

LIVE
LIVE-Publications Search Results

In particular, they have VIF (Visual Information Fidelity) which is a measure of human-perceptual mutual information. There are a number of modern works using metrics like this for image quality which appears to be a very promising line.

There are various other structural-type metrics, such as doing SVD to find singular vectors and values and so on. So far this work feels to early to be useful.

One interesting work on color I found is SHAME : (Spatial Hue Angle Metric) :

SHAME - Computational Color Imaging Second ... - Google Books

It's basically SCIELAB with the addition of a hue histogram. The hue histogram weights colors that occur more often more. eg. if your image has a big patch of red and you get the color wrong on it, that's visible, but if it only has a few scattered red pixels, then their exact color doesn't matter because relative to their surrounding they will seem "red" even if they are pretty far off.

Some other random links I thought were interesting :

JPEG Post Processing by reapplication of JPEG
NASA ADS Predicting image quality using a modular image difference model
Color FAQ - Frequently Asked Questions Color
[x264-devel] commit New AQ algorithm option (Jason Garrett-Glaser )
MetriX MuX Home Page

11/12/2010

11-12-10 - Some notes on function approximation by iteration

People often blindly use Newton-Raphson method to iteratively improve function approximations. It's worth taking a moment to think about where that comes from and how we can do better.

First of all, iterative function approximation almost always comes from some kind of epsilon expansion (often Taylor series). You assume you are near the desired answer, you set the discrepancy to epsilon, then expand and drop higher powers of epsilon.

For example, Newton-Raphson iteration is this :


You wish to find a root r of f , eg. f(r) = 0

Assume I have x_0 which is close to r, eg, r= x_n + e

0 = f(r) = f(x_n + e) = f(x_n) + e * f'(x_n) + O(e^2)

so

e = -f(x_n)/f'(x_n) + O(e^2)

r = x_n - f(x_n)/f'(x_n) + O(e^2)

therefore

x_n+1 = x_n - f(x_n)/f'(x_n)

is closer to r than x_n

The way we use this to do function approximation is we set up a function f() such that the root of f is at the function we wish to approximate.

For example to find sqrt(A) you would set f(x) = x^2 - A ; this f has roots where x^2 = A , eg. x = sqrt(A).

Note that this method is only really useful when you have a function like sqrt where the function is expensive but its inverse is cheap. This is where the "art" comes in - there are any number of functions you can write which have the root at sqrt(A), but you need to write one that doesn't involve a sqrt; the whole point is to find an iteration that is much cheaper than the full quality function. For example to approximate reciproal sqrt you could also use f(x) = x^2 - 1/A or f(x) = 1/x^2 - A , both are valid but produce different iterations of different complexities.

So, now that we sort of see where Newton-Raphson iteration comes from, we can throw it away. What we really want is a more general idea - start with a point close to the solution, do an epsilon expansion, solve for epsilon, that gives us an iteration. In particular, Newton-Raphson is doing an expansion of the form r = x_n + e , but we could also do r = x_n * (1+e) or r = x_n/(1+e) or whatever. Sometimes these give better iterations (in terms of complexity vs. accuracy).

The next major general point to remember is that all of these methods are based on something like Taylor expansion, and while Taylor expansion is great for infinite series, we should all know that it is *not* optimal for finite series. That is, the first N terms of the Taylor expansion of a function are *not* the best N-term polynomial expansion of that function (except in the limit as epsilon goes to zero).

Over a finite interval, you could use Legendre Polynomials to find the actual best N-term approximation (or any other orthogonal polynomial basis). Since we are on computers here it's often easier to just do a brute force solve for the best coefficients.

This is well known to people who do function approximation (eg. any sane person doing a polynomial approximation of sin(x) over the range [0,pi] is not using a Taylor expansion) (see previous discussion of this 06-13-09 and 06-21-09 ). However, when people do iterative approximation this idea goes out the window for some reason.

In particular, after your initial approximation, you usually know something about the error, whether it is positive, what the bound is. Then you can expand in the error and find the optimal coefficients over that range. This gives you an iterative refinement step which is superior to Newton.

I'll do a followup post and work a specific example.

11/09/2010

11-09-10 - New Technology Fucking Blows

I recently "upgraded" to an HDTV and a PS3 and have been watching some Blu Rays.

WTF. Why didn't you all tell me this is a fucking disaster ?

First of all, the fucking Blu Rays want to connect to the internet. Umm, no you don't. Second, they want to run fucking Java to make their menus even more fucking annoying than the damn DVD menus.

Okay, friggle frack, I hate it, but I can live with that.

Then I find YOU CAN'T FUCKING RESUME on BD-J discs !? WTF "Resume" is like the most basic feature that you need in a movie player, like I'm not allowed to fucking stop my movie and then start it again !? I love the "recommended fixes" from even the official Sony page, to just "skip ahead to where you stopped watching". Are you fucking kidding me? Somebody should be shot for this, it's like the engineers making a video format without audio. You add a bunch of menu features and shit and oh, by the way it doesn't play audio anymore. You can't break the basic fucking functionality when you add features!

Resume Play feature on Blu-ray titles - Blu-ray Forum
Recommend me a blu-ray player with resume after power off [Archive] - DVD Talk Forum
Hugely disappointed by my Blu-ray player. [Archive] - AnandTech Forums
Blu-ray Resume Playback Guide Useful Information Sony Asia Pacific

Okay, so all Blu Ray players have that problem, but the PS3 is actually even worse than most and won't even Resume from normal blu rays unless you fucking say a hail mary and cover it with garlic and press just the right buttons. And god forbid it do something like remember the location you stopped watching even after you power it down or remove the disc, like you know every fucking DVD player from 10 years ago does. ( PS3 missing resume function ).

(aside : that PS3 link is a pretty funny example of people defending their enemies :
"damn just remember the chapter before shutting down."
"quit being lazy and find the chapter where you left off and then press play on that"
"I just leave mine on 24/7 and if I stop a movie, it resumes where I left off until I eject it."
WTF R u serious? )

And of course every time you do watch a Blu Ray, you get the wonderful fucking joy of sitting through a bunch of mandatory fucking previews and other nonsense because they "forbid that operation". Yeah yeah you can usually just skip chapters to get to the menu, but you can't go direct to top menu on most discs, and you can't jump directly to playing.

How to Skip Movie Trailers [Archive] - Audioholics Home Theater Forums
Any way to skip previews on blu-ray - AVForums.com
Any tips to skip DVDBD trailers, warnings, and ads on PS3 - Home4Film
annoying mandatory previews - Blu-ray Forum

As for HDTV, it's mostly good. The fucking Bezel on my LG is glossier than greased shit, which is pretty damn awful. Also HDMI connectors are the fucking devil, they're like bad old USB connectors, they don't go in far enough and are wobbly. I have problems with the HDMI connector making good stable contact at the TV and at the PS3 and the computer.

Locking HDMI Cables and Connectors � Reviews and News from Audioholics
HDMI Connector - High Def Forum - Your High Definition Community & High Definition Resource
anyone else have problems with loose hdmi connection on tv - AVS Forum

Oh, and of course neither the TV nor the PS3 came with cables even though they're like fifty cents OEM price now. That's just fucking cheap ass annoying shit.

So far the only really huge win is that the Netflix app for the PS3 is way better than the one for the HTPC, and of course Netflix movies fucking start playing directly without a bunch of shit, and you can resume them. I would just use Netflix 100% of the time except for the fact that Brownstripe is not reliable, and there's nothing more annoying than being in the middle of a movie and your net connection decides to flake out (I wish I could fucking locally cache the whole movie before watching it to ensure full playback).

11-08-10 - 709 vs 601

The old CIE rec601 standard for Y was :


Y = 0.299 R + 0.587 G + 0.114 B

(as used in JPEG, MPEG, etc.)

The new CIE rec709 standard for Y is :


Y = 0.2126 R + 0.7152 G + 0.0722 B

Modern HDTV's and such are supposed to be built to the 709 standard, which means that the perceived brightness (ADD: I'm playing a little loose with terminology here, I should say the "output luminance" or something like that) should be that linear combo of RGB. eg. new TV's should have brighter greens and darker blues than old TV's. I have no idea if that's actually true in practice, when I go in to Best Buy all the TV's look different, so clearly they are not well standardized.

Those formulas are for *linear* light combination. In codec software we usually use them on *gamma-corrected* values, which makes them semi-bogus. eg. there's not much reason to prefer the 709 coefficients for your video encode even if you know you will be displaying on a 709 spec monitor (if you are sending it signal in RGB), because the monitor should be 709-spec on linear RGB but you are using that matrix on gamma-corrected RGB. I suppose if you hope to decode to a hardware-supported YUV format directly (eg. such as for Overlays on Windows), you would want to be using the same color space as the overlay surface wants, but that's a flaky hopeless hell not worth dealing with.

Two points :

1. This is *NOT* a formula for how important R, G and B are. For example if you have an error metric which you compute on each color channel, do not combine them like


Err_tot = 0.2126 Err_R + 0.7152 Err_G + 0.0722 Err_B

that linear combo is just the contribution to *brightness* , but the human eye sees more than just brightness. It's like you're modeling the eye as only rods and no cones.

(BTW while Blue has a very small contribution to brightness and furthermore only 2% of the cones perceive blue, they are however the most sensitive cones, and their signal is amplified in the brain, so our perception of blue intensity levels (and therefore quantizer step size) is in fact very good; the way that you can get away with killing blue is by giving it much lower spatial resolution )

So, for example, if you are using rec709 Y and then only measuring Y-PSNR , you are severely undercounting the contribution of non-green to error.

2. Now, for what we actually do (data compressors) there's a question of how the difference between 601 and 709 affects us. I'll assume that you are using the color conversion in something like JPEG or H264 which will subsample the chroma and probably has some fixed way to quantize the channels differently.

Choosing 709 or 601 for the conversion matrix are just special cases of doing arbitrary color matrices (such as a KLT), obviously it will depend on the content which one is best. It will also depend on the error metric that you use to define "best" - in particular if you measure error in Y-PSNR using the 709 definition of Y, then it will appear to you that the 709 spec color conversion is best. If you measure error in RGB, then the 601 spec will appear best (usually).

My random guess is that 601 will be better for coding most of the time because it doesn't have such a severe preference for green in the high fidelity non-chroma channel. Basically it's better hedged - when you are given a synthetic image that's all red-blue with no green at all, the 709 matrix is very unfortunate for coding, the 601 is not quite so bad.

Note that using the matrix which matches your reproduction device has basically nothing to do with which matrix is best for coding. The definition of "luminance" for your viewing device refers to *linear* RGB anyway, and in coding we are working on gamma-corrected RGB, so it's best just to think of the color matrix as part of the compressor.

11/05/2010

11-05-10 - Papers

I still use the old PSView once in a while for viewing postscripts (it's the only free standalone PS reader for Windows that I know of; standalone = not requiring Ghostscript nightmare). But it's janky and pretty much obsolete because there are now good fast online PS to PDF converters :

www.ps2pdf.com
neevia.com

ps2pdf is a bit faster but sometimes fails to handle files that neevia does handle.

Also randomly I just found this page of amazing docs by Gernot Hoffmann !!! It's mostly pedantic stuff, but the docs are unbelievably well made, with nice simple writing, tons of details, and FIGURES, omg omg, it's so nice to have actual figures drawing out the plots and geometries of things.

Some of�the great ones I've looked at so far :

cielab - tons of details, figures, gamut plots, etc.

planar projections - also see Euler angles, Gimbal lock, etc. lots of great intro-to-computer-graphics topics.

Another random interesting paper : A Chronology of Interpolation: From Ancient Astronomy to Modern Signal and Image Processing

And some good authors :

Fairchild - Color Publications & Presentations
Boulos - Graphics Publications

11-05-10 - Brief note on Perceptual Metric Mistakes

A lot of perceptual metrics have some weird ideas about things that are important. Some things that are not actually important :

1. Symmetry. A lot of researchers ask for the property M(X,Y) = M(Y,X). There's absolutely no reason to think this is a good property of a perceptual image metric. You may well want to treat the reference (original) differently than the distorted image in your metric.

Obviously for image database similarity metrics or something, you want symmetry, but that's not what we're dealing with here. Also you can of course trivially make a symmetric metric from an asymetric one by doing S(X,Y) = A(X,Y) + A(Y,X)

2. Offset/Scale Invariance. Again people demand M(X,Y) = M(S*X + O,S*Y + O) , that is invariance to constant scale and offset. There is no reason to think this is good (in fact there is reason to think this is bad). Pixel values have an absolute meaning and 2->4 is not the same as 4->6.

3. Rotation Invariance. A lot of people ask for rotation invariance, and are careful to normalize their wavelet basis or DCT or whatever so that they have rotation invariance. Again, bad idea. The human eye has different perception of detail in different directions - strongest horizontally, slightly weaker vertically, and much weaker diagonally. Also real images have by far the most correlation horizontally - people with cameras tend to hold them aligned to the horizon, not at completely random angles. Images are not rotationally invariant in the real world, so why should your metric be?

For example one simple thing they do in the PSNR-HVS paper is to do a one step wavelet transform first to make LL,LH,HL,HH bands, then you run whatever metric you want on the four bands, and then weight their contribution thusly :


UQI-HVS = 0.5779ULL  + 0.1582UHL  + 0.1707ULH  + 0.0932UHH

This kind of issue comes up in non-obvious ways as well. For example if you do a unitary DCT (or any other basis) transform on a block, the L2 norm is preserved, so you might think that L2 norm is theoretically more logical, however we have no reason to believe that human vision is invariant under basis-space rotations, therefore there is no reason to prefer the rotation-invariant norm.

4. Not quite in the same category, but another thing that I have my doubts about is use of things like visual attention focus. The basic idea is you can predict what part of the image/video people actually care about, and give more bits to those objects. This sounds nice in theory, but in practice you can't know where people will be looking, so if you look at an ensemble of observers who look at a variety of places, you might improve your average score, but you hurt the worst case (people who were looking at the area you took bits away from). In practice there's also the issue that when you take bits away from an unimportant area, you can create artifacts there, and then those artifacts become visually annoying and thus make that area more of a focus. eg. in old crappy MPEG when you take bits away from the background and give it to the human actors in the foreground, you can make huge blocky ringy awfulness in the background.

A major problem I see in paper after paper is that people don't define their terms clearly. One of biggest issues is "luma" , where people will just start writing equations with an "L" or a "Y" in it without defining exactly what that means. Is it light-linear (eg. CIE Y) ? Perceptually uniform "lightness" (CIE L)? What's the scale? eg. what does L=1 mean?