Now that we have our toolbox, we can attack a popular issue in 3d engines - the reciprocal square root (used primarily in normalizing vectors). We want to find y = 1 / sqrt(x) .
Let me first state that I'm not doing all this because it's actually important for optimization. If you care about speed all you have to do is set :
/arch:SSE2 /fp:fastand get on with your life. Worrying about this issue is a clear case of pointless micro-optimization. I'm doing this for the mathematical curiosity. Nevertheless, lots of people continue to worry about this :
Some Assembly Required - Timing Square Root
ompf.org � View topic - SSE Patterns
Ken Turkowski's Computer Graphics Papers
Inverse Square Root ECE 250 University of Waterloo
HLNUM.ORG Documentation RSqrt
hlnum frsqrt
High order algorithms to find square roots and inverses
computation of square root & reciprocal calculation
To do reciprocal square root we're going to start with y0 from the IEEE floating point trick. In games this is commonly know as "Carmack's trick from Quake 3" , but it goes back to Greg Walsh apparently :
Beyond3d part 1
Beyond3d part 2
Also on the Slashdot page about that article Paul Hsieh mentions that apparently Kahan knew of this idea, but doesn't know the time frame exactly.
BTW Chris Lomont has the best paper about this IEEE trick for recipsqrt, but the Wikipedia page is a good start
So the initial approximation is :
const int magic = 0x5f375a86;
inline float rsqrtf_walsh( float x )
{
union { int i; float f; } u;
u.f = x;
u.i = magic - (u.i>>1);
return u.f;
}
(more on the magic number later, but see Lomont for an explination of where it comes from).
(BTW I don't mean to imply that it's "magic" by using that name; it's a standard IEEE bit trick).
This has rather large error so we are going to use this as y0 and find an iteration for y1.
We can now use our dimensional analysis method. Our unitless number must be
u = x * y0*y0and our iteration must be :
y1 = y0 * f(u)and we'll look at a few possible forms for f :
f1 = A + B * u f2 = A + B / u f3 = A + B * (1-u) + C * (1-u)^2we can confirm that these forms are sane by finding them in other ways.
If we set f(y) = 1/y^2 - x , then the root f(y) = 0 occurs where y = 1/sqrt(x) , so we can use Newton's method to find the roots of f(y). The result is :
y1 = y0 * ( (3/2) - 1/2 x y0^2 ) or : f1 , with A = 3/2 , B = - 1/2If we set f(y) = y^2 - 1/x , the root is in the same place and we can use Newton's method and we get :
y1 = 0.5 * ( y0 + 1 / (x*y0) ) or : f2 , with A = B = 1/2
If we use the first f(y) form and do a "Halley" or "Householder" iteration instead of a Newton iteration, that's form f3 with A = 1 , B = 1/2, C = 3/8.
(maximum % relative error) : y0 : (no iteration) : 3.436526 y0 + one Newton : 0.175126 (f1 with A = 3/2 , B = - 1/2) y0 + two Newtons : 0.000470 y0 + one Babylonian : 0.060595 (f2 with A = B = 1/2) y0 + one Halley : 0.010067 (f3 with A = 1 , B = 1/2, C = 3/8)obviously these methods do not have equal computational complexity. In particular, the Babylonian method uses a divide, so it's much slower than the other methods. A serious study of function approximation should compare the quality to the CPU cost to make sure the method is at a good place on the speed/quality curve.
But of course we know the punch line is coming : we don't need to use those values for ABC. Because the initial estimate y0 has a known error range, we can optimize ABC over that range. We won't bother with doing it analytically, we'll just numerically search. The result is :
maximum percent relative error : y0 : (no iteration) : 3.436526 y0 + one Newton : 0.175126 (f1 with A = 3/2 , B = - 1/2) method f1 optimal : 0.087665 ( A = 1.501338 , B = -0.500461 ) y0 + one Babylonian : 0.060595 (f2 with A = B = 1/2) method f2 optimal : 0.030292 ( A = B = 0.499848 ) y0 + one Halley : 0.010067 (f3 with A = 1 , B = 1/2, C = 3/8) method f3 optimal : 0.002523 ( A = 1.0 , B = 0.5011, C = 0.375608 )by optimizing the coefficients we get roughly a 2X reduction of maximum error *for free* ! (4X for Halley - because it's quadratic; two newton steps would also get a 4X improvement). It's no more CPU work because a float constant multiply is a float constant multiply. We're just using better constants. As noted in the last post - these constants are no longer an iteration that you can repeat over and over, they are optimized for the first step only.
Lastly, let's go back to that magic number 0x5f375a86 that was used in the y0 approximation. Can we do better if we optimize that? The answer is basically no. (* see addendum)
In Quake3 Carmack used 0x5f3759df ; Carmack + 1 Newton has a max relative error of 0.175214 % In his paper Lomont finds 0x5f375a86 is slightly better ; Lomont + 1 Newton gives 0.175129 %but that level of tweakage is pretty irrelevant because they missed the big tweak, which is using an optimal iteration instead of a newton iteration :
inline float rsqrtf_cb_method1( float x )
{
const int magic = 0x5F37592F;
const float A = 1.501338f;
const float B = -0.500461f;
union { int i; float f; } u;
u.f = x;
u.i = magic - (u.i>>1);
float y0 = u.f;
float Bx = B * x;
return y0 * ( A + Bx * y0 * y0 );
}
rsqrtf_cb_method1 has a relative error of 0.087714 %
the magic number I happened to use is different than Lomont or Carmack, but it's actually irrelevant - you can use anything in
0x5f375000 - 0x5f376000 and you will get the same final error (but you will get slightly different
values for A and B). The point is that the bias from the magic number can be balanced by optimizing A or B
very effectively. So you don't have to worry about it - just get the magic number vaguely close. (* see addendum)
To demonstrate that this iteration should not be repeated, let's change :
return y0 * ( A + Bx * y0 * y0 );
to
float y1 = y0 * ( A + Bx * y0 * y0 );
return y1 * ( A2 + B2 * x * y1 * y1 );
now the errors for two steps are :
first step is cb_method1 : 0.087714 % repeat same step ( A2 = A , B2 = B ) : 0.087716 % (!!) second step Newton (A2= 3/2, B2=-1/2): 0.000127 % second step optimal : 0.000072 % (A2 = 1.500000596f , B2 = -0.500000060f)
You should see obviously that the closer you get to the answer, the less optimizing helps, because the Newton iteration (which is based on Taylor series) becomes optimal as epsilon goes to zero.
Now before you get excited and go copy-pasting rsqrtf_cb_method1 around, let me stop you. As I said at the beginning, it's pointless because you have :
inline float rsqrtf_sse( float x )
{
return _mm_cvtss_f32( _mm_rsqrt_ss( _mm_set_ss( x ) ) );
}
with max relative error : 0.032585%
One Newton step after this takes you to 0.000029% error ; optimizing the iteration only gets you to 0.000026% so it's
very pointless.
ADDENDUM :
Jan Kadlec (rrrola) posted a comment pointing to his prior work on the topic . He finds a better initial magic number and then followup floats by doing a more rigorous search. What I wrote at (*) about the magic number not mattering wasn't quite right. I was doing greedy searches in the optimization, and got stuck in a local minimum. You have to have quite a good non-greedy searcher to find his solution. The exact value of it doesn't matter much, but there is this much better solution if you make a big jump to another neighborhood :
Copy-pasting rrrola's version for reference , with 0.0652 % max relative error.
float inv_sqrt_rrrola(float x)
{
union { float f; uint32 u; } y = {x};
y.u = 0x5F1FFFF9ul - (y.u >> 1);
return 0.703952253f * y.f * (2.38924456f - x * y.f * y.f);
}
Also, something you can perhaps help me with : Pizer had done even earlier work on this topic and roughly found the right (right = Kadlec) constant by finding the magic number that minimizes (1 + e_max)/(1 + e_min). Doing that he gets 0x5F1F1412.
If we say ratio = approx/correct , then what Pizer is minimizing is ratio_max/ratio_min , which is the same as minimizing log(ratio_max) - log(ratio_min) . I can't reproduce Pizer's result and I'm not sure why that would be the thing you want to minimize. Intuitively it does make sense that what you care about is really *log* relative error, not relative error. That is, going +100% is clearly better than going -100% in a relative sense (this is sort of like the fact that doing -100% in the stock market is much worse than doing +100% is good, as I wrote long ago, we should really talk about log-return).
I guess one way to see it is if you think about e as a floating point number; what we really want is for the error of the approximation to show up as far to the right of the decimal as possible. The number of binary digits to the right of the decimal is the log of the error. But that would just mean minimizing log( MAX(e_max,-e_min) ) so something is missing in my understanding of this bit.