Now that we have our toolbox, we can attack a popular issue in 3d engines - the reciprocal square root (used primarily in normalizing vectors). We want to find y = 1 / sqrt(x) .
Let me first state that I'm not doing all this because it's actually important for optimization. If you care about speed all you have to do is set :
/arch:SSE2 /fp:fastand get on with your life. Worrying about this issue is a clear case of pointless micro-optimization. I'm doing this for the mathematical curiosity. Nevertheless, lots of people continue to worry about this :
Some Assembly Required - Timing Square Root
ompf.org � View topic - SSE Patterns
Ken Turkowski's Computer Graphics Papers
Inverse Square Root ECE 250 University of Waterloo
HLNUM.ORG Documentation RSqrt
hlnum frsqrt
High order algorithms to find square roots and inverses
computation of square root & reciprocal calculation
To do reciprocal square root we're going to start with y0 from the IEEE floating point trick. In games this is commonly know as "Carmack's trick from Quake 3" , but it goes back to Greg Walsh apparently :
Beyond3d part 1
Beyond3d part 2
Also on the Slashdot page about that article Paul Hsieh mentions that apparently Kahan knew of this idea, but doesn't know the time frame exactly.
BTW Chris Lomont has the best paper about this IEEE trick for recipsqrt, but the Wikipedia page is a good start
So the initial approximation is :
const int magic = 0x5f375a86; inline float rsqrtf_walsh( float x ) { union { int i; float f; } u; u.f = x; u.i = magic - (u.i>>1); return u.f; }(more on the magic number later, but see Lomont for an explination of where it comes from). (BTW I don't mean to imply that it's "magic" by using that name; it's a standard IEEE bit trick).
This has rather large error so we are going to use this as y0 and find an iteration for y1.
We can now use our dimensional analysis method. Our unitless number must be
u = x * y0*y0and our iteration must be :
y1 = y0 * f(u)and we'll look at a few possible forms for f :
f1 = A + B * u f2 = A + B / u f3 = A + B * (1-u) + C * (1-u)^2we can confirm that these forms are sane by finding them in other ways.
If we set f(y) = 1/y^2 - x , then the root f(y) = 0 occurs where y = 1/sqrt(x) , so we can use Newton's method to find the roots of f(y). The result is :
y1 = y0 * ( (3/2) - 1/2 x y0^2 ) or : f1 , with A = 3/2 , B = - 1/2If we set f(y) = y^2 - 1/x , the root is in the same place and we can use Newton's method and we get :
y1 = 0.5 * ( y0 + 1 / (x*y0) ) or : f2 , with A = B = 1/2
If we use the first f(y) form and do a "Halley" or "Householder" iteration instead of a Newton iteration, that's form f3 with A = 1 , B = 1/2, C = 3/8.
(maximum % relative error) : y0 : (no iteration) : 3.436526 y0 + one Newton : 0.175126 (f1 with A = 3/2 , B = - 1/2) y0 + two Newtons : 0.000470 y0 + one Babylonian : 0.060595 (f2 with A = B = 1/2) y0 + one Halley : 0.010067 (f3 with A = 1 , B = 1/2, C = 3/8)obviously these methods do not have equal computational complexity. In particular, the Babylonian method uses a divide, so it's much slower than the other methods. A serious study of function approximation should compare the quality to the CPU cost to make sure the method is at a good place on the speed/quality curve.
But of course we know the punch line is coming : we don't need to use those values for ABC. Because the initial estimate y0 has a known error range, we can optimize ABC over that range. We won't bother with doing it analytically, we'll just numerically search. The result is :
maximum percent relative error : y0 : (no iteration) : 3.436526 y0 + one Newton : 0.175126 (f1 with A = 3/2 , B = - 1/2) method f1 optimal : 0.087665 ( A = 1.501338 , B = -0.500461 ) y0 + one Babylonian : 0.060595 (f2 with A = B = 1/2) method f2 optimal : 0.030292 ( A = B = 0.499848 ) y0 + one Halley : 0.010067 (f3 with A = 1 , B = 1/2, C = 3/8) method f3 optimal : 0.002523 ( A = 1.0 , B = 0.5011, C = 0.375608 )by optimizing the coefficients we get roughly a 2X reduction of maximum error *for free* ! (4X for Halley - because it's quadratic; two newton steps would also get a 4X improvement). It's no more CPU work because a float constant multiply is a float constant multiply. We're just using better constants. As noted in the last post - these constants are no longer an iteration that you can repeat over and over, they are optimized for the first step only.
Lastly, let's go back to that magic number 0x5f375a86 that was used in the y0 approximation. Can we do better if we optimize that? The answer is basically no. (* see addendum)
In Quake3 Carmack used 0x5f3759df ; Carmack + 1 Newton has a max relative error of 0.175214 % In his paper Lomont finds 0x5f375a86 is slightly better ; Lomont + 1 Newton gives 0.175129 %but that level of tweakage is pretty irrelevant because they missed the big tweak, which is using an optimal iteration instead of a newton iteration :
inline float rsqrtf_cb_method1( float x ) { const int magic = 0x5F37592F; const float A = 1.501338f; const float B = -0.500461f; union { int i; float f; } u; u.f = x; u.i = magic - (u.i>>1); float y0 = u.f; float Bx = B * x; return y0 * ( A + Bx * y0 * y0 ); } rsqrtf_cb_method1 has a relative error of 0.087714 %the magic number I happened to use is different than Lomont or Carmack, but it's actually irrelevant - you can use anything in 0x5f375000 - 0x5f376000 and you will get the same final error (but you will get slightly different values for A and B). The point is that the bias from the magic number can be balanced by optimizing A or B very effectively. So you don't have to worry about it - just get the magic number vaguely close. (* see addendum)
To demonstrate that this iteration should not be repeated, let's change :
return y0 * ( A + Bx * y0 * y0 ); to float y1 = y0 * ( A + Bx * y0 * y0 ); return y1 * ( A2 + B2 * x * y1 * y1 );now the errors for two steps are :
first step is cb_method1 : 0.087714 % repeat same step ( A2 = A , B2 = B ) : 0.087716 % (!!) second step Newton (A2= 3/2, B2=-1/2): 0.000127 % second step optimal : 0.000072 % (A2 = 1.500000596f , B2 = -0.500000060f)
You should see obviously that the closer you get to the answer, the less optimizing helps, because the Newton iteration (which is based on Taylor series) becomes optimal as epsilon goes to zero.
Now before you get excited and go copy-pasting rsqrtf_cb_method1 around, let me stop you. As I said at the beginning, it's pointless because you have :
inline float rsqrtf_sse( float x ) { return _mm_cvtss_f32( _mm_rsqrt_ss( _mm_set_ss( x ) ) ); } with max relative error : 0.032585%One Newton step after this takes you to 0.000029% error ; optimizing the iteration only gets you to 0.000026% so it's very pointless.
ADDENDUM :
Jan Kadlec (rrrola) posted a comment pointing to his prior work on the topic . He finds a better initial magic number and then followup floats by doing a more rigorous search. What I wrote at (*) about the magic number not mattering wasn't quite right. I was doing greedy searches in the optimization, and got stuck in a local minimum. You have to have quite a good non-greedy searcher to find his solution. The exact value of it doesn't matter much, but there is this much better solution if you make a big jump to another neighborhood :
Copy-pasting rrrola's version for reference , with 0.0652 % max relative error.
float inv_sqrt_rrrola(float x) { union { float f; uint32 u; } y = {x}; y.u = 0x5F1FFFF9ul - (y.u >> 1); return 0.703952253f * y.f * (2.38924456f - x * y.f * y.f); }
Also, something you can perhaps help me with : Pizer had done even earlier work on this topic and roughly found the right (right = Kadlec) constant by finding the magic number that minimizes (1 + e_max)/(1 + e_min). Doing that he gets 0x5F1F1412.
If we say ratio = approx/correct , then what Pizer is minimizing is ratio_max/ratio_min , which is the same as minimizing log(ratio_max) - log(ratio_min) . I can't reproduce Pizer's result and I'm not sure why that would be the thing you want to minimize. Intuitively it does make sense that what you care about is really *log* relative error, not relative error. That is, going +100% is clearly better than going -100% in a relative sense (this is sort of like the fact that doing -100% in the stock market is much worse than doing +100% is good, as I wrote long ago, we should really talk about log-return).
I guess one way to see it is if you think about e as a floating point number; what we really want is for the error of the approximation to show up as far to the right of the decimal as possible. The number of binary digits to the right of the decimal is the log of the error. But that would just mean minimizing log( MAX(e_max,-e_min) ) so something is missing in my understanding of this bit.
10 comments:
"Now before you get excited and go copy-pasting rsqrtf_cb_method1 around, let me stop you. As I said at the beginning, it's pointless"
Maybe pointless on x86 where rsqrtss gives you an initial estimate accurate to 12 bits, but not on PPC where frsqrte only gives you 5 bits (curiously, AltiVec vrsqrtefp is much better, I think it's either 10 or 12 bits).
Small error: the B2 you give for "second step optimal" is missing its sign.
Indeed!
Do you have actual timings for the various methods? I'm always leery of claims that such and such method is faster without a measured scenario. Even if it is from a trusted source. ;)
"Do you have actual timings for the various methods? I'm always leery of claims that such and such method is faster without a measured scenario. Even if it is from a trusted source. ;)
"
Well I'm not claiming anything is faster, that's not really the point. The "cb_method1" is exactly the same speed as the old Carmack/Walsh/Lomont method.
By far the fastest way is to use your platform's recriprocal sqrt estimate. As ryg points out, if your platforms rsqrte is poor then maybe a "method1" type step after that would be ideal.
Just for completeness ; on Core i7 :
x86 :
carmack/cb1 : 11 clocks
babylonian : 34 clocks
sse rsqrte : 7 clocks
x64 :
carmack/cb1 : 8 clocks
babylonian : 14 clocks
sse rsqrte : 3 clocks
(note that these should be taken with a grain of salt because I haven't carefully analyzed the assembly to make sure they are being treated equally; in particular I measured cb1 to be slightly faster than Walsh (7.5 vs 8.25 blocks on x64), which is just an anomaly of code generation)
Hmm... looking at the ASM, MSVC appears to be automatically unrolling some of my test loops (!!). That's kind of awesome but also annoying because it creates a huge inconsistency in timing, because it depends on what else is going on in the function.
Anyway, here are some more x64 times with what appears to be 4X unrolling being done by the compiler :
rsqrtf_clib : 9.208541
rsqrtf_cb_method1 : 4.134571
rsqrtf_walsh_newton : 4.658854
rsqrtf_walsh_newton_twice : 6.542103
rsqrtf_walsh_halley : 7.093772
rsqrtf_babylonian : 6.710390
rsqrtf_sse_est : 1.986016
x64-Corei7 is so ridiculously fucking fast that it's impossible to time anything on it.
The only way to really test speed reliably these days is "in situ" - you have to put it in your real game and measure it there, because there are too many code generation issues and pipeline issues. (eg. in the real world of consoles, the Walsh floating point trick method is totally impractical due to pipeline stalls)
Yeah, SSE's rsqrt is quite awesome, but I've found a pretty fun use for "magic" floating point tricks in my n-body gravity simulator. If you're doing an inverse square law and you have to normalize a vector, you end up doing something like G * pow(r_dot_r + epsilon, -3/2). You can do this with rsqrt and a bunch of multiplies, but the bit bashing trick is definitely faster. Obviously, the results aren't particularly accurate, but you get the right effect.
> Lastly, let's go back to that magic number 0x5f375a86 that was used in the y0 approximation. Can we do better if we optimize that? The answer is basically no.
<a href="http://rrrola.wz.cz/inv_sqrt.html>That's not right</a> - the optimal constant for one Newton is 0x5F1FFFF9 (for f(u) = 0.703952253 * (2.38924456 - u)).
The max relative error for this is 0.0650197%.
Jan Kadlec, I presume?
"That's not right - the optimal constant for one Newton is 0x5F1FFFF9"
Mmm indeed! Nice work.
It looks like I failed to find that because I was only doing greedy optimization and got stuck in a local minimum; it looks like you have a nice little implementation of the DIRECT optimizer.
I've always been a pedant about the danger of local minima, so shame on me for falling for it.
Addendum will be added to the original post.
Yeah, that's me.
If I remember correctly, the magic constant has 4 periodic local minima. Floating point "noise" is evil - I had to make a 2048³ exhaustive search around the best DIRECT result to find these values.
Post a Comment