tag:blogger.com,1999:blog-5246987755651065286.post2198212625394739951..comments2022-08-24T03:28:17.758-07:00Comments on cbloom rants: 11-20-10 - Function approximation by iteration , Part 3 - Reciprocal Square Rootcbloomhttp://www.blogger.com/profile/10714564834899413045noreply@blogger.comBlogger10125tag:blogger.com,1999:blog-5246987755651065286.post-48182192098035811262010-12-04T15:45:42.032-08:002010-12-04T15:45:42.032-08:00Yeah, that's me.
If I remember correctly, the...Yeah, that's me.<br /><br />If I remember correctly, the magic constant has 4 periodic local minima. Floating point "noise" is evil - I had to make a 2048³ exhaustive search around the best DIRECT result to find these values.Jan Řrřola Kadlechttps://www.blogger.com/profile/12143918995884547616noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-53831458348277470772010-12-01T09:17:10.510-08:002010-12-01T09:17:10.510-08:00Jan Kadlec, I presume?
"That's not right...Jan Kadlec, I presume?<br /><br />"That's not right - the optimal constant for one Newton is 0x5F1FFFF9"<br /><br />Mmm indeed! Nice work.<br /><br />It looks like I failed to find that because I was only doing greedy optimization and got stuck in a local minimum; it looks like you have a nice little implementation of the DIRECT optimizer.<br /><br />I've always been a pedant about the danger of local minima, so shame on me for falling for it.<br /><br />Addendum will be added to the original post.cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-72217040789388875552010-12-01T05:19:11.470-08:002010-12-01T05:19:11.470-08:00> Lastly, let's go back to that magic numbe...> Lastly, let's go back to that magic number 0x5f375a86 that was used in the y0 approximation. Can we do better if we optimize that? The answer is basically no.<br /><br /><a href="http://rrrola.wz.cz/inv_sqrt.html>That's not right</a> - the optimal constant for one Newton is 0x5F1FFFF9 (for f(u) = 0.703952253 * (2.38924456 - u)).<br /><br />The max relative error for this is 0.0650197%.Jan Řrřola Kadlechttps://www.blogger.com/profile/12143918995884547616noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-11097005518052437742010-12-01T02:48:46.588-08:002010-12-01T02:48:46.588-08:00Yeah, SSE's rsqrt is quite awesome, but I'...Yeah, SSE's rsqrt is quite awesome, but I've found a pretty fun use for "magic" floating point tricks in my n-body gravity simulator. If you're doing an inverse square law and you have to normalize a vector, you end up doing something like G * pow(r_dot_r + epsilon, -3/2). You can do this with rsqrt and a bunch of multiplies, but the bit bashing trick is definitely faster. Obviously, the results aren't particularly accurate, but you get the right effect.won3dhttps://www.blogger.com/profile/09787472194187459747noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-86783472268757216802010-11-21T12:44:09.441-08:002010-11-21T12:44:09.441-08:00Hmm... looking at the ASM, MSVC appears to be auto...Hmm... looking at the ASM, MSVC appears to be automatically unrolling some of my test loops (!!). That's kind of awesome but also annoying because it creates a huge inconsistency in timing, because it depends on what else is going on in the function.<br /><br />Anyway, here are some more x64 times with what appears to be 4X unrolling being done by the compiler :<br /><br />rsqrtf_clib : 9.208541<br />rsqrtf_cb_method1 : 4.134571<br />rsqrtf_walsh_newton : 4.658854<br />rsqrtf_walsh_newton_twice : 6.542103<br />rsqrtf_walsh_halley : 7.093772<br />rsqrtf_babylonian : 6.710390<br />rsqrtf_sse_est : 1.986016<br /><br />x64-Corei7 is so ridiculously fucking fast that it's impossible to time anything on it.<br /><br />The only way to really test speed reliably these days is "in situ" - you have to put it in your real game and measure it there, because there are too many code generation issues and pipeline issues. (eg. in the real world of consoles, the Walsh floating point trick method is totally impractical due to pipeline stalls)cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-88910101433117470292010-11-21T12:28:56.990-08:002010-11-21T12:28:56.990-08:00"Do you have actual timings for the various m..."Do you have actual timings for the various methods? I'm always leery of claims that such and such method is faster without a measured scenario. Even if it is from a trusted source. ;)<br />"<br /><br />Well I'm not claiming anything is faster, that's not really the point. The "cb_method1" is exactly the same speed as the old Carmack/Walsh/Lomont method.<br /><br />By far the fastest way is to use your platform's recriprocal sqrt estimate. As ryg points out, if your platforms rsqrte is poor then maybe a "method1" type step after that would be ideal.<br /><br />Just for completeness ; on Core i7 :<br /><br />x86 :<br /><br />carmack/cb1 : 11 clocks<br />babylonian : 34 clocks<br />sse rsqrte : 7 clocks<br /><br />x64 :<br /><br />carmack/cb1 : 8 clocks<br />babylonian : 14 clocks<br />sse rsqrte : 3 clocks<br /><br />(note that these should be taken with a grain of salt because I haven't carefully analyzed the assembly to make sure they are being treated equally; in particular I measured cb1 to be slightly faster than Walsh (7.5 vs 8.25 blocks on x64), which is just an anomaly of code generation)cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-38844069576475613042010-11-21T11:47:40.949-08:002010-11-21T11:47:40.949-08:00Do you have actual timings for the various methods...Do you have actual timings for the various methods? I'm always leery of claims that such and such method is faster without a measured scenario. Even if it is from a trusted source. ;)Ben Garneyhttps://www.blogger.com/profile/02657964734485708905noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-27086240529872717072010-11-20T20:18:51.839-08:002010-11-20T20:18:51.839-08:00Indeed!Indeed!cbloomhttps://www.blogger.com/profile/10714564834899413045noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-50726993368131886662010-11-20T19:36:17.948-08:002010-11-20T19:36:17.948-08:00Small error: the B2 you give for "second step...Small error: the B2 you give for "second step optimal" is missing its sign.ryghttps://www.blogger.com/profile/03031635656201499907noreply@blogger.comtag:blogger.com,1999:blog-5246987755651065286.post-87884530985676201022010-11-20T17:13:53.979-08:002010-11-20T17:13:53.979-08:00"Now before you get excited and go copy-pasti..."Now before you get excited and go copy-pasting rsqrtf_cb_method1 around, let me stop you. As I said at the beginning, it's pointless"<br />Maybe pointless on x86 where rsqrtss gives you an initial estimate accurate to 12 bits, but not on PPC where frsqrte only gives you 5 bits (curiously, AltiVec vrsqrtefp is much better, I think it's either 10 or 12 bits).ryghttps://www.blogger.com/profile/03031635656201499907noreply@blogger.com