Let's get to an example. We want y = sqrt(x) , but without actually doing a square root.

You can get an iteration for this in various ad-hoc ways. Let's see one real quick :

y = sqrt(x) square it y^2 = x divide by y y = x/y add y 2y = y + x/y y = 0.5*(y + x/y) y_1 = 0.5*(y_0 + x/y_0)and that's an iteration for sqrt. BTW in all these notes I'll be writing y_0 and y_1 as two steps, but I really mean y_n and y_n+1.

You can get this in another funny way :

y = sqrt(x) y = sqrt( y * (x/y) ) y = geometric_mean( y , (x/y) ) y_1 = arithmetic_mean( y_0, (x/y_0) )and you can see that it converges because if y is bigger than sqrt(x), then x/y is smaller than sqrt(x), so the mean must be closer to the answer.

But these are sort of ad-hoc weird ways of deriving the iteration. Let's do it in a more formal way which will be more useful in the general case. (BTW this is called the "Babylonian method for sqrt" and you can also get it trivially from Newton's method).

Say we have y_0 which is an approximation of sqrt(x) that we got some other way. It's close to the real answer y = sqrt(x) , so we write :

y = sqrt(x) y^2 = y_0^2 * (1 + e) or y_0 = y / sqrt(1 + e) e = "epsilon" is small y_0^2 * (1 + e) = y^2 = x 1+e = x/y_0^2 e = x/y_0^2 - 1the choice of how I've factored e out from (y/y_0) is pretty arbitrary, there are lots of choices about how you do that and all are valid. In this case the important thing is that e multiplied on y_0, not added (eg. I did not choose y^2 = y_0^2 + e).

So far this has all been exact. Now we can write :

y = sqrt(x) = sqrt( y_0^2 * (1 + e) ) = y_0 * sqrt( 1 + e ) exactly : y = y_0 * sqrt( 1 + e) approximately : y ~= y_0 * approx( sqrt( 1 + e) ) so this is our iteration : y_1 = y_0 * approx( sqrt( 1 + e) )The principle step in these kind of iterations is to take a twiddle-equals for y (an approximation of y) and make than an exact equals (assignment) for an iteration of y_n.

Now the only issue is how we approximate sqrt( 1 + e). First let's do it with a Taylor expansion :

approx( sqrt( 1 + e) ) = 1 + 0.5 * e + O(e^2) y_1 = y_0 * ( 1 + 0.5 * e ) y_1 = y_0 * ( 1 + 0.5 * (x/y_0^2 - 1) ) y_1 = y_0 * ( 0.5 + 0.5 * x/y_0^2 ) y_1 = 0.5 * ( y_0 + x / y_0 )which is the Babylonian method.

But as we discussed last time, we can do better. In particular, if we know the error bound of y_0, then we know the range of e, and we can use a better approximation of sqrt( 1 + e). Specifically if we use the IEEE float-int kind of method for finding y_0 :

float y_0(float x) { union { int i; float f; } u; u.f = x; const int ieee_sqrt_bias = (1<<29) - (1<<22) - 301120; u.i = ieee_sqrt_bias + (u.i >> 1); return u.f; }Then the maximum relative error of y_0 is 3.53% That means (1 - 0.0353) < |y_0/y| < (1 + 0.0353)

(1 - 0.0353) < 1/sqrt( 1 + e) < (1 + 0.0353) (1 - 0.0353)^2 < 1/( 1 + e) < (1 + 0.0353)^2 (1 - 0.0353)^-2 > ( 1 + e) > (1 + 0.0353)^-2 (1 - 0.0353)^-2 -1 > e > (1 + 0.0353)^-2 -1 0.07452232 > e > -0.06703023we'll expand the range and just say |e| <= 0.075 ; so instead of just using a Taylor expansion that assumes e is small, we can minimize the maximum error over that range.

We do that by expanding in the basis {1} and {x} (if you like this is the orthogonal Legendre or Tchebychev basis, but that's kind of silly to say when it's two functions). To find the optimal coefficient you just do the integral dot product with sqrt(1+e) over the appropriate range. The result is :

sqrt(1+e) ~= 0.999765 + 0.037516 * ( e/0.075 ) sqrt(1+e) ~= 0.999765 + 0.5002133 * e y_1 = y_0 * ( 0.999765 + 0.5002133 * e ) y_1 = y_0 * ( 0.999765 + 0.5002133 * (x/y_0^2 - 1) ) y_1 = y_0 * ( 0.4995517 + 0.5002133 * x/y_0^2 ) y_1 = 0.4995517 * y_0 + 0.5002133 * x/y_0We'll call this the "legendre" step for sqrt. Note that this is no longer an iteration that can be repeated, like a Newton iteration - it is specific to a known range of y_0. This should only be used for the first step after getting y_0.

And what is the result ?

max relative error : y_0 ieee trick : 3.5276% y_1 babylonian : 0.0601% y_1 legendre : 0.0389%a nice improvement!

Now that we understand the principles behind how this works, we can just get a bit more hacky and find the answers more directly.

We want to find an iteration for sqrt(x) , and we have some y_0. Let's use a physics "dimensional analysis" kind of argument. We pretend x has units of "Widgets". Then sqrt(x) has units of Widgets^1/2 , as does y_1. We only have x and y_0 to work with, so the only unitless thing we can write is

u = y_0^2/x(or any power of u), and our iteration must be of the form :

y_1 = y_0 * f(u) for some function fnow , f can be any function of u, but to minimize the difficulty of the math we specifically choose

f(u) = A * (1/u) + B + C * unow the issue is just finding the A,B,C which minimize the error.

The Babylonian/Newton iteration is just A = B = 0.5 (C = 0)

Numerical optimization tells me the optimal C is near 0, and optimal AB are near A = B = 0.499850

max relative error : y_0 ieee trick : 3.5276% y_1 babylonian : 0.0601% y_1 legendre : 0.0389% y_1 numerical : 0.0301%the result is half the error of the babylonian method, for no more work - just a better choice of constant (0.499850 instead of 0.5).

## 3 comments:

Part 1 made the point that one should understand what range of input values is most relevant to your program, so that the algorithm's performance can be optimized over the whole of that range.

But in the experiments of Part 2 you don't specify what range is used in the calculation of max relative error. While I haven't tried to replicate these experiments, your A = B = 5.0 to A = B = 4.99985% tweak and the reduction in max relative error is so tiny that I suspect the improvements are semi-random artifacts of an idiosyncratic choice of range.

More generally, you could say that while trying to understand the relevant range of inputs is a good first step, you should go further by trying to quantify the prior probability distribution and decision-theoretic loss function. Minimax error is a bit of a cop-out that's often used when people don't want to think too deeply about their models and data.

BTW, the obvious way to estimate a probability distribution is just to instrument the relevant functions in a game running on realistic data to gather histograms. Smooth the histograms with kernel density estimation or whatever, and re-run your numerical parameter optimizers based on that non-parametric model.

" Part 1 made the point that one should understand what range of input values is most relevant to your program, so that the algorithm's performance can be optimized over the whole of that range.

But in the experiments of Part 2 you don't specify what range is used in the calculation of max relative error. "

Eh, no, you are confused. I quite explicity do work out the range.

The y0 walsh approximation is periodic, that is, the relative area over each power of 2 magnitude is the same (I didn't really explicity say this anywhere in the blog posts because I thought it was obvious). That is, you only have to look at the range of input parameters [1.0,2.0] , and the full range of all floats will be the same.

Given that range of input parameters, you get a certain range of errors from y0.

The function approximation over a range does not apply to the *input* parameter range, but rather the range of the parameter in the function expansion, in this case, epsilon, as has been thoroughly detailed in Part 2.

Post a Comment