# cbloom rants

## 2/09/2017

### strtod

I've been fooling around with strtod for a few days for no good reason. It's been quite interesting.

Some little notes :

strtod in many compilers & standard libraries is broken. (see the excellent pages at exploring binary for details)

(to clarify "broken" : they all round-trip doubles correctly; if you print a double and scan it using the same compiler, you will get the same double; that's quite easy. They're "broken" in the sense of not converting a full-precision string to the closest possible double, and they're "broken" in the sense of not having a single consistent rule that tells you how any given full precision numeral string will be converted)

The default rounding rule for floats is round-nearest (banker). What that means is when a value is exactly at the midpoint of either round-up or round-down (the bits below the bottom bit are 100000..) , then round up or down to make the bottom bit of the result zero. This is sometimes called "round to even" but crucially it's round to even *mantissa* not round to even *number*.

Our goal for strtod should be to take a full numeral string and convert it to the closest possible double using banker rounding.

The marginal rounding value is like this in binary :

```
1|101010....101010|10000000000000000

^ ^              ^ ^- bits below mantissa are exactly at rounding threshold
| |              |
| |              +- bottom bit of mantissa = "banker bit"
| |
| +- 52 bits of mantissa
|
+- implicit 1 bit at top of double

```
in this case the "banker bit" (bottom bit of mantissa bit range) is off, the next bit is on.

If this value was exact, you should round *down*. (if the banker bit was on, you should round up). If this value is not exact by even the tiniest bit (as in, there are more significant bits below the range we have seen, eg. it was 1000....001), it changes to rounding up.

If you think of the infinity of the real numbers being divided into buckets, each bucket corresponds to one of the doubles that covers that range, this value is the boundary of a bucket edge.

In practice the way to do strtod is to work in ints, so if you have the top 64 bits of the (possibly very long) value in an int, this is :

```
hi = a u64 with first 64 bits
top bit of "hi" is on

(hi>>11) is the 52 bits of mantissa + implicit top bit

hi & 0x400 is the "rounding bit"
hi & 0x800 is the "banker bit"
hi & 0x3ff are the bits under the rounding bit

hi & 0xfff == 0x400 is low boundary that needs more bits
hi & 0xfff == 0xBFF is high boundary that needs more bits

At 0x400 :
"rounding bit" is on, we're right on threshold
"banker bit" is off, so we should round down if this value is exact
but any more bits beyond the first 64 will change us to rounding up
so we need more bits
- just need to see if there are any bits at all that could be on

At 0xBFF :
"banker bit" is on, so if we were at rounding threshold, we should round up
(eg. 0xC00 rounds up, and doesn't need any more bits, we know that exactly)
the bits we have are just below rounding threshold
if the remaining bits are all on (FFFF)
OR if they generate a carry into our bottom bit
then we will round up
- need enough of the remaining value to know that it can't push us up to 0xC00

```

The standard approach to doing a good strtod is to start by reading the first 19 digits into a U64. You just use *= 10 integer multiplies to do the base-10 to base-2 conversion, and then you have to adjust the place-value of the right hand side of those digits using a table to find a power of 10. (the place-value is obviously adjusted based on where the decimal was and the exponent provided in the input string ; so if you are given "70.23e12" you read "7023" as an int and then adjust pow10 exponent +10).

Once you have the first 19 digits in the U64, you know that the full string must correspond to an interval :

```
lo = current u64 from first 19 digits

hi = lo + (all remaining digits = 99999)*(place value) + (bias for placevalue mult being too low)

final value is in [lo,hi]

```
so the difficult cases that need a lot of digits to discriminate are the ones where the first 19 digits put you just below the threshold of rounding, but the hi end of the interval is above it.

Older MSVC only used the first 17 digits so fails sooner. For example :

```
34791611969279740608512
=
111010111100000111010011011110000000011001100010011101000000000000000000000
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

this is right on a rounding threshold; because the banker bit is off it should round down

the correct closest double is
34791611969279739000000.000000

but if you bias that up beyond the digits that MSVC reads :

34791611969279740610310
=
111010111100000111010011011110000000011001100010011101000000000011100000110
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm

this should round up; the closest double is
34791611969279743000000.000000

but old MSVC gets it wrong

-----------

Another example :

2022951805990391198363682
=
110101100011000000110111111101000101100100011110011001000000000000000000000000000

is the double threshold

2022951805990391198666718
=
110101100011000000110111111101000101100100011110011001000000000100100111111011110

is definitely above the threshold, should round up

but if you only read the first 19 base-10 digits :

2022951805990391198000000
=
110101100011000000110111111101000101100100011110011000111111110000010001110000000

you see a value that is below the threshold and would round down

You have to look at the interval of uncertainty -

2022951805990391198000000
2022951805990391199000000

and see that you straddle and boundary and must get more digits.

```
There are two things that need refinement : getting more digits & doing your place value scaling to more bits of precision. You can interatively increase the precision of each, which refines your interval smaller and smaller, until you either know which side of the rounding barrier to take, or you have got all the bits of precision.

BTW this needing lots of precision case is exactly the same as the arithmetic coder "underflow" case , where your coding interval looks like, in binary :

```
hi = 101101010101001 100000000000000.110101010101
lo = 101101010101001 011111111111111.01010101010
^               ^               ^- active coding zone, new symbols add at this fixed point position
|               |
|               +-- bad underflow zone! can't yet tell if this bit should be a 1 or 0
|
+-- hi and lo the same in top bits, these bits have been sent & renormalized out

```
anyway, just a funny parallel to me.

There are obviously nasty issues to be aware of with the FPU rounding more, control word, precision, or with the optimizer doing funny things to your FPU math (and of course be aware of x86 being on the FPU and x64 being in SSE). The best solution to all that is to just not use floats, use ints and construct the output floats by hand.

(there is an optimization possible if you can trust the FPU ; numbers that fit completely in an int you can just use the FPU's int-to-float to do all the work for you, if the int-to-float rounding mode matches the rounding mode you want from your strtod)

Now, you may think "hey who cares about these boundary cases - I can round-trip doubles perfectly without them". In fact, round-tripping doubles is easy, since they don't ever have any bits on below the (52+1) in the mantissa, you never get this problem of bits beyond the last one! (you only need 17 digits to round-trip doubles).

Personally I think that these kinds of routines should have well-defined behavior for all inputs. There should be a single definite right answer, and a conforming routine should get that right answer. Then nutters can go off and optimize it and fiddle around, and you can run it through a test suite and verify it *if* you have exactly defined right answers. This is particulary true for things like the CRT. (this also comes up in how the compiler converts float and double constants). The fact that so many compilers fucked this up for so long is a bit of a head scratcher (especially since good conversion routines have been around since the Fortran days). (the unwillingess of people to just use existing good code is so bizarre to me, they go off and roll their own and invariably fuck up some of the difficult corner cases).

Anyway, rolling your own is a fun project, but fortunately there's good code to do this :

"dtoa" (which contains a strtod) by David M Gay :

(BTW lots of people seem to have different versions of dtoa with various changes/fixes; for example gcc, mozilla, chromium; I can't find a good link or home page for the best/newest version of dtoa; help?)

dtoa is excellent code in the sense of working right, using good algorithms, and being fast. It's terrible code in the sense of being some apalling spaghetti. This is actual code from dtoa :

```
if (k &= 0x1f) {
k1 = 32 - k;
z = 0;
do {
*x1++ = *x << k | z;
z = *x++ >> k1;
}
while(x < xe);
if ((*x1 = z))
++n1;
}
#else
if (k &= 0xf) {
k1 = 16 - k;
z = 0;
do {
*x1++ = *x << k  & 0xffff | z;
z = *x++ >> k1;
}

```
holy order-of-operations reliance batman! Jeebus.

To build dtoa I use :

```#define IEEE_8087
#pragma warning(disable : 4244 4127 4018 4706 4701 4334 4146)
#define NO_ERRNO
```
Note that dtoa can optionally check the FLT_ROUNDS setting but does not round correctly unless it is 1 (nearest). Furthermore, in MSVC the FLT_ROUNDS value in the header is broken in some versions (always returns 1 even if fesetenv has changed it). So, yeah. Don't mess with the float rounding mode please.

dtoa is partly messy because it supports lots of wacky stuff from rare machines (FLT_RADIX != 2 for example). (though apparently a lot of the weird float format modes are broken).

An alternative I looked at is "floatscan" from MUSL. Floatscan produces correct results, but is very slow.

Here's my timing on random large (20-300 decimal digits) strings :

```
strtod dtoa.c
ticks = 353

strtod mine
ticks = 267

MSVC 2005 CRT atof
ticks = 1921

MUSL floatscan
ticks = 11445

```

My variant here ("mine") is a little bit faster than dtoa. That basically just comes from using the modern 64-bit CPU. For example I use mulq to do 64x64 to 128 multiply , whereas he uses only 32*32 multiplies and simulates large words by doing the carries manually. I use clz to left-justify binary, whereas he uses a bunch of if's. Stuff like that.

The MUSL floatscan slowness is mmrmm. It seems to be O(N) in the # of digits even when the first 19 digits can unambiguously resolve the correct answer. Basically it just goes straight into bigint (it makes an array of U32's which each hold 9 base10 digits) which is unnecessary most of the time.