1/17/2009

01-17-09 - Float to Int

A while ago the Some Assembly Required blog wrote some good notes about float-to-int. I posted some notes there but I thought I'd try to summarize my thoughts coherently.

What I'd like is a pretty fast float-to-int (ftoi) conversion. The most useful variants are "truncate" (like C, fractions go towards zero), and "round" , that is, fractions go towards the nearest int. We'd like both to be available all the time, and both to be fast. So I want ftoi_trunc and ftoi_round.

First let me say that I hate the FPU control word with a passion. I've had so many bugs because of that fucker over the years. I write some code and test it and everything is fine, and then we actually put in some game and all hell breaks loose. WTF happened? Oh well, I tested it with the default word setup, and now it's running with the FPU set to single precision. The other classic bug is people changing the rounding mode. D3D used to be really bad about this (you could use FPU_PRESERVE but it was a pretty big hit back in the old days with software T&L, not a big deal any more). Or even worse is people who write code intentionally designed to work with the FPU in a non-standard rounding mode (like round to nearest). Then if you call other code that's meant for the normal rounding mode, it fails.

Ok, rant over. Don't mess with the FPU control word.

That means the classic /QIfist really doesn't do that much for us. Yes, it makes ftoi_trunc faster :


int ftoi_trunc(float f) { return (int) f; }

that's fast enough with /QIfist, but you still need round :

int ftoi_round(float f) { return ( f >= 0.f ) ? (int)(f + 0.5f) : (int)(f - 0.5f); }

note that a lot of people just do + 0.5 to round - that's wrong, for negatives you need to go the other way, because the C truncation is *toward zero* not *down*.

Even if you could speed up the round case, I really don't like using compiler options for crucial functionality. I like to make little code snippets that work the way I want regardless of the compiler settings. In particular if I make some code that relies on ftoi being fast I don't want to use C casts and hope they set the compiler right. I want the code to enforce its correctness.

Fortunately the xs routines at stereopsis by Sree Kotay are really good. The key piece is a fast ftoi_round (which I have slightly rejiggered to use the union method of aliasing) :


union DoubleAnd64
{
  uint64    i;
  double    d;
};

static const double floatutil_xs_doublemagic = (6755399441055744.0); // 2^52 * 1.5

inline int ftoi_round(const float val)
{
  DoubleAnd64 dunion;
  dunion.d = val + floatutil_xs_doublemagic;
  return (int) dunion.i; // just cast to grab the bottom bits
}

in my tests this runs at almost exactly the same speed as FISTp (both around 7 clocks), and it always works regardless of the FPU control word setting or the compiler options.

Note that this is a "banker's round" not a normal arithmetic rounding where 0.5 always goes up or down - 0.5 goes to the nearest *even* value. So 2.5 goes to 2.0 and 3.5 goes to 4.0 ; eg. 0.5's go up half the time and down half the time. To be more precise, ftoi_round will actually round the same way that bits that drop out of the bottom of the FPU registers during addition round. We can see that's why making a banker_round routine was so easy, because that's what the FPU addition does.

But, we have a problem. We need a truncate (ftoi_trunc). Sree provides one, but it uses a conditional, so it's slow (around 11 clocks in my tests). A better way to get the truncate is to use the SSE intrinsinc :


inline int ftoi_trunc(const float f)
{
  return _mm_cvtt_ss2si( _mm_set_ss( f ) );
}

Note that the similar _mm_cvt_ss2si (one t) conversion does banker rounding, but the "magic number" xs method is faster because it pipelines better, and because I'm building for x86 so the cvt stuff has to move the value from FPU to SSE. If you were building with arch:sse and all that, then obviously you should just use the cvt's. (but then you dramatically change the behavior of your floating point code by making it run through float temporaries all the time, instead of implicit long doubles like in x86).

So, that's the system I have now and I'm pretty happy with it. SSE for trunc, magic number for round, and no reliance on the FPU rounding mode or compiler settings, and they're both fast.

For completeness I'll include my versions of the alternatives :


#include < xmmintrin.h >

typedef unsigned __int64 uint64;

union DoubleAnd64
{
  uint64  i;
  double  d;
};

static const double floatutil_xs_doublemagic = (6755399441055744.0); // 2^52 * 1.5
static const double floatutil_xs_doublemagicdelta = (1.5e-8);                         //almost .5f = .5f + 1e^(number of exp bit)
static const double floatutil_xs_doublemagicroundeps = (0.5f - floatutil_xs_doublemagicdelta);       //almost .5f = .5f - 1e^(number of exp bit)

// ftoi_round : *banker* rounding!
inline int ftoi_round(const double val)
{
  DoubleAnd64 dunion;
  dunion.d = val + floatutil_xs_doublemagic;
  return (int) dunion.i; // just cast to grab the bottom bits
}

inline int ftoi_trunc(const float f)
{
  return _mm_cvtt_ss2si( _mm_set_ss( f ) );
}

inline int ftoi_round_sse(const float f)
{
  return _mm_cvt_ss2si( _mm_set_ss( f ) );
}

inline int ftoi_floor(const double val)
{
    return ftoi_round(val - floatutil_xs_doublemagicroundeps);
}

inline int ftoi_ceil(const double val)
{
    return ftoi_round(val + floatutil_xs_doublemagicroundeps);
}

// ftoi_trunc_xs = Sree's truncate
inline int ftoi_trunc_xs(const double val)
{
  return (val<0) ? ftoi_round(val+floatutil_xs_doublemagicroundeps) : 
                   ftoi_round(val-floatutil_xs_doublemagicroundeps);
}

BTW note the ceil and floor from Sree's XS stuff which are both quite handy and hard to do any other way. Note you might think that you can easily make ceil and floor yourself from the C-style trunc, but that's not true, remember floor is *down* even on negatives. In fact Sree's truncate is literally saying "is it negative ? then ceil, else floor".

Finally : if you're on a console where you have read-modify-write aliasing stall problems the union magic number trick is probably not good for you. But really on a console you're locked into a specific CPU that you completely control, so you should just directly use the right intrinsic for you.

AND : regardless of what you do, please make an ftoi() function of some kind and call that for your conversions, don't just cast. That way it's clear where where you're converting, it's easy to see and search for, it's easy to change the method, and if you use ftoi_trunc and ftoi_round like me it makes it clear what you wanted.

ASIDE : in fact I'm starting to think *all* casts should go through little helper functions to make them very obvious and clear. Two widgets I'm using to do casts are :


// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to same_size_bit_cast( const t_fm & from )
{
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    return *( (const t_to *) &from );
}

// check_value_cast just does a static_cast and makes sure you didn't wreck the value
//  eg. for numeric casting to make sure the value fits in the new type
template < typename t_to, typename t_fm >
t_to check_value_cast( const t_fm & from )
{
    t_to to = static_cast< t_to >(from);
    ASSERT( static_cast< t_fm >(to) == from );
    return to;
}

intptr_t pointer_to_int(void * ptr)
{
    return (intptr_t) ptr;
}

void * int_to_pointer(intptr_t i)
{
    return (void *) i;
}

ADDENDUM : I'm told this version of same_size_bit_cast is better on various platforms. That's why we want it gathered up in one place so it can be easily changed ;)


// same_size_bit_cast casts the bits in memory
//  eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to same_size_bit_cast( const t_fm & from )
{
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    
    union
    {
        t_fm    fm;
        t_to    to;
    } temp;
    
    temp.fm = from;
    return temp.to;
}

4 comments:

nothings said...

I assume this whole thing has been invented multiple times, but it's nice to see that Herf's stereopsis articles references Hecker's article which cites me for the '1.5' in the magic number. Yay me.

(I remember figuring that out and posting it to a mailing list he and I and Terje were on back when I was at Looking Glass, but I can't actually nail down precisely when it was.)

phernost said...

Has noone ever noticed that it doesn't work.

Try floor 0.9999999999999990008
sub doublemagicroundeps
0.50000001499999902066
add floatutil_xs_doublemagic
6755399441055745
in binary
0100001100111000000000000000000000000000000000000000000000000001
keep lower bits
1 (WRONG)

The correct floor 0.9999999999999990008
sub 0.25
0.7499999999999990008
add floatutil_xs_doublemagic / 2
3377699720527872.5
in binary
0100001100101000000000000000000000000000000000000000000000000001
put lower bits in int
1
shift right 1
0 (correct)

One More Time

Try floor 1.9999999999999988898
sub doublemagicroundeps
1.5000000149999990207
add floatutil_xs_doublemagic
6755399441055746
in binary
0100001100111000000000000000000000000000000000000000000000000010
keep lower bits
2 (WRONG)

The correct floor 1.9999999999999988898
sub 0.25
1.7499999999999988898
add floatutil_xs_doublemagic / 2
3377699720527873.5
in binary
0100001100101000000000000000000000000000000000000000000000000011
put lower bits in int
2
shift right 1
1 (correct)

cbloom said...

Sigh sigh sigh la la la

This only works on single precision *floats* . The closest number to 1.0 that you can have in float is

0.99999988

and it works just fine

phernost said...

Touché, you are right. Since ftoi_floor takes a double I figured you wanted full range support. I may not have read the post thoroughly enough. Although, if this is running on x86 hardware you should probably use long double, if your compiler supports 80bit floats, otherwise every operation must make a round trip to memory for truncation. Something we see with GCC's optimizer, is that all float math is internally in 80bit floats, so all intermediate math must use the 80bit version of magic or use -mfpmath=sse and the correct magic for each type, fun times.

I mainly came about the problem of doing rounding (every kind) in SSE with full range support for each type, and have learned more than I even wanted to. Magic has to be the reverse sign or bad rounding occurs in the range of [magic/3, magic*4/3], plus overflow in the range of [maxval - magic, maxval]. Sigh, modulus with euclidean division, wrapping, and mirroring where cake after figuring all this out.

My attempt:

// should be a constant or result in a constant and an fscale instruction
long double magic(int precision) { return (1.5 * std::pow((long double)2, std::numeric_limits::digits - 2)) * std::pow((long double)2, -precision); }

// the sse version of this ((test & -0.0) ^ value)
FloatType toggleSign(FloatType value, FloatType test) { return test < 0 ? -value : value); }

FloatType roundToEven(FloatType value, FloatType magic)
{
magic = toggleSign(magic * 2, value);
return (value - magic) + magic;
}

FloatType roundDirect(FloatType value, FloatType magic, bool roundUp, bool roundAtHalf, bool asymetric)
{
FloatType quarter = 0.25;
quarter = asymetric ? quarter : thatType::toggleSign(quarter, value);
magic = asymetric ? magic : thatType::toggleSign(magic, value);
value = (roundUp ? value + quarter : value - quarter) - magic;
return (roundUp ^ roundAtHalf ? value + quarter : value - quarter) + magic;
}

FloatType ceil(FloatType value) { return roundDirect(value, magic(0), true, false, true); }
FloatType floor(FloatType value) { return roundDirect(value, magic(0), false, false, true); }
FloatType trunc(FloatType value) { return roundDirect(value, magic(0), false, false, false); }
FloatType rndUp(FloatType value) { return roundDirect(value, magic(0), true, true, true); }
FloatType rndDn(FloatType value) { return roundDirect(value, magic(0), false, true, true); }
FloatType rndAway(FloatType value) { return roundDirect(value, magic(0), true, true, false); }
FloatType rndTo0(FloatType value) { return roundDirect(value, magic(0), false, true, false); }

Ok that was way more than I should have wrote, but maybe it will save someone else from my pain.

old rants