6/13/2009

06-13-09 - Integer division by constants

.. can be turned into multiplies and shifts as I'm sure you know. Often on x86 this is done most efficiently through use of the "mul high" capability (the fact that you get 64 bit multiply output for free and can just grab the top dword).

Sadly, C doesn't give you a way to do mul high, AND stupidly MS/Intel don't seen to provide any intrinsic to do it in a clean way. After much fiddling I've found that this usually works on MSVC :


uint32 Mul32High(uint32 a,uint32 b)
{
    return (uint32)( ((uint64) a * b) >>32);
}

but make sure to check your assembly. This should assemble to just a "mul" and "ret edx".

Now, for the actual divider lookup, there are lots of references out there on the net, but most of them are not terribly useful because 1) lots of them assume expensive multiplies, and 2) most of them are designed to work for the full range of arguments. Often you know that you only need to work on a limited range of one parameter, and you can find much simpler versions for limited ranges.

One excellent page is : Jones on Reciprocal Multiplication (he actually talks about the limited range simplifications, unlike the canonical references).

The best reference I've found is the AMD Optimization Guide. They have a big section about the theory, and also two programs "sdiv.exe" and "udiv.exe" that spit out code for you! Unfortunately it's really fucking hard to find on their web site. sdiv and udiv were shipped on AMD developer CD's and appear to have disappeared from the web. I've found one FTP Archive here . You can find the AMD PDF's on their site, as these links may break : PDF 1 , PDF 2 .

And actually this little CodeProject FindMulShift is not bad; it just does a brute force search for simple mul shift approximations for limited ranges of numerators.

But it's written in a not-useful way. You should just find the shift that gives you maximum range. So I did that, it took two seconds, and here's the result for you :



__forceinline uint32 IntegerDivideConstant(uint32 x,uint32 divisor)
{
    ASSERT( divisor > 0 && divisor <= 16 );
    
    if ( divisor == 1 )
    {
        return x;
    }
    else if ( divisor == 2 )
    {
        return x >> 1;
    }
    else if ( divisor == 3 )
    {
        ASSERT( x < 98304 );  
        return ( x * 0x0000AAAB ) >> 17; 
    }
    else if ( divisor == 4 )
    {
        return x >> 2;
    }
    else if ( divisor == 5 )
    {
        ASSERT( x < 81920 );  
        return ( x * 0x0000CCCD ) >> 18; 
    }
    else if ( divisor == 6 )
    {
        ASSERT( x < 98304 );  
        return ( x * 0x0000AAAB ) >> 18; 
    }
    else if ( divisor == 7 )
    {
        ASSERT( x < 57344 );  
        return ( x * 0x00012493 ) >> 19; 
    }
    else if ( divisor == 8 )
    {
        return x >> 3;
    }
    else if ( divisor == 9 )
    {
        ASSERT( x < 73728 );  
        return ( x * 0x0000E38F ) >> 19; 
    }
    else if ( divisor == 10 )
    {
        ASSERT( x < 81920 );  
        return ( x * 0x0000CCCD ) >> 19; 
    }
    else if ( divisor == 12 )
    {
        ASSERT( x < 90112 );  
        return ( x * 0x0000BA2F ) >> 19; 
    }
    else if ( divisor == 13 )
    {
        ASSERT( x < 98304 );  
        return ( x * 0x0000AAAB ) >> 19; 
    }
    else if ( divisor == 13 )
    {
        ASSERT( x < 212992 );  
        return ( x * 0x00004EC5 ) >> 18; 
    }
    else if ( divisor == 14 )
    {
        ASSERT( x < 57344 );  
        return ( x * 0x00012493 ) >> 20; 
    }
    else if ( divisor == 15 )
    {
        ASSERT( x < 74909 );  
        return ( x * 0x00008889 ) >> 19; 
    }
    else if ( divisor == 16 )
    {
        return x >> 4;
    }
    else
    {
        CANT_GET_HERE();
        return x / divisor;
    }
}

Note : an if/else tree is better than a switch() because we're branching on constants. This should all get removed by the compiler. Some compilers get confused by large switches, even on constants, and fail to remove them.

7 seems to be the worst number for all of these methods. It only works here up to 57344 (not quite 16 bits).

4 comments:

? said...

Hi,

Actually compiler already does this for you... ;)

For example this is what compiler generate for:

return a/15;

MSVC:
mov eax, -2004318071
mul DWORD PTR _a$[esp-4]
shr edx, 3
mov eax, edx

GCC:
9 0003 BA898888 movl $-2004318071, %edx
9 88
10 0008 8B4508 movl 8(%ebp), %eax
11 000b F7E2 mull %edx
12 000d C1EA03 shrl $3, %edx
13 0010 89D0 movl %edx, %eax

Anyway, check this out:
http://hackersdelight.org/magic.htm

BKLA

cbloom said...

Sigh.

christer said...

The canonical reference for this stuff is Granlund and Montgomery's "Division by invariant integers using multiplication" from 1994Ö

http://portal.acm.org/citation.cfm?id=178249

Peter Montgomery did the math, Torbjörn did the implementation in gcc.

cbloom said...

"2) most of them are designed to work for the full range of arguments. Often you know that you only need to work on a limited range of one parameter, and you can find much simpler versions for limited ranges. "

old rants