12-05-08 - 64 Bit Multiply

Something that I've found myself wanting to do a lot recently is multiply two 32 bit numbers, and then take the top 32 bit dword from the 64 bit result. In C this looks like :

U32 mul32by32returnHi(U32 a,U32 b)
    U64 product = (U64) a * b;
    U32 ret = (U32)(product >> 32);
    return ret;

That works fine and all, but the C compiler doesn't understand that you're doing something very simple. It generates absolutely disgusting code. In particular, it actually promotes a & b to 64 bit, and then calls a function called "_allmul" in the Windows NT RTL. This allmul function does a bunch of carries and such to make 64-by-64 bit multiply work via the 32+32->64 multiply instruction in x86. You wind up with a function that takes 60 clocks when it could take 6 clocks :

U32 mul32by32returnHi(U32 a,U32 b)
        mov eax, a
        mul b
        mov eax,edx

Now, that's nice and all, the problem is that tiny inline asm functions just don't work in MSVC these days. Calling a big chunk of ASM is fine, but using a tiny bit of ASM causes the optimizer to disable all its really fancy optimizations, and you wind up with a function that's slower overall.

What I'd like is a way to get at some of the x86 capabilities that you can't easily get from C. The other thing I often find myself wanting is a way to get at the carry flag, or something like "sbb" or "adc". The main uses of those are Terje's cool tricks to get the sign of an int or add with saturation

It would be awesome if you could define your own little mini assembly functions that acted just like the built-in "intrinsics". It would be totally fine if you had to add a bunch of extra data to tell the compiler about dependencies or side effects or whatever, if you could just make little assembly widgets that worked as neatly as their own intrinsics it would be totally awesome for little ops.

ADDENDUM : urg ; so I tested Won's suggestion of Int32x32To64 and found it was doing the right thing (actually UInt32x32To64 - and BTW that's just a macro that's identical to my first code snippet - it doesn't seem to be a magic intrinsic, though it does have platform switches so you should probably use that macro). That confused me and didn't match what I was seeing, so I did some more tests...

It turns out the first code snippet of mul32by32returnHi actually *will* compile to the right thing - but only if you call it from simple functions. If you call it from a complex function the compiler gets confused, but if you call it from a little tight test loop function it does the right thing. URG.

Here's what it does if I try to use it in my StringHash interpolation search test code :

            int start = mul32by32returnHi( query.hash, indexrange );
004087A5  xor         eax,eax 
004087A7  push        eax  
004087A8  mov         eax,dword ptr [esp+38h] 
004087AC  push        eax  
004087AD  push        0    
004087AF  push        ebx  
004087B0  mov         dword ptr [esi],ebx 
004087B2  call        _allmul (40EC00h) 
004087B7  mov         ecx,dword ptr [esp+14h] 

You can see it's extending the dwords to 64 bits, pushing them all, calling the function, then grabbing just one dword from the results. Ugh.

And here's what it does in a tight little test loop :

      U32 dw = *dwp++;
      hash = mul32by32returnHi(hash,dw) ^ dw;
00401120  mov         eax,ecx 
00401122  mul         eax,edx 
00401124  add         esi,4 
00401127  xor         edx,ecx 
00401129  mov         ecx,dword ptr [esi] 

Which not only is correct use of 'mul' but also has got crazy reordering of the loop which makes it a lot faster than my inline ASM version.


Autodidactic Asphyxiation said...

Does this work?


Metatron said...

> ... if you could just make little assembly
> widgets that worked as neatly as their own
> intrinsics it would be totally awesome for
> little ops.

Well that's how it's implemented in the gcc, for every assembler-block you write down dependencies, in/out and register-trashing etc.; you also may give variable registers, so the optimizing backend is able to reassign registers and rearrange your assembler-block, and is also able to pipeline-optimize it (means he rips it in pieces and puts it anywhere he likes and which does not violate dependencies).

It's sad that in regard to assembler the VC is (got) a piece of sh*t, if I would have to compile embedded code for a proprietary chip, I wouldn't even start "arguing" which one to use.

Not to speak about x64-assembler, 50% of my white hair is probably related to that one issue.

sharokh said...

if i want multiply 2 64bit number, what i do?
i want assembly code
anybody can help me?
thanx alot

old rants