cbloom rants: 09-06-10 - Cross Platform SIMD

9/06/2010

09-06-10 - Cross Platform SIMD

I did a little SIMD "Lookup3" (Bob Jenkin's hash), and as a side effect, it made me realize that you can almost get away with cross platform SIMD these days. All the platforms do 16-byte SIMD, so that's nice and standard. The capabilities and best ways of doing things aren't exactly the same, but it's pretty close, and you can mostly cover the differences. Obviously to get super maximum optimal you would want to special case per platform, but even then having a base cross-platform SIMD implementation to start from would let you get all your platforms going easier and identify the key-spots to do platform specific work.

Certainly for "trivial SIMDifications" this works very well. Trivial SIMDification is when you have a scalar op that you are doing a lot of, and you change that to doing 4 of them at a time in parallel with SIMD. That is, you never do horizontal ops or anything else funny, just a direct substitution of scalar ops to 4-at-a-time vector ops. This works very uniformly on all platforms.

Basically you have something like :

U32 x,y,z;

    x += y;
    y -= z;
    x ^= y;

and all you do is change the data type :

simdU32 x,y,z;

    x += y;
    y -= z;
    x ^= y;

and now you are doing four of them at once.

The biggest issue I'm not sure about is how to define the data type.

From a machine point of view, the SIMD register doesn't have a type, so you might be inclined to just expose a typeless "qword" and then put the type information in the operator. eg. have a generic qword and then something like AddQwordF32() or AddQwordU16() . But this is sort of a silly argument. *All* registers are typeless, and yet we find data types in languages to be a convenient way of generating the right instructions and checking program correctness. So it seems like the ideal thing is to really have something like a qwordF32 type, etc for each way to use it.

The problem is how you actually do that. I'm a little scared that anything more than a typedef might lead to bad code generation. The simple typedef method is :


#if SPU
typedef qword simdval;
#if XENON
typedef __vector4 simdval;
#else if SSE
typedef __m128 simdval;
#endif

But the problem is if you want to make them have their type information, like say :


typedef __vector4 simdF32;
typedef __vector4 simdU32;

then when you make an "operator +" for F32 and one for U32 - the compiler can't tell them apart. (if for no good reason you don't like operator +, you can pretend that says "Add"). The problem is the typedef is not really a first class type, it's just an alias, so it can't be used to select the right function call.

Of course one solution is to put the type in the function name, like AddF32,AddU32,etc. but I think that is generally bad code design because it ties operation to data, which should be as indepednent as possible, and it just creates unnecessary friction in the non-simd to simd port.

If you actually make them a proper type, like :


struct simdF32 { __vector4 m; };
struct simdU32 { __vector4 m; };

then you can do overloading to get the right operation from the data type, eg :


RADFORCEINLINE simdU32 operator + ( const simdU32 lhs, const simdU32 rhs )
{
    return _mm_add_epi32(lhs,rhs);
}

RADFORCEINLINE simdF32 operator + ( const simdF32 lhs, const simdF32 rhs )
{
    return _mm_add_ps(lhs,rhs);
}

The problem is that there is some reason to believe that anything but the fundamental type is not handled as well by the compiler. That is, qword,__vector4, etc. get special very good handling by the compiler, and anything else, even a struct which consists of nothing but that item, gets handled worse. I haven't actually seen this myself, but there are various stories around the net indicating this might be true.

I think the typedef way is too just too weak to actually be useful, I have to go the struct way and hope that modern compilers can handle it. Forunately GCC has the very good "vector" keyword thingy, so I don't have to do anything fancy there, and MSVC is now generally very good at handling mini objects (as long as everything inlines).

Another minor annoying issue is how to support reinterprets. eg. I have this simdU32 and I want to use it as a simdU16 with no conversion. You can't use the standard C method of value at/address of, because that might go through memory which is a big disaster.

And the last little issue is whether to provide conversion to a typeless qword. One argument for that would be that things like Load() and Store() could be implemented just from qword, and then you could fill all your data types from that if they have conversion. But if you allow implicit conversion to/from typeless, then all your types can be directly made into each other. That usually confuses the hell out of function overloading among other woes.

24 comments:

Mojo said...: Struct containing a single simd value actually works well. It's been quite a while since I've noticed any extraneous loads&stores like the older compilers used to do.

Casting does bork the compiler sometimes though. An inline method which returns a reference to the underlying type works pretty well, the temporary copies can be optimized away.; September 7, 2010 at 2:23 AM
jeskola said...: VC generates very good code for my float4 and int4 structs most of the time. It can handle this too without using memory:

inline int4 float4::reinterpret_int4() const { return int4(*(__m128i *)&x); }

Sometimes it doesn't seem to believe my __restricts though.; September 7, 2010 at 7:24 AM
jeskola said...: VC generates very good code for my float4 and int4 structs most of the time. It can handle this too without using memory:

inline int4 float4::reinterpret_int4() const { return int4(*(__m128i *)&x); }

Sometimes it doesn't seem to believe my __restricts though.; September 7, 2010 at 7:25 AM
castano said...: It's been a long time, but IIRC one problem is that msvc does not align struct function arguments properly when passed by value, so you have to be very careful if you rely on that. However, if you use the __m128 data type, the compiler does the right thing. You would think that the align keyword would do the same, but instead it simply gives you an infuriating error when passing aligned structs by value.; September 7, 2010 at 10:29 AM
cbloom said...: "inline int4 float4::reinterpret_int4() const { return int4(*(__m128i *)&x); }"

Yeah this piece concerns me, but I guess I don't have much of a choice for that; have to just do it and cross my fingers.

There are alternatives :

I could call an instruction like or with self or something and hope that gets optimized out.

I could also use the move-through-union method.; September 7, 2010 at 11:19 AM
cbloom said...: "Sometimes it doesn't seem to believe my __restricts though."

Ugh I get this in MSVC and it's infuriating. I spent all of yesterday trying various tricks to make it stop storing temporaries to memory after each loop iteration and couldn't get it to stop.; September 7, 2010 at 11:21 AM
cbloom said...: "msvc does not align struct function arguments properly when passed by value, so you have to be very careful if you rely on that. However, if you use the __m128 data type, the compiler does the right thing."

Yeah there is some problem with this. Also x64 has weird rules about passing m128's.

But I think all this goes away if I just make all my functions FORCEINLINE.

Of course that's not really what you want for more complex functions.; September 7, 2010 at 11:24 AM
cbloom said...: Another few little open questions to me :

do I make separate simdU32 and simdS32 ?

how about variable names? vecU32 ? quadU32 ?; September 7, 2010 at 11:27 AM
jeskola said...: At least simple loops like this usually work well:

void test(float4 * __restrict pf, int4 * __restrict pi, int n)
{
for (int i = 0; i < n; i++)
pf[i] = ((pf[i] * 123.0f).reinterpret_int4() ^ pi[i]).reinterpret_float4();
}

loop:
movaps xmm0, XMMWORD PTR [eax]
movdqa xmm2, XMMWORD PTR [ecx+eax]
mulps xmm0, xmm1
pxor xmm0, xmm2
movdqa XMMWORD PTR [eax], xmm0
add eax, 16
dec edx
jne SHORT loop

This looks close to optimal.; September 7, 2010 at 12:52 PM
won3d said...: Alignment is a pain.

Note that there can be hidden state in SSE registers. I believe this is true for K8 and K10, and it might be true for Core i (Core 2 doesn't seem to be affected). I think it has to do with the subnormal state (or some other floating point sub type) which can be cleared if you do certain integer operations, so there might be a penalty to the next floating point op you do.

You'd really have to work hard to do this, though. I happened to be writing some fast r^(-3/2) code for a gravity simulator.

Does MSVC do autovectorization? GCC's has improved greatly recently. I would even consider implementing SIMD primitives as unrolled loops and depending on the vectorizer for that.; September 7, 2010 at 2:21 PM
cbloom said...: "Does MSVC do autovectorization? GCC's has improved greatly recently. I would even consider implementing SIMD primitives as unrolled loops and depending on the vectorizer for that. "

Relying on the compiler to do anything complex is not really viable IMO without some ability to compile-time-assert that it is happening.; September 7, 2010 at 2:29 PM
ryg said...: "I think it has to do with the subnormal state (or some other floating point sub type) which can be cleared if you do certain integer operations, so there might be a penalty to the next floating point op you do."
At some point AMD had shadow/tag bits for this that needed to be recalculated (at a small penalty) on a data type switch. The Core i series does have a penalty for mixing data types too, but for a different reason: SIMD int and FP units are separate and there's a 1 cycle bypass delay to move data across the chip.

"Does MSVC do autovectorization?"
Not that I'm aware of. Not a big fan of this kind of optimization anyway - it tends to work well on simple loops but is very brittle and easy to break by changes that shouldn't make a difference. That's the worst kind of optimization to work with - high variance in execution time between similar versions of source code, unpredictable at the source level, and with lots of external requirements (e.g. alignment restrictions) that are easy to break from a distance without noticing it.; September 7, 2010 at 9:12 PM
ryg said...: Correction: Core2 was the one with the 1-cycle data bypass delay for mixing types, Core i has 2-cycle delays between some units.; September 7, 2010 at 9:15 PM
Jeff Roberts said...: Charles, radvec4.h has a bunch of this awkwardly abstracted...; September 7, 2010 at 9:51 PM
Sam Martin said...: About 4-5 years ago I built a simd vector library for Lionhead using the typedef approach (I believe they still use it) and we also take the same approach at Geomerics.

I spent quite a while looking at the other options, but the generated code on the 3 platforms by the compilers at that point was shocking for anything other than a typedef. This may have changed since, but my gut feeling is that typedefs are still the way forward.

IMO, the lack of some type safety is not really that a big thing in practice - not worth the additional upheaval at any rate.

There are other pros and cons though:

+ you can write fairly decent simd vector code in a nice cross platform style. It doesn't replace platform-specific optimisation, but it's a good first pass.

+ there is a surprising amount of common functionality between the 3 main simd targets beyond the usual */+-. Many minor differences can be abstracted.

+ alignment is painful. You have to fallback to other vector types for unaligned data.

- there are some cross platform hurdles. Xbox declares all the operators for __vector4 in the global namespace for example. Plus minor compiler bugs/quibbles.

- it's way too easy to write unperformant code on platforms without a unified register set by transfering things between floats/ints/vectors. But avoiding this can lead to obfuscated code.

So in summary it's great for simd-ising loops and so on, but in retrospect I'm not sure the (potential) performance gains it offers are worth the costs of using as a general purpose vector library. In retrospect I think a straight forward 4-element float array still has the advantage. Not an obvious call though.; September 8, 2010 at 2:58 AM
won3d said...: ryg, thanks for the info! When are you going to start your blog?; September 8, 2010 at 8:58 AM
castano said...: "ryg, thanks for the info! When are you going to start your blog?"

The ryg blog; September 8, 2010 at 11:00 AM
cbloom said...: Sam, thanks for the notes!

"IMO, the lack of some type safety is not really that a big thing in practice - not worth the additional upheaval at any rate."

Well, it does one huge thing, which is to let me use operator+. Without strong types I have to do Add4I , Add4F , etc.

"+ alignment is painful."

I wish I could disable loading my simd types from pointers. eg.

val = *ptr;

is forbidden and you have to manually call LoadAligned() or LoadUnaligned().; September 8, 2010 at 12:08 PM
ryg said...: Did something like Sam too (albeit more recently), works just fine. I ended up only supporting floats (+logical/compare ops) which sidesteps the type safety issue entirely.

For float the architectures are fairly close to each other, enough to paper over the differences by just exposing the important primitives and emulating them with multi-instruction sequences if necessary (e.g. madd -> mul+add on x86, unaligned loads on Xenon/PS3, or a "splat individual element" primitive that generates the necessary shuffle/permutation masks on x86/PS3).

Integer is more of an issue. Xenon has severely gutted integer SIMD (no int multiplies at all!) and the instructions have bigger differences between the architectures in general. For example, PPC vector shifts are always variable shifts with separate shift amount per vector element, x86 has either an immediate operand or a register parameter, but the shift amount is always the same for all elements. The PPC shifts are sufficiently more expressive to make me want to use them, but that doesn't map well to x86 at all. For 32-bit elements shufps (x86) / vpermwi (Xenon) is usually enough to get by, but for 8- and 16-bit I often really want to use vperm and you don't get that on x86, which usually means a very different dataflow. Most of the integer min/max stuff is only available in fairly recent x86 processors, and same for "horizontal" ops.

Once you take all that out, you're basically down to add/sub (both without carry-out) and shifts with compile-time constant amounts. That's a useless enough subset for me to just not bother :); September 8, 2010 at 9:22 PM
ryg said...: ...although if you throw in some unpacks it's enough to get through most of the pixel processing in H.264. But that's an exception :); September 8, 2010 at 9:25 PM
cbloom said...: "Once you take all that out, you're basically down to add/sub (both without carry-out) and shifts with compile-time constant amounts. That's a useless enough subset for me to just not bother :)"

Eh, I sort of thought that, but when I was writing the exact same code for the 4th time it occurred to me that this is not the way it should be.

The common stuff is enough to do SIMD hashes, SIMD PNG filters, various simple pixel processing, etc. I think it's probably enough that I can SIMD almost everything I need to in Oodle in a cross-platform way (DXTC encoder, lossless PNG-alike, lossy DCT image compressor, hash, etc.)

If nothing else, I think just having the common typedef for your function protos and for loads & stores and all that basic stuff would save massive amounts of duplication. It would let you do the "#if X86" on the inside of the function where it really matters rather than duplicating the whole code flow path for each platform (which not only is more typing but creates fragile code that is hard to maintain and prone to bugs).; September 8, 2010 at 10:31 PM
ryg said...: "The common stuff is enough to do SIMD hashes, SIMD PNG filters, various simple pixel processing, etc."
Okay, that's way more integer-heavy than the stuff I dealt with. I mostly wanted this for some rendering / animation / collision stuff, and for all that you don't really need integer beyond logical ops anyway.

"If nothing else, I think just having the common typedef for your function protos and for loads & stores and all that basic stuff would save massive amounts of duplication."
Yeah it does, and I used that in several places (e.g. use the dot product instrs on Xenon where you have them, otherwise do a 4x4 transpose + mul/3x madd). It's very nice to be able to drop a couple platform-specific intrinsics in there when they're the best choice, without duplicating the whole thing.

Not too fond of "typesafe" vector stuff in general. It sounds like a good idea, but both the default AltiVec intrinsics and the typed (spu_*) SPU intrinsics are just a PITA to use. It just gets messy, particularly with compare results ("vector bool short"? Yeah right) and unpack-style operations when you don't want to change the data type, just interleave two halves.; September 8, 2010 at 11:31 PM
cbloom said...: "It sounds like a good idea, but both the default AltiVec intrinsics and the typed (spu_*) SPU intrinsics are just a PITA to use."

Yeah I hated the spu_ stuff so much that I mostly just used the raw si_ stuff.

But I'm not sure if that's because it's a bad idea or because it's just a bad implementation.

I think maybe I could have the best of both worlds.

Make an untyped generic simd and Add4F() blah blah calls. Also make a typed simd and provide reinterpret ops to the generic. Let them interop painlessly.

Time to write some code...; September 8, 2010 at 11:56 PM
Anonymous said...: >> Ryg: Integer is more of an issue.
>> Xenon has severely gutted integer
>> SIMD [...]

Indeed. Usually in such cases I end up developing an algorithm using plain C, using it for the PC version and writing the SIMD-ed version for both consoles.; September 9, 2010 at 6:13 AM

cbloom rants

9/06/2010

09-06-10 - Cross Platform SIMD

24 comments:

old rants