8/06/2010

08-06-10 - More SPU

1. CWD instruction aligns the position you give it to the size of the type (byte,word,dword). God dammit. I don't see this documented anywhere. CWD is supposed to generate a shuffle key for word insertion at a position (regA). In fact it generates a shuffle key for insertion at position ( regA & ~3 ). That means I can't use it to do arbitrary four byte moves.

2. Shufb is not mod 32. Presumably because it has those special 0 and FF selections. This is not a huge deal because you can fix it by doing AND splats(31) , but that is enough slowdown that it ruins shufb as a fast path for doing byte indexing. (people use rotqby instead because that is mod 16).

A related problem for Shufb is that there's no Byte Add. If you had byte add, and shufb was mod 32, then you could generate grabbing a 16-byte portion of two quads by adding an offset.

In order to deal with this, you have to first mod your offset down to [0,15] so that you won't overflow, then you have to see if your original offset before modding had the 16 bit set, and if so, swap the reg order you pass to shuffle. If you had a byte add and shuffle was mod 32, you wouldn't have to do any of that and it would be way faster to do grabs of arbitrary qwords from two neighboring qwords.

(also there's no fast way to generate the typical shuffle mask {010203 ...}. )

3. You can make splats a few different ways; one is to rely on the "spu_splats" builtin, which figures out how to spread the value (usually using FSMB or some such variant). But you can also write it directly using something like :

(vec_int4) { -1,-1,-1,-1 }

which the compiler will either turn into loading a constant or some code to generate it, depending on what it thinks is faster.

Some important constants are hard to generate, the most common being (vec_uchar16){0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} , so you might want to load that into a register and then pass it around if you need it much. (if you are using the stdlib memcpy at all, you can also get it from __Shuffles).

4. Something that I always want with data compression and bit manipulation is a "shift right with 1 bits coming in". Another one is bit extract and bit insert , grab N bits from position M. Also just "mask with N bits" would always be nice, so I don't have to do ((1 << N)-1) (or whatever) to generate masks. Obviously the SPU is not intended for bit manipulation. I'm sure getting Blu-Ray working on it was a nightmare.

5. If you build with -fpic (to make your ELF relocatable), then all your static addresses are relative to r126, which means if you just reference some static in your inner loop it will do a ton of math all the time (the add r126 is not a big deal, but just fetching the address a lot of work) ; as usual if you just copy it out to a local :


   combinedDecodeTable = s_combinedDecodeTable;

the compiler will "optimize" out the copy and will do all the math to compute s_combinedDecodeTable in your inner loop. So you have to use the "volatile trick" :

    volatile UINTa v_hack;
    v_hack = (UINTa) s_combinedDecodeTable;
    const CombinedDecodeTable * RADRESTRICT const combinedDecodeTable = (const CombinedDecodeTable * RADRESTRICT const) v_hack;

sigh.

6. memcpy seems to be treated differently than __builtin_memcpy by the compiler. They both call the same code, but sometimes calling memcpy() doesn't inline, but __builtin_memcpy does. I don't know WTF is up with that (and yes I have -fbuiltins set and all that shit, but maybe there's some GCC juju I'm missing). Also, it is a decent SPU-optimized memcpy, but the compiler doesn't have a true "builtin" knowledge of memcpy for SPU the way it does on some platforms - eg. it doesn't know that for types that are inherently 16-byte aligned it can avoid the rollup/rolldown, it just always calls memcpy. So if you're doing a lot of memcpys of 16-byte aligned stuff you should probably have your own.

7. This one is a fucking nightmare : I put in some "scheduling barriers" (asm volatile empty) to help me look at the disassembly for optimization (it makes it much easier to trace through because shit isn't spread all over). After debugging a bit, I forgot to take out the scheduling barriers and ran my speed test again -

it was faster.

Right now my code is 5-10% faster with a few scheduling barriers manually inserted (19 clocks per symbol vs 21). That's fucking bananas, it means the compiler's scheduler is fucking up somehow. It sucks because there's no way for me to place them in a systematic way, I have to randomly try moving them around and see what's fastest.

8. Lots of little things are painfully important on the SPU. For example, obviously most of your functions should be inline, but sometimes you have to make them forceinline (and it's a HUGE difference if you don't because once a val goes through the stack it really fucks you), but then sometimes it's faster if it's not inline. Another nasty one is manual scheduling. Obviously the compiler does a lot of scheduling for you (though it fucks up as noted previously), but in many cases it can't figure out that two ops are actually commutative. eg. say you have something like A() which does shared++ , and B() also does shared++ , then the compiler can't tell that the sequence {A B} could be reordered to {B A} , so you have to manually try both ways. Another case is :


A()

if ( x )
{
    B()
    .. more ..
}
else
{
    B()
    .. more ..
}

and again the compiler can't figure out that the B() could be hoisted out of the branches (more normally you would probably write it with B outside the branch, and the compiler won't bring it in). It might be faster inside the branches or outside, so you have to try both. Branch order reorganization can also help a lot. For example :

if ( x )
    if ( y )
        .. xy case
    else
        .. xNy case
else
    if ( y )
        .. Nxy case
    else
        .. NxNy case

obviously you could instead do the if(y) first then the if(x). Again the compiler can't do this for you and it can make a big difference.

This kind of shit affects every platform of course, but on a nice decent humane architecture like a Core iX/x86 it's < 1%. On SPU it's 5-10% which is nothing to sneeze at and forces you to do this annoying shit.

9. One thing that's annoying in general about the SPU (and even to some extent the PPU) is that some code patterns can be *SO* slow that your non-inner-loop parts of the code dominate.

On the wonderful PC/x86 , you basically can just find your hot spots and optimize those little routines, and ignore all your glue and startup code, cuz it will be reasonably fast. Not true on the SPU. I had some setup code that built some acceleration tables that's only called once, and then an inner loop that's called two hundred thousand times. The setup code takes 10% of the total time. On the PC it doesn't even show up as a blip. (On the Xenon you can get this same sort of thing if your setup code happens to do signed-16-bit loads, variable shifts, or load-hit-stores). On the SPU my setup code that was so bad was doing lots of byte reads & writes, and lots of branches.

The point is on the PC/x86 you can find 10% of your code base that you have to worry about for speed, and the other 90% you just make work. On the SPU you have to be aware of speed on almost all of your code. Blurg that blows so bad.

10. More generally, I found it just crucial to have my SPU ELF reloader running all the time. Every time I touch the code in any way, I hit "build" and then look over and see my speed, because my test app is just running on the PS3 all the time reloading the SPU ELF and running the test. Any time you do the most inoccuous thing - you move some code around, change some variable types - check the speed, because it may have made a surprisingly big difference. (*)

(* = there is a disadvantage to this style of optimization, because it does lead you into local minima; that is, I wind up being a greedy perturbation search in the code space; some times in order to find real minima you have to take a big step backwards).

11. For testing, be careful about what args you pass to system_init ; make sure you aren't sharing your SPUs with the management processes. If you see unreliable timing, something is wrong. (5,1) seems to work okay.

3 comments:

Krzysztof Narkowicz said...

6. Maybe you should try splitting loads and calculations (http://cellperformance.beyond3d.com/articles/2006/04/a-practical-gcc-trick-to-use-during-optimization.html)? Although from my small experience, SPU compiler is so bad, that any random and strange code change can lead to increased performance :).

ryg said...

"Right now my code is 5-10% faster with a few scheduling barriers manually inserted (19 clocks per symbol vs 21). That's fucking bananas, it means the compiler's scheduler is fucking up somehow."
The GCC scheduler isn't very good.

There's SPA (SPU Pipelining Assembler) which does pretty good register allocation, unrolling and scheduling (software pipelining to work around latencies). But it doesn't integrate well with C code at all, if you want to try it, you basically have to port that function and everything below it to SPA, which is a maintenance nightmare for code that you occasionally want to touch. It's particularly bad if you're targeting multiple platforms so you absolutely positively need the C++ code version alongside.

That said, once you've distilled your code down to relatively simple loops, it works a lot better to just write it in ASM without worrying about scheduling and unrolling, then let SPA figure out the rest. It's a lot less frustrating than tweaking C++ code in obscure ways to get the compiler to do what you want, anyway.

cbloom said...

" Maybe you should try splitting loads and calculations (http://cellperformance.beyond3d.com/articles/2006/04/a-practical-gcc-trick-to-use-during-optimization.html)"

To be clear : Mike's post is about using the GCC scheduling trick to make it easier to read the assembly so that you can see what's going on and then optimize it.

That's what I was doing when I discovered that it had randomly gotten faster.

old rants