cbloom rants: 08-04-10 - Initial Learnings of SPU

Yes I know I'm 4 (5?) years late.
The SPU decrementer runs at the same frequency as mftb on the PPU, that's 79.8 MHz
There are two different ways the intrinsics are exposed; spu_ are C++ and supposed to know about types; the si_ just expose the instructions directly and are untyped. The spu_ intrinsics are kind of annoying because they are overloaded C++ , but they don't get it right, so you get compile errors all the time. The vec_blah types don't really seem to work right with the intrinsics. The best practice I've found is just to use the untyped "qword" and the "si_" intrinsics, and cast to a vec_ type if you want to grab a value out (like ((vec_uint4)w)[0] ).
SPU GCC seems to have the same problem as the PPU GCC and the Xenon MSVC compiler, in that all of them incorrectly optimize out assignments to different types. I've never seen this problem on MSVC/x86 , but maybe it's just that x86 is pretty good about running the same speed no matter what the type of the variable. The problem with the PowerPC and SPU is that performance turns out to be very sensitive to data types. So, I have lots of code that does something like :
```
int x = struct.array[7];

for(... many .. )
{
    .. do stuff with x ..
}
```
and what I'm seeing is that the "do stuff with x" is generating a load from struct.array[7] every time, and the assignment into my temp has been eliminated completely. On the SPU this shows up when you have something a struct member int m_x and you do something like :
```
qword x = spu_promote( struct.m_x , 0 );

for(... many .. )
{
    .. do stuff with x ..
}
```
and the compiler decides that rather than load m_x once and keep it in a register it will reload it all the time. It's like the optimizer has a very early pass to merge variables that have just been renamed by assignment - but it is incorrectly not accounting for side effects on the code gen. Weird. Anyway this brings us to :
The SPU has only vector registers and instructions. Despite that, basic int math is mostly okay. The "top" ([0]) int in a qword can be used basically like an int register. You get all the basic ops on it and it runs full speed. The big problem is with loads and stores. You can only load/store aligned quadwords, so accessing something like a U32 from a memory location involves some shuffling around. rotqby solves your loads, and cbd/chd/cwd/cdd + shufb solves your stores.
One annoyance is that since only the top word can be used for things like loads, you often lose your SIMD. Like say you want to generate 4 addresses and do 4 loads, you can do your math on 4 channels, but then you have to copy the results out of channels 1-3 into other registers in slot 0 (using shuf or rotqby), and this often is more expensive than just doing the math 4 separate times. (because the instructions to transfer across lanes are slower than math).
You can, unlike old SSE, do a lot of ops that are "horizontal" across the quadword. There are nice bit scatters and gathers, as well as byte rotates, shuffles, etc. You can treat the qword as a big 128 bit register and rotate around the whole thing, but it's not fast; a rotate left takes two instructs (rotbi and rotbybi) while a rotate right takes three (rotbi, rotbybi, + a subtract)
So far as I can tell, __align_hint, doesn't do anything for you. For review, __attribute__ ((aligned (16)) is the way you specify that a certain variable should be aligned by the compiler. __align_hint on the other hand lets it know that a given pointer which was not specified as aligned in fact is aligned (you know this as a programmer). In theory it could generate better code from that, but I don't see it, which is related to :
It certainly doesn't do magic about combining loads. For example, say you have a pointer to a struct that's struct { int x,y,z,w; }. You tell it __align_hint and then you do struct.x ++ , struct.y ++. In theory, it should load that whole struct as one single quadword load, do a bunch of work on it, then store it back as one quadword. It won't do that for you. The compiler generates separate loads and shuffles to get out each member, so you have to do all this kind of thing by hand.
The SPU has no branch predictor. Or rather it has a "zero bit" branch predictor - it always predicts branch not taken (unless you manually hint it, see later). It is deeply pipelined, so it will run ahead on the expected branch, and then if you actually go the other way, it's a 20 clock penalty. That's not nearly as bad as a missed branch on most chips, so it's nothing to tear up your code over (eg. it's less than an L1 miss on the PPU, which is 40 clocks!).
To make use of the static branch predictor, organize branches so the likely part is first (the "if" part is likely, the "else" is unlikely). You can override this with if (__builtin_expect(expr, 1/0)) , I use these :
```
#define IF_LIKELY(expr)   if ( __builtin_expect(expr,1) )
#define IF_UNLIKELY(expr) if ( __builtin_expect(expr,0) )
```
but Mike says it's best just to inherently arrange your code. The way this is actually implemented on the SPU is with a branch hint instruction. The instruction sequence to the SPU will look like :
```
branch hint - says br is likely
.. stuff ..
branch
```
if branch was unlikely, there's no need for a hint, so it's only generated to tell the SPU that the branch is likely, eg. the instruction fetch needs to jump ahead to the branch location. The branch hint needs to be 15 clocks or so before the branch. One compile option that turns out to be important is
```
-mhint-max-nops=2
```
which tells GCC to insert nops to make the branch hint far enough away from the branch. Right now 2 is optimal for me, but YMMV, it's a number you have to fiddle with.
Now, in theory you can also do dynamic branch prediction. If you can compute the predicate for your branch early enough, you can see if it will be taken or not and either hint or not hint (obviously not using a branch, but using a cmov/select). I have not found a way to make the compiler generate this, so this seems to be reserved for pure assembly in super high performance code.
Dual issue; The SPU is a lot like old Pentium, it has dual pipes and different types of instructions go on either pipe. Unfortunately, for data compression almost everything I want is on the odd pipe - all the loads & store, branches and all the cross-quadword stuff (shuffles and rotqby and such) are all odd pipe. The instructions have to be aligned right (they have to alternate even and odd); if you're writing ASM you need to be super-aware of this; if you're just writing C, the compiler should do it for you; with intrinsics, you're sort of in between. The compiler should theoretically schedule the intrinsics to make them line up with the right pipe, but if you only give it odd pipe intrinsics it can't do anything. Because of that I saw odd things like :
```
// these two instructions are much faster 

    qword vec_item_addr = si_a( vec_fastDecodeTable , vec_peek );
    qword vec_item = si_rotqby(vec_item_data,vec_item_addr);

// than this one !?

    qword vec_item = si_rotqby(vec_item_data,vec_peek);
```
Part of the solution to that is to compile with "-mdual-nops=1" which lets the compiler insert nops to fix dual issue alignment. The =1 was the best for me, but again YMMV. BTW one problem with the SPU is that there is code size pressure (just to fit in the 256k), and all these nops do make the code size bigger.
ADDENDUM : just found an important issue with the dual pipe; instruction fetch is on odd pipe (obvious I guess, all fetch is on odd pipe). That means when you overload the odd pipe, not only are you failing to dual issue - you are fighting with instruction fetch for time. That's a disaster. So pretty much any time you can replace a load/store/rotq/shufb/etc. with a math op on even pipe, it's a win, even if you need 2-3 math instructions to do the same odd pipe instruction.

Optimizing for SPU is not as obvious as Insomniac and some others might lead you to believe. If you just replace C with vector ops, you can easily make your code slower. The problem is that the latencies and dependency stalls are not obvious. Latencies are mostly around 2-6 , with lots at 4, which seems short until you realize that 4 clocks = 8 instructions, which makes it pretty hard to make sure you are not dependency-bound. For example , this :

    4370:   42 7f ff ad     ila $45,65535   # ffff
    4374:   38 8d c8 33     lqx $51,$16,$55 // load quad for codelen
    4378:   18 0d c8 36     a   $54,$16,$55 // 54 = address of codelen
    437c:   18 0d db b5     a   $53,$55,$55 // 53 = peek*2
    4380:   1c 03 5b 34     ai  $52,$54,13  // 52 = align byte for codelen
    4384:   38 83 9a b0     lqx $48,$53,$14 // load qw for table[peek]
    4388:   18 03 9a b2     a   $50,$53,$14 // 50 = address of symbol
    438c:   3b 8d 19 af     rotqby  $47,$51,$52 // rotate to get byte for codelen
    4390:   1c 03 99 31     ai  $49,$50,14  // 49 = adjust to get word symbol
    4394:   14 3f d7 ac     andi    $44,$47,255 // &= 0xFF to get byte 44 = codelen
    4398:   3b 8c 58 2e     rotqby  $46,$48,$49 // rot by to get symbol
    439c:   3b 0b 06 2b     rotqbi  $43,$12,$44   // use
    43a0:   08 03 56 0d     sf  $13,$44,$13   // use bits
    43a4:   18 2b 57 07     and $7,$46,$45   // $7 = symbol , 45 = ffff
    43a8:   39 8b 15 8c     rotqbybi    $12,$43,$44  // use

is faster than this :

    4360:   0f 60 d7 2c     shli    $44,$46,3   // *= 8 to make table address
    4364:   38 8b 07 aa     lqx $42,$15,$44 // load qw for table
    4368:   18 0b 07 ab     a   $43,$15,$44 // 43 = address of table item
    436c:   3b 8a d5 29     rotqby  $41,$42,$43 // rotate qword to get two dwords at top
    4370:   08 02 d4 8b     sf  $11,$41,$11 // use
    4374:   3b 0a 45 27     rotqbi  $39,$10,$41 // use
    4378:   3f 81 14 87     rotqbyi $7,$41,4    // rotate for symbol
    437c:   39 8a 53 8a     rotqbybi    $10,$39,$41 // use

Anyway, the point is if you want to really get serious, you have to map out your dependency chains and look at the latency of each step and try to minimize that, rather than just trying to reduce instructions.

One particularly annoying example of bad code gen. If you do this :
```
int x,y;

loop {

if ( x <= y )


}
```
it will *sometimes* do the right thing (the right thing is to make x and y qwords and do all the int ops on the [0] element). However, sometimes it decides that they aren't important enough and it will leave them as ints on the stack, in which case you get loads & shuffles to fetch them.
More troubling is the fact that you cannot easily trick the compiler to fix it. You might think that this :
```
vec_int4 x,y;

loop {

if ( x[0] <= y[0] )


}
```
would do the trick - you're telling the compiler that I want them to be qwords, but I only use the top int part. Nope, that doesn't work, in fact it often makes it *much worse*. In theory, the compiler should be able to tell that I am only ever touching x[0], so when I do
```
x[0] ++
```
it should be able to know that it can go ahead and increment the *whole* vector x with a plain old add (eg. also increment x[1],2,3), but it doesn't always do that, and you wind up getting instructions to extract the portion you're working on. Sigh.
The only reliable way I have found is to go ahead and use the qword intrinsics
```
qword x,y;

loop {

qword comp = si_clgt( x, y );
if ( comp[0] )


}
```
but the problem with that is that you now have to be very careful about what instructions you choose to generate for your operations - you can often be slower than the plain C using intrinsics because you aren't thinking about scheduling and pairing. Also, once you start using intrinsics, you can't use too much of the [0] form of access any more because of the above note.
To be clear, what I would like is a way to say "I'm going to only work on this as an int, but please keep it in a vector and ignore the bottom 3 components and generate the code that is fastest to give me the [0] result" , but there doesn't seem to be a reliable way to make the compiler do that. I dunno, maybe there's a way to get this.

2 comments:

MH said...: Most people werent even using 20% of the SPUs. Probably better now that people have worked with it for a while. Ide still bet most folks are in the 50% to 75% SPU usage; plenty of room for sloppy code.; August 6, 2010 at 4:44 PM
said...: If you have the right tools, writing it in straight assembly (no C/C++) is quite pleasant.

I don't think anybody believes that optimizing for the SPU is easy, but you can get good at it after a while.

As for too many ops on the left vs right, thats what shufb is for :) creative use is key.

I've done hinting dynamic branches in C intrinsics before. Check out the PlayStation Edge code for reference.; August 17, 2010 at 4:43 PM

cbloom rants

8/04/2010

08-04-10 - Initial Learnings of SPU

2 comments:

old rants