I finally have done my first hard core optimization for PowerPC and discovered a lot of weird quirks,
so I'm going to try to write them up so I have a record of it all. I'm not gonna talk about the issues
that have been well documented elsewhere (load-hit-store and restrict and all that nonsense).
X. The code gen is just not very good. I'm spoiled by MSVC on the PC, not only is the code gen for the PC quite good, but any mistakes
that it makes are magically hidden by out of order PC cores. On the PC if it generates a few unnecessary moves because it didn't do the
best possible register assignments, those just get hidden and swallowed up by out-of-order when you have a branch or memory load to hide
them.
In contrast, on the PPC consoles, the code gen is quirky and also very important, because in-order execution means that things like
unnecessary moves don't get hidden. You have to really manually worry about shit like what variables get put into registers, how the
branch is organized (non-jumping case should be most likely), and even exactly what instructions are done for simple operations.
Basically you wind up in this constant battle with the compiler where you have to tweak the C, look at the assembly, tweak the C, back
and forth until you convince it to generate the right code. And that code gen is affected by stuff that's not in the immediate
neighborhood - eg. far away in the function - so if you want to be safe you have to extract the part you want to tweak into its own
function.
X. No complex addressing (lea). One consequence of this is that byte arrays are special and much faster than arrays of larger objects, because
it has to do an actual multiply or shift. So for
example if you have a struct of several byte members, you should use SOA (several structs) instead of AOS (one array of large struct).
X. Inline ASM kills optimization. You think with the code gen being annoying and flaky you could win by doing some manual inline ASM,
but on Xenon inline ASM seems to frequently kick the compiler into "oh fuck I give up" no optimization mode, the same way it did on the PC
many years ago before that got fixed.
X. No prefetching. On the PC if you scan linearly through an array it will be decently prefetched for you. (in some cases like memcpy you
can beat the automatic prefetcher by doing 4k blocks and prefetching backwards, but in general you just don't have to worry about this).
On PPC there is no automatic prefetch even for the simplest case so you have to do it by hand all the time. And of course there's no
out-of-order so the stall can't be hidden. Because of this you have to rearrange your code manually to create a minimum of dependencies to
give it a time gap between figuring out the address you want (then prefetch it) and needing the data at that address.
X. Sign extension of smaller data into registers. This one was surprising and bit me bad. Load-sign-extend (lha) is super expensive, while load-unsigned-zero-extend (lhz)
is normal cost. That means all your variables need to be unsigned, which fucking sucks because as we know unsigned makes bugs.
(I guess this is a microcoded
instruction so if you use -mwarn-microcode you can get warnings about it).
PS3 gcc appears to be a lot better than Xenon at generating an lhz when the sign extension doesn't actually matter. eg. I had cases like load an
S16 and immediately stuff it into a U8. Xenon would still generate an lha there, but PS3 would correctly just generate an lhz.
-mwarn-microcode is not really that awesome because of course you do have to use lots of microcode (shift,mul,div) so you just get spammed
with warnings. What you really want is to be able to comment up your source code with the spots that you *know* generate microcode and have
it warn only when it generates microcode where it's unexpected. And actually you really want to mark just the areas you care about with some
kind of scope, like :
__speed_critical {
.. code ..
}
and then it should warn about microcode and load hit stores and whatever else within that scope.
X. Stack variables don't get registered. There appears to be a quirk of the compiler that if you have variables on the stack, it really want
to reference them from the stack. It doesn't matter if they are used a million times in a loop, they won't get a register (and of course "register"
keyword does nothing). This is really fucking annoying. It's also an aspect of #1 - whether or not it gets registered depends on the phase
of the moon, and if you sneeze the code gen will turn it back into a load from the stack. The same is actually true of static globals, the
compiler really wants to generate a load from the static base mem, it won't cache that.
Now you might think "I'll just copy it into a local" , but that doesn't work because the compiler completely eliminates that unnecessary copy.
The most reliable way we found to make the compiler register important variables is to copy them into a global volatile (so that it can't
eliminate the copy) then back into a local, which then gets registered. Ugh.
You might think this is not a big deal, but because the chips are so damn slow, every instruction counts. By not registering the variables,
they wind up doing extra loads and adds to get the values out of static of stack mem and generate the offsets and so on.
X. Standard loop special casing. On Xenon they seem to special case the standard
for(int i=0;i < count;i++) { }
kind of loop. If you change that at all, you get fucked. eg. if you just do the same thing but manually, like :
for(int i=0;;)
{
i++;
if ( i == count ) break;
}
that will be much much slower because it loses the special case loop optimization. Even the standard paradigm of backward looping :
for(int i=count;i--;) { }
appears to be slower. This just highlights the need for a specific loop() construct in C which would let the compiler do whatever it
wants.
X. Clear top 32s. The PS3 gcc wants to generate a ton of clear-top-32s. Dunno if there's a trick to make
this go away.
X. Rotates and shifts. PPC has a lot of instructions for shifting and masking. If you just write the C, it's generally pretty good at
figuring out that some combined operation can be turned into one instruction. eg. something like this :
x = ( y >> 4 ) & 0xFF;
will get turned into one instruction. Obviously this only works for constant shifts.
X. The ? : paradigm. As usual on the PC we are spoiled by our fucking wonderful compiler which almost always recognizes ? : as a case
it can generate without branches. The PPC seems to have nothing like cmov or a good setge variant, so you have to
generate it
manually . The clean solution to this is to write your own SELECT , that's like :
#define SELECT(condition,val_if_true,val_if_false) ( (condition) ? (val_if_true) : (val_if_false) )
and replace it with Mike's bit masky version for PPC.