7/10/2010

07-10-10 - PowerPC Suxors

I finally have done my first hard core optimization for PowerPC and discovered a lot of weird quirks, so I'm going to try to write them up so I have a record of it all. I'm not gonna talk about the issues that have been well documented elsewhere (load-hit-store and restrict and all that nonsense).

X. The code gen is just not very good. I'm spoiled by MSVC on the PC, not only is the code gen for the PC quite good, but any mistakes that it makes are magically hidden by out of order PC cores. On the PC if it generates a few unnecessary moves because it didn't do the best possible register assignments, those just get hidden and swallowed up by out-of-order when you have a branch or memory load to hide them.

In contrast, on the PPC consoles, the code gen is quirky and also very important, because in-order execution means that things like unnecessary moves don't get hidden. You have to really manually worry about shit like what variables get put into registers, how the branch is organized (non-jumping case should be most likely), and even exactly what instructions are done for simple operations.

Basically you wind up in this constant battle with the compiler where you have to tweak the C, look at the assembly, tweak the C, back and forth until you convince it to generate the right code. And that code gen is affected by stuff that's not in the immediate neighborhood - eg. far away in the function - so if you want to be safe you have to extract the part you want to tweak into its own function.

X. No complex addressing (lea). One consequence of this is that byte arrays are special and much faster than arrays of larger objects, because it has to do an actual multiply or shift. So for example if you have a struct of several byte members, you should use SOA (several structs) instead of AOS (one array of large struct).

X. Inline ASM kills optimization. You think with the code gen being annoying and flaky you could win by doing some manual inline ASM, but on Xenon inline ASM seems to frequently kick the compiler into "oh fuck I give up" no optimization mode, the same way it did on the PC many years ago before that got fixed.

X. No prefetching. On the PC if you scan linearly through an array it will be decently prefetched for you. (in some cases like memcpy you can beat the automatic prefetcher by doing 4k blocks and prefetching backwards, but in general you just don't have to worry about this). On PPC there is no automatic prefetch even for the simplest case so you have to do it by hand all the time. And of course there's no out-of-order so the stall can't be hidden. Because of this you have to rearrange your code manually to create a minimum of dependencies to give it a time gap between figuring out the address you want (then prefetch it) and needing the data at that address.

X. Sign extension of smaller data into registers. This one was surprising and bit me bad. Load-sign-extend (lha) is super expensive, while load-unsigned-zero-extend (lhz) is normal cost. That means all your variables need to be unsigned, which fucking sucks because as we know unsigned makes bugs. (I guess this is a microcoded instruction so if you use -mwarn-microcode you can get warnings about it).

PS3 gcc appears to be a lot better than Xenon at generating an lhz when the sign extension doesn't actually matter. eg. I had cases like load an S16 and immediately stuff it into a U8. Xenon would still generate an lha there, but PS3 would correctly just generate an lhz.

-mwarn-microcode is not really that awesome because of course you do have to use lots of microcode (shift,mul,div) so you just get spammed with warnings. What you really want is to be able to comment up your source code with the spots that you *know* generate microcode and have it warn only when it generates microcode where it's unexpected. And actually you really want to mark just the areas you care about with some kind of scope, like :


__speed_critical {
  .. code ..
}

and then it should warn about microcode and load hit stores and whatever else within that scope.

X. Stack variables don't get registered. There appears to be a quirk of the compiler that if you have variables on the stack, it really want to reference them from the stack. It doesn't matter if they are used a million times in a loop, they won't get a register (and of course "register" keyword does nothing). This is really fucking annoying. It's also an aspect of #1 - whether or not it gets registered depends on the phase of the moon, and if you sneeze the code gen will turn it back into a load from the stack. The same is actually true of static globals, the compiler really wants to generate a load from the static base mem, it won't cache that.

Now you might think "I'll just copy it into a local" , but that doesn't work because the compiler completely eliminates that unnecessary copy. The most reliable way we found to make the compiler register important variables is to copy them into a global volatile (so that it can't eliminate the copy) then back into a local, which then gets registered. Ugh.

You might think this is not a big deal, but because the chips are so damn slow, every instruction counts. By not registering the variables, they wind up doing extra loads and adds to get the values out of static of stack mem and generate the offsets and so on.

X. Standard loop special casing. On Xenon they seem to special case the standard


for(int i=0;i < count;i++) { }

kind of loop. If you change that at all, you get fucked. eg. if you just do the same thing but manually, like :

for(int i=0;;)
{
    i++;
    if ( i == count ) break;
}

that will be much much slower because it loses the special case loop optimization. Even the standard paradigm of backward looping :

for(int i=count;i--;) { }

appears to be slower. This just highlights the need for a specific loop() construct in C which would let the compiler do whatever it wants.

X. Clear top 32s. The PS3 gcc wants to generate a ton of clear-top-32s. Dunno if there's a trick to make this go away.

X. Rotates and shifts. PPC has a lot of instructions for shifting and masking. If you just write the C, it's generally pretty good at figuring out that some combined operation can be turned into one instruction. eg. something like this :


x = ( y >> 4 ) & 0xFF;

will get turned into one instruction. Obviously this only works for constant shifts.

X. The ? : paradigm. As usual on the PC we are spoiled by our fucking wonderful compiler which almost always recognizes ? : as a case it can generate without branches. The PPC seems to have nothing like cmov or a good setge variant, so you have to generate it manually . The clean solution to this is to write your own SELECT , that's like :


#define SELECT(condition,val_if_true,val_if_false)  ( (condition) ? (val_if_true) : (val_if_false) )

and replace it with Mike's bit masky version for PPC.

17 comments:

  said...

The top 32-bit clears is due to the fact that the heritage of the PS3 is 64-bit, not 32-bit.

  said...

Rolling your own assembly from scratch via an assembler is also very painful. There are a very many hard to remember scheduling rules to get right.

Anonymous said...

Welcome to the world of PPC!

I'm not sure how much of your post might be under NDA (also I'm not sure how much the constructors would acttualy care).

Yep, manually pre-fetching quickly gets a must to do, otherwise a simple stupid loop on an array will keep hitting L2-miss. Also virtuals are overkill on this platform: PC CPU are more than able to run fastly virtual calls thx to their architecture, but PPC just get hanged by these.
I end up re-writing several stuff as templated/metaprogrammed code to get rid of these virtuals, which somehow some people liked to put inside expensive loops :(

GCC/Xenon: y-m-may-vary. A dev compared asm of our engine and he found the Xenon's version to be ~2-5% faster somehow.

I disagree with your comment about stack variable not getting registered: most of my primitive type variables do end-up in registers: loop counter, ptr, simd-vector class... (maybe an issue with your compiler flags?)
On the other hand PPC compilers are especially afraid of optimizing member variables into registers due to memory aliasing, so putting often-read members in local variables is a must.

MaciejS said...

Oh yeah, PPC compilers seem to be really bad at utilizing registers/paranoid about aliasing sometimes. I've seen plenty of LHSes just because compiler decided to store something to memory, then load it back few instructions later (there was no reason to do that, no aliasing, no volatile, nothing). Usually it's possible to 'fix' it by modifying C++ code a little bit, but it's still a major PITA. (That's mostly MSVC, can't really comment on GCC/SNC, didn't use it that much).

cbloom said...

"The top 32-bit clears is due to the fact that the heritage of the PS3 is 64-bit, not 32-bit."

Yeah, that's fine; the registers are 64 bit, so you mostly want to work on 64 bit junk. The pointers are 32 bit (or maybe the TLB is 32 bit or something?). Anyway, it seems that any time you do math to compute an address it wants to clear the top 32 before letting you deref. That's pretty fucked, it seems like there are many ways to avoid that. Like the memory subsystem could have just always ignored the top 32 bits. Or they could have just not done the masking and relied on you to generate valid addresses. Duh. (maybe it's a hacky way to keep you from stomping in the kernel, I dunno).

  said...

Yeah, it was never explained to me why this was the case. At one point I started writing a pre-processor that would go through the assembly code before actual assembling and remove the clears when it was "safe" to do so. That never made it to the light of day, though I still think its a good idea. If the compiler people can't do it due to whatever reasons, no reason why you can't do it for them.

ryg said...

* Top-clearing: On PS3, the SN systems compiler has an optimizer option where you basically tell it "yeah, I'm not doing any weird stuff with address wraparound". Turn that on, and all of the clearing disappears. This can easily be an instant 10% improvement on loops that process several data streams but don't have much work per element.

* Top-clearing 2: Explicitly making integer variables 64-bit (even if 32 bits would be fine) can be a big win because it avoids the extra clears. Note this goes for all 64-bit architectures, even x64. Also, C/C++ typing semantics can be full of surprises in that regard. One thing I recently encountered in x64 code:
"U8 *SomePtr = BasePtr + (Offset & 0xffff) * 4;" compiled into a shl+add. Well, Offset was a U32, so the whole expression gets evaluated as "U8 *SomePtr = BasePtr + U32((Offset & 0xffff) * 4);" and you get the 32-bit shift because it might overflow - nevermind that it can't because of the masking. Fix: Write it explicitly as "U8 *SomePtr = BasePtr + U64(Offset & 0xffff)*4;" and it gets folded into the addressing mode.

* Really important: don't take the address of an important local or it will end up on the stack and stay there (not sure if that's what you mean with "stack variables", or if you mean extra arguments that are pushed onto the stack). Extra-important when you have a "vertex buffer lock"-like primitive. Always use it like this:

Vertex *vp_;
vb->Lock(..., &vp_);
Vertex * restrict vp = vp_;


and then use vp.

* If you have code that processes data sequentially, you want to end up using the load-update and store-update instructions if possible. These correspond to "x = *(ptr += dist);" and "*(ptr += dist) = x;", respectively. Because of the lack of indexed addressing modes, you'll often end up with a bunch of pointers, and stepping them individually can be a big time sink in tight loops. Note the pre-increment semantics, which are annoying. It would be nice if the compilers always used this automatically (introducing a pointer adjustment at the top of the loop if necessary), but they don't. Extra-bad because converting the code to explicitly do the preincrement makes it a lot harder to read and understand.

* Damn variable-distance shifts being microcoded makes bitstream parsing a royal pain. That, and the lack of a an integer select/cmov instruction. And the high mispredicted branch penalty. Argh!

ryg said...
This comment has been removed by the author.
ryg said...

* Register spills are bad, way worse than on x86. Main problem is the long relatively LHS window. A short-time spill can get really costly if it causes a LHS on every iteration!

* Bitfield structs: just say no on PPC (or check the code very closely). In theory PPC has all the nice bitfield-specialized "rotate and do masked insertion" ops and stuff like that. In practice they tend to get sandwiched between loads and stores to make sure you hit every LHS opportunity on the way. Had a nice instance of that in some in-game GUI code where it checked if a widget had focus, set some other internal state based on that, then checked another flag, and so on. A bunch of bools for every widget, nicely bitpacked to save space. Code looked completely innocent but was a total LHS-fest.

* Short functions: You really want them inlined because stack frame setup will cause LHSs too. On PC I sometimes like to have very small (2-3 line) functions in the .cpp because they have unwieldy dependencies (e.g. system headers) that you really don't want to put into public headers included from everywhere. PPC makes this painful. Now you can just turn on LTCG and that gets sorted out automatically, but this is way too heavyweight in terms of turnaround time to use for every single compile - it's usually something you use for "outside" versions. Not a big issue, but annoying.

* Debug builds, goddamnit. PPC compilers seem to think that "debug" means "just do codegen in the most stupid way possible". Yeah, I don't want scheduling, unrolling, partial redundancy elimination or high-level loop transforms in my debug builds. But forcing everything to memory after every statement and not extracting induction variables in loops? That's complete suicide on an architecture with huge LHS penalties and no nontrivial addressing modes. A debug build for a game stops being useful when it's so slow that it's impossible to play.

Anonymous said...

>> Bitfield structs: just say no on PPC

Indeed. Still it can be a good tradeof versus L2 cache misses sometimes.

Also the Xenon compiler with full optimization is often able to get rid of all these LHS on bitfield.

Doing the work manually (loading the bitfield on the stack, doing all operations locally, then storing the result once for all) will obviously get us rid of LHS on both platforms.

PS: I'm the same "Sly" as in the 3rd comment ; just wanted to get rid of the -id suffix, so I changed my OpenId.

  said...

Great to hear that the compiler people have finally implemented a good work-around for the top clears. :)

ryg said...

"Also the Xenon compiler with full optimization is often able to get rid of all these LHS on bitfield."
Yeah, the problem with multiplatform dev is that it's always the worst implementation of a feature on any of your target platform/compiler combos that ultimately makes the decision.

"Doing the work manually (loading the bitfield on the stack, doing all operations locally, then storing the result once for all) will obviously get us rid of LHS on both platforms."
Yep, basically all of these problems are solvable by looking at the generated code and rephrasing the C++ code until it produces decent results.

But that's a disaster in terms of maintainability. You have your code written in a high-level language but there's tons of invisible (in the source code) low-level constraints that are functionally equivalent but make a big difference in performance.

It's far too easy to cause a significant performance penalty just by changing code in a way that should be completely innocent.

"Great to hear that the compiler people have finally implemented a good work-around for the top clears. :)"

Oh, SNC has a bunch of good stuff.

They'll also automatically move single-precision FP computations over to VMX if that avoids load-hit-stores for int<->float conversions, for example.

SNC also tends to compile significantly faster than GCC. And produces code that's both faster and smaller. And their linker is way quicker than GNU ld. Okay, enough advertisement for now. :)

cbloom said...

"
Yep, basically all of these problems are solvable by looking at the generated code and rephrasing the C++ code until it produces decent results.

But that's a disaster in terms of maintainability. You have your code written in a high-level language but there's tons of invisible (in the source code) low-level constraints that are functionally equivalent but make a big difference in performance."

Yeah this is what I was trying to get at about the code gen being a disaster.

Sure you can tweak on the C until you get nice ASM, but then its bad on the other platform, or you make some little innocuous change and it completely mucks up the code gen. It's just so fragile and you have to constantly keep an eye on it.

It's a total nightmare for robust cross-platform dev.

cbloom said...

"One thing I recently encountered in x64 code:
"U8 *SomePtr = BasePtr + (Offset & 0xffff) * 4;""

BTW a useful note on this :

On x64 the behavior of code like this is different in Debug and Release.

In the MSufSort 64 bit bugs I was looking at, it has a bunch of address math like that.

In Debug builds, the compiler would do the (U32) conversion on the math and it would wrap, so the code would work.

In Release builds, it would sometimes NOT do the U32 conversion, just promote everything to 64 bit. Which is sort of nice, that's usually what you want, but in this case it caused bugs.

So with MSVC on x64 when you are doing mixed 32/64 maths, it will sometimes just promote everything and sometimes not.

cbloom said...

"* Really important: don't take the address of an important local or it will end up on the stack and stay there ....

Vertex *vp_;
vb->Lock(..., &vp_);
Vertex * restrict vp = vp_;
"

Yeah, what we were seeing was cases like this. The problem is I would do something like your copy to vp , but the compiler would completely eliminate the copy to vp and generate later code as if I was just using vp_.

It's like some early pass of the parse had figured out the copy did nothing and so eliminated the temporary.

eg. I had cases like

int array[32];
FillArray(array);
const register int x = array[7];

and then when I would use "x" later down in the function it would gen a bunch of code to lookup array[7] instead of putting it in a register and leaving it there.

Arseny Kapoulkine said...

?: compilation heavily depends on the compiler.
However, I don't think I often saw MSVC generate a cmov or something else for PC.
GCC (and SNC), on the other hand, sometimes do this.

Of course the robust thing is to make a select macro (along with the corresponding version for floats).

cbloom said...

" ?: compilation heavily depends on the compiler. However, I don't think I often saw MSVC generate a cmov or something else for PC. GCC (and SNC), on the other hand, sometimes do this."

Yeah, for sure. VC wouldn't do it much on x86 because their code gen was pretty much 586 compatible. For x64 they do generate cmov pretty reliably.

old rants