x86 is really fucking wonderful and it's a damn shame that we don't have it on all platforms.
(addendum : I don't really mean x86 the ISA, I mean x86 as short hand for the family of modern processors that run x86; in particular
P-Pro through Core i7).
It's not just that the chips are super fast and easy to use. It's that they actually encourage good
software engineering. In order to make the in-order PPC chips you have to abandon everything you've learned about
good software practices in the last 20 years. You can't abstract or encapsulate. Everything has to be in locals,
every function has to be inline.
1. Complex addressing.
This is way more huge than I ever expected. There are two important subaspects here :
1.A. being able to do addition and simple muls in the addressing, eg. [eax + ecx*2]
1.B. being able to use memory locations directly in instructions instead of moving through registers.
Together these work together to make it so that on x86 you don't have to fuck around with loading shit out
to temporaries. It makes working on variables in structs almost exactly the same speed as working on variables
in a local.
e.g.
x = y;
is
mov eax, ecx;
while
x = s.array[i];
is
mov eax, [eax + ecx*4 + 48h]
and those run at almost the same speed !
This is nice for C and accessing structs and arrays of course, but it's especially important for C++ where lots of things are
this-> based. The compiler keeps "this" in a register, and this-> references run at the same speed as locals!
ADDENDUM : the really bad issue with the current PPC chips is that the pipeline from integer computations to load/stores is very bad,
it causes a full latency stall. If you have to compute an address and then load from it, and you can't get other instructions in
between, it's very slow. The great thing about x86 is not that it's one instruction, it's that it's fast. Again, to be clear, the
point here is not that CISC is better or whatever, it's simply that having fast complex addressing you don't have to worry about
changes the way you write code. It lets you use structs, it lets you just use for(i) loops and index off i and not worry about it.
Instead on the PPC you have to worry about things like indexing byte arrays is faster than any other size, and if you're writing loops
and accessing dword arrays maybe you should be iterating with a pointer instead of an index, or maybe iterate with index*4, or whatever.
2. Out of order execution.
Most people thing of OOE as just making things faster and letting you be a bit lazy about code gen and stalls
and so on. Yes, that is true, but there's a more significant point I think : OOE makes C++ fast.
In particular, the entire method of referencing things through pointers is impossible in even moderate performant code without
OOE.
The nasty thing that C++ (or any modern language really, Java ,C#, etc. are actually much much worse in this
regard) does is make you use a lot of pointers, because your data types may be polymorphic or indeterminate,
it's often hard to hold them by value. Many people think that's a huge performance problem, but on the PC/x86
it actually doesn't hurt much. Why?
Typical C++ code may be something like :
.. stuff ..
this->m_obj->func();
.. stuff ..
this may involve several dependent memory fetches. On an in-order chip this is stall city. With OOE it
can get rearranged to :
..stuff ..
temp = this->m_obj;
.. stuff ..
vtable = temp->vtable;
.. stuff ..
vtable->func();
.. stuff ..
And as long as you have enough stuff to do in between it's no problem.
Now obviously doing lots of random calls through objects and vtables in a row will still make you slow, but that's
not a common C++ pattern and it's okay if that's slow. But the common pattern of just getting a class pointer
from somewhere then doing a bunch of stuff on it is fast (or fast enough for not-super-low-level code anyway).
ADDENDUM : obviously if your code path was completely static, then a compile-time scheduler could do the same thing.
But your code path is not static, and the caches have basically random delays because other threads might be using them too,
so no static scheduling can ever be as good. And even beyond that, the compiler is just woefully unable to handle scheduling
for these things. For example, to schedule as well as OOP can, you would have to do things like speculatively read ptr and *ptr
even if it might only be needed if a certain branch is taken (because if you don't do the prefetching the stall will be horrific)
etc. Furthermore, the scheduling can only really compete when all your functions are inline; OOP sort of inlines your functions for you
since it can schedule functions across the jump. etc. etc.
ADDENDUM : 3. Another issue that I think might be a big one is the terrible penalty for "jump to variable" on PPC. This hits you
when you do a switch() and also when you make virtual calls. It can only handle branch prediction for static branches, there's
no "branch target predictor" like modern x86 chips have. Maybe I'll write a whole post about virtual functions.
Final addendum :
Anyway, the whole point of this post was not to make yet another rant about how current consoles are slow or bad chips. Everyone
knows that and it's old news and boring.
What I have realized and what I'm trying to say is that these bad old chips are not only slow - much worse than that! They cause a
regression in software engineering practice back to the bad old days when you have to worry about shit like whether you pre-increment
or post-increment your pointers. They make clean, robust, portable programming catastrophically inefficient. All the things we have
made progress on in the last 20 years, since I started coding on Amigas and 286's where we had to worry about this shit, we moved into
an enlightened age where algorithms were more important than micro bullsit, and now we have regressed.
At the moment, the PowerPC console targets are *SO* much slower than the PC, that the correct way to write code is just
to write with only the PowerPC in mind, and whatever speed you get on x86 will be fine. That is, don't think about
the PC/x86 performance at all, just 100% write your code for the PPC.
There are lots of little places where they differ - for example on x86 you should write code to take use of complex addressing,
you can have fewer data dependencies if you just set up one base variable and then do lots of referencing off it. On PPC this
might hurt a lot. Similarly there are quirks about how you organize your branches or what data types you use (PPC is very sensitive
to the types of variables), alignment, how you do loops (preincrement is better for PPC), etc.
Rather than bothering with #if __X86 and making fast paths for that, the right thing is just to write it for PPC and not sweat the
x86, because it will be like a bazillion times faster than the PPC anyway.
Some other PPC notes :
1. False load-hit-stores because of the 4k aliasing is an annoying and real problem (only the bottom bits of the address are used for LHS
conflict detection). In particular, it can easily come up when
you allocate big arrays, because the allocators will tend to give you large memory blocks on 4k alignment. If you then do a memcpy
between two large arrays you will get a false LHS on every byte! WTF !?!? The result is that you can get wins by randomly
offsetting your arrays when you know they will be used together. Some amount of this is just unavoidable.
2. The (unnamed console) compiler just seems to be generally terrible about knowing when it can keep things in registers and when it can't. I noted
before about the failure to load array base addresses, but it also fucks up badly if you *EVER* call a function using common variables.
For example, say you write a function like this :
{
int x = 0;
for( ... one million .. )
{
.. do lots of stuff using x ..
x = blah;
}
external_func(&x);
}
the correct thing of course is to just keep x in a register through the whole function and not store its value back to the stack until right
before the function :
{
//int x; // x = r7
r7 = 0;
for( ... one million .. )
{
.. do lots of stuff using r7 ..
r7 = blah;
}
stack_x = r7
external_func(&stack_x);
}
Instead what I see is that a store to the stack is done *every time* x is manipulated in the function :
{
//int x; // x = r7
r7 = 0;
stack_x = r7;
for( ... one million .. )
{
.. do lots of stuff using r7 - stores to stack_x every time ! ..
r7 = blah;
stack_x = r7;
}
external_func(&stack_x);
}
The conclusion is the same one I came to last time :
When you write performance-critical code, you need to completely isolate it from function calls, setup code, etc. Try to pass in everything
you need as a function argument so you never had to load from globals or constants (even loading static constants seems to be compiled very
badly, you have to pass them in to make sure they get into registers), and do everything inside the function on locals (which you
never take the address of). Never call external functions.