Shelving with just one workspace as a way to save your work seems to work okay, but I've been using it to make temp checkins to go between my various computers, and it's not awesome for that. I guess what I really have to do is make real branches for that, but branches scare the bejeesus out of me.
8/05/2010
08-05-10 - P4 Shelf
8/04/2010
08-04-10 - Initial Learnings of SPU
- Yes I know I'm 4 (5?) years late.
- The SPU decrementer runs at the same frequency as mftb on the PPU, that's 79.8 MHz
- There are two different ways the intrinsics are exposed; spu_ are C++ and supposed to know about types; the si_ just expose
the instructions directly and are untyped. The spu_ intrinsics are kind of annoying because they are overloaded C++ , but they don't get it right, so you get compile
errors all the time. The vec_blah types don't really seem to work right with the intrinsics. The best practice I've found is
just to use the untyped "qword" and the "si_" intrinsics, and cast to a vec_ type if you want to grab a value out (like
((vec_uint4)w)[0] ).
- SPU GCC seems to have the same problem as the PPU GCC and the Xenon MSVC compiler, in that all of them incorrectly optimize out
assignments to different types. I've never seen this problem on MSVC/x86 , but maybe it's just that x86 is pretty good about running
the same speed no matter what the type of the variable. The problem with the PowerPC and SPU is that performance turns out to be very
sensitive to data types. So, I have lots of code that does something like :
int x = struct.array[7]; for(... many .. ) { .. do stuff with x .. }and what I'm seeing is that the "do stuff with x" is generating a load from struct.array[7] every time, and the assignment into my temp has been eliminated completely. On the SPU this shows up when you have something a struct member int m_x and you do something like :qword x = spu_promote( struct.m_x , 0 ); for(... many .. ) { .. do stuff with x .. }and the compiler decides that rather than load m_x once and keep it in a register it will reload it all the time. It's like the optimizer has a very early pass to merge variables that have just been renamed by assignment - but it is incorrectly not accounting for side effects on the code gen. Weird. Anyway this brings us to : - The SPU has only vector registers and instructions. Despite that, basic int math is mostly okay. The "top" ([0]) int in a qword can
be used basically like an int register. You get all the basic ops on it and it runs full speed. The big problem is with loads and stores.
You can only load/store aligned quadwords, so accessing something like a U32 from a memory location involves some shuffling around.
rotqby solves your loads, and cbd/chd/cwd/cdd + shufb solves your stores.
One annoyance is that since only the top word can be used for things like loads, you often lose your SIMD. Like say you want to generate 4 addresses and do 4 loads, you can do your math on 4 channels, but then you have to copy the results out of channels 1-3 into other registers in slot 0 (using shuf or rotqby), and this often is more expensive than just doing the math 4 separate times. (because the instructions to transfer across lanes are slower than math).
- You can, unlike old SSE, do a lot of ops that are "horizontal" across the quadword. There are nice bit scatters and gathers, as well
as byte rotates, shuffles, etc. You can treat the qword as a big 128 bit register and rotate around the whole thing, but it's not fast;
a rotate left takes two instructs (rotbi and rotbybi) while a rotate right takes three (rotbi, rotbybi, + a subtract)
- So far as I can tell, __align_hint, doesn't do anything for you. For review, __attribute__ ((aligned (16)) is the way you specify
that a certain variable should be aligned by the compiler. __align_hint on the other hand lets it know that a given pointer which was
not specified as aligned in fact is aligned (you know this as a programmer). In theory it could generate better code from that, but I
don't see it, which is related to :
- It certainly doesn't do magic about combining loads. For example, say you have a pointer to a struct that's struct { int x,y,z,w; }.
You tell it __align_hint and then you do struct.x ++ , struct.y ++. In theory, it should load that whole struct as one single quadword
load, do a bunch of work on it, then store it back as one quadword. It won't do that for you. The compiler generates separate loads and
shuffles to get out each member, so you have to do all this kind of thing by hand.
- The SPU has no branch predictor. Or rather it has a "zero bit" branch predictor - it always predicts branch not taken (unless you manually hint it, see
later). It is deeply pipelined, so it will run ahead on the expected branch, and then if you actually go the other way, it's a 20 clock penalty. That's not
nearly as bad as a missed branch on most chips, so it's nothing to tear up your code over (eg. it's less than an L1 miss on the PPU, which is 40 clocks!).
To make use of the static branch predictor, organize branches so the likely part is first (the "if" part is likely, the "else" is unlikely). You can override this with if (__builtin_expect(expr, 1/0)) , I use these :
#define IF_LIKELY(expr) if ( __builtin_expect(expr,1) ) #define IF_UNLIKELY(expr) if ( __builtin_expect(expr,0) )
but Mike says it's best just to inherently arrange your code. The way this is actually implemented on the SPU is with a branch hint instruction. The instruction sequence to the SPU will look like :branch hint - says br is likely .. stuff .. branch
if branch was unlikely, there's no need for a hint, so it's only generated to tell the SPU that the branch is likely, eg. the instruction fetch needs to jump ahead to the branch location. The branch hint needs to be 15 clocks or so before the branch. One compile option that turns out to be important is-mhint-max-nops=2
which tells GCC to insert nops to make the branch hint far enough away from the branch. Right now 2 is optimal for me, but YMMV, it's a number you have to fiddle with.Now, in theory you can also do dynamic branch prediction. If you can compute the predicate for your branch early enough, you can see if it will be taken or not and either hint or not hint (obviously not using a branch, but using a cmov/select). I have not found a way to make the compiler generate this, so this seems to be reserved for pure assembly in super high performance code.
- Dual issue; The SPU is a lot like old Pentium, it has dual pipes and different types of instructions go on either pipe. Unfortunately, for data compression
almost everything I want is on the odd pipe - all the loads & store, branches and all the cross-quadword stuff (shuffles and rotqby and such) are all odd pipe.
The instructions have to be aligned right (they have to alternate even and odd); if you're writing ASM you need to be super-aware of this; if you're just writing C, the compiler should do it for you;
with intrinsics, you're sort of in between. The compiler should theoretically schedule the intrinsics to make them line up with the right pipe, but if you only
give it odd pipe intrinsics it can't do anything. Because of that I saw odd things like :
// these two instructions are much faster qword vec_item_addr = si_a( vec_fastDecodeTable , vec_peek ); qword vec_item = si_rotqby(vec_item_data,vec_item_addr); // than this one !? qword vec_item = si_rotqby(vec_item_data,vec_peek);Part of the solution to that is to compile with "-mdual-nops=1" which lets the compiler insert nops to fix dual issue alignment. The =1 was the best for me, but again YMMV. BTW one problem with the SPU is that there is code size pressure (just to fit in the 256k), and all these nops do make the code size bigger.ADDENDUM : just found an important issue with the dual pipe; instruction fetch is on odd pipe (obvious I guess, all fetch is on odd pipe). That means when you overload the odd pipe, not only are you failing to dual issue - you are fighting with instruction fetch for time. That's a disaster. So pretty much any time you can replace a load/store/rotq/shufb/etc. with a math op on even pipe, it's a win, even if you need 2-3 math instructions to do the same odd pipe instruction.
- Optimizing for SPU is not as obvious as Insomniac and some others might lead you to believe. If you just replace C with
vector ops, you can easily make your code slower. The problem is that the latencies and dependency stalls are not obvious.
Latencies are mostly around 2-6 , with lots at 4, which seems short until you realize that 4 clocks = 8 instructions, which makes
it pretty hard to make sure you are not dependency-bound.
For example , this :
4370: 42 7f ff ad ila $45,65535 # ffff 4374: 38 8d c8 33 lqx $51,$16,$55 // load quad for codelen 4378: 18 0d c8 36 a $54,$16,$55 // 54 = address of codelen 437c: 18 0d db b5 a $53,$55,$55 // 53 = peek*2 4380: 1c 03 5b 34 ai $52,$54,13 // 52 = align byte for codelen 4384: 38 83 9a b0 lqx $48,$53,$14 // load qw for table[peek] 4388: 18 03 9a b2 a $50,$53,$14 // 50 = address of symbol 438c: 3b 8d 19 af rotqby $47,$51,$52 // rotate to get byte for codelen 4390: 1c 03 99 31 ai $49,$50,14 // 49 = adjust to get word symbol 4394: 14 3f d7 ac andi $44,$47,255 // &= 0xFF to get byte 44 = codelen 4398: 3b 8c 58 2e rotqby $46,$48,$49 // rot by to get symbol 439c: 3b 0b 06 2b rotqbi $43,$12,$44 // use 43a0: 08 03 56 0d sf $13,$44,$13 // use bits 43a4: 18 2b 57 07 and $7,$46,$45 // $7 = symbol , 45 = ffff 43a8: 39 8b 15 8c rotqbybi $12,$43,$44 // useis faster than this :4360: 0f 60 d7 2c shli $44,$46,3 // *= 8 to make table address 4364: 38 8b 07 aa lqx $42,$15,$44 // load qw for table 4368: 18 0b 07 ab a $43,$15,$44 // 43 = address of table item 436c: 3b 8a d5 29 rotqby $41,$42,$43 // rotate qword to get two dwords at top 4370: 08 02 d4 8b sf $11,$41,$11 // use 4374: 3b 0a 45 27 rotqbi $39,$10,$41 // use 4378: 3f 81 14 87 rotqbyi $7,$41,4 // rotate for symbol 437c: 39 8a 53 8a rotqbybi $10,$39,$41 // useAnyway, the point is if you want to really get serious, you have to map out your dependency chains and look at the latency of each step and try to minimize that, rather than just trying to reduce instructions.
- One particularly annoying example of bad code gen. If you do this :
int x,y; loop { if ( x <= y ) }it will *sometimes* do the right thing (the right thing is to make x and y qwords and do all the int ops on the [0] element). However, sometimes it decides that they aren't important enough and it will leave them as ints on the stack, in which case you get loads & shuffles to fetch them.More troubling is the fact that you cannot easily trick the compiler to fix it. You might think that this :
vec_int4 x,y; loop { if ( x[0] <= y[0] ) }would do the trick - you're telling the compiler that I want them to be qwords, but I only use the top int part. Nope, that doesn't work, in fact it often makes it *much worse*. In theory, the compiler should be able to tell that I am only ever touching x[0], so when I dox[0] ++
it should be able to know that it can go ahead and increment the *whole* vector x with a plain old add (eg. also increment x[1],2,3), but it doesn't always do that, and you wind up getting instructions to extract the portion you're working on. Sigh.The only reliable way I have found is to go ahead and use the qword intrinsics
qword x,y; loop { qword comp = si_clgt( x, y ); if ( comp[0] ) }but the problem with that is that you now have to be very careful about what instructions you choose to generate for your operations - you can often be slower than the plain C using intrinsics because you aren't thinking about scheduling and pairing. Also, once you start using intrinsics, you can't use too much of the [0] form of access any more because of the above note.To be clear, what I would like is a way to say "I'm going to only work on this as an int, but please keep it in a vector and ignore the bottom 3 components and generate the code that is fastest to give me the [0] result" , but there doesn't seem to be a reliable way to make the compiler do that. I dunno, maybe there's a way to get this.
8/02/2010
08-02-10 - Work
But more than that, work is like a mental tick that I sometimes indulge in. It's an autistic fugue. You go into this hole where all you can think about is technical issues, and it's horrible, but it also sort of feels good. Like taking a poo, or playing with a loose tooth. Then when I get into this state I just can't stop. I try to relax with N, but all I can think about is spu_shufb and should I be using _align_hint ? And does DMA invalid ll-sc reservations? And I have to go back to work.
I imagine it's a bit like having OCD. It's not like the OCD guy really wants to count the lines in the wood grain. But if he walks into the room and doesn't count them, it just eats at his brain - "must count wood grain" - repeating over and over. It's not like working really makes me happy; after a day of solid work I don't feel good, in fact quite the opposite, my brain feel fried from fatigue, and my body is in great pain from sitting too much, but I can't resist it, and if I don't work I just keep thinking "work work work".
08-02-10 - SPU Developing
The whole MSVC editor-is-my-debugger thing is such a massive win that it really hurts to step back from it to the bad old days of separate debugger.
I did one thing that sort of helps - I run my test app in a loop, and when it finishes each test, I unload my SPU image, wait for a key press, then reload it and repeat. This lets me rebuild my SPU ELF and just reload it and repeat. This avoids a lot of test cycle time, but only works until you have a crash so it fails a lot.
The ability to just do stdin/stdout to/from a console from the PS3 is pretty awesome. It lets me write my test apps as if they were just command line apps and run them from my PC.
7/31/2010
07-31-10 - GCC Scheduling Barrier
There's a common belief that an empty volatile asm in GCC is a scheduling barrier :
__asm__ volatile("")
but that appears to not actually be the case (* or rather, it is actually the case, but not according to spec).
What I believe it does do is it splits the "basic blocks"
for compilation, but then after initial optimization there's another merge pass where basic blocks are
combined and they can in fact then schedule against each other.
The GCC devs seem to be specifically defending their right to schedule across asm volatile by refusing to give gaurantees : gcc.gnu.org/bugzilla/show_bug.cgi?id=17884 or gcc-patches/2004-10/msg01048.html . In fact it did appear (in an unclear way) in the docs that they wouldn't schedule across asm volatile, but they claim that's a documentation error.
Now they also have the built-in "__sync" stuff . But I don't see any simple compiler scheduling barrier there, and in fact __sync has problems.
__sync is defined to match the Itanium memory model (!?) but then was ported to other platforms. They also do not document the semantics well. They say :
"In most cases, these builtins are considered a full barrier"
What do you mean by "full barrier" ? I think they mean LL,LS,SL,SS , but do they also mean Seq_Cst ? ( there also seem to be some bugs in some of the __sync implementations bugzilla/show_bug.cgi?id=36793 )
For example on PS3 __sync_val_compare_and_swap appears to be :
sync lwarx .. loop .. stwcx isyncwhich means it is a full Seq_Cst operation like a lock xchg on x86. That would be cool if I actually knew it always was that on all platforms, but failing to clearly document the gauranteed memory semantics of the __sync operations makes them dangerous. (BTW as an aside, it looks like the GCC __sync intrinsics are generating isync for Acquire unlike Xenon which uses lwsync in that spot).
(note of course PS3 also has the atomic.h platform-specific implementation; the atomic.h has no barriers at all, which pursuant to previous blog Mystery - Does the Cell PPU need Memory Control might actually be the correct thing).
I also stumbled on this thread where Linus rants about GCC .
I think the example someone gives in there about signed int wrapping is pretty terrible - doing a loop like that is fucking bananas and you deserve to suffer for it. However, signed int wrapping does come up in other nasty unexpected places. You actually might be wrapping signed ints in your code right now and not know about it. Say I have some code like :
char x; x -= y; x += z;What if x = -100 , y = 90 , and z = 130 ? You should have -100 - 90 + 130 = -60. But your intermediate was -190 which is too small for char. If your compiler just uses an 8 bit machine register it will wrap and it will all do the right thing, but under gcc that is not gauranteed.
See details on signed-integer-overflow in GCC and notes about wrapv and wrapv vs strict-overflow
7/27/2010
07-27-10 - 2d arrays
Say you need to change a stack two dimensional array :
int array[7][3];into a dynamically allocated one, and you don't want to change any code. Do you know how? It's :
int (*array) [3] = (int (*) [3]) malloc(sizeof(int)*3*7);It's a little bit cleaner if you use a typedef :
typedef int array_t [3]; int array[7][3]; array_t array[7]; array_t * array = (array_t *) malloc(7*sizeof(array_t));those are all equivalent (*).
You can take a two dimensional array as function arg in a few reasonable ways :
void func(int array[7][3]) { }
void func(int (*array)[3]) { }
void func(int array[][3]) { }
function arg arrays are always passed by address.
2-d arrays are indexed [row][col] , which means the first index takes big steps and the second index takes small steps in memory.
(* = if your compiler is some fucking rules nazi, they are microscopically not quite identical, because array[rows][cols] doesn't actually have to be rows*cols ints all in a row (though I'm not sure how this would ever actually not be the case in practice)).
7/26/2010
07-26-10 - Virtual Functions
First of all, let's be clear than any serious game code base must have *some* type of dynamic dispatch (that is, data-dependent function calls). When people say "avoid virtual functions" it just makes me roll my eyes. Okay, assume I'm not just over-abstracting and I'm doing the dynamic dispatch for a good reason, because I actually need to do different things on different objects. The issue is just how you implement it.
How C++ virtual functions are implemented on most compilers :
-
There's a "vtable" of function pointers associated with each class type. Individual objects have a pointer to their vtable. The advantage of
this is that vtables are shared, but the disadvantage is that you get an extra cache miss. What actually happens when you make a virtual
call?
(multiple and virtual inheritance ignored here for now).
vtable pointer = obj->vtable; func_pointer = vtable->func jump func_pointer
Why does this hurt ?
vtable pointer = obj->vtable;The load of vtable pointer may be a cache miss, but that doesn't count against us since you are working on obj anyway you have to have one cache miss there.
func_pointer = vtable->funcThen you fetch the func pointer, which is maybe a cache miss (if this type of object has not been used recently).
jump func_pointerThen you jump to variable, which may or may not be able to use branch prediction or fuck up your CPU's pipelining.
How can virtual calls be removed ?
- A good practice is to never expose virtuals directly. In the base class you should have something like :
void DoStuff(int x);
as the public interface, and then hidden lower down, something like :virtual void v_DoStuff(int x) = 0; void DoStuff(int x) { v_DoStuff(x); }this gives you a single controlled call-through point which you can then make non-virtual or whatever at some point. (though, this screws up the next issue:) - The compiler can do it better than you (in theory anyway). Virtual functions that aren't actually overriden anywhere can obviously
replaced with direct calls. But it's better than that - if you hold a pointer to some type, as long as nothing *derived* from that type
override the virtual, it can be called direct. Even if sibling or cousin or parent classes override that virtual, by holding it by
a derived type you can know that you are in a leaf part of the virtual call.
No C++ compiler that I know of actually does this, because it requires knowledge of the class heirarchy in the whole program. But Java and C# are doing this now, so hopefully we will get it in C++ soon.
When/if we get this, it is a very powerful optimization, because you can make then make functions that take concrete types and get joint-dispatch where you turn many virtual calls into just one. An example to be clear :
class Base { virtual int Func() { return 0; } }; class A : public Base { virtual int Func() { return 1; } }; class B : public A { int m_data; }; class C : public Base { virtual int Func() { return 2; } }; void Test(A * obj) { obj->Func(); }in this case the virtual call to Func() can be replaced with the direct call A::Func() by the compiler because no child of A overrides Func. - Don't use the method of virtual calls to hide implementation from interface. Some C++ coders will make a naked pure virtual interface class for
their D3DHardware or ResourceManager or whatever, and then make a concrete class that impelements those virtuals. This is pretty nice for code cleanliness,
but gives you virtual calls all over that are unnecessary. Note that if we had compiler-virtual-elimination this method would be fine since all those
virtuals could be eliminated as there is only one implementation of that interface in the program.
- Large/rare classes could use in-object vtables. In some cases, the space savings of sharing the vtable for all instances of a given class is
not a huge win, and in that case you'd rather have the vtable directly in line in the object, because it gives you one less cache miss.
There's no way to make the compiler do this for you, but you can do it reasonably easily with function pointers.
- They can be converted into normal branches. In the DoStuff() example, instead of calling v_DoStuff, you could
just branch on object type and call some concrete functions. Like :
void DoStuff(int x) { if ( this is Actor ) Actor::DoStuff(x); else Base::DoStuff(x); }the advantage of this is that you avoid a cache miss for the vtable, and you use a normal branch which can be predicted even on shitty chips.This is only practical if you have only a few overrides for the function. The shitty thing about this (and many of these patterns) is that it pushes knowledge of the derived class up to the base. One nice variant of this is :
- Rare overrides can use a test. If DoStuff() usually calls the default implementation, but once in a rare while
needs an override, you can do that more efficiently with a pattern like :
void DoStuff(int x) { if ( base is okay ) Base::DoStuff(x); else v_DoStuff(x); }so we check a bool (could be in a bitfield) and if so we use direct call, else we call the virtual. - Pushing data up to parent instead of calling down to child. A good method is simply to eliminate the virtual
through code flow redesign.
For example, a lot of bad game engines might have something like "virtual GetPosition()" on the base object. This at first seems like a good idea - not all objects have positions, and they might want to implement different ways of reporting it. In fact it is a bad idea, and you should just store m_position as public data, then have the different object types push their different position into m_position. (For example silly people implement attachments by making GetPosition() query the parent position then add some offset; instead you should do that computation only once in your object update and store it into the shared m_position).
Most good games have a chunk of the most common data stored directly in the base object type so it can be accessed without virtuals.
- Specialize for common cases. I sort of noted about this before, but when you're calling a bunch of virtuals
from something, you can change many virtuals into one by dispatching to a func that calls them concretely. eg.
Say you have a function like :
void func(obj * o) { o->v1(); o->v2(); o->v3(); }that does a bunch of virtual calls. Instead make a concrete call version t_func which does no virtuals :template
the difficulty is the dispatching from func to t_func. C++ has no mechanism to do type-dependent dispatch other than the vtable mechanism, which is annoying when you want to write a helper function and not add it to your class definition. There are general solutions to this (see for example in cblib the PrefDispatcher which does this by creating a separate but parallel class heirarchy to do dispatch) but they are all a bit ugly. A better solution for most games is to either add func() to the vtable or to just to know what concrete types you have and do manual dispatch.<typename T > void t_func(T * o) { o->T::v1(); o->T::v2(); o->T::v3(); } void func(obj * o) { dispatch to actual type of obj : t_func<typeof(o) >( o ); } - Note that just flattening your hierarchy is not generally a great solution. For example, rather than have a bunch of different types of Crate (ExplodingCrate, FlammableCreate) you might decide to just make one single UberCrate that can be any type of crate. This eliminates virtual calls when you are working on UberCrate since it is just one concrete type, but it adds a lot of branches (if (crate.isExploding)) , and often makes the classes fatter in memory, etc. Making objects data-polymorphic instead of type-polymorphic may or may not be a win depending on the details.
- In general it's good practice to make queries as fast and non-virtual as possible, and push the virtuals to the updates.
07-26-10 - Code Issues
I'll give you a particular example to be concrete, though this is something I often want to do. In the RAD LZH stuff I have various compressors. One is a very complex optimal parser. I want to put that in a separate file. People should be able to just include rrLZH.cpp and it will build and run fine, but the optimal parser will not be available. If they build in rrLZHOptimal, it should automatically provide that option.
I know how to do this in C++. First rrLZH has a function pointer to the rrLZHOptimal which is statically initialized to NULL. The rrLZHOptimal has a CINIT class which registers itself and sets that function pointer to the actual implementation.
This works just fine (it's a standard C++ self-registration paradigm), but it has a few problems in practice :
1. You can run into order-of-initialization issues if you aren't careful. (this is not a problem if you are a good C++ programmer and embrace proper best practices; in that case you will be initializing everything with singletons and so on).
2. It's not portable because of the damn linkers that don't recognize CINIT as a binding function call, so the module can get dropped if it's in a lib or whatever. (this is the main problem ; it would have been nice if C++0x had defined a way to mark CINIT constructions as being required in the link (not by default, but with a __noremove__ modifier or something)). There are various tricks to address this but I don't think any of them is very nice. (*)
I general I like this pattern a lot. The more portable version of this is to have an Install() function that you have to manually call. I *HATE* the Install function pattern. It causes a lot of friction to making a new project because you have to remember to call all the right Installs, and you get a lot of mysterious failures where some function just doesn't work and you have to go back and install the right thing, and you have to install them in the right order (C++ singleton installs mostly take care of order for you). etc.
(* : this is one of those issues that's very annoying to deal with as a library provider vs. an end-application developer. As an app developer you just decide "this is how we're doing it for our product" and you have a method that works for you. As a library developer you have to worry about people not wanting to use the method you have found that works, and how things might behave under various compilers and usage patterns. It sucks.)
ADDENDUM : the problem with the manually-calling Install() pattern is that it puts the selection of features in the wrong & redundant place. There is one place that I want to select my modules, and that is in my build - eg. which files I compile & link, not in the code. The problem with it being in the code is that I can't create shared & generic startup code that just works. I wind up having to duplicate startup code to every app, which is very very bad for maintainability. And of course you can't make a shared "Startup()" function because that would force you to link in every module you might want to use, which is the thing you want to avoid.
For the PS3 people : what would be the ideal way for me expose bits of code that can be run on the SPU? I'm just not sure what people are actually using and how they would like things to be presented to them. eg. should I provide a PPU function that does all the SPU dispatching for you and do all my own SPU management? Is it better if I go through SPURS or some such? Or should I just provide code that builds for SPU and let you do your management?
I've been running into a problem with the MSVC compiler recently where it is incorrectly merging functions that aren't actually the same. The repro looks like this. In some header file I have a function sort of like this :
StupidFunction.h :
inline int StupidFunction()
{
return SOME_POUND_DEFINE;
}
Then in two different files I have :
A.cpp :
#define SOME_POUND_DEFINE (0)
#include "StupidFunction.h"
printf("%d\n",StupidFunction());
and :
B.cpp :
#define SOME_POUND_DEFINE (1)
#include "StupidFunction.h"
printf("%d\n",StupidFunction());
and what I get is that both printfs print the same thing (random whether its 0 or 1 depending on build order).
If I put "static" on StupidFunction() it fixes this and does the right thing. I have no idea what the standard says about compilation units and inlines and merging and so on, so for all I know their behavior might be correct, but it's damn annoying. It appears that the exact definition of inline changed in C99, and in fact .cpp and .c have very different rules about inlines (apparently you can extern an inline which is pretty fucked up). (BTW the whole thing with C99 creating different rules that apply to .c vs .cpp is pretty annoying).
ADDENDUM : see comments + slacker.org advice about inline best practice (WTF, ridiculous) , and example of GCC inline rules insanity
In other random code news, I recently learned that the C preprocessor (CPP) is not what I thought.
I always thought of CPP as just a text substitution parser. Apparently that used to be the case (and still is the case for many compilers, such as Comeau and MSVC). But at some point some new standard was introduced that makes the CPP more tightly integrated with the C language. And of course those standards-nazis at GCC now support the standard.
The best link that summarizes it IMO is the GCC note on CPP Traditional Mode that describes the difference between the old and new GCC CPP behavior. Old CPP was just text-sub, New CPP is tied to C syntax, in particular it does tokenization and is allowed to pass that tokenization directly to the compiler (which does not need to retokenize).
I guess the point of this is to save some time in the compile, but IMO it's annoying. It means that abuse of the CPP for random text-sub tasks might not work anymore (that's why they have "traditional mode", to support that use). It also means you can't do some of the more creative string munging things in the CPP that I enjoy.
In particular, in every CPP except GCC, this works :
#define V(x) x #define CAT(a,b) V(a)V(b)to concatenate two strings. Note that those strings can be *anything* , unlike the "##" operator which under GCC has very specific annoying behavior in that it must take a valid token on each side and produce a valid token as output (one and only one!).
In further "GCC strictness is annoying", it's fucking annoying that they enforce the rule that only ints can be constants. For example, lots of code bases have something like "offsetof" :
/* Offset of member MEMBER in a struct of type TYPE. */ #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)well, that's illegal under GCC for no damn good reason at all. So you have to do :
/* Offset of member MEMBER in a struct of type TYPE. */ #ifndef __GNUC__ #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER) #else /* The cast to "char &" below avoids problems with user-defined "operator &", which can appear in a POD type. */ #define offsetof(TYPE, MEMBER) \ (__offsetof__ (reinterpret_castdamn annoying. (code stolen from here ). The problem with this code under GCC is that a "type *" cannot be used in a constant expression.<size_t> \ (&reinterpret_cast<const volatile char &> \ (static_cast<TYPE *> (0)->MEMBER)))) #endif /* C++ */
A similar problem comes up in templates. On every other compiler, a const pointer can be used as a template value argument, because it's just the same as an int. Not on GCC! In fact because they actually implement the standard, there's a new standard for C++0x which is going to make NULL okay, but only NULL which is also annoying because there are places I would use arbitrary values. (see for example 1 or 2 ).
ADDENDUM : a concrete example where I need this is my in-place hash table template. It's something like :
template < typename t_key,t_key empty_val,t_key deleted_val >
class hash_table
{
that is, I hash keys of type t_key and I need a value for "empty" and "deleted" special keys for the client to
set. This works great (and BTW is much faster than the STL style of hash_map for many usage patterns), but on GCC
it doesn't work if t_key is "char *" because you can't template const pointer values. My work-around for GCC is
to take those template args as ints and cast them to t_key type internally, but that fucking blows.
In general I like to use template args as a way to make the compiler generate different functions for various constant values. It's a much cleaner way than the #define / #include method that I used above in the static/inline problem example.
7/21/2010
07-21-10 - x86
It's not just that the chips are super fast and easy to use. It's that they actually encourage good software engineering. In order to make the in-order PPC chips you have to abandon everything you've learned about good software practices in the last 20 years. You can't abstract or encapsulate. Everything has to be in locals, every function has to be inline.
1. Complex addressing.
This is way more huge than I ever expected. There are two important subaspects here :
1.A. being able to do addition and simple muls in the addressing, eg. [eax + ecx*2]
1.B. being able to use memory locations directly in instructions instead of moving through registers.
Together these work together to make it so that on x86 you don't have to fuck around with loading shit out to temporaries. It makes working on variables in structs almost exactly the same speed as working on variables in a local.
e.g.
x = y; is mov eax, ecx; while x = s.array[i]; is mov eax, [eax + ecx*4 + 48h]
and those run at almost the same speed !
This is nice for C and accessing structs and arrays of course, but it's especially important for C++ where lots of things are this-> based. The compiler keeps "this" in a register, and this-> references run at the same speed as locals!
ADDENDUM : the really bad issue with the current PPC chips is that the pipeline from integer computations to load/stores is very bad, it causes a full latency stall. If you have to compute an address and then load from it, and you can't get other instructions in between, it's very slow. The great thing about x86 is not that it's one instruction, it's that it's fast. Again, to be clear, the point here is not that CISC is better or whatever, it's simply that having fast complex addressing you don't have to worry about changes the way you write code. It lets you use structs, it lets you just use for(i) loops and index off i and not worry about it. Instead on the PPC you have to worry about things like indexing byte arrays is faster than any other size, and if you're writing loops and accessing dword arrays maybe you should be iterating with a pointer instead of an index, or maybe iterate with index*4, or whatever.
2. Out of order execution.
Most people thing of OOE as just making things faster and letting you be a bit lazy about code gen and stalls and so on. Yes, that is true, but there's a more significant point I think : OOE makes C++ fast.
In particular, the entire method of referencing things through pointers is impossible in even moderate performant code without OOE.
The nasty thing that C++ (or any modern language really, Java ,C#, etc. are actually much much worse in this regard) does is make you use a lot of pointers, because your data types may be polymorphic or indeterminate, it's often hard to hold them by value. Many people think that's a huge performance problem, but on the PC/x86 it actually doesn't hurt much. Why?
Typical C++ code may be something like :
.. stuff .. this->m_obj->func(); .. stuff ..this may involve several dependent memory fetches. On an in-order chip this is stall city. With OOE it can get rearranged to :
..stuff .. temp = this->m_obj; .. stuff .. vtable = temp->vtable; .. stuff .. vtable->func(); .. stuff ..And as long as you have enough stuff to do in between it's no problem. Now obviously doing lots of random calls through objects and vtables in a row will still make you slow, but that's not a common C++ pattern and it's okay if that's slow. But the common pattern of just getting a class pointer from somewhere then doing a bunch of stuff on it is fast (or fast enough for not-super-low-level code anyway).
ADDENDUM : obviously if your code path was completely static, then a compile-time scheduler could do the same thing. But your code path is not static, and the caches have basically random delays because other threads might be using them too, so no static scheduling can ever be as good. And even beyond that, the compiler is just woefully unable to handle scheduling for these things. For example, to schedule as well as OOP can, you would have to do things like speculatively read ptr and *ptr even if it might only be needed if a certain branch is taken (because if you don't do the prefetching the stall will be horrific) etc. Furthermore, the scheduling can only really compete when all your functions are inline; OOP sort of inlines your functions for you since it can schedule functions across the jump. etc. etc.
ADDENDUM : 3. Another issue that I think might be a big one is the terrible penalty for "jump to variable" on PPC. This hits you when you do a switch() and also when you make virtual calls. It can only handle branch prediction for static branches, there's no "branch target predictor" like modern x86 chips have. Maybe I'll write a whole post about virtual functions.
Final addendum :
Anyway, the whole point of this post was not to make yet another rant about how current consoles are slow or bad chips. Everyone knows that and it's old news and boring.
What I have realized and what I'm trying to say is that these bad old chips are not only slow - much worse than that! They cause a regression in software engineering practice back to the bad old days when you have to worry about shit like whether you pre-increment or post-increment your pointers. They make clean, robust, portable programming catastrophically inefficient. All the things we have made progress on in the last 20 years, since I started coding on Amigas and 286's where we had to worry about this shit, we moved into an enlightened age where algorithms were more important than micro bullsit, and now we have regressed.
At the moment, the PowerPC console targets are *SO* much slower than the PC, that the correct way to write code is just to write with only the PowerPC in mind, and whatever speed you get on x86 will be fine. That is, don't think about the PC/x86 performance at all, just 100% write your code for the PPC.
There are lots of little places where they differ - for example on x86 you should write code to take use of complex addressing, you can have fewer data dependencies if you just set up one base variable and then do lots of referencing off it. On PPC this might hurt a lot. Similarly there are quirks about how you organize your branches or what data types you use (PPC is very sensitive to the types of variables), alignment, how you do loops (preincrement is better for PPC), etc.
Rather than bothering with #if __X86 and making fast paths for that, the right thing is just to write it for PPC and not sweat the x86, because it will be like a bazillion times faster than the PPC anyway.
Some other PPC notes :
1. False load-hit-stores because of the 4k aliasing is an annoying and real problem (only the bottom bits of the address are used for LHS conflict detection). In particular, it can easily come up when you allocate big arrays, because the allocators will tend to give you large memory blocks on 4k alignment. If you then do a memcpy between two large arrays you will get a false LHS on every byte! WTF !?!? The result is that you can get wins by randomly offsetting your arrays when you know they will be used together. Some amount of this is just unavoidable.
2. The (unnamed console) compiler just seems to be generally terrible about knowing when it can keep things in registers and when it can't. I noted before about the failure to load array base addresses, but it also fucks up badly if you *EVER* call a function using common variables. For example, say you write a function like this :
{
int x = 0;
for( ... one million .. )
{
.. do lots of stuff using x ..
x = blah;
}
external_func(&x);
}
the correct thing of course is to just keep x in a register through the whole function and not store its value back to the stack until right
before the function :
{
//int x; // x = r7
r7 = 0;
for( ... one million .. )
{
.. do lots of stuff using r7 ..
r7 = blah;
}
stack_x = r7
external_func(&stack_x);
}
Instead what I see is that a store to the stack is done *every time* x is manipulated in the function :
{
//int x; // x = r7
r7 = 0;
stack_x = r7;
for( ... one million .. )
{
.. do lots of stuff using r7 - stores to stack_x every time ! ..
r7 = blah;
stack_x = r7;
}
external_func(&stack_x);
}
The conclusion is the same one I came to last time :
When you write performance-critical code, you need to completely isolate it from function calls, setup code, etc. Try to pass in everything you need as a function argument so you never had to load from globals or constants (even loading static constants seems to be compiled very badly, you have to pass them in to make sure they get into registers), and do everything inside the function on locals (which you never take the address of). Never call external functions.
7/18/2010
07-18-10 - Mystery - Why no isync for Acquire on Xenon -
Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Example POWER Implementation for C/C++ Memory Model
PowerPC storage model and AIX programming
Review of the PPC memory control instructions in case you're a lazy fucker who wants to butt in but not actually read the links that I post :
First of all review of the PPC memory model. Basically it's very lazy. We are dealing with in-order cores, so the load/store instructions happen in order, but the caches and store buffers are not kept temporally in order. That means an earlier load can get a newer value, and stores can be delayed in the write queue. The result is that loads & stores can go out of order arbitrarily unless you specifically control them. (* one exception is that "consume" order is guaranteed, as it is on all chips but the Alpha; that is, *ptr is always newer than ptr). To control ordering you have ;
lwsync = #LoadLoad barrier, #LoadStore barrier, #StoreStore barrier ( NOT #StoreLoad barrier ) ( NOT Sequential Consistency ).
lwsync gives you all the ordering that you have automatically all the time on x86 (x86 gives you every barrier but #StoreLoad for free). If you put an lwsync after every instruction you would have a nice x86-like semantics.
In a hardware sense, lwsync basically affects only my own core; it makes me sequentialize my write queue and my cache reads, but doesn't cause me to make a sync point with all other cores.
sync = All barriers + Sequential Consistency ; this is equivalent to a lock xchg or mfence on x86.
Sync makes all the cores agree on a single sync point (it creates a "total order"), so it's very expensive, especially on very-many-core systems.
isync = #LoadLoad barrier, in practice it's used with a branch and causes a dependency on the load used in the branch. (note that atomic ops use loadlinked-storeconditional so they always have a branch there for you to isync on). In a hardware sense it causes all previous instructions to finish their loads before any future instructions start (it flushes pipelines).
isync seems to be the perfect thing to implement Acquire semantics, but the Xbox 360 doesn't seem to use it and I'm not sure why. In the article linked above they say :
"PowerPC also has the synchronization instructions isync and eieio (which is used to control reordering to caching-inhibited memory). These synchronization instructions should not be needed for normal synchronization purposes."
All that "Acquire" memory semantics needs to enforce is #LoadLoad. So lwsync certainly does give you acquire because it has a #LoadLoad, but it also does a lot more that you don't need.
ADDENDUM : another Xenon mystery : is there a way to make "volatile" act like old fashioned volatile, not new MSVC volatile? eg. if I just want to force the compiler to actually do a memory load or store (and not optimize it out or get from register or whatever), but don't care about it being acquire or release memory ordered.