07-31-10 - GCC Scheduling Barrier

When implementing lock-free threading, you sometimes need a compiler scheduling barrier, which is weaker than a CPU instruction scheduling barrier, or a cache temporal ordering memory barrier.

There's a common belief that an empty volatile asm in GCC is a scheduling barrier :

 __asm__ volatile("")

but that appears to not actually be the case (* or rather, it is actually the case, but not according to spec). What I believe it does do is it splits the "basic blocks" for compilation, but then after initial optimization there's another merge pass where basic blocks are combined and they can in fact then schedule against each other.

The GCC devs seem to be specifically defending their right to schedule across asm volatile by refusing to give gaurantees : gcc.gnu.org/bugzilla/show_bug.cgi?id=17884 or gcc-patches/2004-10/msg01048.html . In fact it did appear (in an unclear way) in the docs that they wouldn't schedule across asm volatile, but they claim that's a documentation error.

Now they also have the built-in "__sync" stuff . But I don't see any simple compiler scheduling barrier there, and in fact __sync has problems.

__sync is defined to match the Itanium memory model (!?) but then was ported to other platforms. They also do not document the semantics well. They say :

"In most cases, these builtins are considered a full barrier"

What do you mean by "full barrier" ? I think they mean LL,LS,SL,SS , but do they also mean Seq_Cst ? ( there also seem to be some bugs in some of the __sync implementations bugzilla/show_bug.cgi?id=36793 )

For example on PS3 __sync_val_compare_and_swap appears to be :

   .. loop ..

which means it is a full Seq_Cst operation like a lock xchg on x86. That would be cool if I actually knew it always was that on all platforms, but failing to clearly document the gauranteed memory semantics of the __sync operations makes them dangerous. (BTW as an aside, it looks like the GCC __sync intrinsics are generating isync for Acquire unlike Xenon which uses lwsync in that spot).

(note of course PS3 also has the atomic.h platform-specific implementation; the atomic.h has no barriers at all, which pursuant to previous blog Mystery - Does the Cell PPU need Memory Control might actually be the correct thing).

I also stumbled on this thread where Linus rants about GCC .

I think the example someone gives in there about signed int wrapping is pretty terrible - doing a loop like that is fucking bananas and you deserve to suffer for it. However, signed int wrapping does come up in other nasty unexpected places. You actually might be wrapping signed ints in your code right now and not know about it. Say I have some code like :

char x;
x -= y;
x += z;
What if x = -100 , y = 90 , and z = 130 ? You should have -100 - 90 + 130 = -60. But your intermediate was -190 which is too small for char. If your compiler just uses an 8 bit machine register it will wrap and it will all do the right thing, but under gcc that is not gauranteed.

See details on signed-integer-overflow in GCC and notes about wrapv and wrapv vs strict-overflow


07-29-10 - Bothersome

I think I've been working too much and I'm stressed out and probably shouldn't be taking it out on the internet.

One of the reasons that I never talk to people is that they almost always bring the conversation down to a low level. It's one of my greatest frustrations. Of course it happens in politics all the time, I want to talk about something like how we could actually get better regulation of corporations; obviously just putting Glass-Steagall back in place would be good, but you have to look at the underlying reason why that went away - too much political influence of the big banks, too much importance put on GDP; but besides that you have to ask why the free market is not regulating itself better. Why are private funds investing in hedge funds that charge such high fees and don't actually beat the S&P 500? Maybe it's just ignorance, or maybe they're getting kickbacks. And why aren't shareholders making sure that executives do what's in the best interest of the company? I think this is one of the most important things, there's a big problem with corporate boards and the whole shareholder-election process that isn't being addressed; boards are full of cronies and are failing in their oversight. When's the last time you ever heard of a board shaking up a major company because it was being run badly? Never!

Anyway, I want to talk about something interesting, and instead I get, "but regulation is bad" , uhh , okay, that view is fine, but how about some actual content, if you want to disagree tell me something interesting. No, "mmm I don't like the idea of bigger government". Umm, okay, why not? how about some ideas about how it could be controlled in other ways. No. The conversation is brought down to a boring low level.

Or, even when you're talking to smart people, they will often go into this annoying pedantic correction mode; they don't want to let you be right about anything so they pick some little irrelevant detail to squabble about, like "urm you didn't actually mean GDP, you meant GNP, that's a common mistake". Umm, okay, maybe that's true, if it is true you could make your correction interesting by explaining yourself a bit, or you could just shut the fuck up because it's not germane to the point I'm trying to make and it just drags the conversation down into semantics.

I also really don't like talking to people about a topic that they obviously don't care enough to actually learn. When I find somebody who really knows something about a topic and can teach me, I am ecstatic, I want to pick their brain, first of all I want to get references and do the reading so I can get up on the background material because I don't want to waste their time going over stuff anybody could teach me. I enjoy teaching myself, but find that people who actually want to learn are very rare. Most people just want to rant about some topic they don't actually know anything about. I'll offer "well if you'd like to learn I can point you to.." oh no, I'll just rant without learning thank you. Okay, I'm done with this conversation. Or you get the people who think they're an expert and because of that don't want to listen to anything you have to say. I mean, I know perfectly well that I think I'm an expert on lots of things that I'm probably not, but I still want to learn. If you actually have new insight that I haven't figured out, that's fantastic, please give it to me.

In other stressful news, one of the neighbors just bought their child a drum kit.

Despite my past complaints about the god damn home improvers, it's actually a very quiet neighborhood. For one thing we aren't afflicted by that common Seattle blight of being near "musicians". (Seattle has perhaps the highest per-capita of amateur bands in the US, which sounds good in theory but is actually fucking terrible, because they practice). It's kind of amusing when you just go for walks around the neighborhood; in our neighborhood there's a guy who plays accordian about a block away who's quite good actually, and a guy who jams on electric guitar really loud about two blocks away. (though it doesn't beat where I lived in SF, where down the block from me some older guys would hang out in their garage and play really good jazz).

Giving a child drums should really be illegal. I mean, you could practice on those "Rock Band" style fake drums until you're decent; once you're decent the sound is not so bad, but there's something in particular about an instrument being played badly that is just excruciating and hard to tune out.


07-27-10 - 2d arrays

Some little things I often forget and have to search for.

Say you need to change a stack two dimensional array :

int array[7][3];

into a dynamically allocated one, and you don't want to change any code. Do you know how? It's :

int (*array) [3] = (int (*) [3]) malloc(sizeof(int)*3*7);

It's a little bit cleaner if you use a typedef :

typedef int array_t [3];

int array[7][3];

array_t array[7];

array_t * array = (array_t *) malloc(7*sizeof(array_t));

those are all equivalent (*).

You can take a two dimensional array as function arg in a few reasonable ways :

void func(int array[7][3]) { }

void func(int (*array)[3]) { }

void func(int array[][3]) { }

function arg arrays are always passed by address.

2-d arrays are indexed [row][col] , which means the first index takes big steps and the second index takes small steps in memory.

(* = if your compiler is some fucking rules nazi, they are microscopically not quite identical, because array[rows][cols] doesn't actually have to be rows*cols ints all in a row (though I'm not sure how this would ever actually not be the case in practice)).


07-26-10 - Virtual Functions

Previous post on x86/PPC made me think about virtual functions.

First of all, let's be clear than any serious game code base must have *some* type of dynamic dispatch (that is, data-dependent function calls). When people say "avoid virtual functions" it just makes me roll my eyes. Okay, assume I'm not a moron and I'm doing the dynamic dispatch for a good reason, because I actually need to do different things on different objects. The issue is just how you implement it.

How C++ virtual functions are implemented on most compilers :

    There's a "vtable" of function pointers associated with each class type. Individual objects have a pointer to their vtable. The advantage of this is that vtables are shared, but the disadvantage is that you get an extra cache miss. What actually happens when you make a virtual call? (multiple and virtual inheritance ignored here for now).
    vtable pointer = obj->vtable;
    func_pointer = vtable->func
    jump func_pointer

Why does this hurt ?

    vtable pointer = obj->vtable;
    The load of vtable pointer may be a cache miss, but that doesn't count against us since you are working on obj anyway you have to have one cache miss there.
    func_pointer = vtable->func
    Then you fetch the func pointer, which is maybe a cache miss (if this type of object has not been used recently).
    jump func_pointer
    Then you jump to variable, which may or may not be able to use branch prediction or fuck up your CPU's pipelining.

How can virtual calls be removed ?

  • A good practice is to never expose virtuals directly. In the base class you should have something like :
    void DoStuff(int x);
    as the public interface, and then hidden lower down, something like :
    virtual void v_DoStuff(int x) = 0;
    void DoStuff(int x) { v_DoStuff(x); }
    this gives you a single controlled call-through point which you can then make non-virtual or whatever at some point. (though, this screws up the next issue:)

  • The compiler can do it better than you (in theory anyway). Virtual functions that aren't actually overriden anywhere can obviously replaced with direct calls. But it's better than that - if you hold a pointer to some type, as long as nothing *derived* from that type override the virtual, it can be called direct. Even if sibling or cousin or parent classes override that virtual, by holding it by a derived type you can know that you are in a leaf part of the virtual call.

    No C++ compiler that I know of actually does this, because it requires knowledge of the class heirarchy in the whole program. But Java and C# are doing this now, so hopefully we will get it in C++ soon.

    When/if we get this, it is a very powerful optimization, because you can make then make functions that take concrete types and get joint-dispatch where you turn many virtual calls into just one. An example to be clear :

    class Base { virtual int Func() { return 0; } };
    class A : public Base { virtual int Func() { return 1; } };
    class B : public A { int m_data; };
    class C : public Base { virtual int Func() { return 2; } };
    void Test(A * obj)
    in this case the virtual call to Func() can be replaced with the direct call A::Func() by the compiler because no child of A overrides Func.

  • Don't use the method of virtual calls to hide implementation from interface. Some C++ coders will make a naked pure virtual interface class for their D3DHardware or ResourceManager or whatever, and then make a concrete class that impelements those virtuals. This is pretty nice for code cleanliness, but gives you virtual calls all over that are unnecessary. Note that if we had compiler-virtual-elimination this method would be fine since all those virtuals could be eliminated as there is only one implementation of that interface in the program.

  • Large/rare classes could use in-object vtables. In some cases, the space savings of sharing the vtable for all instances of a given class is not a huge win, and in that case you'd rather have the vtable directly in line in the object, because it gives you one less cache miss. There's no way to make the compiler do this for you, but you can do it reasonably easily with function pointers.

  • They can be converted into normal branches. In the DoStuff() example, instead of calling v_DoStuff, you could just branch on object type and call some concrete functions. Like :
    void DoStuff(int x)
        if ( this is Actor ) Actor::DoStuff(x);
        else Base::DoStuff(x);
    the advantage of this is that you avoid a cache miss for the vtable, and you use a normal branch which can be predicted even on shitty chips.

    This is only practical if you have only a few overrides for the function. The shitty thing about this (and many of these patterns) is that it pushes knowledge of the derived class up to the base. One nice variant of this is :

  • Rare overrides can use a test. If DoStuff() usually calls the default implementation, but once in a rare while needs an override, you can do that more efficiently with a pattern like :
    void DoStuff(int x)
        if ( base is okay ) Base::DoStuff(x);
        else v_DoStuff(x);
    so we check a bool (could be in a bitfield) and if so we use direct call, else we call the virtual.

  • Pushing data up to parent instead of calling down to child. A good method is simply to eliminate the virtual through code flow redesign.

    For example, a lot of bad game engines might have something like "virtual GetPosition()" on the base object. This at first seems like a good idea - not all objects have positions, and they might want to implement different ways of reporting it. In fact it is a bad idea, and you should just store m_position as public data, then have the different object types push their different position into m_position. (For example silly people implement attachments by making GetPosition() query the parent position then add some offset; instead you should do that computation only once in your object update and store it into the shared m_position).

    Most good games have a chunk of the most common data stored directly in the base object type so it can be accessed without virtuals.

  • Specialize for common cases. I sort of noted about this before, but when you're calling a bunch of virtuals from something, you can change many virtuals into one by dispatching to a func that calls them concretely. eg. Say you have a function like :
    void func(obj * o)
    that does a bunch of virtual calls. Instead make a concrete call version t_func which does no virtuals :
    template < typename T >
    void t_func(T * o)
    void func(obj * o)
        dispatch to actual type of obj :
           t_func < typeof(o) >( o );
    the difficulty is the dispatching from func to t_func. C++ has no mechanism to do type-dependent dispatch other than the vtable mechanism, which is annoying when you want to write a helper function and not add it to your class definition. There are general solutions to this (see for example in cblib the PrefDispatcher which does this by creating a separate but parallel class heirarchy to do dispatch) but they are all a bit ugly. A better solution for most games is to either add func() to the vtable or to just to know what concrete types you have and do manual dispatch.
  • Note that just flattening your hierarchy is not generally a great solution. For example, rather than have a bunch of different types of Crate (ExplodingCrate, FlammableCreate) you might decide to just make one single UberCrate that can be any type of crate. This eliminates virtual calls when you are working on UberCrate since it is just one concrete type, but it adds a lot of branches (if (crate.isExploding)) , and often makes the classes fatter in memory, etc. Making objects data-polymorphic instead of type-polymorphic may or may not be a win depending on the details.
  • In general it's good practice to make queries as fast and non-virtual as possible, and push the virtuals to the updates.

07-26-10 - Jeebus

God damn my landlord is unreasonable. I hate all you people so fucking much. Just leave me alone please. I'm really sick of living in someone else's house. I want to be able to do whatever I want with my home.

I dunno, maybe I should just go ahead and buy a house up here. It's okay here I guess, though I don't know if I can stand the winters, or certain other drawbacks. Plus where I want to live if I'm single vs. married with kids is very different.

I know in my head that people who are successful in life are people who just choose a certain plan and commit to it as if they were sure, even though that's totally illogical and there's no reason to be sure of it. You have to act as if you are going to stick with something; every house you move into you should treat as if you are going to be there for life. People who hesitate or hedge tend to get nowhere.

07-26-10 - Code Issues

How do I make a .c/.cpp file that's optional? eg. if you don't build it into your project, then you just don't get the functionality in it, but if you do, then it magically turns itself on and gives you more goodies.

I'll give you a particular example to be concrete, though this is something I often want to do. In the RAD LZH stuff I have various compressors. One is a very complex optimal parser. I want to put that in a separate file. People should be able to just include rrLZH.cpp and it will build and run fine, but the optimal parser will not be available. If they build in rrLZHOptimal, it should automatically provide that option.

I know how to do this in C++. First rrLZH has a function pointer to the rrLZHOptimal which is statically initialized to NULL. The rrLZHOptimal has a CINIT class which registers itself and sets that function pointer to the actual implementation.

This works just fine (it's a standard C++ self-registration paradigm), but it has a few problems in practice :

1. You can run into order-of-initialization issues if you aren't careful. (this is not a problem if you are a good C++ programmer and embrace proper best practices; in that case you will be initializing everything with singletons and so on).

2. It's not portable because of the damn linkers that don't recognize CINIT as a binding function call, so the module can get dropped if it's in a lib or whatever. (this is the main problem ; it would have been nice if C++0x had defined a way to mark CINIT constructions as being required in the link (not by default, but with a __noremove__ modifier or something)). There are various tricks to address this but I don't think any of them is very nice. (*)

I general I like this pattern a lot. The more portable version of this is to have an Install() function that you have to manually call. I *HATE* the Install function pattern. It causes a lot of friction to making a new project because you have to remember to call all the right Installs, and you get a lot of mysterious failures where some function just doesn't work and you have to go back and install the right thing, and you have to install them in the right order (C++ singleton installs mostly take care of order for you). etc.

(* : this is one of those issues that's very annoying to deal with as a library provider vs. an end-application developer. As an app developer you just decide "this is how we're doing it for our product" and you have a method that works for you. As a library developer you have to worry about people not wanting to use the method you have found that works, and how things might behave under various compilers and usage patterns. It sucks.)

ADDENDUM : the problem with the manually-calling Install() pattern is that it puts the selection of features in the wrong & redundant place. There is one place that I want to select my modules, and that is in my build - eg. which files I compile & link, not in the code. The problem with it being in the code is that I can't create shared & generic startup code that just works. I wind up having to duplicate startup code to every app, which is very very bad for maintainability. And of course you can't make a shared "Startup()" function because that would force you to link in every module you might want to use, which is the thing you want to avoid.

For the PS3 people : what would be the ideal way for me expose bits of code that can be run on the SPU? I'm just not sure what people are actually using and how they would like things to be presented to them. eg. should I provide a PPU function that does all the SPU dispatching for you and do all my own SPU management? Is it better if I go through SPURS or some such? Or should I just provide code that builds for SPU and let you do your management?

I've been running into a problem with the MSVC compiler recently where it is incorrectly merging functions that aren't actually the same. The repro looks like this. In some header file I have a function sort of like this :

StupidFunction.h :

inline int StupidFunction()

Then in two different files I have :
A.cpp :

#define SOME_POUND_DEFINE  (0)
#include "StupidFunction.h"


and :
B.cpp :

#define SOME_POUND_DEFINE  (1)
#include "StupidFunction.h"


and what I get is that both printfs print the same thing (random whether its 0 or 1 depending on build order).

If I put "static" on StupidFunction() it fixes this and does the right thing. I have no idea what the standard says about compilation units and inlines and merging and so on, so for all I know their behavior might be correct, but it's damn annoying. It appears that the exact definition of inline changed in C99, and in fact .cpp and .c have very different rules about inlines (apparently you can extern an inline which is pretty fucked up). (BTW the whole thing with C99 creating different rules that apply to .c vs .cpp is pretty annoying).

ADDENDUM : see comments + slacker.org advice about inline best practice (WTF, ridiculous) , and example of GCC inline rules insanity

In other random code news, I recently learned that the C preprocessor (CPP) is not what I thought.

I always thought of CPP as just a text substitution parser. Apparently that used to be the case (and still is the case for many compilers, such as Comeau and MSVC). But at some point some new standard was introduced that makes the CPP more tightly integrated with the C language. And of course those standards-nazis at GCC now support the standard.

The best link that summarizes it IMO is the GCC note on CPP Traditional Mode that describes the difference between the old and new GCC CPP behavior. Old CPP was just text-sub, New CPP is tied to C syntax, in particular it does tokenization and is allowed to pass that tokenization directly to the compiler (which does not need to retokenize).

I guess the point of this is to save some time in the compile, but IMO it's annoying. It means that abuse of the CPP for random text-sub tasks might not work anymore (that's why they have "traditional mode", to support that use). It also means you can't do some of the more creative string munging things in the CPP that I enjoy.

In particular, in every CPP except GCC, this works :

#define V(x) x
#define CAT(a,b)  V(a)V(b)

to concatenate two strings. Note that those strings can be *anything* , unlike the "##" operator which under GCC has very specific annoying behavior in that it must take a valid token on each side and produce a valid token as output (one and only one!).

In further "GCC strictness is annoying", it's fucking annoying that they enforce the rule that only ints can be constants. For example, lots of code bases have something like "offsetof" :

/* Offset of member MEMBER in a struct of type TYPE. */
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

well, that's illegal under GCC for no damn good reason at all. So you have to do :

/* Offset of member MEMBER in a struct of type TYPE. */
#ifndef __GNUC__
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
/* The cast to "char &" below avoids problems with user-defined
   "operator &", which can appear in a POD type.  */
#define offsetof(TYPE, MEMBER)                                  \
  (__offsetof__ (reinterpret_cast <size_t>                      \
                 (&reinterpret_cast <const volatile char &>     \
                  (static_cast<TYPE *> (0)->MEMBER))))
#endif /* C++ */

damn annoying. (code stolen from here ). The problem with this code under GCC is that a "type *" cannot be used in a constant expression.

A similar problem comes up in templates. On every other compiler, a const pointer can be used as a template value argument, because it's just the same as an int. Not on GCC! In fact because they actually implement the standard, there's a new standard for C++0x which is going to make NULL okay, but only NULL which is also annoying because there are places I would use arbitrary values. (see for example 1 or 2 ).

ADDENDUM : a concrete example where I need this is my in-place hash table template. It's something like :

template < typename t_key,t_key empty_val,t_key deleted_val >
class hash_table

that is, I hash keys of type t_key and I need a value for "empty" and "deleted" special keys for the client to set. This works great (and BTW is much faster than the STL style of hash_map for many usage patterns), but on GCC it doesn't work if t_key is "char *" because you can't template const pointer values. My work-around for GCC is to take those template args as ints and cast them to t_key type internally, but that fucking blows.

In general I like to use template args as a way to make the compiler generate different functions for various constant values. It's a much cleaner way than the #define / #include method that I used above in the static/inline problem example.


07-21-10 - x86

x86 is really fucking wonderful and it's a damn shame that we don't have it on all platforms. (addendum : I don't really mean x86 the ISA, I mean x86 as short hand for the family of modern processors that run x86; in particular P-Pro through Core i7).

It's not just that the chips are super fast and easy to use. It's that they actually encourage good software engineering. In order to make the in-order PPC chips you have to abandon everything you've learned about good software practices in the last 20 years. You can't abstract or encapsulate. Everything has to be in locals, every function has to be inline.

1. Complex addressing.

This is way more huge than I ever expected. There are two important subaspects here :

1.A. being able to do addition and simple muls in the addressing, eg. [eax + ecx*2]

1.B. being able to use memory locations directly in instructions instead of moving through registers.

Together these work together to make it so that on x86 you don't have to fuck around with loading shit out to temporaries. It makes working on variables in structs almost exactly the same speed as working on variables in a local.


  x = y;


  mov eax, ecx;


  x = s.array[i];


  mov eax, [eax + ecx*4 + 48h]

and those run at almost the same speed !

This is nice for C and accessing structs and arrays of course, but it's especially important for C++ where lots of things are this-> based. The compiler keeps "this" in a register, and this-> references run at the same speed as locals!

ADDENDUM : the really bad issue with the current PPC chips is that the pipeline from integer computations to load/stores is very bad, it causes a full latency stall. If you have to compute an address and then load from it, and you can't get other instructions in between, it's very slow. The great thing about x86 is not that it's one instruction, it's that it's fast. Again, to be clear, the point here is not that CISC is better or whatever, it's simply that having fast complex addressing you don't have to worry about changes the way you write code. It lets you use structs, it lets you just use for(i) loops and index off i and not worry about it. Instead on the PPC you have to worry about things like indexing byte arrays is faster than any other size, and if you're writing loops and accessing dword arrays maybe you should be iterating with a pointer instead of an index, or maybe iterate with index*4, or whatever.

2. Out of order execution.

Most people thing of OOE as just making things faster and letting you be a bit lazy about code gen and stalls and so on. Yes, that is true, but there's a more significant point I think : OOE makes C++ fast.

In particular, the entire method of referencing things through pointers is impossible in even moderate performant code without OOE.

The nasty thing that C++ (or any modern language really, Java ,C#, etc. are actually much much worse in this regard) does is make you use a lot of pointers, because your data types may be polymorphic or indeterminate, it's often hard to hold them by value. Many people think that's a huge performance problem, but on the PC/x86 it actually doesn't hurt much. Why?

Typical C++ code may be something like :

.. stuff ..
.. stuff ..

this may involve several dependent memory fetches. On an in-order chip this is stall city. With OOE it can get rearranged to :

..stuff ..
temp = this->m_obj;
.. stuff ..
vtable = temp->vtable;
.. stuff ..
.. stuff ..

And as long as you have enough stuff to do in between it's no problem. Now obviously doing lots of random calls through objects and vtables in a row will still make you slow, but that's not a common C++ pattern and it's okay if that's slow. But the common pattern of just getting a class pointer from somewhere then doing a bunch of stuff on it is fast (or fast enough for not-super-low-level code anyway).

ADDENDUM : obviously if your code path was completely static, then a compile-time scheduler could do the same thing. But your code path is not static, and the caches have basically random delays because other threads might be using them too, so no static scheduling can ever be as good. And even beyond that, the compiler is just woefully unable to handle scheduling for these things. For example, to schedule as well as OOP can, you would have to do things like speculatively read ptr and *ptr even if it might only be needed if a certain branch is taken (because if you don't do the prefetching the stall will be horrific) etc. Furthermore, the scheduling can only really compete when all your functions are inline; OOP sort of inlines your functions for you since it can schedule functions across the jump. etc. etc.

ADDENDUM : 3. Another issue that I think might be a big one is the terrible penalty for "jump to variable" on PPC. This hits you when you do a switch() and also when you make virtual calls. It can only handle branch prediction for static branches, there's no "branch target predictor" like modern x86 chips have. Maybe I'll write a whole post about virtual functions.

Final addendum :

Anyway, the whole point of this post was not to make yet another rant about how current consoles are slow or bad chips. Everyone knows that and it's old news and boring.

What I have realized and what I'm trying to say is that these bad old chips are not only slow - much worse than that! They cause a regression in software engineering practice back to the bad old days when you have to worry about shit like whether you pre-increment or post-increment your pointers. They make clean, robust, portable programming catastrophically inefficient. All the things we have made progress on in the last 20 years, since I started coding on Amigas and 286's where we had to worry about this shit, we moved into an enlightened age where algorithms were more important than micro bullsit, and now we have regressed.

At the moment, the PowerPC console targets are *SO* much slower than the PC, that the correct way to write code is just to write with only the PowerPC in mind, and whatever speed you get on x86 will be fine. That is, don't think about the PC/x86 performance at all, just 100% write your code for the PPC.

There are lots of little places where they differ - for example on x86 you should write code to take use of complex addressing, you can have fewer data dependencies if you just set up one base variable and then do lots of referencing off it. On PPC this might hurt a lot. Similarly there are quirks about how you organize your branches or what data types you use (PPC is very sensitive to the types of variables), alignment, how you do loops (preincrement is better for PPC), etc.

Rather than bothering with #if __X86 and making fast paths for that, the right thing is just to write it for PPC and not sweat the x86, because it will be like a bazillion times faster than the PPC anyway.

Some other PPC notes :

1. False load-hit-stores because of the 4k aliasing is an annoying and real problem (only the bottom bits of the address are used for LHS conflict detection). In particular, it can easily come up when you allocate big arrays, because the allocators will tend to give you large memory blocks on 4k alignment. If you then do a memcpy between two large arrays you will get a false LHS on every byte! WTF !?!? The result is that you can get wins by randomly offsetting your arrays when you know they will be used together. Some amount of this is just unavoidable.

2. The (unnamed console) compiler just seems to be generally terrible about knowing when it can keep things in registers and when it can't. I noted before about the failure to load array base addresses, but it also fucks up badly if you *EVER* call a function using common variables. For example, say you write a function like this :

int x = 0;

  for( ... one million .. )
    .. do lots of stuff using x ..
    x = blah;


the correct thing of course is to just keep x in a register through the whole function and not store its value back to the stack until right before the function :

//int x; // x = r7
r7 = 0;

  for( ... one million .. )
    .. do lots of stuff using r7 ..
    r7 = blah;

stack_x = r7

Instead what I see is that a store to the stack is done *every time* x is manipulated in the function :

//int x; // x = r7
r7 = 0;
stack_x = r7;

  for( ... one million .. )
    .. do lots of stuff using r7 - stores to stack_x every time ! ..
    r7 = blah;
   stack_x = r7;


The conclusion is the same one I came to last time :

When you write performance-critical code, you need to completely isolate it from function calls, setup code, etc. Try to pass in everything you need as a function argument so you never had to load from globals or constants (even loading static constants seems to be compiled very badly, you have to pass them in to make sure they get into registers), and do everything inside the function on locals (which you never take the address of). Never call external functions.


07-18-10 - Mystery - Why no isync for Acquire on Xenon -

The POWER architecture docs say that to implement Acquire memory constraint, you should use "isync". The Xbox 360 claims they use "lwsync" to enforce Acquire memory constraint. Which is right? See :

Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Example POWER Implementation for C/C++ Memory Model
PowerPC storage model and AIX programming

Review of the PPC memory control instructions in case you're a lazy fucker who wants to butt in but not actually read the links that I post :

First of all review of the PPC memory model. Basically it's very lazy. We are dealing with in-order cores, so the load/store instructions happen in order, but the caches and store buffers are not kept temporally in order. That means an earlier load can get a newer value, and stores can be delayed in the write queue. The result is that loads & stores can go out of order arbitrarily unless you specifically control them. (* one exception is that "consume" order is guaranteed, as it is on all chips but the Alpha; that is, *ptr is always newer than ptr). To control ordering you have ;

lwsync = #LoadLoad barrier, #LoadStore barrier, #StoreStore barrier ( NOT #StoreLoad barrier ) ( NOT Sequential Consistency ).

lwsync gives you all the ordering that you have automatically all the time on x86 (x86 gives you every barrier but #StoreLoad for free). If you put an lwsync after every instruction you would have a nice x86-like semantics.

In a hardware sense, lwsync basically affects only my own core; it makes me sequentialize my write queue and my cache reads, but doesn't cause me to make a sync point with all other cores.

sync = All barriers + Sequential Consistency ; this is equivalent to a lock xchg or mfence on x86.

Sync makes all the cores agree on a single sync point (it creates a "total order"), so it's very expensive, especially on very-many-core systems.

isync = #LoadLoad barrier, in practice it's used with a branch and causes a dependency on the load used in the branch. (note that atomic ops use loadlinked-storeconditional so they always have a branch there for you to isync on). In a hardware sense it causes all previous instructions to finish their loads before any future instructions start (it flushes pipelines).

isync seems to be the perfect thing to implement Acquire semantics, but the Xbox 360 doesn't seem to use it and I'm not sure why. In the article linked above they say :

"PowerPC also has the synchronization instructions isync and eieio (which is used to control reordering to caching-inhibited memory). These synchronization instructions should not be needed for normal synchronization purposes."

All that "Acquire" memory semantics needs to enforce is #LoadLoad. So lwsync certainly does give you acquire because it has a #LoadLoad, but it also does a lot more that you don't need.

ADDENDUM : another Xenon mystery : is there a way to make "volatile" act like old fashioned volatile, not new MSVC volatile? eg. if I just want to force the compiler to actually do a memory load or store (and not optimize it out or get from register or whatever), but don't care about it being acquire or release memory ordered.

07-18-10 - Mystery - Does the Cell PPU need Memory Control -

Is memory ordering needed on the PPU at all ?

I'm having trouble finding any information about this, but I did notice a funny thing in Mike Acton's Multithreading Optimization Basics :


// Pentium
#define  AtomicStoreFence() __asm { sfence }
#define  AtomicLoadFence()  __asm { lfence }

// PowerPC
#define  AtomicStoreFence() __asm { lwsync }
#define  AtomicLoadFence()  __asm { lwsync }

// But on the PPU
#define  AtomicStoreFence() 
#define  AtomicLoadFence()

Now, first of all, I should note that his Pentium defines are wrong. So that doesn't inspire a lot of confidence, but Mike is more of a Cell expert than an x86 expert. (I've noted before that thinking sfence/lfence are needed on x86 is a common mistake; this is just another example of the fact that "You should not try this at home!" ; even top experts get the details wrong; it's pretty sick how many random kids on gamedev.net are rolling their own lock-free queues and such these days; just say no to lock-free drugs mmmkay). (recall sfence and lfence only have any effect on non-temporal memory such as write-combined or SSE; normal x86 doesn't need them at all).

Anyway, the issue on the PPU is you have two hardware threads, but only one core, and more importantly, only one cache (and only one store queue (I think)). The instructions are in order, all of the out-of-orderness of these PowerPC chips comes from the cache, so since they are on the same cache, maybe there is no out-of-orderness ? Does that mean that memory accesses act sequential on the PPU ?

Hmmm I'm not confident about this and need more information. The nice thing about Cell being open is there is tons of information about it from IBM but it's just a mess and very hard to find what you want.

Of note - thread switches on the Cell PPU are pretty catastrophically slow, so doing a lot of micro-threading doesn't really make much sense on that platform anyway.

ADDENDUM : BTW I should note that even if the architecture doesn't require memory ordering (such as on x86), doing this :

#define  AtomicStoreFence() 
#define  AtomicLoadFence()

is a bad idea, because the compiler can still reorder things on you. Better to do :

#define  AtomicStoreFence()  _CompilerWriteBarrier() 
#define  AtomicLoadFence()   _CompilerReadBarrier()

07-18-10 - Mystery - Do Mutexes need More than Acquire-Release -

What memory order constraints do Mutexes really need to enforce ?

This is a surprisingly unclear topic and I'm having trouble finding any good information on it. In particular there are a few specific questions :

1. Does either Mutex Lock or Unlock need to be Sequential Consistent? (eg. a global sync/ordering point) (and followup : if they don't *need* be Seq_Cst , is there a good argument for them to be Seq_Cst anyway?)

2. Does either Lock or Unlock need to keep memory accesses from moving IN , or only keep them from moving OUT ? (eg. can Lock just be Acquire and Unlock just be Release ?)

Okay, let's get into it a bit. BTW by "mutex" I mean "CriticalSection" or "Monitor". That is, something which serializes access to a shared variable.

In particular, it should be clear that instructions moving *OUT* is bad. The main point of the mutex is to do :

y = 1;


  load x
  x ++;
  store x;


y = 2;

and obviously the load should not move out the top nor should the store move out the bottom. This just means the Lock must be Acquire and the Unlock must be Release. However, the y=1 could move inside from the top, and the y=2 could move inside from the bottom, so in fact the y=1 assignment could be completely eliminated.

Hans Boehm : Reordering Constraints for Pthread-Style Locks goes into this question in a bit of detail, but it's fucking slides so it's hard to understand (god damn slides). Basically he argues that moving code into the Mutex (Java style) is fine, *except* if you allow a "try_lock" type function, which allows you to invert the mutex; with try_lock, then lock() must be a full barrier, but unlock() still doesn't need to be.

Joe Duffy mentions this subject but doesn't come to any conclusions. He does argue that it can be confusing if they are not full barriers . I think he's wrong about that and his example is terrible. You can always cook up very nasty examples if you touch shared variables inside mutexes and also outside mutexes. I would like to see an example where *well written code* behaves badly.

One argument for making them full barriers is that CriticalSection provides full barriers on Windows, so people are used to it, so it's good to give people what they are used to. Some coders may see "Lock" and think code neither moves in or out. But on some platforms it does make the mutex much more expensive.

To be concrete, is this a good SpinLock ?

    while ( ! CAS( lock , 0 , 1 , memory_order_seq_cst )


    StoreRelease( lock , 0 );

    // AtomicExchange( lock, 0 , memory_order_seq_cst );

One issue that Joe mentions is the issue of fairness and notifying other processors. If you use the non-fencing Unlock, then you aren't immediately giving other spinning cores a change to grab your lock; you sort of bias towards yourself getting the lock again if you are in high contention. IMO this is a very nasty complex issue and is a good reason not to roll your own mutexes; the OS has complex mechanisms to prevent live locks and starvation and all that shit.

For more concreteness - Viva64 has a nice analysis of Dmitriy V'jukov's implementation of the Peterson Lock . This is a specific implementation of a lock which does not have *any* sequence point; the Lock() is Acquire_Release ordered (so loads inside can't move up and stores before it can't move in) and Unlock is only Release ordered.

The question is - would using a minimally-ordering Lock implementation such as Dmitriy's cause problems of any kind? Obviously Dmitriy's lock is correct in the sense of providing mutual exclusion and data race freedom, so the issue is not that; it's more a question of whether it causes practical programming problems or severely unexpected behavior. What about interaction with file IO or other non-simple-memory-access resources? Is there a good reason not to use such a minimally-ordering lock?


07-17-10 - Broken Games

Soccer is broken. There's too little scoring and it's too easy to play a very defensive style. The issue with low scores is not just lack of excitement (that aspect is debatable), the big problem is that in a game that's often 1:0 or 0:0 , it greatly increases the importance of bad calls and flukes. If the scores were more like 7-5 , then 1 slip up wouldn't matter so much. Statistically, the "best" team loses in soccer more often than any other major sport.

Anyway, it seems like it would be very easy to fix. You just do something to force an attacking style. One random idea I had is you could require that 3 forwards always stay on the opponent's side of midfield. This prevents you from drawing back everyone for defense, and means that the attackers can get a numbers advantage whenever they want to take that risk.

No Limit Hold'em is broken. It's far too profitable and easy to play a very tight style. The fix is very easy - antes. But almost nobody does it outside of the biggest games. (an extra blind would also work well).

Baseball is broken. Not in a game rule system way, I think it actually functions pretty well in that sense. Rather it is broken as a spectator sport because it is just too slow and drawn out. The fix for this is also very easy - put time limits on pitchers and batters. None of this fucking knocking the dirt off your shoes, throwing to first, then asking for more time, oh my god.

Basketball is broken. I wrote about that before so I won't repeat myself.

Rugby is broken. In a lot of ways actually. The rules of the scrum are very hard to enforce, so you constantly get collapsed scrums and balls not put in straight and so on, very messy, not fun to play or watch. The best part of the game is the free running, but it's very easy to win without a good running game at all, just by playing well in the set pieces and kicking, which is really ugly rugby. I don't have any great ideas on how to fix it, but it's definitely broken. Sevens is actually a much better game for the most part.

I guess it's pretty hard to change these things because they are established and have history and fans and so on, and any time you make a change a bunch of morons complain that you're mucking things up, but a bit of tweakage could seriously improve most sports.


Tennis is broken. Power and serving are rewarded too much over control, which makes matches boring. The French Open is generally the best tennis of the year to watch because it's slower. This could be easily fixed by limitting racket technology to 1980 levels or something.

Auto racing is horribly broken. I think F1 is hopeless and boring so I won't talk about that. The Le Mans / GT series are almost interesting, but the stupid rules just make it incomprehensible who has an advantage each year. Some manufacturer can happen to have a car that fits well with the current rule set, and then they dominate for a few years. In many of the series, the cars are so modified that they hardly share any parts with their street origins at all. Like currently the BMW M3's are struggling in the ALMS but winning in the ELMS because of tiny differences in the arcane rules (something about suspension and aero that's allowed).

I think the solution is very easy : let manufacturers bring anything they want, but it has to be available for the public to buy at some fixed price. So rather than all these classes that have all these rules (no 4WD for example in the Le Mans series, and minimum weights and so on) - get rid of the rules and have a $100k series and a $150k series. Let manufacturers make the best car they can make for that price, and if they want to take a loss and bring a car that's got more value than that, they can, as long as the public can buy it. This would really let us see what a $150k M3 can do vs a $150k R8.


07-16-10 - Content

Where the fuck is all this content that we are supposed to have in this age of the internet and vast media ?

News sites covering sports events should have a "spoiler free" mode. They should let you view the information in chronological order (past to present), and let you block how far ahead you want to see. eg. say I have game 3 of the NBA finals on tape, and before I watch it I want to catch up on what happened in game 2, I should be able to mark "don't show me game 3" and go read the news. I'm hitting that particular problem right now with the Tour de F.

Why is there no fucking decent blog of someone telling me news about the Tour? Yes, I know there are plenty of news sites like velonews or cyclingnews or whatever, that's not what I want. I also don't want a "tour diary" from a rider. I want a smart, funny, 3rd party who is following everything and can write about what happens and also some editorial info about the secret dramas. Where is my content?

For ages I've wanted a blog I could follow that was just a well-curated extraction of amusement. I like to see a funny photo or some hot chicks or whatever trashy internet amusement there is, but I don't want to have to slog through the mass of crap that you're bathed in when you go to the massive aggregator sites like milkandcookies or daily* or whatever. Like just one little high quality nugget once a day, why the fuck do I not have that?

The other thing I've wanted forever is a science news site that's targetted at science degree graduates, but not specialists in that exact topic. There's a big gap between popular reporting, which is just woefully low-level, often just wrong, or completely inane (like reporting crackpot fringe loonies as if they are real science), and the full-on rigor and impenetrability of actual research papers. There could be a middle ground, intended for intelligent scientific people, written by people who actually understand the full depth of what they're writing about. The only place I know of to get that is in college magazines; for example the Caltech Engineering&Science magazine that I occasionally get is actually a pretty good source for that depth of material.

In other news, the opening of the Montlake bridge almost every day of the summer so that a few fuckers can get their over-height sailboats through is a really ridiculous slap in the face of any kind of civic sense. I've been on a sailboat and gone from Lake Union to Lake Washington, and it is a delight, but you can get through just fine on a moderate size boat without raising the bridge. You have to almost intentionally get a really tall mast just so you can fuck up the lives of thousands of people when the bridge raising causes traffic to back up onto the 520 and leads to a massive traffic jam. It's really appalling.


07-13-10 - Tech Blurg

How do I atomically store or load 128 bit data on x64 ?

One option is just to use cmpxch16b to do loads and stores. That's atomic, but seems a bit excessive. I dunno, maybe it's fine. For loads that's simple enough, you just do a cmpxch16b with 0 and it gives you the value that was there. For stores it's a bit uglier because you have to do a loop and do at least two cmps (one to load, then one to store, which will only succeed if nobody else stored since the load).

The other option is to use the SSE 128 bit load/store. I *believe* that it is atomic (assuming no cache lines are straddled), however it is important to note that SSE memory ops on x86/x64 are weakly ordered, unlike normal memory ops which are all strongly ordered (every x86 load is #LoadLoad and every store is #StoreStore). So, to make strongly ordered 128 bit load/store from the SSE load store you have to do something like

load :
    sse load 128

store :
    sse store 128

or such. I'm not completely sure that's right though and I'm having trouble finding any information on this. What I need is load_acquire_128 and store_release_128. (yes I know MSVC has intrinsics for LoadAcq_128 and StoreRel_128, but those are only for Itanium). (BTW a lot of people mistakenly think they need to use lfence or sfence with normal code; no no, those are only for SSE and write combined memory).

(ADDENDUM : yes, I think this is correct; movdqa (a = aligned) appears to be the correct atomic way to load/store 128 bits on x86; I'm a little worried that getting the weaker SSE memory model involved will break some of the assumptions about the x86 behavior of access ordering).

In other news, the random differences between GCC and MSVC are fucking annoying. Basically it's the GCC guys being annoying cocks; you know MS is not going to copy your syntax, but you could copy theirs. If you would get your heads out of your asses and stop trying to be holier than Redmond, you would realize it's better for the world if you provide compatible declarations. Shit like making me do __attribute__((always_inline)) instead of just supporting __forceinline is just annoying and pointless. Also, you all need to fix up your damn stdlib to be more like MS. Extensions like vsnprintf should be named _vsnprintf (MS style, correct) (* okay maybe not).

You also can't just use #defines to map the MSVC stuff to GCC, because often the qualifiers have to go in different places, so it's a real mess. BTW not having pragma warning disable is pretty horrendous. And no putting it on the command line is nowhere near equivalent, you want to be able to turn them on and off for specific bits of code where you know the warning is bogus or innocuous.

The other random problem I have is the printf format for 64 bit int (I64d) appears to be MS only. God damn it.


07-12-10 - My Nifty P4

What's new in MyNifty :

    no stall for down networks
    catch lots more cases that need to do checkouts (especially for the vcproj)
    don't check files for being read-only before trying p4 edit (lets you fix mistakes)
    don't check files for being in project before doing p4 edit (lets you edit source controlled files not in your project)

The result should be that using VC with MyNifty, you basically don't even know that your files are in source control - everything should just autocheckout at the right time.

Go get the original NiftyPlugins at Code.Google and do the install.

Then download MyNifty.zip

Extract MyNifty.zip into the dir where the NiftyP4 DLL is installed. It should be something like :

    C:\Documents and Settings\charlesb\Application Data\Microsoft\MSEnvShared\Addins\

(you may want to save copies of the originals).

Run VC.

Tools -> Addin Manager
NiftyPerforce : enable & startup

Tools -> Options -> Environment -> Documents
"Allow Editing of Read only files " = yes !

Output pane should have a "NiftyPerforce" option; you should see a message there like "NiftyPerforce ( Info): MyNiftyP4! RELEASE"

Nifty should be set to : useSystemEnv = True, autoCheckoutOnEdit = false, autoCheckout on everything else = true. I recommend autoAdd and autoDelete = false.

Make sure your P4 ENV settings are done, or use P4CONFIG files.

07-12-10 - Corporate Inequality

One of the things that bothers me is that all the corporations I deal with basically have free reign to fuck me, and I have nothing I can do back. If I every do anything wrong, they charge me fees, and they can fuck up severely and I get nothing.

I deposited a check at Well's Fargo once that was pretty old; I had misplaced it and just found it and went and deposited it. They charged me a $25 fee for depositing a check that was too old.

Recently I had some money taken from my First Mutual account through a fraudulent ACH. They of course reimbursed it - but WTF that is not enough. They allowed someone to take money out of my account when the person didn't even sign my name. The name signed is "Lindsey Meadows" or something and First Mutual just let it right on through. I should get to charge them $25 for their gross incompetence.

Anytime anyone bills you wrong, you get to wait for 30 minutes on the phone. If you are *lucky* they will fix the bill. What about fucking compensating me you cocks? But if I make a mistake and send in my monthly payment with a slightly wrong number on the check I get a fee.

I sent a bike by UPS and they checked it out and told me the rate was $65. Two weeks later I get a notice from them that they measured it during shipping and decided it was oversize and they were charging me an extra $70.

Just recently, UPS has completely fucked up delivery of two packages; one I sent they bounced back to me even though the address was completely correct; I had to talk to them on the phone and send it again. They gave me no reimbursement at all (they didn't charge me for the second shipping, but didn't reimburse the first), and of course no compensation for the trouble or the delay. Recently they completely lost a package that was shipped to me; again it was insured and everything so I'll get it fine, but I should be able to drop a $70 charge on them for failed delivery. Oh you screwed up delivery? Okay, that's a $50 fee you owe me. Oh you don't remember agreeing to that? It's on page 97 of the contract you had to sign to accept my package.

My garbage pickup company will occasionally just drop a $10 oversize garbage pickup fee on me. I never put out anything outside the bin, I have no idea what the fuck they are charging me for, maybe some neighbor sneaks shit in or its just a mistake, but the fact that they can just tack on fees at will without my agreement is the problem.

Of course I signed away permission for them to do that somewhere in the contract. But that is no fucking excuse. Retarded anti-humanists will say "it's your own fault, you had the opportunity to read the contract and you chose to sign it". What? First of all, you can't be serious that I'm supposed to study every fucking contract that's put in my face. Second of all, if I actually didn't sign the contracts that I didn't agree with, I couldn't live anywhere since all the rental agreements are absurd, I couldn't have a phone, a credit card, a bank, utility service, I mean are you fucking stupid? Of course I have to deal with these companies, I have no choice to not sign abusive contracts.

I fucking hate our lack of freedom and independence.

There is no free market solution to these problems. For one thing, there is basically no significant competition in almost every service. Even in sectors that have apparent competition, like say car insurance or banking, sure there are various people to choose from, but in fact they are almost all identical. They all run their business the same way, and none of them is actually good to their consumers.

The only solution is strong government regulation. In particular there are two very simple consumer protection laws that I would like to see :

1. Elimination of non-voluntary fees. All charges to a consumer must be explicitly authorized (and no they cannot be preauthorized on contingencies).

This is extremely powerful, simple, and would make a huge difference. For example it solves bank overdraft abuse. When you try to make an overdraft, the bank has to contact you (by email and cell) and confirm that you want to accept the overdraft (and the $25 fee). If you say no, the charge just bounces (and there is no fee to you). Obviously the same thing would apply to cell phone abuse through roaming or overage minutes or whatever.

Now you the consumer might want to preauthorize $50 a month in fees to give you some wiggle room, but you could choose $0 preauthorized fees if you like.

2. Make user agreements illegal. This is a little trickier because I don't think you can quite ban them completely, so you have to say something about them being "minimal" and "transparent". Maybe you could require that the average person should be able to fully read and understand it within 60 seconds.

Agreements that protect the service provider from lawsuits or that specify settlement through arbitration should simply be illegal.

But this line of thinking is all irrelevant because nothing that significantly reduces corporations' power to fuck us over will ever be done.


07-10-10 - PowerPC Suxors

I finally have done my first hard core optimization for PowerPC and discovered a lot of weird quirks, so I'm going to try to write them up so I have a record of it all. I'm not gonna talk about the issues that have been well documented elsewhere (load-hit-store and restrict and all that nonsense).

X. The code gen is just not very good. I'm spoiled by MSVC on the PC, not only is the code gen for the PC quite good, but any mistakes that it makes are magically hidden by out of order PC cores. On the PC if it generates a few unnecessary moves because it didn't do the best possible register assignments, those just get hidden and swallowed up by out-of-order when you have a branch or memory load to hide them.

In contrast, on the PPC consoles, the code gen is quirky and also very important, because in-order execution means that things like unnecessary moves don't get hidden. You have to really manually worry about shit like what variables get put into registers, how the branch is organized (non-jumping case should be most likely), and even exactly what instructions are done for simple operations.

Basically you wind up in this constant battle with the compiler where you have to tweak the C, look at the assembly, tweak the C, back and forth until you convince it to generate the right code. And that code gen is affected by stuff that's not in the immediate neighborhood - eg. far away in the function - so if you want to be safe you have to extract the part you want to tweak into its own function.

X. No complex addressing (lea). One consequence of this is that byte arrays are special and much faster than arrays of larger objects, because it has to do an actual multiply or shift. So for example if you have a struct of several byte members, you should use SOA (several structs) instead of AOS (one array of large struct).

X. Inline ASM kills optimization. You think with the code gen being annoying and flaky you could win by doing some manual inline ASM, but on Xenon inline ASM seems to frequently kick the compiler into "oh fuck I give up" no optimization mode, the same way it did on the PC many years ago before that got fixed.

X. No prefetching. On the PC if you scan linearly through an array it will be decently prefetched for you. (in some cases like memcpy you can beat the automatic prefetcher by doing 4k blocks and prefetching backwards, but in general you just don't have to worry about this). On PPC there is no automatic prefetch even for the simplest case so you have to do it by hand all the time. And of course there's no out-of-order so the stall can't be hidden. Because of this you have to rearrange your code manually to create a minimum of dependencies to give it a time gap between figuring out the address you want (then prefetch it) and needing the data at that address.

X. Sign extension of smaller data into registers. This one was surprising and bit me bad. Load-sign-extend (lha) is super expensive, while load-unsigned-zero-extend (lhz) is normal cost. That means all your variables need to be unsigned, which fucking sucks because as we know unsigned makes bugs. (I guess this is a microcoded instruction so if you use -mwarn-microcode you can get warnings about it).

PS3 gcc appears to be a lot better than Xenon at generating an lhz when the sign extension doesn't actually matter. eg. I had cases like load an S16 and immediately stuff it into a U8. Xenon would still generate an lha there, but PS3 would correctly just generate an lhz.

-mwarn-microcode is not really that awesome because of course you do have to use lots of microcode (shift,mul,div) so you just get spammed with warnings. What you really want is to be able to comment up your source code with the spots that you *know* generate microcode and have it warn only when it generates microcode where it's unexpected. And actually you really want to mark just the areas you care about with some kind of scope, like :

__speed_critical {
  .. code ..

and then it should warn about microcode and load hit stores and whatever else within that scope.

X. Stack variables don't get registered. There appears to be a quirk of the compiler that if you have variables on the stack, it really want to reference them from the stack. It doesn't matter if they are used a million times in a loop, they won't get a register (and of course "register" keyword does nothing). This is really fucking annoying. It's also an aspect of #1 - whether or not it gets registered depends on the phase of the moon, and if you sneeze the code gen will turn it back into a load from the stack. The same is actually true of static globals, the compiler really wants to generate a load from the static base mem, it won't cache that.

Now you might think "I'll just copy it into a local" , but that doesn't work because the compiler completely eliminates that unnecessary copy. The most reliable way we found to make the compiler register important variables is to copy them into a global volatile (so that it can't eliminate the copy) then back into a local, which then gets registered. Ugh.

You might think this is not a big deal, but because the chips are so damn slow, every instruction counts. By not registering the variables, they wind up doing extra loads and adds to get the values out of static of stack mem and generate the offsets and so on.

X. Standard loop special casing. On Xenon they seem to special case the standard

for(int i=0;i < count;i++) { }

kind of loop. If you change that at all, you get fucked. eg. if you just do the same thing but manually, like :

for(int i=0;;)
    if ( i == count ) break;

that will be much much slower because it loses the special case loop optimization. Even the standard paradigm of backward looping :

for(int i=count;i--;) { }

appears to be slower. This just highlights the need for a specific loop() construct in C which would let the compiler do whatever it wants.

X. Clear top 32s. The PS3 gcc wants to generate a ton of clear-top-32s. Dunno if there's a trick to make this go away.

X. Rotates and shifts. PPC has a lot of instructions for shifting and masking. If you just write the C, it's generally pretty good at figuring out that some combined operation can be turned into one instruction. eg. something like this :

x = ( y >> 4 ) & 0xFF;

will get turned into one instruction. Obviously this only works for constant shifts.

X. The ? : paradigm. As usual on the PC we are spoiled by our fucking wonderful compiler which almost always recognizes ? : as a case it can generate without branches. The PPC seems to have nothing like cmov or a good setge variant, so you have to generate it manually . The clean solution to this is to write your own SELECT , that's like :

#define SELECT(condition,val_if_true,val_if_false)  ( (condition) ? (val_if_true) : (val_if_false) )

and replace it with Mike's bit masky version for PPC.

07-10-10 - Clipless pedals

I think clipless pedals are fucking terrible. Yes, they are slightly more efficient, and it does feel kind of nice to be locked in, but for the average amateur cyclist, they are a big problem and way too many people use them.

First of all, the efficiency difference vs. Powergrips or toe clip-strap pedals is pretty small. Those also lock your foot in pretty well and let you spin. When people say "clipless pedals are a huge gain" they are comparing to platform pedals which is fucking ridiculous, you need to compare vs. toe clips. The nice thing about toe clips is you can leave them loose around town, and then when you get out on a long course, you just reach down and pull the strap tight and then you are locked in nice and neat.

The first problem with clipless pedals is ergonomic. Yes, they can be adjusted just right so that you have good geometry and they won't cause any pain - but that is a big pain in the ass, and the average amateur doesn't have the perfect adjustment. The result is knee and hip pain. The extra freedom of strap pedals lets you get a more comfortable position and avoid the pain.

The biggest problem with clipless pedals is that it turns the amateur road rider into a real dick-head. They aren't comfortable with clipping in and out, so they go to great measures to avoid it. They'll hang on to posts at red lights, run lights and stop signs, won't wait in a line of other cyclists, etc. They create a real hazard because they can't get their feet out easily.

Of all the dick cyclists on the road, the yuppie amateur road racer has to be the worst. They're the ones who are all wobbly and don't stop for pedestrians. They ride in big groups and don't get out of the way for cars. They run stop signs and act like they aren't doing anything wrong. They'll often ride way out in the road for no good reason and not get over to let cars by. Of course cyclists should take the lane when they need to for safety reasons, but that is not the case for these turds.

It really makes me sad when I see some out of shape people who are obviously trying to get into cycling, and the fucking shop has set them up with some harsh aluminum frame, with a way too aggressive forward position, and clipless pedals; they're obviously very uncomfortable on their bike, and also out of control.


07-09-10 - Backspace

#define _WIN32_WINNT 0x0501 
#include <windows.h>
#include <psapi.h>

static bool strsame(const char * s1,const char * s2)
    for(int i=0;;i++)
        if ( s1[i] != s2[i] )
            return false;
        if ( s1[i] == 0 )
            return true;

// __declspec(dllimport) void RtlFillMemory( void *, size_t count, char );
extern "C" 
void * __cdecl memset(void *buf,int val,size_t count)
    char * p = (char *) buf;
    for(size_t i=0;i < count;i++)
        p[i] = val;
    return buf;

//#undef RtlCopyMemory
//NTSYSAPI void NTAPI RtlCopyMemory( void *, const void *, size_t count );
extern "C" 
void * __cdecl memcpy(void *dst,const void *src,size_t count)
    char * d = (char *) dst;
    char * s = (char *) src;
    for(size_t i=0;i < count;i++)
        d[i] = s[i];
    return dst;

//int CALLBACK WinMain ( IN HINSTANCE hInstance, IN HINSTANCE hPrevInstance, IN LPSTR lpCmdLine, IN int nShowCmd )
int my_WinMain(void)
    bool isExplorer = false;
    HWND active = GetForegroundWindow();
    DWORD procid;
    DWORD otherThread = GetWindowThreadProcessId(active,&procid);
    if ( active )
        HWND cur = active;
            HWND p = GetParent(cur);
            if ( ! p ) break;
            cur = p;
        char name[1024];
        name[0] = 0;
        if ( GetClassNameA(cur,name,sizeof(name)) )
            //lprintf("name : %s\n",name);
            isExplorer = strsame(name,"CabinetWClass") ||
    if ( isExplorer )
        //lprintf("sending alt-up\n");
        INPUT inputs[4] = { 0 }; // this calls memset
        inputs[0].type = INPUT_KEYBOARD;
        inputs[0].ki.wVk = VK_LMENU;
        inputs[1].type = INPUT_KEYBOARD;
        inputs[1].ki.wVk = VK_UP;
        // send keyups in reverse order :
        inputs[2] = inputs[1]; // this generates memcpy
        inputs[3] = inputs[0];
        inputs[2].ki.dwFlags |= KEYEVENTF_KEYUP;
        inputs[3].ki.dwFlags |= KEYEVENTF_KEYUP;
        //lprintf("sending backspace\n");
        // can't use SendInput here cuz it will run me again
        // find the actual window and send a message :
        DWORD myThread = GetCurrentThreadId();
        HWND focus = GetFocus();
        if ( ! focus )
            focus = active;
        // the HotKey thingy that I use will still send the KeyUp for Backspace,
        //  so I only need to send the key down :
        //  (some apps respond to keys on KeyUp and would process backspace twice- eg. Firefox)
        // also, if I send the KeyDown/Up together the WM_CHAR gets out of order oddly
        int vk = VK_BACK;
        int ScanKey = MapVirtualKey(vk, 0);
        const LPARAM lpKD = 1 | (ScanKey << 16);
        //const LPARAM lpKU = lpKD | (1UL << 31) | (1UL << 30);
    return 0;

extern "C"
int WinMainCRTStartup(void)
    return my_WinMain();

Bug fixed 07/12 : don't send key up message. Also, build without CRT and the EXE size is under 4k.

Download the EXE


07-08-10 - Remote Dev

I think the OnLive videogames-over-the-internet thing is foolish and unrealistic.

There is, however, something it would be awesome for : dev kits.

If I'm a multi-platform videogame developer, I don't want to have to buy dev kits of every damn system for every person in my company. They cost a fortune and it's a mess to administer. It would be perfectly reasonable to have a shared farm of devkits somewhere out on the net, you can get to them whenever you want and run your game and get semi-interactive gameplay ala OnLive.

It would be vastly superior to the current situation, wherein poor developers wind up buying 2 dev kits for their 40 person team and you have to constantly be going around asking if you can use it. Instead you would always get instant access to some dev kit in the world.

Obviously this isn't a big deal for 1st party, but for the majority of small devs who do little ports it would be awesome. It would also rock for me because then I could work from home and instantly test my RAD stuff on any platform (and not have to ask someone at the office to turn on the dev kit, or make sure noone is using it, or whatever).


07-07-10 - Counterpoint

Dave Moulton is usually right on the money about issues bike related, but when he rants about bicycle helmet laws , I think he's off the money.

Basically he contends that the real problem with bike-car safety is that car drivers do not take the responsibility of their powerful machine seriously enough. Yes, of course, I agree absolutely, but SO WHAT ?

We could have a lot of discussions about the way the world should be, but it's irrelevant and non-productive to pine for things that will never be. Yes, it would be nice if people payed attention to the road when they drove, didn't talk on cell phones, didn't drink coffee and talk to their passenger. A car is a deadly powerful killing machine, and stupid people forget that because it's so comfy and feels safe and is easy to drive and so on. Yes, I think most people would drive better if they had to drive something like an old open-roof roadster where you are exposed to the elements and feel vulnerable. But you are not going to change American's driving habits. People want to jump in their giant beast, mash the gas, watch TV while they're driving, and fuck you if you're a pedestrian or cyclist in their way.

Look, drivers are fucking dangerous morons. It doesn't matter if you're on a bike or not - they are constantly running red lights, pulling into crosswalks, not stopping for pedestrians, going the wrong way around roundabouts. Almost every single day I see some major violation of basic traffic laws, and even beyond that there are just constant violations of basic human sense and decency. One of the ones that's really getting my goat recently is how people around here love to blow right through a stop sign and come to stop about ten feet past it, with their nose way out in the intersection (they do this intentionally because they want to get further out into the intersection to see to the sides; the reasonable thing to do would be to first stop at the stop sign and then pull forward to see). So when I'm driving or biking through the intersection, all I see is somebody who blows through a stop sign and is coming right at me (and then they stop just before hitting me).

I personally would love to see the elimination of the entire concept of the vehicular "accident". They are only rarely accidents; it's usually somebody fucking up. The person who crashed should not only have to pay the cost of the accident, but should get a punitive legal punishment such as license suspension or even prison time. For example when the old lady ran right through a red light and smashed my car should have clearly had her license taken away. In cases of hitting pedestrians you should get jail time. It's almost impossible for a pedestrian to ever be at fault, because even if they do jump out right in front of you - you should always expect them to do that when you are in an area with pedestrians, so you should be going slow and be ready to slam on the brakes. But this is never going to happen so it's a pointless rant.

As for the issue of mandatory helmets - I don't really think it's anything to get riled up about. As a cyclist of course you should choose to wear a helmet even if it's not a law. Obviously making it a law is political weakness. Oh shit cyclists are getting hit by cars - let's restrict the cyclists because god knows we're not gonna restrict the cars. Well duh, of course that's how politics works. But there are perfectly reasonable reasons to make helmets madatory - the same reason seat belts in cars are mandatory - because it reduces the medical cost which is shared by society.

We can rant all we want about how drivers should pay more attention, be more courteous, be reasonable and intelligent, but it just won't ever happen.

What I would like to see is better ways for me as a cyclist to avoid cars, and me as a driver to avoid cyclists. Part of the problem is that the people who put down the bike lanes are real fucking morons. Right in my neighborhood we have bike lanes or "sharrows" right down the busiest arterial roads, when there are perfectly good quiet back streets that run parallel and would be much better routes for bikes. Personally when I ride, I take the back routes that are very low traffic, but the majority of cyclists are just as retarded as the retarded cars, and they take the road that is "bike recommended" even though it's much worse.

(BTW as a reminder, let me emphasize that the retardation equivalence of bikes and cars in no way excuses the cars from their sins; if you have a fight between someone with a feather and someone with a knife and they are both being dangerous morons with their weapon - the feather guy can be forgiven but the knife guy is a fucking selfish ignorant dick; you often hear self-righteous car morons go on about how the bikes "do bad things too" ; so fucking what? so what? he's poking you with his feather, just ignore him and be careful with your damn knife).


07-05-10 - Country Living

Probably because I've been reading Tanizaki (wonderful) recently, and also because my neighborhood has turned into a construction yard as all the god damn home-improvers have kicked into high gear for the summer, I have been fantasizing a lot recently about living out in the country.

The woods have a wonderful silence to them; the boughs are baffles, muffling sound, making the air heavy and still. I imagine having a clearing in the woods so a bit of light can get in. In the clearing is a japanese style pavillion, dark thick wood braces and paper screens. It is empty of all clutter, my private space, quiet and peaceful, where I can just think and work and be alone.

There are actually lots of huge wooded properties for sale out not too far from Seattle. I think the best nearby place is out in the Snoqualmie river valley, around Duvall/Carnation/Novelty Hill. You can get 40 acres for around $600k which is pretty stonkering. 40 acres is enough that you can put a little building in the middle and not be able to see or hear a neighbor at all. It also seems like a pretty good investment. It's inevitable that the suburbs will get built out to there eventually, and then all that land could be worth a fortune. This is why I've never understood living in traditional suburbs; if you go just another ten miles out you get to real country where you can have big wild property with woods and gardens and isolation, for less money!

But then I start thinking - if I'm going to live in the middle of nowhere, why live in the middle of nowhere near Seattle? It's too far to really go into the city on any kind of regular basis, so I may as well just live in the countryside in CA or Spain or Argentina or somewhere with better weather.

Living in the country is really only okay if I'm married or something. If I'm single I have to be in the city. Even if I am with the woman I love, moving out to the country is sort of like retiring from life. It's changing gears to a very isolated, simple life. That's very appealing to me, but I don't think it's time for that phase of my life just yet.

Lately I have been taking lots of walks around Seattle U. It has pretty nice grounds, with lots of little hidden gardens tucked behind or between buildings where you can stroll or sit. I love the feeling on a college campus. You can just feel the seriousness in the air. Even when there are lots of kids around there's a feeling of quiet and solitude; maybe it's because the big buildings create a sort of echoing canyon that changes the sounds.

I miss having deep intellectual problems to work on that you really have to go and think about for a long time. Even though I'm sort of doing research right now, it's engineering research, where my time needs to be spent at the machine writing code for test cases, it's not theoretical research. It's really a delightful thing to have a hard theoretical problem to work on. You just keep it in the back of your mind and you chew on it for months. You try to come at it in different ways, you search for prior art papers about it. All the time you are thinking about it, and often the big revelation comes when you are taking a hike or something.

07-05-10 - Counterpoint 2

In which I reply to other people's blogs :

Smartness overload ( and addendum ) is purportedly a rant against over-engineering and excessive "smartness".

But let's start there right away. Basically what he wants is to have less smartness in the development of the basic architectural systems, but that requires *MORE* smartness every single time you use the system. For example :

In many situations overgeneralization is a handy excuse for laziness. Managing object lifetimes is one of my pet peeves. It�s common to use single, �universal� system for all kinds of situations instead of spending 5 minutes and think.

He's anti-smartness but pro "thinking" each time you write some commmon code. My view is that "smartness" during development is very very bad. But by that I mean requiring the client (coder) to think and make the right decision each time they do something simple. That inevitably leads to tons of bugs. Having systems that are clear and uniform and simple are massive wins. When I'm trying to write some leaf code, I shouldn't have to worry about basic issues, they should be taken care of. I shouldn't have to write array code by hand every time I need an array, I should use a vector. etc.

Furthermore, he is arguing against general solutions. I don't see how you can possibly argue that having each coder cook up their own systems for lifetime management is a good idea. Uniformity is a massive massive win. Even if you wrote some manual lifetime control stuff that was great, when some co-worker goes into your code and tries to use things they will be lost and have problems. What if you need to pass objects between code that use different schemes? What a mess.

Yet, folks insist on using reference counted pointers or GC everywhere. What? Some of counters can be manipulated from multiple threads? Well, let�s make _all_ pointers thread-safe, instead of thinking for another 5 minutes and separating special cases. It may be tempting to have a solution that just works in every case, write it once and use everywhere. Sadly, in many cases it means unnecessary overhead.

Yes! It is very tempting to have a solution that just works in every case! And in fact, having that severely *reduces* the need for smartness, and severely reduces bugs. Yes, if the overhead is excessive that's a problem, but that can't be dealt with without destroying good systems.

I think what's he trying to say is something along the lines of "don't use a jackhammer to hammer a nail" or something; that you shouldn't use some very heavy complex machinery when something simple would do the trick. Yes, of course I agree with that, but he also succumbs to the fallacy of taking that way too far and just being anti-jackhammer in general. The problem with that is that you wind up having to basically cook up the heavy machinery from scratch over and over again, which is much worse.

Especially with thread safety issues, I think it is very wrong-headed to suggest that coders should "think" and "not be lazy" in each occurance of a problem and figure out what exactly they need to thread-protect and how they can do it minimally, etc. To write thread-safe code it is *crucial* to have basic systems and common paradigms that "just work everywhere". Now that doesn't mean that have to make all smart pointers theadsafe. You could easily have something like "SingleThreadSmartPointer" and "ThreadSafeSmartPointer". An even better mechanism would be to design your threading system such that cross-thread smart pointers aren't necessary. Of course you want sensible efficient systems, but you also want them to package up common actions for you in a gauranteed safe way.

Finally, let's get to the real meat of the specific argument, which is about object lifetime management. He seems to be trashing a bogus smart pointer system in which people are passing around smart pointers all the time, which incurs lots of overhead. This is reminiscent of all the people who think the STL is incredibly slow, just because they are using it wrong. Nobody sensible has a smart pointer system like that. Smart people who use the boost:: pointers will make use of a mix of pointers - scoped_ptr, shared_ptr, auto_ptr, etc. for different lifetime management cases. Obviously the case where a single object always owns another object is trivial. You could use auto_ptr or even just a naked pointer if you don't care about automatic cleanup. The nice thing is that if I later decide I need to share that object, I can change it to shared_ptr, and it is easy to do so (or vice-versa). Even if something is a shared_ptr, you don't have to pass it as a smart pointer. You can require the caller to hold a ref and then pass things as naked pointers. Obviously little helper functions shouldn't take a smart pointer that has to inc and dec and refcount thread-safely, that's just bone headed bad usage, not a flaw of the paradigm.

Now, granted, by not using smart pointers everywhere you are introducing holes in the automaticity where bad coders can cause bugs. Duh. That is what good architecture design is all about - yes if we can make everything magically work everywhere without performance overhead we would love to, but usually we can't so we have to make a compromise. That compromise should make it very easy for the user to write efficient and mistake-free code. See later for more details.

Object lifetime management involves work one way or another. If you use smart pointers or some more lazy type of GC, that amount of work needed for the coder to do every time he works with shared objects is greatly reduced. This make it easier to write leaf code and reduces bugs.

The idea of using an ID as a weak reference without a smart pointer is basically a no-go in game development IMO. Let me explain why :

First of all, you cannot ever convert the ID to a pointer *ever* because that object might go away while you are using the pointer. eg.

Object * ptr = GetObject( ID ) ; // checks object is alive

// !! in other thread Object is deleted !!

ptr->stuff(); // crash !

So, one solution to this is to only use ID's. This is of course what Windows and lots of other OS'es do for most of their objects. eg. HANDLE, HWND, etc. are actually weak reference ID's, and you can never convert those to pointers, the function calls all take the ID and do the pointer mapping internally. I believe this is not workable because we want to get actual pointers to objects for convenience of development and also efficiency.

Let me also point out that a huge number of windows apps have bugs because of this system. They do things like

HWND w = Find Window Handle somehow

.. do stuff on w ..

// !! w is deleted !!

.. do more stuff on w .. // !! this is no good !

I have some Windows programs that snoop other programs and run into this issue, and I have to wind up checking IsWindow(w) all over the place to tell if the windows whose ID I'm holding has gone away. It's a mess and very unsafe (particularly because in early versions of windows the ID's can get reused within moderate time spans, so you actually might get a success from IsWindow but have it be a different window!).

Now, of course weak references are great, but IMO the way to make them safe and useful is to combine them with a strong reference. Like :

ObjectPtr ptr = GetObject( ID ); // checks existence of weak ref

the weak ref to pointer mapping only returns a smart pointer, which ensures it is kept alive while you have a pointer. This is just a form of GC of course.

By using a system like this you can be both very efficient and very safe. The system I use is roughly like this :

Object owners use smart pointers.  Of course you could just have a naked pointer or something, but the performance cost of using a
smart pointer here is nil and it just makes things more uniform which is good.

Weak references resolve to smart pointers.

Function calls take naked pointers.  The caller must own the object so it doesn't die during the call.  Note that this almost never
requires any thought - it is true by construction, because in order to call a function on an object, you had to get that object from
somewhere.  You either got it by resolving a weak pointer, or you got it from its owner.

This is highly efficient, easy to use, flexible, and almost never has problems. The only way to break it is to intentionally do screwy things like

object * ptr = GetObject( ID ).GetPointer();
CallFunc( ptr );

which will get a smart pointer, get the naked pointer off it, and let the smart pointer die.

Now, certainly lots of projects can be written without any complicated lifetime management AT ALL. In particular, many games throughout history have gotten away with having a single phase of the world tick when all object destruction happens; that lets you know that objects never die during the frame, which means you can use much simpler systems. I think if you are *sure* that you can use simpler systems then you should use them - using fancy systems when you don't need them is like using a hash table to implement an array with the index as the hash key. Duh, that's dumb. But if you *DO* need complicated lifetime management, then it is far far better to use a properly enegineered and robust system than to do ad-hoc per-usage coding of custom solutions.

Let me make another more general point : every time you have to "think" when you write code is an opportunity to get it wrong. I think many "smart" coders overestimate their ability to write simple code correctly from scratch, so they don't write good robust architectural systems because they know they can just write some code to handle each case. This is bad software engineering IMO.

Actually this leads me into another blog post that I've been contemplating for a while, so we'll continue there ...


07-04-10 - Counterpoint 1

In which I reply to other people's blogs :

The Windows registry vs. INI files is surprisingly pro-registry.

First of all, using the Registry does not actually *fix* any issues with multiple instances that you can't fix pretty easily with INI files. In particular, the contention over the config data is the easy part of the problem. There's an inherent messy ambiguity with settings and multiple instances. If I change settings in instance 1 do I want those to be immediately reflected in instance 2? If so, the only way to fix this is to have all instances constantly checking for new settings (or get some notification). This problem is exactly the same with the registry or INI files. Sure the registry gives you nice safe atomic writes, but you can implement that yourself easily enough, or you could use an off the shelf database. So that is really not much of an argument. In fact, getting changes across instances with INI files could be done pretty neatly using a file change notification to cause reloads (I'm not sure if Windows provides a similar watcher notification mechanism for registry changes). (the system that most people use of dumping settings changes on program exit is equally broken with the registry or INI files).

Second, storing things like last dialog positions and dumping structs and such is not really appropriate for INI files (or really even for the registry for that matter). The INI file is for things that the user might want to edit by hand, or copy to other machines, or whatever. That other junk is really a logically separate type of data. It's like the NCB in MSVC, which we all know you want to just wipe out from time to time. (in fact making it separate is nice because if I accidentally get my last dialog position off in outer space I can just delete that data). I think the official nice Win32 way to store this data is off in AppData somewhere, but I don't love that either.

Third, the benefits of the INI are massive and understated. 1. text editting is in fact a huge benefit over the registry. It lets you see all the options and edit them in a tool that is friendly and familiar. 2. it lets you do all the things you would do normally on files - eg. I can easily email my INI to friends, I can save backups of settings I made for some purpose, hell I can munge the INI from batch files, I can easily zip it up to save old versions, etc.

And this last one is by far the most important - making programs be "transportable" - that is, they rely on nothing but stuff in dirs under them - is just a massive win. It lets me rearrange my disk, copy programs around without running installers, save versions of programs, etc.

Back in the DOS days, whenever I finished a code project, we would make tape backups (lol tape) of the code * and all the compilers used to build it *. To do that all you had to do was include the dir containing the compiler. Five years later we could pull out projects that used some bizarro compilers that we didn't have any more, and it would all just work because they were fully transportable. The win of that is so massive it dominates the convenience of the registry for the developer.

Which brings us to the most important part : the convenience for the developer is not the issue here! It's what would be nicer for the user. And there INI is just all win. If it's more work for the developer to make that work, we should do that work.

old rants