1/30/2009

01-30-09 - News

Google beats Deer

Jen-Hsun Huang has a bad case of the crazies :

Q: Jeffrey Katzenberg is giving effusive testimonials for Intel�s Larrabee graphics chip.

A: You and I don�t know what Larrabee can do. It doesn�t exist. If you pay for Katzenberg�s compute farm, he will tell you your lawnmower is great for making movies. There are no applications that run on Larrabee today. You have to recompile them.

(from Industrial Arithmetic). Am I the only one who thought his name was Jenson Wong? Where did all those H's come from? And, um, yeah, that comment makes a lot of sense, cuz you can totally run your executables on a GPU. WTF?

Oh hi, how's it going? It's me! Every girl ever. at the Phat Phree ; this is ancient but it recently showed up in Best of Craigslist so I thought I should give props to the original.

I Was on the Line ! ; OMG tell it brotha, I got a parking ticket for this once in SLO and it made me so mad. I wasn't even in a handicapped spot, just over the edge of the parking spot lines, all because of some idiot in a Escalade! Sometimes I'll see those douchebags in their new car who intentionally park right in the middle of two spots so their precious piece of shit doesn't get a scratch. It really makes me want to get a crowbar and just smash the fuck out of their car, but I haven't done it yet. And it's always some mediocre car too, like if it was actually something rare and amazing like a Koenigsegg or some shit, I would sort of understand, but now, the fuckers who do this usually drive like a Porshe Boxster or somepin.

Ladybug Letter writes about how to start your Organic farm ; damn I really want to do this. (BTW the related Mariquita Farm has an amazing index of foods; go through the alphabet!)

Lots of people these days are doing silly computational geometry and calling it "architecture". I believe that this stuff is the exact *opposite* of architecture, but it is interesting to look at the pictures .

01-30-09 - SetFileValidData and async writing

Extending and async writing files on Windows is a bad situation. See for example Asynchronous Disk I/O Appears as Synchronous on Windows . Basically you should just not even try, but if you really want to give it a go :

1. Open your file for OVERLAPPED and also for NO_BUFFERING. Buffering helps reads, but it severely hurts writes. It takes my write speeds from 80 MB/sec down to like 30 MB/sec. I suspect that the reason is that buffering causes the pages to be read in before they are written out. (it's sort of like using cached memory - it will fetch in the pages even though you're doing nothing but stomping all over them).

2. Use the undocumented NtSetInformationFile to resize the file to its full size before you write anything. SetEndOfFile will only work for page size granularity, NtSetInformationFile can do arbitrary sizes. BTW this is also better for fragmentation than just writing lots of data onto the end of the file, which can cause NTFS to give you lots of allocation chains.

3. Use SetFileValidData to tell Windows that whatever is in the sectors on disk is okay. If you don't use SetFileValidData, Windows will first zero out each sector before you get to touch it. This is like a security thing, but obviously it's pretty bad for perf to basically write the whole file twice. SetFileValidData will fail unless you first ask for your process to get the right permissions, which of course will only work for processes running as administrator. Okay, I did all that. this post is okay but dear god don't read the thread.

If you do all those things right - your WriteFile() calls will actually be asynchronous. The actual speed win is not huge. Part of the problem is the next issue :

When I do all that, I start hitting some kind of weird OS buffer filling issue. I haven't tried to track down exactly what's happening because I don't really care that much, but what I see is that the writes are totally async and very fast (> 100 MB/sec) for some random amount of time (usually up to about 10 MB of writing or so) and then suddenly randomly start having huge delays. The write speed then goes down to 40 MB/sec or so.

ADDENDUM : when I say "you should not even try" I mean you should just live with WriteFile() being synchronous. It's plenty fast. Just run it from a thread and it still looks async to your thread (you need a thread anyway because OpenFile and CloseFile are very slow and synchronous; in fact the only thing you can rely on actually being fast and async is ReadFile). Also just live with the fact that Windows is zeroing the disk before you write it, everyone else does.

01-30-09 - Stack Tracing on Windows

There are basically 3 ways to capture stack traces on Windows.

1. Manually walking ebp/esp ; this steps back through the frame pointers, it relies on the callers stack being pushed. This is basically what RtlCaptureStackBackTrace or DmCaptureStackBackTrace does, but you can also just write it yourself very easily. The advantage of this is it's reasonably fast. The disadvantage is it doesn't work on all CPU architectures, and it doesn't work with the frame pointer omission optimization.

For info on RtlCaptureStackBackTrace, see Undocumented NT Internals or MSDN

2. StackWalk64. This is the new API you're supposed to use. The advantage is it works on all CPUs and it even works with frame pointer omission (!). But you can see from that latter fact that it must be very slow. In order to work with FPO it loads the PDB and uses the instruction pointer map to figure out how to trace back. It also can trace through lots of system calls that normal ebp-walking fails on.

See gamedev.net or ExtendedTrace for examples. But it's really too slow.

3. Manual push/pop in prolog/epilog. Uses the C compiler to stick a custom enter/leave on every function that does a push & pop to your own stack tracker. Google Perftools has an option to work this way. The "MemTracer" project works this way (more on MemTracer some day). The nice thing about this is it works on any architecture as long as the prolog/epilog is supported. The disadvantage is it adds a big overhead even on functions that you never trace. That rather sucks. Stacktraces are very rare in my world, so I want to pay the cost of them only when I actually do them, I don't want to be pushing & popping stack info all the time.

01-30-09 - Firefox 3

I finally installed Firefox 3 at home and I kinda wish I hadn't. The only thing I wanted was the improved SVG viewer.

The biggest annoyance so far is that it does the incremental-render thing, which is sort of good, but then it resets the page when it finishes. This means that pages can be "not done loading" for a long time, like 60 seconds, and you can start jamming on them, editing in edit boxes, and then loading "finishes" and BLAMO all your edits are gone.

Also all the crud in the top of the Bookmarks Menu is really annoying. I have to drag my mouse cursor past like 2 inches of shit before I get to my actual bookmarks.

If I was easy to anger like I used to be I would have screamed at my computer. Now I just sigh.

I really fucking hate computers and software. And yet it's my medium, it's the tool I have to use to accomplish my art. I feel kind of like a guitar player who just really fucking despises guitars and yet gets up on stage every night because it's the only way he can create.

1/29/2009

01-28-09 - Insane

I think Republicans have officially jumped the shark. They're not wrong, they're just insane. They now go in the pile of 90% of America that is just literally fucking nuts.

The god nutters. People who believe literally in angels or the literal word of the bible in general.

People who believe in the "Law of Attraction" or "The Secret". Open yourself up to the possibilities of the universe and I'll stuff you full of something.

People who believe in the healing powers of crystals or magnets. (or whatever new loony tunes thing they believe in this week; it changes pretty fast; ions? silver?)

People who buy the ShamWow.

Onto that list goes people who believe that "too much regulation" somehow caused the massive corporate malfeasance, or that with less regulation companies will behave better (!?). People who believe that some homos in their town will somehow turn everyone gay. People who believe that the Ay-rabs are all out to get us.


There's a ton of people that you just cannot talk to about the main topics in their lives. God nuts are a prime example, but so are the crazy "America is Great" crowd. Your only possible responses are "uh yeah, great" (with eye roll) or "you are nuts and here's why" (leading to huge argument), there is just no middle ground of reasonable conversation with them because they paint themselves into a corner of ridiculousness that can only be either ignored or attacked.

I was thinking about this because it comes up in less extreme cases as well. For example when programmers are like "our method is totally awesome" and you try to gently point out some inconsistencies and they're just like "oh no, it's awesome, we used it in X product and it all works out fine". They've painted themselves into a corner of oblivious ridiculousness where your only options for response are "umm, okay" (with eye roll) or "no, look dumbass, your algorithm is N-squared and there's no way that's okay when O(N) solutions for this problem are extremely well known; there were 20 papers published before you did your work all of which had better solutions" (or whatever).

1/28/2009

01-28-09 - Graph Memory

So we have a thing to track memory allocs with a stack trace, la di da, no big whoop. I log it all out to a file. So I wrote a thing to parse them into a hierarchy and spit them out with tabs for tabview . That's awesome.

Then I thought, hey, Atman makes these awesome graphs with "graphviz" so maybe I'll try that. One disadvantage of the pure hierarchy view in tabview is that you can't really see the flow when lines merge back up. That is, call graphs are not *trees* they're *DAGs*. Sometimes the stack hierarchy forks apart but then comes back together again. Graphviz should be able to show this neatly. Graphviz makes "SVG" files that you can just load with Firefox (Firefox 3's support is much better than 2's).

Anyway I made some graphs with various options and it's kinda cool. Here's an example : Allocs1.svg (you'll need to use ctrl-wheel to zoom out to see anything). Apparently if you click this link Firefox does nothing good. You have to download it - then open it with Firefox. WTF. Apparently it's my Verio servers doing the wrong thing . Yegads I hate the web.

Not bad, but I'm a little disappointed with graphviz's layout abilities. In particular the actual cell and edge layout seems very good, but they seem to have literally zero code to try to put the labels in good places.

In a bit of odd deja vu, this was one of the very first things I ever worked on as a professional programmer in 1991; I worked for a GIS company that had an old COBOL GIS database engine that worked with Census data and such; I wrote them a new C front end with graphics for PC's and such, and one of the things you have to do is take this raw street data with lat/long coordinates and street names and do some nice munging to make it look okay; a big part of that is a lot of heuristics about putting the labels for streets in good places (and when to repeat the label, etc.).


ADDENDUM : talking to Sean I realized you really want the graph to have different sizing/color options, be hierarchical, interactive, and stable.

That sounds hard, but I don't think it actually is. The key thing that makes it easy is that there is very good hierarchy in this information, and you can create the graph incrementally. I think that means you can just use simple iterative penalty methods to make the graph stable.

Here's my proposal :

Start with the graph at very coarse granularity ; maybe directory granularity of you have a few directories and that makes sense, else file granularity. Whatever coarse level so you have < 32 nodes or so. Just use a solver like graphviz to make this initial graph.

Now, interactively the user can click any group to expand its hierarchy. When that happens the big cell splits into various pieces. You just create the new pieces inside the parent and make the new edges - and then you just let them time evolve with a semi-physical iterative evoluton.

You apply a penalty force for intersection with neighbors to drive the nodes apart so there's no overlap. You similarly apply forces with the edges to make them never intersect edges. And the edges also act kind of like springs, applying forces to try to be short and straight. Stable 2d physics is a pretty solved problem so you just let them run until they settle down. Note that as they spread apart they can force the other nodes in the graph to move around, but it's all nice and smooth and stable.

I think it's much easier to treat the canvas as just infinitely large and let your nodes move apart all they need to. Graphviz does everything oriented towards being printed on a page which is not necessary for the interactive view.


ADDENDUM 2 : after figuring out some evil things I've got graph_viz making better graphs. Here's an example : allocs2 SVG (direct clicking should work now too).

See comments on this post for details.

I'm still not happy with the edge labeling, and the FF SVG navigation is not ideal, but it's useable now.

01-28-09 - More Google Magic

The new code snippet search is fucking awesome. You can browse inside tarballs (!) and you can actually follow links to headers, which is of course crucial.

Now if they would only do browse-symbol too it would fucking rock my universe (so you could click a symbol and go "show me the definition" or "go to instantiation" ).

ADDENDUM : I'm finding that if I search for code from the main Google search it's not showing up in my results, but if I go to the manual codesearch direct page, I can find lots of stuff. Plus you can filter by language. For example I just found that Apple's Javascript Implementation is in "namespace WTF" ; awesome. I'm sure that stands for WebToolsFoundation or something but I like it.

In other Google Madness news. I *HATE* the new Maps feature where when you zoom all the way in it switches to street view. Urg. I run my Firefox with Flash disabled cuz Flash is the fucking devil which puts nonsense on top of web pages, so when I zoom in to street view, Maps just eats its ass and doesn't work any more. And I can't like undo it, because if I hit "back" it just takes me out of maps and I lose my location completely. Urg. One of the many things I hate about web software is I can't just save the version of the tool that I like and keep using that version when they add a bunch of junk I don't want.

01-27-09 - Apartments

Things you should not do in apartments :

Clog dancing.  Irish dancing.  Texan line dancing.  Tap dancing.  Really any kind of jig or stomping.

Amateur Carpentry.  Really anything involving regular hammering.

Learn to play a new instrument.  Especially not a string instrument, or brass, or drums.

Bowling.  Really indoor sports of any sort other than like maybe paper football.

Good noises to hear :


Someone skilled playing piano.  And some nice sonata type stuff, some Chopin maybe, not some banging Rachmaninoff.

People having sex.  But only if I mainly hear a girl moaning gently, no male grunting or bedframe banging.

Kittens.  No, I've never heard kittens, but just the idea of hearing kittens sounds appealing.

Children playing.  A mother singing lullabies.

One bad thing about living in Washington is that if I kill my upstairs neighbor it would be harder to escape to Mexico. Sure I could run to Canada, but it's way more likely I'd be extredited, and I couldn't keep driving through Central America and buy a bar in Colombia and secretly help the rebels while seducing a senorita. Plus I'd have to live in Canada (shudder).

1/27/2009

01-27-09 - Urg Blogger GData

So my Blog poster code looks like :

for (int retries = 0; retries < 4; retries++)
{
    if (retries > 0)
        Console.WriteLine("retrying...");

    try
    {
        createdEntry = service.Insert(blogPostUri, newPost);
        break;
    }
    catch (Exception e) //(...) // Google.GData.Client.GDataRequestException
    {
        createdEntry = null;
        Console.WriteLine("service.Insert exception: " + e);
    }

    Console.WriteLine("Retrying .. press a key");
    Console.ReadKey();
}

I do the retry loop because the connection frequently closes itself and does other weird shit where it drops your request. This has been working for the last N months, but recently I've started getting exceptions even though the post goes up successfully. Urg.

+      [Google.GData.Client.GDataRequestException]   {"Execution of request failed: http://www.blogger.com/feeds/5246987755651065286/posts/default"}   Google.GData.Client.GDataRequestException
+      Data   {System.Collections.ListDictionaryInternal}   System.Collections.IDictionary {System.Collections.ListDictionaryInternal}
      HelpLink   null   string
+      InnerException   {"The underlying connection was closed: The connection was closed unexpectedly."}   System.Exception {System.Net.WebException}
      Message   "Execution of request failed: http://www.blogger.com/feeds/5246987755651065286/posts/default"   string
      Source   "Google.GData.Client"   string
      StackTrace   "   at Google.GData.Client.GDataRequest.Execute()\r\n   at Google.GData.Client.GDataGAuthRequest.Execute(Int32 retryCounter)\r\n   at Google.GData.Client.GDataGAuthRequest.Execute()\r\n   at Google.GData.Client.Service.EntrySend(Uri feedUri, AtomBase baseEntry, GDataRequestType type, AsyncSendData data)\r\n   at Google.GData.Client.Service.Insert(Uri feedUri, AtomEntry newEntry, AsyncSendData data)\r\n   at Google.GData.Client.Service.Insert(Uri feedUri, AtomEntry newEntry)\r\n   at TextBlog.ConsoleApp.PostNewEntry(Service service, Uri blogPostUri, String title, String content, DateTime create, DateTime mod) in C:\\src\\TextBlog\\TextBlog\\Program.cs:line 164"   string
+      TargetSite   {Void Execute()}   System.Reflection.MethodBase {System.Reflection.RuntimeMethodInfo}

Developing for the web sucks donkey balls, cuz you have to talk to other people's services, and if they change anything about the way their stuff works, your stuff breaks. Urg.

Hmm, maybe it's this?


Why am I getting an HTTP 503 error code? 

There is an unspecified rate limit on the number of messages per user
per second. HTTP 503 indicates that the rate limit has been exceeded. To
handle this, consider implementing an exponential backoff algorithm and
an algorithm which interleaves small batches of emails from different
users.


ADDENDUM : the answer seems to be one or both of these :


// http://code.google.com/p/google-gdata/wiki/KeepAliveAndUnderlyingConnectionIsClosed
((GDataRequestFactory)service.RequestFactory).KeepAlive = false;

//int t = ((GDataRequestFactory)service.RequestFactory).Timeout;
// get's and set's the Timeout property used for the created HTTPRequestObject in milliseconds. if you set it to -1 it will stick with the default of the HTPPRequestObject. From MSDN: The number of milliseconds to wait before the request times out. The default is 100,000 milliseconds (100 seconds).

// http://groups.google.com/group/gdata-dotnet-client-library/browse_thread/thread/99434d4d7df88e43?hl=en
System.Net.ServicePointManager.Expect100Continue = false;

Anyway, my blog poster is working again for the moment. But it's only a matter of time before changes in the site fuck me. I don't like that. I'm starting to miss my simple FTP poster that would just never ever change or ever have any problems. I have no patience any more for people "improving" software and breaking my shit. Like it's fine if you want to push new versions or whatever, but I want to keep the versions that work for me. You can't do that anymore. On the one hand you have reliance on OS components which you are forced to update, and on the other hand you have connection to web services which are changing constantly.

01-27-09 - Oddworld Memory Memories

I've been working on some memory allocator related junk recently (just for laughs I made my SmallAllocator in cblib work lock free and timed it running on a bunch of threads; an alloc takes around 200 clocks; I think there may be some cache contention issues). Anyway it reminded me of some funny stuff :

Munch's Oddysee was a wild painful birth; we were shipping for Xbox launch and crunching like mad at the end. We had lots of problems to deal with, like trying to rip as much of the fucking awful NetImmerse Engine out as possible to get our frame rate up from 1 fps (literally 6 months before ship we had areas of the game running at 1 fps) (scene graphs FTL). And then there were the DMusic bugs. Generally I found that dealing with XBox tech support guys, the graphics and general functionality guys have been really awesome, really helpful, etc. Eventually we trapped the exception in the kernel debugger and it got fixed.

Anyhoo... in the rush to ship and compromise, one of the things we left out was the ability to clean up our memory use. We just leaked and fragmented and couldn't release stuff right, so we could load up a level, but we couldn't do a level transition. Horrible problem you have to fix, right? Nah, we just rebooted the Xbox. When you do a level transition in Munch, you might notice the loading screen pops up and is totally still for a few seconds, and then it starts moving, and the screen flashes briefly during that time. That's your Xbox rebooting, giving us a fresh memory slate to play with. (shortly after launch the Xbox guys forbid developers from using this trick, but quite a few games in the early days shipped with this embarassing delay in their level load).

For Stranger's Wrath we used my "Fixed Restoring" Small Allocator, which is a standard kind of page-based allocator. We were somewhat crazy and used a hefty amount of allocations; our levels were totally data driven and objects could be different types and sizes depending on designer prefs, so we didn't want to lock into a memory layout. The different levels in the game generally all get near 64 MB, but they use that memory very differently. We wanted to be able to use things like linked lists of little nodes and STL hashes and not worry about our allocator. So the Small Allocator provides a small block allocator with (near) zero size overhead (in the limit of a large # of allocations). That is, there are pages of some fixed size, say 16 KB, and each page is assigned to one size of allocation. Allocations of that size come out of that page just by incrementing a pointer, so they're very fast and there's zero header size (there is size overhead due to not using a whole page all the time).

That was all good, except that near the end when all the content started coming together I realized that some of our levels didn't quite fit no matter how hard we squeezed - after hard crunching they were around 65 MB and we needed to get down to 63 or so, and also the Fixed Restoring Allocator was wasting some space due to the page granularity. I was assigning pages to each size of allocation, rounding up to the next alignment of 8 or 16 or 32. Pages would be given out as needed for each size bucket. So if a size bucket was never used, pages for that size would never be allocated.

This kind of scheme is pretty standard, but it can have a lot of waste. Say you allocate a lot of 17 byte items - that gets rounded up to 24 or 32 and you're wasting 7 bytes per item. Another case is that you allocate a 201 byte item - but only once in the entire game ! You don't need to give it a whole page, just let it allocate from the 256-byte item page.

Now in a normal scenario you would just try to use a better general purpose allocator, but being a game dev near shipping you can do funny things. I ran our levels and looked at the # of allocations of each size and generated a big table. Then I just hard-coded the Fixed Restoring allocator to make pages for exactly the sizes of allocations that we do a lot of. So "Stranger's Wrath" shipped with allocations of these sizes :


static const int c_smallBlockSizes[] = { 8,16,24,32,48,64,80,96,116,128,136,156,192,224,256,272 };

(The 116, 136, 156, and 272 are our primary GameObject and renderable types). (yes, we could've also just switched those objects to a custom pool allocator for the object, but that would've been a much bigger code change, which is not something I would want to do very close to shipping when we're trying to be in code lockdown and get to zero bugs).

01-25-09 - Low Level Threading Junk Part 2.02

Some more tidbits before I come back to this :

There is a misconception being widely spread that x86 can reorder reads (Herb Sutter and lots of good people have been repeating this). So far as I can tell that's just not true. The IA-32 spec says that writes don't move past writes nor do reads move past reads. (there are lots of other constraints in there).

x86 plain old load and store are acquire and release. Note that that does NOT mean they have a total order (though if your code is written right for acquire/release semantics you don't need a total order). However volatile (locked) ops on x86 do have a total order.

BTW I am very much not suggesting that you or I write code that relies on the quirks on the x86. It's better to be very careful and generally correct and mark all the places that you are assuming different types of sequencing. If those sequencing commands turn into NOPs on x86, then bully for you. Still, it helps me to actually know what's going on in our systems, and if we're going to say things, let's try to say them right; my god there's so much completely wrong information about this stuff out there. This Gamasutra article is just chock-full of wrong.

I found this guy Anthony Williams who seems to know what's up : intel-memory-ordering-and-c++-memory-model and acquire/release vs. sequential consistency

Also, Bartosz in the famous post where he gets it wrong talks about the different memory model constraints. One is :

# memory_order_consume: potentially weaker form of memory_order_acquire that enforces ordering of the current load before other operations that are data-dependent on it (for instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won�t be moved before it (yes, even that is not guaranteed on all platforms!).

This is for the dependent read case. This is the same place that Linux uses "smp_read_barrier_depends". The typical scenario is like :


   Object * ptr = s_shared_ptr;
   smp_read_barrier_depends();
   int local = ptr->data;

or in C++0x style :

   Object * ptr = s_shared_ptr.load(memory_order_consume);
   int local = ptr->data;

Note that Bartosz says this in a funny way. The issue is not that the compiler or the CPU reorder buffer can move the dependent read before the pointer read. Obviously that's impossible. The issue is that for purposes of *memory timing* it could look as if the dependent read was moved earlier. Say I write this bad code :


   Object * ptr = s_shared_ptr;
   int local = ptr->data;

which is 

   mov ecx   , &s_shared_ptr
   mov eax , [ecx] // eax = ptr
   mov edx , [eax + 4] // data is at ptr+4

Can never be

   mov ecx   , &s_shared_ptr
   mov edx , [eax + 4] // data is at ptr+4
   mov eax , [ecx] // eax = ptr

Obviously there's no way to execute this out of order, the chip and the compiler are not *broken*. But it can look like it went out of order if your cache architecture allows it.

Say some other thread wrote s_shared_ptr->data , then s_shared_ptr. It used a Release so they went to the bus in order. But your chip is crazy dumb and loads cache lines in random order. Your chip reads the line with s_shared_ptr in it, and then your code runs :


   Object * ptr = s_shared_ptr;
   int local = ptr->data;

What you see is the *new* value of s_shared_ptr , but the *old* value of ptr->data. Now your dumb cache pulls in ptr->data but it's too late. We see that our code *acted* like it read ptr->data before ptr.

Fortunately this doesn't happen on modern chips (the Alpha is the only chip I know of that can do this). However to be really correct about marking up your code's memory model semantically you should include the memory_order_consume or smp_read_barrier_depends. Then your compiler can turn those into NOPs ;)

Now it's a little bit of a mystery to me exactly how processors manage this. I think it must be that the caches talk to each other and they must invalidate pages in temporal order or something.


BTW I really don't like the idea of a C++ atomic<> class, or the Magic Microsoft Volatile. The problem with both of those is they hide where the code is doing really crucial synchronization things. You can have code that looks like :


   newnode->next = node->next;
   node->next = newnode;
   return node->data;

and it's secretely using atomic<> or volatile and it only works because it's relying on those to do the right acquire/release stuff. Dear god.

The really scary thing about both of those is that they are deceptively magical. They work very neatly. Most of the time. And they make it very easy for any programmer who doesn't know anything to go ahead and write some lock free code. Yikes.

I would much rather see everyone continue to use locks, and when you really do need to do lock-free stuff, the C++0x proposal for specifically marking up the semantics needed with memory model constraints is pretty good IMO. It clearly marks what the code requires and what the ordering constraints are.

To say this more concisely : I'd rather see atomic<> functions in the language rather than atomic data types. Because it's really the operations that are "atomic" or not. But it looks like I won't get my way and we're all going to be smothered by an avalanche of broken thready code in the next 10 years.

1/26/2009

01-25-09 - madness

Am I mad to want to share binary data files between 32 bit and 64 bit runs? Binary data with pointer-sized stuff in it, mind you. It would mean using 64-bit pointers somehow in my 32 bit runs. Or being more careful about not writing pointers to binary data. Currently I use the method of rebasing pointers so that I can write nice big blocks of structs straight to disk.

In other news : my TextBlog poster apparently wigged out, but now it's not repro'ing so I can't fix it. Urg. I kind of hate RSS. I hate that when I fuck something up on my blog it gets broadcast all over the universe. It's like somebody xeroxing your first drafts and handing them out as you're writing. No no! quit it! When I'm done writing, then people can come over and look at what I wrote. Class, hand those back to me, everyone pass your xerox down to the person on the end of the row.

01-25-09 - Low Level Threading Junk Part 2.01

I'm going to update Part 2 with some fixes and I'd like to write a Part 3 about some actual lock-free madness, but those may have to wait a tiny bit. In the mean time I thought I would post some links.

These are some very good summaries, mainly by the MSDN guys :

MSDN - Lockless Programming Considerations for Xbox 360 and Microsoft Windows
MSDN - Concurrency What Every Dev Must Know About Multithreaded Apps
MSDN - Understanding Low-Lock Techniques in Multithreaded Apps
MSDN - Synchronization and Multiprocessor Issues (Windows)
Intel - Multiple Approaches to Multithreaded Applications

These articles have a good discussion of the memory models of various processors ; in particular Part 1 has a nice table that shows what each processor does (warning : there may be some slight wrongness in here about x86).

Memory Ordering in Modern Microprocessors, Part I
Memory Ordering in Modern Microprocessors, Part II

Here come a mess of links about memory models and barriers and what's going on. There's a lot of contradiction (!!) so be careful : For example - does LFENCE do anything at all on x86 for normal aligned integral reads? good question... btw SFENCE definitely does do nothing (I think?).

Bartosz Milewski - Publication Safety on Multicore
Hans Boehm - Why atomics have integrated ordering constraints
Bartosz Milewski - C Atomics and Memory Ordering
YALFAm (Yet Another Lock Free Approach, maybe) - comp.programming.threads Google Groups
Who ordered memory fences on an x86 � ��Bartosz Milewski�s Programming Cafe
The Double-Checked Locking is Broken Declaration
Synchronization and the Java Memory Model
Re Memory fence instructions on x86
Re LFENCE instruction (was [rfc][patch 33] x86 optimise barriers)
Re Intel x86 memory model question 2
Re Intel x86 memory model question 1
Memory Barriers, Compiler Optimizations, etc. - comp.programming.threads Google Groups
LFENCE instruction (was [rfc][patch 33] x86 optimise barriers)
cbrumme's WebLog Memory Model

And now some links about some real mad shit. Do not try any of this at home. I've found two mad geniuses : Chris Thomasson ("the AppCore guy") who has got some nutty x86 lock-free synchronization stuff that relies on details of x86 and does safe communication with zery fencing, and Dmitriy V'jukov ("the Relacey guy") who has written a ton of nutty awesome stuff for Intel.

Wait free queue - comp.programming.threads Google Groups
Thin Lock Implementation � ��Bartosz Milewski�s Programming Cafe
Paul E. McKenney Read-Copy Update (RCU)
Multithreaded Producer-Consumer The Easy Way
Larry Osterman's WebLog So you need a worker thread pool...
julian m bucknall Hazard Pointers (careful - wrong stuff in here)
Dr. Dobb's Lock-Free Queues July 1, 2008
Computer Laboratory - Cambridge Lock Free Library
CodeProject A simple Win32 readerswriters lock with reentrance. Free source code and programming help
Atomic Ptr Plus Project
AppCore A Portable High-Performance Thread Synchronization Library
Re Non-blocking queue for the 2 threads (AppCore)
refcount - mostly lock-free word-based atomic reference counting (AppCore)
Asymmetric rw_mutex with atomic-free fast-path for readers - Scalable Synchronization Algorithms Google Groups
Asymmetric Dekker Synchronizatio

There's also Intel's TBB :

Threading Building Blocks Home

It's a horrible name, it's not "Threading Building Blocks" at all. It's fucking OpenMP. It's not really a lib of little threading helpers, which would be handy, it's like fucking CUDA for CPUs.

Also just to repeat - I am in no way recommending that anybody go down this path. As I have said repeatedly - CRITICAL_SECTION is really fucking fast and smart and does exactly what you want. Just use it. If critical section locks are giving you problems it's because your architecture is wrong, not because the micro-routine is not fast enough.

01-25-09 - Low Level Threading Junk Part 2

Okay, we're starting to learn some primitive threading junk. So let's try to use it do something useful.

Non-trivial statics in C++ are hideously non-thread-safe. That should be extremely obvious but if you don't see it, read here : The Old New Thing C++ scoped static initialization is not thread-safe, on purpose! . Obviously plain C static initialization of integral types is fine (BTW floats are NOT fine - beware using const floats in C++). Beware all things like this :


Object & Instance() { static Object s; return s; }

void func(int i)
{
    static Object once;
    once.DoStuff(i);
}

Not thread safe. For the most part, we should just not use them. It's very nice to do minimal work in CINIT. Much bruhaha has been made about the "thread safe singleton" (see all these) :

opbarnes.com Code Snippet Released - Thread-safe Singleton (for Windows platforms)
Dr. Dobb's Singleton Creation the Thread-safe Way October 1, 1999 - BROKEN
Dr. Dobb's C++ and The Perils of Double-Checked Locking Part I July 1, 2004
Double-checked locking - Wikipedia, the free encyclopedia

For my money this is a little silly because there's a really damn trivial solution to this. The main point of the old C++ initialize-on-demand Singleton pattern (like Instance() above) is to make it work during CINIT with complex initialization order dependencies. Now, CINIT we know is run entirely on one thread before any threads are made. (you are fucking nuts if you're making threads during CINIT). That means we know any Singletons that get made during CINIT don't have to worry about threading. That means you can use a trivial singleton pattern, and just make sure it gets used in CINIT :


static Object * s_instance = NULL; // okay cuz its a basic type

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        s_instance = new Object;
    }
    return s_instance;
}

static volatile Object * s_forceUse = GetSingleton();

The forceUse thing just makes sure our GetSingleton is called during cinit, so that we can be sure that we're all made by the time threads come around. The GetSingleton() that we have here is NOT thread safe, but it doesn't need to be ! (btw I assume that "CreateThread" acts as a MemoryBarrier (??))

Okay, that's nice and easy. What if you want a Singleton that doesn't usually get made at all, so you don't want to just make it during CINIT ? (you actually want it to get made on use). Okay, now we have to worry about thread-safing the construction.

The easiest way is you know your Singleton won't get used during CINIT. In that case you can just use Critical Section :


static CCriticalSection s_crit;     // CRITICAL_SECTION wrapper class will construct during CINIT
static Object * s_instance = NULL;

// broken version :

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        s_crit.Lock();

        s_instance = new Object;
    
        s_crit.Unlock();
    }
    return s_instance;
}

Okay, that was easy, but it's horribly broken. First of all, there's no gaurantee that Object is only made once. The thread can switch while the instruction pointer is between the first if() and the critsec lock. If that happens, some other thread can get in and make s_instance, and then when we come back to execute we run through and make it again. (If you like, you could say we put the critsec in the right place - we could fix it by moving the critsec out of the if). Even aside from that the line that assigns s_instance is all wrong because the pointer to s_instance is not necessarilly being written atomically, and it might be written before the stuff inside Object is written. What did we learn in Part 1?

// good version :

static CCriticalSection s_crit;     // CRITICAL_SECTION wrapper class will construct during CINIT
static Object * volatile s_instance = NULL;

Object * GetSingleton()
{
    if ( ! s_instance ) // must be memory_order_consume
    {
        s_crit.Lock();

        if ( ! s_instance )
        {
            Object * pNew = new Object;
        
            InterlockedExchangeRelease( &s_instance , pNew );
        }   
    
        s_crit.Unlock();
    }
    return s_instance;
}

This is a "double-checked lock" that works. The purpose of the double-check is to avoid taking the critsec when we can avoid doing so. If instance is NULL we take the critsec, then we have to check again to make sure its still null, then we rock on and make sure that the Object's memory is flushed before s_instance is set, by using a "Release" memory barrier. Also using Interlocked ensures the pointer is written atomically.

ADDENDUM : I need to add some more notes about this. See comments for now.

Okay, that's all good, but it doesn't work if you now ever try to use this Singleton from CINIT. The problem is that s_crit might not be constructed yet. There's a pretty easy solution to that - just check if s_crit has been initialized, and if it hasn't, then don't use it. That works because we know CINIT is single threaded, so you can do something like :


class Object { } ;
static CRITICAL_SECTION s_crit = { 0 };
AT_STARTUP( InitializeCriticalSection(&s_crit); );
static Object * s_instance = NULL;

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        if ( s_crit.DebugInfo == 0 ) // DebugInfo will always be nonzero for an initialized critsec
        {
            // must be in CINIT !
            s_instance = new Object;
        }
        else
        {
            EnterCriticalSection(&s_crit);

            if ( ! s_instance )
            {
                Object * pNew = new Object;
        
                InterlockedExchangeRelease( &s_instance , pNew );
            }   
    
            LeaveCriticalSection(&s_crit);
        }
    }
    return s_instance;
}

This actually works pretty dandy. Note that in CINIT you might take the upper or lower paths - you have no way of knowing if GetSingleton() will be called before or after the critsec is initialized. But that's fine, it works either way, by design. Note that we are crucially relying here on the fact that all non-trivial CINIT work is done after all the simple-type zeroing.

Okay, so that's all fine, but this last thing was pretty ugly. Wouldn't it be nice to have a critical section type of mutex object that can be statically initialized so that we don't have to worry about CINIT mumbo jumo ?

Well, yes it would, and it's pretty trivial to make a standard simple Spin Lock thing :


typedef LONG SpinMutex; // initialize me with = 0

void SpinLock(SpinMutex volatile * mut)
{
    // 0 is unlocked, 1 is locked
    COMPILER_ASSERT( sizeof(SpinMutex) == sizeof(LONG) );
    int spins = 0;
    while( InterlockedCompareExchange((LONG *)mut,1,0) == 1 ) // Acquire
    {
        ++spins;
        if ( spins > CB_SPIN_COUNT )
        {
            Sleep(0);
        }
    }
    // *mut should now be 1 
}
    
void SpinUnlock(SpinMutex volatile * mut)
{
    // *mut should now be 1
    InterlockedDecrement((LONG *)mut); // Release
    // *mut should now be 0
    // if there was a higher priority thread stalled on us, wake it up !    
}

Note that SpinLock will deadlock you if you try to recurse. Okay, now because this thing is static initializable we can use it to make a Singleton and be safe for CINIT :

static SpinMutex s_mutex = 0;
static Object * volatile s_instance = NULL;

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        SpinLock(&s_mutex);

        if ( ! s_instance )
        {
            Object * pNew = new Object;
        
            InterlockedExchangeRelease( &s_instance , pNew );
        }   
    
        SpinUnlock(&s_mutex);
    }
    return s_instance;
}

And that works too.

Now, people get tempted to use these simple SpinLocks all over. *DON'T DO IT*. It looks like it should be faster than CRITICAL_SECTION but in fact it's way way slower in bad cases.

First of all, what is a CRITICAL_SECTION ? See : Matt Pietrek - Critical Sections Under Windows .

It starts out as a simple spin lock like the above. It does the same kind of thing to busy-spin the processor first. But then it doesn't just Sleep. This kind of spin-and-sleep is a really really bad thing to do in heavy threading scenarios. If you have lots of threads contending over one resource, especially with very different execution patterns the spin-and-sleep can essentially serialize them (or worse). They can get stuck in loops and fail to make any progress.

Once CRITICAL_SECTION sees contention, it creates a kernel Event to wait on, and puts the thread into an altertable wait. The Windows scheduler has lots of good mojo for dealing with events and threads in altertable waits - it's the best way to do threading sleeping and wake ups generally. For example, it has mojo to deal with the bad cases of heavy contention by doing things like randomly choosing one of the threads that is waiting on a critsec to wake up, rather than just waking up the next one (this prevents a few runaway threads from killing the system).

One important note : you should almost alwayse use "InitializeCriticalSectionAndSpinCount" these days to give your crit secs a spinner. That's because more about this in a moment.

About performance : (from one of the MSDN articles) :


    * MemoryBarrier was measured as taking 20-90 cycles.
    * InterlockedIncrement was measured as taking 36-90 cycles.
    * Acquiring or releasing a critical section was measured as taking 40-100 cycles.

which clearly shows that the idea that a critical section is "too slow" is nonsense. I've said this already but let me emphasize :

    Most of the time (no contention) a Crit Sec *is* just an InterlockedIncrement
        (so how could you beat it by using InterlockedIncrement instead?)

    When there is contention and the Crit Sec does something more serious (use an Event) it's slow
        but that is exactly what you want !!

In fact, the big win from "lock free" is not that InterlockedIncrement or whatever is so much faster - it's that you can sometimes do just one interlocked op, unlike crit sec which requires two (one for Enter and one for Leave).

An interesting alternative is this QLock thing by Vladimir Kliatchko : Dr. Dobb's Developing Lightweight, Statically Initializable C++ Mutexes May 1, 2007 (QLock) ; it's quite a bit like the MS critsec, in that he uses a simple interlocked op to check for contention, and if it's not free then he makes an Event and goes into a wait state and all that. The code is there on Dr. Dobbs but you have to do the thing where you hit the "Download" in the top bar and then navigate around to try to find it. it's here in MUTEX.LST .


Ok, now a bit more InitializeCriticalSectionAndSpinCount. The reason you want a Spin is because basically every new machine these days is "multiprocessor" (multicore). That's a big difference from threading on a single core.

If you're threading on a single core - memory will never change underneath you while you are executing. You can get swapped out and memory can change and you can get swapped in, so it looks like memory changed underneath you, but it doesn't happen in real time. With multiple cores memory can be touched by some other core that is running along at full speed.

Often this doesn't affect the way you write threading code, but it does affect performance issues. It means you can have contention on a way finer scale. In a normal OS single proc environment, you get large time slices so other threads aren't frequently randomly poking at you. With real multicore the other guy can be poking at you all the time. A good range is to spin something like 1000-5000 times. 1000 times is a microsecond

What the spin count does is let you stay in the same thread and avoid an OS thread switch if some other processor is holding the lock. Note that if it's a thread on the same processor - spinning is totally pointless (in fact it's harmful).

1/25/2009

01-25-09 - Low Level Threading Junk Part 1

Okay, this is definitely one of those posts where I'm no expert by a long shot so I'll probably write some things that are wrong and you should correct me. By "low level" I mean directly accessing shared variables and worrying about what's going on with the memory, as opposed to just using the language/OS constructs for safe sharing.

Let me say up front that writing lots of low-level thread-sharing code is a very very bad idea and should not be in 99% of the cases. Just use CriticalSection and once in a while use Interlocked and don't worry about the tiny inefficiency; if you do things right they won't be a speed hit. Trying to get things right in lots of low level threading code is a recipe for huge bugs.

I'm going to assume you know about race conditions and basic threading primitives and such. If you're hazy, this is a pretty good introduction : Concurrency What Every Dev Must Know About Multithreaded Apps . I'm also going to assume that you know how to do simple safe thread interaction stuff using Interlocked, but maybe you don't know exactly what's going on with that.

First of all, let me try to list the various issues we're dealing with :

1. Multiple threads running on one core (shared cache). This means you can get swapped in and out; this is actually the *easy* part but it's what most people think of as threading.

2. Multiple cores/processors running multiple threads, possibly with incoherent caches. We have #1 to deal with, but also now the memory views of various threads can be different. The memories sync up eventually, but it's almost like they're talking through a delated communication channel.

3. CPU OOP instruction reorder buffers and cache gather/reorder buffers. Instructions (even in ASM) may not execute in the order you wrote, and even if they do exectue in that order, memory reads & writes may not happen in the order of the instructions (because of cache line straddle issues, write gather buffers, etc.)

4. Single CISC CPU instructions being broken into pieces (such as unaligned accesses). Single ASM instructions may not be single operations; things like "inc" become "load, add, store" and there's an opportunity for a thread to interleave in there. Even apparently atomic ops like just a "load" can become multiple instructions if the load is unaligned, and that creates a chance for another thread to poke in the gap.

5. The compiler/optimizer reordering operations. Obviously things don't necessarily happen in the order that you write them in your C program.

6. The compiler/optimizer caching values or eliminating ops

I think that's the meat of it. One thing that sort of mucks this all up is that x86 and MSVC >= 2005 are sort of special cases which are much simpler than most other compilers & platforms. Unfortunately most devs and authors are working with x86 and MSVC 2005+ which means they do lazy/incorrect things that happen to work in that case. Also I'm going to be talking about C++ but there are actually much better memory model controls now in Java and C#. I'm going to try to describe things that are safe in general, not just safe on x86/Windows, but I will use the x86/Windows functions as an example.

Almost every single page I've read on this stuff get its wrong. Even by the experts. I always see stuff like this . Where they implement a lock-free fancy doohicky, and then come back later and admit that oh, it doesn't actually work. For example this Julian Bucknall guy has a lot of long articles about lock free stuff, and then every 3rd article he comes back and goes "oh BTW the last article was wrong, oops". BTW never try to use any of the lock free stuff from a place like "codeproject.com" or anybody's random blog.

I've read a lot of stuff like :

Unfortunately, Matt's answer features what's called double-checked locking which isn't supported by the C/C++ memory model.

To my mind, that's a really silly thing to say. Basically C/C++ doesn't *have* a memory model. Modern C# and Java *do* which means that the language natively supports the ideas of memory access ordering and such. With C/C++ you basically have zero gaurantees of anything. That means you have to do everything manually. But when you do things manually you can of course do whatever you want. For example "double checked locking" is just fine in C, but you have to manually control your memory model in the right way. (I really don't even like the term "memory model" that much; it's really an "execution model" because it includes things like the concept of what's atomic and what can be reordered).

Some things I'm going to talk about : how lock free is really like spin locking, how critical sections work, why you should spincount critical sections now, what kind of threading stuff you can do without critsecs, etc.

Something I am NOT going to talk about is the exact details of the memory model on x86/windows, because I don't think you should be writing code for a specific processor memory model. Try to write code that will always work. x86/windows has strong constraints (stores are not reordered past stores, etc. etc. but forget you know that and don't rely on it).


Let's look at a simple example and see if we can get it right.

Thread A is trying to pass some work to thread B. Thread B sits and waits for the work then does it. Our stupid example looks like :


// global variables :
MyStruct * g_work = NULL;


Thread A :

int main(int argc, char ** argv)
{
    g_work = new MyStruct( argc, argv );

    ....
}


void ThreadB()
{

    while( g_work == NULL )
    {
    }

    g_work->DoStuff();
}

Okay, now let's go through it and fix it.

First of all, this line should give you the heebee jeebees :

    g_work = new MyStruct( argc, argv );
It's doing a bunch of work and assigning to a shared variable. There are no gaurantees about what order that gets written to memory, so g_work could be assigned before the struct is set up, then ThreadB could start poking into it while I'm still constructing it. We want to release a full object to g_work that we're all done with. We can start trying to fix it by doing :
1.  MyStruct * temp = new MyStruct( argc, argv );
2.  g_work = temp;
that's good, but again you cannot assume anything about the memory model in C or the order of operations. In particular, we need to make sure that the writes to memory done by line 1 are actually finished before line 2 executes.

In Windows we do that by calling MemoryBarrier() :

    MyStruct * temp = new MyStruct( argc, argv );
    MemoryBarrier();
    g_work = temp;
MemoryBarrier is an intrinsic in MSVC ; it actually does two things. 1. It emits an instruction that causes the processor to force a sync point (this also actually does two things : 1A : flushes caches and write gather buffers and 1B. puts a fence in the reorder buffer so the processor can't speculate ahead). 2. the MemoryBarrier instrinsic also acts as a compiler optimizer barrier - so that the MSVC compiler won't move work before MemoryBarrier ahead of MemoryBarrier.

MemoryBarrier is a full memory fence, it creates an ordering point. In general if you just write memory operations :

    A
    B
    C
you can't say what order they actually happen in. If another thread is watching that spot it might see C,A,B or whatever. With MemoryBarrier :
    A
    B
    MemoryBarrier
    C
You get an order constraint : C is always after {AB} , so it might be ABC or BAC.

Another digression about the compiler optimizer fence : in windows you can also control just the compiler optimization with _ReadWriteBarrier (and _ReadBarrier and _WriteBarrier). This doesn't generate a memory fence to the processor, it's just a command to the optimizer to not move memory reads or writes across a specific line. I haven't seen a case where I would actually want to use this without also generating a memory barrier (??). Another thing I'm not sure about - it seems that if you manually output a fence instruction with __asm, the compiler automatically treats that as a ReadWriteBarrier (??).

Alright, so we're getting close, we've made a work struct and forced it to flush out before becoming visible to the other thread :

    MyStruct * temp = new MyStruct( argc, argv );
    MemoryBarrier();
    g_work = temp; <-
What about this last line? It looks inocuous, but it holds many traps. Assignment is atomic - but only if g_work is a 32 bit pointer on 32-bit x86, or a 64 bit pointer on 64-bit x86. Also since g_work is just a variable it could get optimized out or deferred or just stored in local cache and not flushed out to the bus, etc.

One thing we can do is use "volatile". I hesitate to even talk about volatile because it's not well defined by C and it means different things depending on platform and compiler. (In particular, MS has added lots of threading usefulness to volatile, but nobody else does what they do, so don't use it!). What we want "volatile" for here is to force the compiler to actually generate a memory store for g_work. To my mind I like to think that volatile means "don't put me in registers - always read or write memory". (again on x86 volatile means extra things, for example volatile memory accesses won't get reordered, but don't rely on that!). Note you might also have to make sure g_work is aligned unless you are sure the compiler is doing that.

One thing to be careful with about volatile is how you put it on a pointer. Remember to read pointer adjective from right to left in C:

    volatile char * var;

    // this is a non-volatile pointer to a volatile char !! probably not what you meant

    char * volatile var;

    // this is a volatile pointer to chars - probably what you meant
    //  (though of course you have to make sure the char memory accesses are synchronized right)

Note that volatile is a pretty big performance hit. I actually think most of the time you should just not use "volatile" at all, because it's too variable in its meaning, and instead you should manually specify the operations that need to be sync'ed :

On Windows there's a clean way to do this :


    MyStruct * temp = new MyStruct( argc, argv );
    InterlockedExchangePointer( &g_work, temp );

The Interlocked functions are guaranteed to be atomic. Now we don't have to just hope that the code we wrote actually translated into an atomic op. The Interlocked functions also automatically generate memory barriers and optimizer barriers (!! NOTE : only true on Windows, NOT true on Xenon !!). Thus the InterlockedExchangePointer forces MyStruct to get written to temp first.

Let me just briefly mention that this full MemoryBarrier is *very* expensive and you can get away with less. In particular, something you will see is "Acquire" and "Release". The heuristic rule of thumb is that you use "Acquire" to read shared conditions and "Release" to write them. More formally, "Acquire" is a starting memory barrier - it means any memory op done after the Acquire will not move before it (but ones done before can move after). "Release" is a finishing memory barrier - memory ops done before it will not move after (but ones done after can be done before).

So if you have :

    A
    B
    C - Release
    D
The actual order could be {A B C D} or {B A C D} or {A B D C} but never {A C B D}. Obviously Acquire and Release are a slightly weaker constraint than a full barrier so they give the processor more room to wiggle, which is good for it.

So lets do another example of this simple thread passing (stolen from comp.programming.threads) :


int a;
int b;
LONG valid = 0;
CRITICAL_SECTION mtx;

void function1 () { // called from many threads

  EnterCriticalSection(&mtx);
  if (! valid) {
    a = something;
    b = something;
    InterlockedExchangeAddRelease(&valid, 1);
    // writes to memory 'a' and 'b' will be done before valid is set to 1
  }
  do_other_stuff();
  LeaveCriticalSection(&mtx);

}

int function2 () { // called from many threads

  if (InterlockedExchangeAddAcquire(&valid, 0)) {
    return a + b;
  } else {
    return 0;
  }

} 

int function3 () { // broken

  if ( valid ) {
    return a + b;
  } else {
    return 0;
  }

} 

Now it's easy to fall into a trap of thinking that because we did the "Release" that function3 is okay. I mean, by the time function3 sees valid get set to 1, 'a' and 'b' will already be set, so function 3 is right, okay? Well, sort of. That would be true *if* function 3 was in assembly so the compiler couldn't reorder anything, and if the chip couldn't reorder memory ops (or if we rely on the x86 constraint of read ordering). You should know the actual execution of function 3 could go something like :


fetch a
fetch b
add b to a
test valid
set conditional a to 0
return a

which is now reading a and b before 'valid'. Acquire stops this.

Some good links on this basic memory barrier stuff : (read Kang Su in particular)

Kang Su's Blog volatile, acquirerelease, memory fences, and VC2005
Memory barrier - Wikipedia, the free encyclopedia
Acquire and Release Semantics
The Old New Thing Acquire and release sound like bass fishing terms, but they also apply to memory models
Memory ordering, barriers, acquire and release semantics niallryan.com
Memory barrier + volatile (C++) - comp.programming.threads Google Groups

BTW note that volatile does some magic goodness in VC >= 2005 , but *not* on Xenon even with that compiler version. In summary :


ReadBarrier/WriteBarrier - just prevents compiler reordering

MemoryBarrier() - CPU memory load/store barrier

Interlocked & volatile
    Interlocked functions on Windows automatically generate a compiler & a CPU barrier
    Interlocked on Xbox 360 does not generate a CPU barrier - you need to do it

special volatile thing in MSVC ?
    volatile automatically does the right thing on VC >= 2005
    not for XBox though

ADDENDUM : lol the new Dr. Dobbs has a good article by Herb called volatile vs volatile . It covers a lot of this same territory.

And some more links that are basically on the same topic : (the thing with the valid flag we've done here is called the "publication safety pattern")

Bartosz Milewski - Publication Safety on Multicore
Hans Boehm - Why atomics have integrated ordering constraints
Bartosz Milewski - C Atomics and Memory Ordering

01-24-09 - linkies

Complexification Gallery - wow really gorgeous images. All made algorithmically with Processing, and most driven by semi-physical-mathematical models. Lots of applets to play with. This is seriously fucking amazing and inspiring.

Killzone 2 making of video is pretty awesome.

Mischief and Ivo Beltchev have some crazy debugger database plugin thing for the string-CRC model. IMO this is fucking nuts. But I do love me some autoexp.dat

We were talking about this the other day and I realized I've forgotten what "Koenig lookup" is and why you need it. Basically it just means that functions are looked up in the namespace of their arguments. So :


namespace my_ns
{
    class Vec3 { ... };

    void func( Vec3 & x )
    {
        ...
    }

};

int main()
{
    my_ns::Vec3 v;
    func(v);
}

works, even though it's calling a func() that's not in the global namespace.

If you think about it a second this is nice, because it means that non-member functions on a class act like they are in the same namespace as that class. This is pretty crucial for non-member operators; it would be syntactically horrific to have to call the operators in the right namespace. But it's nice for other stuff too, it means you don't need to jam everything into member functions in order to make it possible to hand a class out to another namespace.


namespace my_ns
{
    class Vec3 { ... };

    bool operator == (const Vec3 &a , const Vec3 & b) { ... }
};


int main()
{
    my_ns::Vec3 a,b;

    if ( a == b )  // ***
    {
        ...
    }
}
At the line marked *** we're calling my_ns::operator == (Vec3,Vec3) - but how did we get to call a function in my_ns when we're not in that namespace? Koenig lookup.

Now, this really becomes crucial when you start doing generic programming and using templates and namespaces. The reason is your containers in the STL are in std:: namespace. You are passing in objects that are in your namespace. Obviously the STL containers and algorithms need to get access to the operations in your namespace. The only way they can do that is Koenig lookup - they use the namespace of the type they are operating on. For example to use std::sort and make use of your " operator < " it needs to get to your namespace.

See Herb's good articles :

What's In a Class - The Interface Principle
Namespaces and the Interface Principle
Argument dependent name lookup - Wikipedia, the free encyclopedia

Not exactly related but other good C++ stuff : Koenig has a Blog .

And if you want to get really befuddled about C++ name lookup you can mix this in : Why Not Specialize Function Templates .

Finally, I've had "dos here" for a long time, but of course it really should be "tcc here". Just follow these registry instructions but for the "command" put :


"C:\Program Files\JPSoft\TCCLE9\tcc.exe" /IS cdd %1

1/24/2009

01-22-09 - endir

One of the most useful things I've done recently is "endir.bat" :
endir.bat :

if not exist "%1" endbat
md zz
call mov %1 zz\
call zren zz %1

It puts an object into a dir with the same name as that object. This sounds silly but it's a super useful starting point for lots of things. For example, you have a bunch of music folders named :

Stupid Band NAme - Ironic Album Title

You want to make that

Stupid Band Name / Ironic Album Title

a dir inside a dir. You just go "endir" on the original single dir, it makes a dir inside a dir. Now to do the rename you just truncate each name where you want it. (it's impossible to write a general renaming rule for this because people use all kinds of different separators; I've literally seen names like "And_You_Will_Know_Us_By_The_Trail_Of_Dead_Source_Tags_And_Codes").

1/21/2009

01-21-09 - Towards Lock-Free Threading

Currently Oodle threads communicate through plain old mutex locking, but I'd like to go towards lock-free communication eventually. Now, lock-free coding doesn't have the normal deadlock problems, but it has a whole new host of much scarier and harder to debug thread timing problems, because you don't have the simplicity of mutex blocking out concurrency during shared data access.

It occurs to me there's a pretty simple way to make the transition and sort of have a middle ground.

Start by writing your threading using plain old mutexes and a locked "communication region". The communication area can only be accessed while the mutex is held. This is just the standard easy old way :


{Thread A} <---> {Communication Region} <---> {Thread B}

Thread A - lock mutex on CR
    change values ; 
    unlock
--- Thread B lock mutex
    read values
    unlock

Now find yourself a lockfree stack (aka singly linked list LIFO). The good old "SList" in Win32 is one fine choice. Now basically pretend that Thread A and Thread B are like over a network, and send messages to each other via the lock-free stacks.

To keep the code the same, they both get copies of the Communication Region :


{Thread A | Communication Region} <---> {Communication Region | Thread B}

and they send messages via two stacks :

{Thread A | Communication Region} --- stack AtoB --> {Communication Region | Thread B}
{Thread A | Communication Region} <-- stack BtoA --- {Communication Region | Thread B}

The messages you send across the stacks apply the deltas to the communication regions to keep them in sync, just like networked game.

So for example, if Thread A is the "main thread" and Thread B is a "job doer thread" , a common case might be that the communication region is a list of pending jobs and a list of completed jobs.


Main Thread
    {Pending Jobs}
    {Completed Jobs}

Request Job :
    add to my {Pending Jobs}
    push message on stackAtoB


Worker Thread
    {Pending Jobs}
    {Completed Jobs}

Sit and pop jobs off stackAtoB
put in pending jobs list
work on pending jobs
put result on completed jobs list
push completion message on stackBtoA

The nice thing is that Main Thread can at any time poke around in his own Pending and Completed list to see if various jobs are still pending or done yet awaiting examination.

Obviously if you were architecting for lock-free from the start you wouldn't do things exactly like this, but I like the ability to start with a simple old mutex-based system and debug it and make sure everything is solid before I start fucking around with lock-free. This way 99% of the code is identical, but it still just talks to a "Communication Region".


ADDENDUM :

I should note that this is really neither new nor interesting. This is basically what every SPU programmer does. SPU "threads" get a copy of a little piece of data to work on, they do work on their own copy, and then they send it back to the main thread. They don't page pieces back to main memory as they go.

While the SPU thread is working, the main thread can either not look at the "communication region", or it can look at it but know that it might be getting old data. For many applications that's fine. For example, if the SPU is running the animation system and you want the main thread to query some bone positions to detect if your punch hit somebody - you can just go ahead and grab the bone positions without locking, and you might get new ones or you might get last frame's and who cares. (a better example is purely visual things like particle systems)

Now I should also note that "lock free" is a bit of a false grail. The performance difference of locks vs. no locks is very small. That is, whether you use "CriticalSection" or "InterlockedExchange" is not a big difference. The big difference comes from the communication model. Having lots of threads contending over one resource is slow, whether that resource is "lock free" or locked. Obviously holding locks for a long time is bad, but you can implement a "lock free" model using locks and its plenty fast.

That is, this kind of model :


[ xxxxxxxxxx giant shared data xxxxxxxxxx ]

    |          |         |         |
    |          |         |         |
    |          |         |         |

[thread A] [thread B] [thread C] [thread D]

is slow regardless of whether you use locks or not. And this kind of model :

[data A  ] [data B  ] [data C  ] [data D  ] 
     
    |     /    |     /   |       / |
    |    /     |    /    |      /  |
    |   /      |   /     |     /   |

[thread A] [thread B] [thread C] [thread D]

is fast regardless of whether you use locks or not. Okay I've probably made this way more wrong and confusing now.


ADDENDUM #2 :

Let me try to express it another way. The "message passing" model that I described is basically a way of doing a large atomic memory write. The message that you pass can contain various fields, and it is processed synchronously by the receiver. That makes common unsafe lock-free methods safe. Let me try to make this clear with an example :

You want Thread B to do some work and set a flag when it's done. Thread A is waiting to see that flag get set and then will process the work. So you have a communication region like :

// globals :
    bool isDone;
    int workParams[32];
    int workResults[32];

Now a lot of people try to do lock-free work passing trivially by going :

Thread A :

    isDone = false;
    // set up workParams
    // tell Thread B to start

    ... my stuff ...

    if ( isDone )
    {
        // use workResults
    }   

Thread B :

    // read workParams

    // fill out workResults

    isDone = true;

Now it is possible to make code like this work, but it's processor & compiler dependent and can be very tricky and causes bugs. (I think I'll write some about this in a new post, see later). (the problem is that the reads & writes of isDone and the params and results don't all happen together and in-order). Instead we can just pass the object :

struct ThreadMessage
{
    LinkNode
    bool isDone;
    int workParams[32];
    int workResults[32];
}

Thread A :
    ThreadMessage myLocalCopy; // in TLS or stack
    // fill out myLocalCopy
    // push myLocalCopy to thread message queue to thread B

    ... my stuff ...

    if ( pop my message queue )
    {
        myLocalCopy = copy from queue
        // use myLocalCopy.workResults
    }
    
Thread B :
    ThreadMessage myLocalCopy; // in TLS or stack
    myLocalCopy = pop message queue

    // read myLocalCopy.workParams

    // fill out myLocalCopy.workResults
        
    push myLocalCopy to queue for thread S

Okay. Basically we have taken the separate variables and linked them together, so that as far as our thread is concerned they get written and read in one big chunk. That is, we move the shared data from one large consistent state to another.

01-20-09 - Spam

The gmail spam filter is pretty good, it's eliminated like 99% of my spam. However, at the same time, it's kind of disappointing. Spam still gets through, and the spam that gets through is just so obvious, how can you not tell it's spam? I know Google doesn't like have any smart employees or anything like that, so let me help you guys out :

1. If it's an email selling ED drugs, it's probably spam.

2. If I get an email and mark "report spam" on it - you probably shouldn't send me the *exact same email* the next day.

3. If the email is full of crazy non-text characters or weird ANSI color codes, it's probably spam.

I mean, Google is letting through emails whose entire body is :


Buy Viagra&Cialis&Tramadol....

How can you possibly think that's right?

BTW I think it's very productive to have an atmosphere at your company where anybody can challenge anybody else about the behavior of their functionality. So for example in this case I'd like to see every Google employee forwarding their spam to the spam dev team. But in game dev I'd like to see the artists go up to the tools guy and go "hey dude, our level export takes 30 minutes, WTF." This is one of those things that's very hard to get right. If you have a culture where everyone is friendly and everyone is trying to be the best, it can be welcomed. But if your culture is that everyone is just trying to do the minimum of work and fly under the radar and not talk to each other, then this will be very unwelcome.

01-20-09 - Laptops Part 3

A little more reportage from the laptop front. Holy christ looking for laptops sucks so bad. They're basically all the same inside, but of course people can't just name them clearly based on what they're like. You have to look into every single fucking model in great detail to see what's actually going on with it. Some general thoughts :

802.11n (aka "Wireless N") seems cool; I wasn't really aware of that. The Intel Wifi 5100 and 5300 seem like the way to go. A lot of newer laptops don't have 802.11n which seems pretty dumb.

As much as I hate it, Vista-64 seems like the way to go for the future. All the rotten programmers out there are gonna bloat things up to all hell in the next few years, and I imagine you'll need 8 GB of RAM to run anything in 2010. Hell photoshop already wants 2GB+ for itself. A 32 bit OS locks you into 3 GB which is gonna suck in 5 years.

XP can definitely be found for laptops, but it does greatly reduce your choices. The Dell business line offers "XP downgrades" on most laptops. Lenovo Thinkpads and Toshiba Satellites also have some XP options. Oddly it seems that all of the Mini (10" or less) lappies are on XP.

LED backlighting is brighter, lighter weight, thinner, and uses less power (than the older CCFL backlighting). That all sounds good, *however* I have seen a lot of reports of problems with LED screens. Unfortunately it seems to be one of those things where they vary by screen manufacturer (LG or Samsung or whoever), and every laptop brand sources their screens from various manufacturers, so you never know what you're going to get. The problems I've seen reported are edge bleeding, flickering, brightness various, and "chunky" appearance. I think it's worth the gamble going with LED now, but you might have to fight your manufacturer for RMA if you get a bad one.

SSD's are clearly awesome for laptops but don't seem quite ready for prime time. Again as with the LED screens, not all SSD's are created equal, and it's hard to tell what brand you're going to get when you buy a laptop (mainly some of the off brands have very very poor write performance). Also, there appear to still be major problems with the bios/driver interaction with SSD's - I've widely seen reports that they completely stall the CPU during heavy write activity, indicating a failure to do asynchronous writes. Oddly the SSD upgrade options for laptops seem uniformly overpriced (just buy your own and stick it in yourself) - but there are some laptops with SSD's included by default that are moderately priced.

As for graphics - I've seen a lot of reports of heat problems with NV 8000 series of mobile chips (See previous laptop post for links). Apparently the 9000 series is better. The ATI chips seem to have less of a problem with heat, however there are many reports of problems with ATI mobile drivers under Vista. Yay. The Intel integrated chips are the best choice if all you care about is battery life.

The NV9400 is an integrated part (integrated with the south bridge / memory controller all that mForce whatever). It's supposed to be pretty good for power and is a good DX10.1 part. It's what's in the MacBook Pro for example. BTW it's a real LOL to see the tech writers speculate about Apple shunting OS work off to CUDA. Buy the marketing nonsense much? BTW also lots of laptop specs lie about the "graphics memory" with NV 9 chips. NV 9 is a shared-memory model chip because it's built into the memory controller, kind of like the XBoxes, so the "1 Gig" of graphics memory they're claiming just means that the NV 9 is using 1 G of your main memory.

Another interesting new thing in graphics is the Thinkpad T400 Switchable Graphics . The T400 has an ATI HD3470 for games and an Intel integrated X4500 for battery life. Under Vista apparently you can switch with just a mouse click - no reboot. Under XP you have to reboot. This is kind of cool, but it's also scary as hell because apparently it uses a hacked-together FrankenDriver. Personally I think just going with the NV9400 is a better choice. (also, the MacBook Pro has a 9400 -> 9600 switch option; that's weird because the 9400 is good enough IMO. anyway apparently you have to log out and log back in to switch which is kind of lame; it's not quite a reboot, but it's not a mouse click either).

There are a lot of shitty screens out there. I ranted before about the prevalence of 1280x800 15" screens. Oddly, there are also 1920x1200 15" screens now, which is equally insane in the opposite direction. (I assume those laptops come with a complimentary magnifying glass). Many laptop manufacturers now seem determined to fuck up the screens for no reason but taking glossy screens, and then PUTTING AN EXTRA GLOSSY FILM ON ALREADY GLOSSY SCREENS. Apple for example does this with the new Macbooks. The manufacutrers usually call them "edgeless" or "infinity" screens or something like that, but they should call them "nerfed" or "bricked" screens because they make your laptop useless outside of pitch dark. BTW it's super LOL that the MacBook website shows the pictures of the screen with giant reflections across it; the web site design people must have thought it was cool looking and intentional so they're trying to show off the sweet glare. (the MBP is available with matte I gather, but the MB is not). BTW also this is not a question of taste - yes, matte vs. glossy is a reasonable choice that could go either way, but "glossy with extra gloss layer" is not a reasonable choice.

On the plus side, matte screens definitely can be found (or just plain glossy without the extra stupid layer if you like glossy). With quite a lot of searching I found a bunch of 14" and 15" laptops with matte WXGA+ (1440x900) screens. Something I could not find was non-widescreen (1400x1050). Everything is widescreen now. (you can find 1680x1050 aka WSXGA+ in 15" if you want a bit more res)

Intel Core 2 Duo is the way to go, but there's a little trap. Some of them are 25W and some are 35W. Obviously if you care about heat and battery you want the 25W. It seems the secret code here is a "P" in the name and ignore the number, so P8400 or P9500 is good. Like everything else Intel has decided to jump into fucking retarded obfuscated naming land. Oh yes, of course I want a Penryn on Montevina that's a 9500 (beware, some Penryns on Montevina are 35W). Pretty good guide to Intel nomenclature .

The really thin & light laptops are crazy overpriced right now for no reason. It's the same fricking parts, in fact often they use *cheaper* parts like older CPU's and lesser GPU's, but then they charge almost a $1000 markup for being small. For example the Dell Latitude E4300 is around $1700 but a more capable and identical (but larger) Latitude E6400 is around $900. This applies obviously to stuff like the Air and the Envy, but also to pretty much every good Sony Vaio. If you want the super light laptop you have to either shell out, or go off-brand, or wait.

There are countless things to verify that many lappies fail on : heat, noise, touchpad works decently, hinges okay, screen latch (!), solid case, gigabit ethernet, eSATA.

Finally, www.notebookcheck.net was my savior in finding laptop information. They actually have all the models gathered and a pretty clear simple list of specs with each one, so you can decode the laptop model jungle. Notebookcheck reviews have actually noise volume measurements and exact temperature measurements. Booya. They do it in Celcius. 30 C is okay. 35 C or more is uncomfortable.


Now the specific models :

Dell Latitude E6400 is one of my favorites. Available in 14.1" with WXGA+ (1440x900) matte LED. It's a metal case. It's reasonable light and small and has decent battery life. The specs are not top of the line but they're fine. It also has no huge "style" markup price. Price range $800-1100. The E5400 is a good budget choice, it's just a bit heavier and bigger and slightly downgraded components, but can be had for $600. (these have no top-end graphics options, they're for non-gamers)

Thinkpad T400 is probably what I'd buy for myself right now if I was buying something. 14.1" with WXGA+ (1440x900) matte LED. Very good battery life. Good build quality and keyboard, though I have seen reports that they are flimsier and the keyboard is worse than the older Thinkpads like the T61 line. T400 is around $1200 well equiped. BTW the Thinkpad R400 and Ideapad Y430 are both very similar to the T400, but not really much cheaper, so just go with the T400.

Dell Studio 15 is a pretty solid middle of the road choice for a 15" lappy. 1440x900 LED , but sadly only glossy. It's rather heavy, battery not great. I rather like the sloping keyboard design of these things, much more ergonomic than the big brick style of the Thinkpads.

HP dv5t is a 15" with 1680x1050. Saddly only glossy and offers the retarded "infinity display" which is awesome if you like seeing reflections of the sky . The HP dv4 and dv5 are the cheapest (major brand) laptops with high end GPUs and decent specs. Battery life is notoriously poor for HP's. The use the dedicated NV chips not the integrated.

The Toshiba Satellites seem okay but have nothing in particular to recommend them over one of these.

The Acer Aspire line is pretty amazing if you want a sub-$500 laptop. The components are basically fine, the only thing sucky about them is shitty plastic build quality. If you're not a style snob, they could be a great choice.

The Sony Vaio Z is a very good ultraportable 13", but expensive ($1699). The cheaper Vaio's like the NR,CS,NZ are not competitive in build quality or specs with the laptops above.

The new "unibody" MacBook (not pro) is actually a decent value; I mean, it's not a good value, but it's not awful. Unfortunately it's only available in awful-gloss.

Lastly, laptop pricing is fucking awful. Because you're locked into the manufacturers, they do weird shit with sales and coupons. The prices can fluctuate wildly from one day to the next. Chances are one of the laptops mentioned here is available with a close to 50% discount right now. So if you are not locked into a particular model and can wait a bit, do so.

1/20/2009

01-20-09 - Tuesday Random

You shouldn't put flags in the top bits of pointers any more, because there are things like /3GB and others that make your address space be top-down. *However* you can usually still put flags in the bottom bits of your pointers. If all allocations are 8 or 16 byte aligned you've got 3 or 4 bits of flags in the bottom of your pointers. I like to make a little class called PointerAndFlags that lets you get at the pointer part and the flag parts cleanly.

It occurred to me this morning that we almost have a "Game STL" at RAD. Jeff has a really good sort, and I wrapped it up in templates. I have my old vector (pretty trivial, really), and I recently screwed around with hash tables for Oodle which I could easily wrap in templates. Sort, vector and hash_map are 90% of the good stuff that I use in the STL. I could easily build my hash table on top of my vector, which means that there is only one type of object that does allocations, which is pretty cute.

Unfortunately I don't have time to really do that right, and there's not much motivation for RAD to do it. We couldn't sell it; nobody would buy it (even though they should - it's one of those things that actually has a ton of value, but everybody thinks they can write their own, and producers who do the funding don't understand the value even if the programmers do). It's also one of those things where people are foolishly unwilling to settle for compromises. If there's a very good available solution which doesn't make exactly the trade-offs that you would have chosen, many programmers go off and roll their own. And anyway, even if we could sell it, we shouldn't ; library code is kind of thing that really needs to be open source, so that a concensus can be formed and everyone's fixes can be integrated, etc.

There's something I've written about previously in regards to poker, but I've been thinking about it a lot recently in other contexts. It's that "tough decisions" don't matter. Basically if a decision is really hard and complex and interesting - then it's probably irrelevant. If the EV of both ways is roughly equal, you don't need to sweat over which choice to take, just flip a coin and take one.

I thought about this the other day watching some interview on the Daily Show where someone was saying how Bill Clinton would sit down with you for hours and talk about policy with you, and how he was a policy nerd and just loved to get into the details. Okay, that's awesome, it's way better than someone who doesn't want to learn or study or think at all, but it's not actually the crucial thing in a president. That is, you don't need a president who gets the really tough issues right. All that really matters is having a president that doesn't *grossly* botch the *easy* decisions.

I think I've written about this in game development and production meetings and such too. I've been in lots of design meetings where we'll get sucked into some long argument about some irrelevant detail where the designers and artists will go over and over for hours about whether the health indicator should be orange or red. The fact is, getting those details right is super irrelevant. There are so many other issues that you're completely screwing up that you don't need to get those little things right.

I was thinking about it just now in terms of evaluating algorithms and STL's and such like. For example with hash tables there are a lot of different implementation choices, you can trade off code complexity vs. speed vs. memory use vs. flexibility. The truth is, almost any reasonable point on that curve would be fine. The only thing that actually matters is that you avoid implementations that are *grossly bad*. That is, they're way off the curve of where the good choices are. Whether you choose something that's slightly faster or not is pretty irrelevant in the scheme of things. (like all of these points, obviously this is dependent on context - decisions that are usually irrelevant can become crucial in some conditions).

1/19/2009

01-19-09 - swap, templates, and junk

I keep running into annoying problems.

One is having macro conflicts. I like to just have stuff like


#define MIN(a,b)    ( (a) < (b) ? (a) : (b) )

for myself, which is fine and all, until you try to mix multiple codebases that all have their own idea of what that macro should be.

(obviously MIN is not actually a big deal because everybody agrees on what that should be; a bigger issue is like if people are doing a #define of malloc or SWAP or Log).

The cleanest thing I've come up with is :


1. All #defines in headers must be "namespaced" , like CB_MIN instead of just MIN.

2. Never #define anything that looks like it's in global namespace (eg. no #defining malloc)

3. Make a header called "use_cb.h" which does
    #define MIN     CB_MIN
    #define malloc  cb::malloc

    etc.

Basically this is hacky way of "namespacing" the macros and then "using the namespace" in the CPP files. Not awesome.


The other issue I'm running into is generic functions and template specialization.

Say you're working with some codebase, let's call it "stupidlib". They have their own ::swap template :


template < typename Type > inline void stupidSwap(Type &a,Type &b)
{
    Type c = a; a = b; b = c;
}

And then they also override it for some special cases :

template < >
inline void stupidSwap<stupidObject>(stupidObject &a,stupidObject &b)
{
    a.Swap(b);
}

Now, you are trying to munge stupidlib into a codebase that just uses the normal STL. It's putting objects in containers, and you want it to use std::swap correctly, but when you call std::swap on a stupidObject - it doesn't see that a nice swap override has been made.

So far as I can't tell you cannot fix this. It is impossible to make std::swap redirect to stupidSwap for the cases where stupidSwap has been overriden. You have to do it manually for each object type, and any time somebody changes stupidlib your code can silently break.

Now there is a pretty easy solution to this : never define a generic operation which already exists in the STL - just override the STL.

The problem is that people like to replace bits of the STL all the time. Hell I do it, because the STL headers are so retardedly bloated they kill compile times, so I try to use them minimally.

But that "solution" isn't really a solution anyway - it's just a convention to use the STL defs as bases for overloading when they exist. When you're trying to overload a generic operation that isn't defined in the STL you go back to having the same problem - two codebases can define that operation and various specializations, and you can't mix them.

Generic/template programming simply does not work well when mixing codebases.

old rants