2/10/2009

02-10-09 - How to fight patents

Some rich folks like Gates and Buffet could create a consortium that just patents everything it can. It would also accept "donations" of patents from people like us who want to make sure their work is free ("defensive patents") - eg if you invent something, you submit the paper to them and they patent it.

Obviously this institution could just make all its patents free for anyone to use. But it could also use them as a hammer to force open other patents.

For example, you could make all your patents absolutely free for any other institution who does not own any patents. If you do own any patents, then you cannot use our patents unless you also make your patents completely free under the same license. (sort of like GPL).

This might be too restrictive, but on the plus side if you just got some good patents into this institute, that would force lots of other people to open up theirs under the same license, and in the ideal world it would quickly snowball.

(of course this was the idea of GPL open source code as well, and I would say that has generally been a complete failure; the real result of the GPL is that nobody serious can use GPL code, so all the GPL libraries out there are just wasted).

2/09/2009

02-09-09 - void pointer

Urg WTF why doesn't C++ let void * be implicitly cast to any other pointer type !? It's fucking void * obviously I'm using that because it's untyped or type-unknown so I'm going to be casting it. Forcing me to cast in situations like that where it's obvious just makes me more inclined to cast all the time without thinking about it, which is much worse.

Being too strict about enforcing stupid rules is very negative. It makes people ignore the rules. You need to complain only when it really matters.

2/04/2009

02-04-09 - Exceptions

I'm not really high on the whole Herb Sutter super-exception safe everything, but they are really nice for error handling some times. They have two big wins in my opinion (vs. just having error return values and checking with if statements) :

1. Consolidating error handling code and putting the error-handling code in the place that actually can respond to the error in a reasonable way.

2. Correct default behavior if you are lazy or screw up and don't handle the error - it automatically propagates out rather than just being silently ignored. Standard if-checking badly violates one of my guiding principles in coding : if you don't write the right code, it just silently breaks; I always want my programs to noisily scream when I don't write the right code.

Anyway, I'm thinking about this because handling Async IO errors is a huge pain. The big standard problem comes from file opens. The thing is, my "fopen" doesn't actually open the file, it just queues an open request that will get done at some time in the future. If the open fails (say because the file doesn't exist), you don't actually see that until you try to get a char off the file and it's no good.

With if-checking it looks like :


File = fopen();
// no need to check error here really because this did nothing

... do other work to give it some time

char c = fgetc(File);
// ! boom this can fail because the file didn't actually open

// so fgets is no good, you have to do :
char c;
EStat status = fgetc(File,&c);
if ( status == error )
   ... clean up file, back out work already done, christ

With exceptions it's a million times cleaner. You just say the File ops can throw various IOExceptions. One of those IOExceptions is "FileNotFound". That exception might be thrown in the fopen() or in the next fgetc() or whatever.

In fact with exceptions and class back-out you can also have it automatically undo the temp work you did :

try {

File = fopen();

class X;
X.DoStuff();

char c = fgetc(File); // this might throw if the open actually failed

X.Read(File);

... okay, all good. now :

World.Apply(X);

} catch

Sadly I can't use exceptions for various reasons (aside from all the practical problems with exceptions, lots of clients have them disabled, so there you go), but my god the error stuff for this async IO is so nasty. One way I'm handling it is by sort of simulating the exception code style without exceptions.

What I do is still use an fgetc() that just returns a char. If the file fails to open, fgetc() just returns zeros, and an error flag is set in the file. That way you can go ahead and write straight line code without checking errors, and then once in a while you can check the error flag. So the code looks like :

{
File = fopen();

class X;
X.DoStuff();

char c = fgetc(File); // this just sets File.error

X.Read(File);

if ( File.ok )
   World.Apply(X);
// else X will just be destructed and not be loaded into the world

}

Sadly it's not actually so neat in practice. In particular with many layers of wrapping, exceptions just do What You Want magically. With an error flag I have to be pushing it through manually all the time. A lot of my files are actually a raw disk file, wrapped in a double-buffer file layer, wrapped in an LZ decompress layer. Each layer has to push out the status to the parent, and errors can occur at any point in the code because of the async nature.

2/02/2009

02-01-09 - Swap and Templates Part 2

So I'm back into the nightmare of std::swap and template overloading. I've been working on my cb::hash_table a bit, which I built on cb::vector so I'm seeing a lot of stuff with vector. One issue is when the vector changes capacity, you have to copy all the data over and delete the old ones. The normal way to implement vector is to do this with a bunch of uninitialized copy constructor calls, then call the destructor on all the old data. But that copy can be very expensive, and worse if your objects are large in memory it means you can temporarily double your memory use.

So I had the idea to use swap. When the vector capacity changes, first go ahead and default construct everything in the new space so it's valid. Then ::swap the new to the old, and then destruct the old vector. This is of course actually not ideal, what you really want is a single call that would "swap_construct" and object so that you don't have to default construct anything. But anyhoo, for many types of data it's easy to make a very cheap default constructor and a very fast swap.

Apparently this idea is old, and in fact newer versions of the Dinkum STL do this, they call it Swaptimization which you can find on that page if you search for swap. BTW it's also obvious that sometimes swapping is worse, eg. for vector< int > - but it's never much worse. If you wanted maximimum speed on silly cases like that, you could use a type-traits template to choose whether the class prefers swapping or copying. Actually in my old versions of my own vector I had a lot of type-traits to select the most efficient operation, but I've ripped that all out now. I find it too fragile and just also not important. (note that most STL implementations now use type traits for acceleration; I think this is really loathsome, because it's not exposed in a standard way, so you can't get to it for your own types - it makes them fact on basic types and their own types, and slow on your types, which blows - the whole idea of generic programming is that you can adapt it to work on your types, but the standards people have recently started giving special treatment to things in the STL, for example the STL gets different name lookup rules now (wtf)). (for type traits in the STL see for example stuff like "_IsOKToMemCpy" and "has_trivial_destructor" in STLport.)

Anyhoo, the problem now is that you need to make sure the right swap() is called. Ruh roh. You can't really correct overload std::swap in the std namespace ; see for example : STL defects #225-227 or PDF post about swap problems by Alan Griffiths . or C++ Standards - The "swap" Problem

So, what might you do instead? One option is to define your own swap() in your own namespace, and rely on the Koenig lookup to find your swap (cuz your objects are in your namespace). That doesn't work, partly because the STL calls std::swap explicitly. I'm not sure why they do that instead of just calling swap(). Apparently new MSVC uses ADL_Swap or some funny mumbo jumbo to try to fix this.

In any case, I have a hacky solution that I'm using for now.

I go into the STL headers to the definition of std::swap. It should look something like this :

   template < class _Tp >
   inline void swap(_Tp& __a, _Tp& __b) {
     _Tp __tmp = __a;
     __a = __b;
     __b = __tmp;
   }   

I change that to :

   template < class _Tp >
   inline void swap(_Tp& __a, _Tp& __b) {
     MySwap(a,b);
   }   

Now in my own code I have MySwap :

template < typename Type >
struct swap_functor
{
   void operator () (Type &a,Type &b)
   {
      Type c = a; a = b; b = c;
   }
};

template< typename Type > inline void MySwap(Type &a,Type &b)
{
   swap_functor< Type >()(a,b);
}

Which redirects MySwap to a functor, and the default implementation is the usual swap.

The reason to go to a functor is that you can take advantage of partial specialization due to the stronger support for class templates as opposed to function templates. Thus you can easily partial-specialize to all types of vectors (for example) :

template < class t_entry > 
struct swap_functor< cb::vector< t_entry > >
{
   void operator () ( cb::vector< t_entry > & _Left, cb::vector< t_entry > & _Right)
   {
      _Left.swap(_Right);
   }
};

The reason we do the redirects like this is that when you call anything in std::algorithms, they will call std::swap. First of all they will call the STL-provided overrides for std::swap for various STL types (like std::vector and std::string). Then they will call to the default, which calls MySwap. MySwap will go to the specializations that I wanted for my types. Finally it will fall back to the default swap.

This is the only way I know of to get specialization for both my types and the STL types. Note that to get the specializations on the std types, you have to call std::swap() - it's the "outermost" swap, and you always need to call to the leaves.

Now, for example, if I wanted to also get somebody else's library integrated, call it "somelib" that implements "somelib::swap" , I would do all the above, but the cb::swap_functor default implementation would be changed to call somelib::swap. Again everyone should call std::swap to get all the specializations.

Obviously this is not ideal because it involves digging into other people's libraries.

BTW using Swap in vectors is obviously semantically much better, even if it isn't an optimization. If you consider default constructed objects to be "nulls" and objects that have been set up and pushed into a vector to be "initialized", then the standard copy way of growing a vector temporarily makes a bunch of new "initialized" objects. The swap way just moves them and doesn't temporary change the number. A common case is stuffing something like a shared_ptr ref-counted pointer into a vector - the swap way keeps the ref counts all invariant, while the copy way temporarily bumps the refs up then back down again.

BTW one thing I noticed while testing this is that std::sort doesn't seem to use swap (!?). I always thought it did. Other algorithms, like std::reverse and std::random_permutation do call std::swap. sort seems to invoke the copy constructor and assignment a lot. The RAD version rr::sort does work entirely through swap, which means it should beat std::sort very trivially in bad cases, like vector< vector< int > >

ADDENDUM : thanks to the commenters; I think what I like best is to make my swap_functor default to doing a *byte* swap. Then any object which wants to be swapped and can't stand to be byte swapped (very rare) would override swap_functor. Doing it that way means almost nobody ever has to override swap_functor, it just uses the default (byte swap) almost always.

BTW the cool way to byte swap is something like this :


#pragma pack(push)
#pragma pack(1)
template < int n_count >
  struct Bytes { char bytes[n_count]; };
#pragma pack(pop)

template < typename t_type >
void ByteCopy(t_type * pTo,const t_type & from)
{
   typedef Bytes< sizeof(t_type) > t_bytes;

   *(reinterpret_cast< t_bytes * >(pTo)) = reinterpret_cast< const t_bytes & >(from);
}

template < typename t_type >
void ByteSwap(t_type & a,t_type & b)
{
   typedef Bytes< sizeof(t_type) > t_bytes;

   t_bytes c; // don't use t_type cuz we don't want a constructor or destructor
   ByteCopy(&c,reinterpret_cast< const t_bytes & >(a));
   ByteCopy(&a,b);
   ByteCopy(&b,reinterpret_cast< const t_type & >(c));
}

The nice thing about this is for small types the compiler does the right thing and just generates mov instructions. For large types you would want to call a memswap function, but you would never do this on large types so punt.

1/30/2009

01-30-09 - SetFileValidData and async writing

Extending and async writing files on Windows is a bad situation. See for example Asynchronous Disk I/O Appears as Synchronous on Windows . Basically you should just not even try, but if you really want to give it a go :

1. Open your file for OVERLAPPED and also for NO_BUFFERING. Buffering helps reads, but it severely hurts writes. It takes my write speeds from 80 MB/sec down to like 30 MB/sec. I suspect that the reason is that buffering causes the pages to be read in before they are written out. (it's sort of like using cached memory - it will fetch in the pages even though you're doing nothing but stomping all over them).

2. Use the undocumented NtSetInformationFile to resize the file to its full size before you write anything. SetEndOfFile will only work for page size granularity, NtSetInformationFile can do arbitrary sizes. BTW this is also better for fragmentation than just writing lots of data onto the end of the file, which can cause NTFS to give you lots of allocation chains.

3. Use SetFileValidData to tell Windows that whatever is in the sectors on disk is okay. If you don't use SetFileValidData, Windows will first zero out each sector before you get to touch it. This is like a security thing, but obviously it's pretty bad for perf to basically write the whole file twice. SetFileValidData will fail unless you first ask for your process to get the right permissions, which of course will only work for processes running as administrator. Okay, I did all that. this post is okay but dear god don't read the thread.

If you do all those things right - your WriteFile() calls will actually be asynchronous. The actual speed win is not huge. Part of the problem is the next issue :

When I do all that, I start hitting some kind of weird OS buffer filling issue. I haven't tried to track down exactly what's happening because I don't really care that much, but what I see is that the writes are totally async and very fast (> 100 MB/sec) for some random amount of time (usually up to about 10 MB of writing or so) and then suddenly randomly start having huge delays. The write speed then goes down to 40 MB/sec or so.

ADDENDUM : when I say "you should not even try" I mean you should just live with WriteFile() being synchronous. It's plenty fast. Just run it from a thread and it still looks async to your thread (you need a thread anyway because OpenFile and CloseFile are very slow and synchronous; in fact the only thing you can rely on actually being fast and async is ReadFile). Also just live with the fact that Windows is zeroing the disk before you write it, everyone else does.

01-30-09 - Stack Tracing on Windows

There are basically 3 ways to capture stack traces on Windows.

1. Manually walking ebp/esp ; this steps back through the frame pointers, it relies on the callers stack being pushed. This is basically what RtlCaptureStackBackTrace or DmCaptureStackBackTrace does, but you can also just write it yourself very easily. The advantage of this is it's reasonably fast. The disadvantage is it doesn't work on all CPU architectures, and it doesn't work with the frame pointer omission optimization.

For info on RtlCaptureStackBackTrace, see Undocumented NT Internals or MSDN

2. StackWalk64. This is the new API you're supposed to use. The advantage is it works on all CPUs and it even works with frame pointer omission (!). But you can see from that latter fact that it must be very slow. In order to work with FPO it loads the PDB and uses the instruction pointer map to figure out how to trace back. It also can trace through lots of system calls that normal ebp-walking fails on.

See gamedev.net or ExtendedTrace for examples. But it's really too slow.

3. Manual push/pop in prolog/epilog. Uses the C compiler to stick a custom enter/leave on every function that does a push & pop to your own stack tracker. Google Perftools has an option to work this way. The "MemTracer" project works this way (more on MemTracer some day). The nice thing about this is it works on any architecture as long as the prolog/epilog is supported. The disadvantage is it adds a big overhead even on functions that you never trace. That rather sucks. Stacktraces are very rare in my world, so I want to pay the cost of them only when I actually do them, I don't want to be pushing & popping stack info all the time.

1/28/2009

01-28-09 - Graph Memory

So we have a thing to track memory allocs with a stack trace, la di da, no big whoop. I log it all out to a file. So I wrote a thing to parse them into a hierarchy and spit them out with tabs for tabview . That's awesome.

Then I thought, hey, Atman makes these awesome graphs with "graphviz" so maybe I'll try that. One disadvantage of the pure hierarchy view in tabview is that you can't really see the flow when lines merge back up. That is, call graphs are not *trees* they're *DAGs*. Sometimes the stack hierarchy forks apart but then comes back together again. Graphviz should be able to show this neatly. Graphviz makes "SVG" files that you can just load with Firefox (Firefox 3's support is much better than 2's).

Anyway I made some graphs with various options and it's kinda cool. Here's an example : Allocs1.svg (you'll need to use ctrl-wheel to zoom out to see anything). Apparently if you click this link Firefox does nothing good. You have to download it - then open it with Firefox. WTF. Apparently it's my Verio servers doing the wrong thing . Yegads I hate the web.

Not bad, but I'm a little disappointed with graphviz's layout abilities. In particular the actual cell and edge layout seems very good, but they seem to have literally zero code to try to put the labels in good places.

In a bit of odd deja vu, this was one of the very first things I ever worked on as a professional programmer in 1991; I worked for a GIS company that had an old COBOL GIS database engine that worked with Census data and such; I wrote them a new C front end with graphics for PC's and such, and one of the things you have to do is take this raw street data with lat/long coordinates and street names and do some nice munging to make it look okay; a big part of that is a lot of heuristics about putting the labels for streets in good places (and when to repeat the label, etc.).


ADDENDUM : talking to Sean I realized you really want the graph to have different sizing/color options, be hierarchical, interactive, and stable.

That sounds hard, but I don't think it actually is. The key thing that makes it easy is that there is very good hierarchy in this information, and you can create the graph incrementally. I think that means you can just use simple iterative penalty methods to make the graph stable.

Here's my proposal :

Start with the graph at very coarse granularity ; maybe directory granularity of you have a few directories and that makes sense, else file granularity. Whatever coarse level so you have < 32 nodes or so. Just use a solver like graphviz to make this initial graph.

Now, interactively the user can click any group to expand its hierarchy. When that happens the big cell splits into various pieces. You just create the new pieces inside the parent and make the new edges - and then you just let them time evolve with a semi-physical iterative evoluton.

You apply a penalty force for intersection with neighbors to drive the nodes apart so there's no overlap. You similarly apply forces with the edges to make them never intersect edges. And the edges also act kind of like springs, applying forces to try to be short and straight. Stable 2d physics is a pretty solved problem so you just let them run until they settle down. Note that as they spread apart they can force the other nodes in the graph to move around, but it's all nice and smooth and stable.

I think it's much easier to treat the canvas as just infinitely large and let your nodes move apart all they need to. Graphviz does everything oriented towards being printed on a page which is not necessary for the interactive view.


ADDENDUM 2 : after figuring out some evil things I've got graph_viz making better graphs. Here's an example : allocs2 SVG (direct clicking should work now too).

See comments on this post for details.

I'm still not happy with the edge labeling, and the FF SVG navigation is not ideal, but it's useable now.

1/27/2009

01-27-09 - Oddworld Memory Memories

I've been working on some memory allocator related junk recently (just for laughs I made my SmallAllocator in cblib work lock free and timed it running on a bunch of threads; an alloc takes around 200 clocks; I think there may be some cache contention issues). Anyway it reminded me of some funny stuff :

Munch's Oddysee was a wild painful birth; we were shipping for Xbox launch and crunching like mad at the end. We had lots of problems to deal with, like trying to rip as much of the fucking awful NetImmerse Engine out as possible to get our frame rate up from 1 fps (literally 6 months before ship we had areas of the game running at 1 fps) (scene graphs FTL). And then there were the DMusic bugs. Generally I found that dealing with XBox tech support guys, the graphics and general functionality guys have been really awesome, really helpful, etc. Eventually we trapped the exception in the kernel debugger and it got fixed.

Anyhoo... in the rush to ship and compromise, one of the things we left out was the ability to clean up our memory use. We just leaked and fragmented and couldn't release stuff right, so we could load up a level, but we couldn't do a level transition. Horrible problem you have to fix, right? Nah, we just rebooted the Xbox. When you do a level transition in Munch, you might notice the loading screen pops up and is totally still for a few seconds, and then it starts moving, and the screen flashes briefly during that time. That's your Xbox rebooting, giving us a fresh memory slate to play with. (shortly after launch the Xbox guys forbid developers from using this trick, but quite a few games in the early days shipped with this embarassing delay in their level load).

For Stranger's Wrath we used my "Fixed Restoring" Small Allocator, which is a standard kind of page-based allocator. We were somewhat crazy and used a hefty amount of allocations; our levels were totally data driven and objects could be different types and sizes depending on designer prefs, so we didn't want to lock into a memory layout. The different levels in the game generally all get near 64 MB, but they use that memory very differently. We wanted to be able to use things like linked lists of little nodes and STL hashes and not worry about our allocator. So the Small Allocator provides a small block allocator with (near) zero size overhead (in the limit of a large # of allocations). That is, there are pages of some fixed size, say 16 KB, and each page is assigned to one size of allocation. Allocations of that size come out of that page just by incrementing a pointer, so they're very fast and there's zero header size (there is size overhead due to not using a whole page all the time).

That was all good, except that near the end when all the content started coming together I realized that some of our levels didn't quite fit no matter how hard we squeezed - after hard crunching they were around 65 MB and we needed to get down to 63 or so, and also the Fixed Restoring Allocator was wasting some space due to the page granularity. I was assigning pages to each size of allocation, rounding up to the next alignment of 8 or 16 or 32. Pages would be given out as needed for each size bucket. So if a size bucket was never used, pages for that size would never be allocated.

This kind of scheme is pretty standard, but it can have a lot of waste. Say you allocate a lot of 17 byte items - that gets rounded up to 24 or 32 and you're wasting 7 bytes per item. Another case is that you allocate a 201 byte item - but only once in the entire game ! You don't need to give it a whole page, just let it allocate from the 256-byte item page.

Now in a normal scenario you would just try to use a better general purpose allocator, but being a game dev near shipping you can do funny things. I ran our levels and looked at the # of allocations of each size and generated a big table. Then I just hard-coded the Fixed Restoring allocator to make pages for exactly the sizes of allocations that we do a lot of. So "Stranger's Wrath" shipped with allocations of these sizes :


static const int c_smallBlockSizes[] = { 8,16,24,32,48,64,80,96,116,128,136,156,192,224,256,272 };

(The 116, 136, 156, and 272 are our primary GameObject and renderable types). (yes, we could've also just switched those objects to a custom pool allocator for the object, but that would've been a much bigger code change, which is not something I would want to do very close to shipping when we're trying to be in code lockdown and get to zero bugs).

01-25-09 - Low Level Threading Junk Part 2.02

Some more tidbits before I come back to this :

There is a misconception being widely spread that x86 can reorder reads (Herb Sutter and lots of good people have been repeating this). So far as I can tell that's just not true. The IA-32 spec says that writes don't move past writes nor do reads move past reads. (there are lots of other constraints in there).

x86 plain old load and store are acquire and release. Note that that does NOT mean they have a total order (though if your code is written right for acquire/release semantics you don't need a total order). However volatile (locked) ops on x86 do have a total order.

BTW I am very much not suggesting that you or I write code that relies on the quirks on the x86. It's better to be very careful and generally correct and mark all the places that you are assuming different types of sequencing. If those sequencing commands turn into NOPs on x86, then bully for you. Still, it helps me to actually know what's going on in our systems, and if we're going to say things, let's try to say them right; my god there's so much completely wrong information about this stuff out there. This Gamasutra article is just chock-full of wrong.

I found this guy Anthony Williams who seems to know what's up : intel-memory-ordering-and-c++-memory-model and acquire/release vs. sequential consistency

Also, Bartosz in the famous post where he gets it wrong talks about the different memory model constraints. One is :

# memory_order_consume: potentially weaker form of memory_order_acquire that enforces ordering of the current load before other operations that are data-dependent on it (for instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won�t be moved before it (yes, even that is not guaranteed on all platforms!).

This is for the dependent read case. This is the same place that Linux uses "smp_read_barrier_depends". The typical scenario is like :


   Object * ptr = s_shared_ptr;
   smp_read_barrier_depends();
   int local = ptr->data;

or in C++0x style :

   Object * ptr = s_shared_ptr.load(memory_order_consume);
   int local = ptr->data;

Note that Bartosz says this in a funny way. The issue is not that the compiler or the CPU reorder buffer can move the dependent read before the pointer read. Obviously that's impossible. The issue is that for purposes of *memory timing* it could look as if the dependent read was moved earlier. Say I write this bad code :


   Object * ptr = s_shared_ptr;
   int local = ptr->data;

which is 

   mov ecx   , &s_shared_ptr
   mov eax , [ecx] // eax = ptr
   mov edx , [eax + 4] // data is at ptr+4

Can never be

   mov ecx   , &s_shared_ptr
   mov edx , [eax + 4] // data is at ptr+4
   mov eax , [ecx] // eax = ptr

Obviously there's no way to execute this out of order, the chip and the compiler are not *broken*. But it can look like it went out of order if your cache architecture allows it.

Say some other thread wrote s_shared_ptr->data , then s_shared_ptr. It used a Release so they went to the bus in order. But your chip is crazy dumb and loads cache lines in random order. Your chip reads the line with s_shared_ptr in it, and then your code runs :


   Object * ptr = s_shared_ptr;
   int local = ptr->data;

What you see is the *new* value of s_shared_ptr , but the *old* value of ptr->data. Now your dumb cache pulls in ptr->data but it's too late. We see that our code *acted* like it read ptr->data before ptr.

Fortunately this doesn't happen on modern chips (the Alpha is the only chip I know of that can do this). However to be really correct about marking up your code's memory model semantically you should include the memory_order_consume or smp_read_barrier_depends. Then your compiler can turn those into NOPs ;)

Now it's a little bit of a mystery to me exactly how processors manage this. I think it must be that the caches talk to each other and they must invalidate pages in temporal order or something.


BTW I really don't like the idea of a C++ atomic<> class, or the Magic Microsoft Volatile. The problem with both of those is they hide where the code is doing really crucial synchronization things. You can have code that looks like :


   newnode->next = node->next;
   node->next = newnode;
   return node->data;

and it's secretely using atomic<> or volatile and it only works because it's relying on those to do the right acquire/release stuff. Dear god.

The really scary thing about both of those is that they are deceptively magical. They work very neatly. Most of the time. And they make it very easy for any programmer who doesn't know anything to go ahead and write some lock free code. Yikes.

I would much rather see everyone continue to use locks, and when you really do need to do lock-free stuff, the C++0x proposal for specifically marking up the semantics needed with memory model constraints is pretty good IMO. It clearly marks what the code requires and what the ordering constraints are.

To say this more concisely : I'd rather see atomic<> functions in the language rather than atomic data types. Because it's really the operations that are "atomic" or not. But it looks like I won't get my way and we're all going to be smothered by an avalanche of broken thready code in the next 10 years.

1/26/2009

01-25-09 - Low Level Threading Junk Part 2.01

I'm going to update Part 2 with some fixes and I'd like to write a Part 3 about some actual lock-free madness, but those may have to wait a tiny bit. In the mean time I thought I would post some links.

These are some very good summaries, mainly by the MSDN guys :

MSDN - Lockless Programming Considerations for Xbox 360 and Microsoft Windows
MSDN - Concurrency What Every Dev Must Know About Multithreaded Apps
MSDN - Understanding Low-Lock Techniques in Multithreaded Apps
MSDN - Synchronization and Multiprocessor Issues (Windows)
Intel - Multiple Approaches to Multithreaded Applications

These articles have a good discussion of the memory models of various processors ; in particular Part 1 has a nice table that shows what each processor does (warning : there may be some slight wrongness in here about x86).

Memory Ordering in Modern Microprocessors, Part I
Memory Ordering in Modern Microprocessors, Part II

Here come a mess of links about memory models and barriers and what's going on. There's a lot of contradiction (!!) so be careful : For example - does LFENCE do anything at all on x86 for normal aligned integral reads? good question... btw SFENCE definitely does do nothing (I think?).

Bartosz Milewski - Publication Safety on Multicore
Hans Boehm - Why atomics have integrated ordering constraints
Bartosz Milewski - C Atomics and Memory Ordering
YALFAm (Yet Another Lock Free Approach, maybe) - comp.programming.threads Google Groups
Who ordered memory fences on an x86 � ��Bartosz Milewski�s Programming Cafe
The Double-Checked Locking is Broken Declaration
Synchronization and the Java Memory Model
Re Memory fence instructions on x86
Re LFENCE instruction (was [rfc][patch 33] x86 optimise barriers)
Re Intel x86 memory model question 2
Re Intel x86 memory model question 1
Memory Barriers, Compiler Optimizations, etc. - comp.programming.threads Google Groups
LFENCE instruction (was [rfc][patch 33] x86 optimise barriers)
cbrumme's WebLog Memory Model

And now some links about some real mad shit. Do not try any of this at home. I've found two mad geniuses : Chris Thomasson ("the AppCore guy") who has got some nutty x86 lock-free synchronization stuff that relies on details of x86 and does safe communication with zery fencing, and Dmitriy V'jukov ("the Relacey guy") who has written a ton of nutty awesome stuff for Intel.

Wait free queue - comp.programming.threads Google Groups
Thin Lock Implementation � ��Bartosz Milewski�s Programming Cafe
Paul E. McKenney Read-Copy Update (RCU)
Multithreaded Producer-Consumer The Easy Way
Larry Osterman's WebLog So you need a worker thread pool...
julian m bucknall Hazard Pointers (careful - wrong stuff in here)
Dr. Dobb's Lock-Free Queues July 1, 2008
Computer Laboratory - Cambridge Lock Free Library
CodeProject A simple Win32 readerswriters lock with reentrance. Free source code and programming help
Atomic Ptr Plus Project
AppCore A Portable High-Performance Thread Synchronization Library
Re Non-blocking queue for the 2 threads (AppCore)
refcount - mostly lock-free word-based atomic reference counting (AppCore)
Asymmetric rw_mutex with atomic-free fast-path for readers - Scalable Synchronization Algorithms Google Groups
Asymmetric Dekker Synchronizatio

There's also Intel's TBB :

Threading Building Blocks Home

It's a horrible name, it's not "Threading Building Blocks" at all. It's fucking OpenMP. It's not really a lib of little threading helpers, which would be handy, it's like fucking CUDA for CPUs.

Also just to repeat - I am in no way recommending that anybody go down this path. As I have said repeatedly - CRITICAL_SECTION is really fucking fast and smart and does exactly what you want. Just use it. If critical section locks are giving you problems it's because your architecture is wrong, not because the micro-routine is not fast enough.

old rants