01-25-09 - Low Level Threading Junk Part 2.02

Some more tidbits before I come back to this :

There is a misconception being widely spread that x86 can reorder reads (Herb Sutter and lots of good people have been repeating this). So far as I can tell that's just not true. The IA-32 spec says that writes don't move past writes nor do reads move past reads. (there are lots of other constraints in there).

x86 plain old load and store are acquire and release. Note that that does NOT mean they have a total order (though if your code is written right for acquire/release semantics you don't need a total order). However volatile (locked) ops on x86 do have a total order.

BTW I am very much not suggesting that you or I write code that relies on the quirks on the x86. It's better to be very careful and generally correct and mark all the places that you are assuming different types of sequencing. If those sequencing commands turn into NOPs on x86, then bully for you. Still, it helps me to actually know what's going on in our systems, and if we're going to say things, let's try to say them right; my god there's so much completely wrong information about this stuff out there. This Gamasutra article is just chock-full of wrong.

I found this guy Anthony Williams who seems to know what's up : intel-memory-ordering-and-c++-memory-model and acquire/release vs. sequential consistency

Also, Bartosz in the famous post where he gets it wrong talks about the different memory model constraints. One is :

# memory_order_consume: potentially weaker form of memory_order_acquire that enforces ordering of the current load before other operations that are data-dependent on it (for instance, when a load of a pointer is marked memory_order_consume, subsequent operations that dereference this pointer won�t be moved before it (yes, even that is not guaranteed on all platforms!).

This is for the dependent read case. This is the same place that Linux uses "smp_read_barrier_depends". The typical scenario is like :

   Object * ptr = s_shared_ptr;
   int local = ptr->data;

or in C++0x style :

   Object * ptr = s_shared_ptr.load(memory_order_consume);
   int local = ptr->data;

Note that Bartosz says this in a funny way. The issue is not that the compiler or the CPU reorder buffer can move the dependent read before the pointer read. Obviously that's impossible. The issue is that for purposes of *memory timing* it could look as if the dependent read was moved earlier. Say I write this bad code :

   Object * ptr = s_shared_ptr;
   int local = ptr->data;

which is 

   mov ecx   , &s_shared_ptr
   mov eax , [ecx] // eax = ptr
   mov edx , [eax + 4] // data is at ptr+4

Can never be

   mov ecx   , &s_shared_ptr
   mov edx , [eax + 4] // data is at ptr+4
   mov eax , [ecx] // eax = ptr

Obviously there's no way to execute this out of order, the chip and the compiler are not *broken*. But it can look like it went out of order if your cache architecture allows it.

Say some other thread wrote s_shared_ptr->data , then s_shared_ptr. It used a Release so they went to the bus in order. But your chip is crazy dumb and loads cache lines in random order. Your chip reads the line with s_shared_ptr in it, and then your code runs :

   Object * ptr = s_shared_ptr;
   int local = ptr->data;

What you see is the *new* value of s_shared_ptr , but the *old* value of ptr->data. Now your dumb cache pulls in ptr->data but it's too late. We see that our code *acted* like it read ptr->data before ptr.

Fortunately this doesn't happen on modern chips (the Alpha is the only chip I know of that can do this). However to be really correct about marking up your code's memory model semantically you should include the memory_order_consume or smp_read_barrier_depends. Then your compiler can turn those into NOPs ;)

Now it's a little bit of a mystery to me exactly how processors manage this. I think it must be that the caches talk to each other and they must invalidate pages in temporal order or something.

BTW I really don't like the idea of a C++ atomic<> class, or the Magic Microsoft Volatile. The problem with both of those is they hide where the code is doing really crucial synchronization things. You can have code that looks like :

   newnode->next = node->next;
   node->next = newnode;
   return node->data;

and it's secretely using atomic<> or volatile and it only works because it's relying on those to do the right acquire/release stuff. Dear god.

The really scary thing about both of those is that they are deceptively magical. They work very neatly. Most of the time. And they make it very easy for any programmer who doesn't know anything to go ahead and write some lock free code. Yikes.

I would much rather see everyone continue to use locks, and when you really do need to do lock-free stuff, the C++0x proposal for specifically marking up the semantics needed with memory model constraints is pretty good IMO. It clearly marks what the code requires and what the ordering constraints are.

To say this more concisely : I'd rather see atomic<> functions in the language rather than atomic data types. Because it's really the operations that are "atomic" or not. But it looks like I won't get my way and we're all going to be smothered by an avalanche of broken thready code in the next 10 years.


Anonymous said...

I think it must be that the caches talk to each other and they must invalidate pages in temporal order or something.

Old school, there's a shared bus, and they implement bus snooping. However, once you start having, say, huge quanties of cores without a single temporally-coherent bus, I have no clue how the magic happens.

I wonder if future processors will be more Alpha-like, or if everyone will agree that that's just too nuts.

(The thing about an implementation that's Alpha-like isn't just the suckiness if you don't fence; if the thing is being that crazily out-of-order in its cache handling, it seems like the fences are going to be really expensive since it's going to have to force the cache to come to a consistent state--even while other stuff is happening in adjacent cores.)

Brian said...

Having atomic data types does make some sense. It could provide some static checking that someone doesn't perform non-atomic operations in an unsafe way on the data.

A lot of researchers are looking at implementation transactional memory models to replace locking. The devil is of course getting the overhead low enough to work.

Brian said...

BTW...As far as future processors with very large numbers of cores, people are talking relaxing cache coherency guarantees. Maybe only some cores will be coherent with respect to others...

It may okay though... There are some processors with large numbers of cores today that use directory based cache coherency systems...

old rants