2/24/2009

02-24-09 - Low Level Threading - Part 3.0

Intro to Lock Free Programming

Disclaimer : don't do it. Okay now let's dive in.

"Lock Free" is what this general group of techniques is usually called, but it's a bit of a misnomer. "Lock Free" just means not using any blocking thread synthronization techniques like mutexes, however the classic lock free technique of try-discard-loop is basically a spin lock. Hence there is another group of techniques called "Wait Free" to indicate when not even that kind of wait is necessary. Another class recently coined is "Obstruction Free". And in fact we don't care if we sometimes take mutexes to resolve sticky spots (a lot of people now are wisely doing this - for example queues where you take a mutex when near empty).

A lot of people take "Lock Free" to mean avoiding the OS mutex because it's too slow. That's not really a great goal and you won't see a huge speedup if you do that. Good Lock Free programming is about redesigning your code flow to minimize contention for shared resources and to spend a minimum amount of time in contention. Obviously one form of contention is fighting over an OS mutex, but fighting over a cache line is contention too and hurts badly.

Our goal is just to do fast & correct thread synchronization. We don't want to cut any corners and introduce possible races. Any tiny speed win from cheating is just not worth it.

In lock free programming we are basically trying to share some data between threads in a way that we can understand sequentially so that we can code it right. In this post I'm going to talk about some simple "lock free" coding patterns and what's going on. In the next post we'll go through some of the simple lock free data structures.

First off let's look at the simple publication pattern and what's really going on. Thread 1 sets up some work and then "publishes" it. Thread 2 waits for the work then does it.


// shared variables :
void * shared_var1;
void * shared_var2;
bool shared_published = false;

//Thread 1:

.. set up x & y ..
shared_var1 = x;
shared_var2 = y;
shared_published = true;

// Thread 2:

while( ! shared_published ) { }
x = shared_var1;
y = shared_var2;
DoWork(x,y);

Already your spidey senses should be tingling. Let's go through and look at the problems :


// shared variables :
// !! should pad a cache line here to keep away from other pages
// !! need to align to required atomic alignment
void * shared_var1;
void * shared_var2;
bool shared_published = false; // !! <- bool is not safe because it might be a char which is not atomic on many systems
// !! should pad a cache line here to keep away from other pages

//Thread 1:

// !! need to ensure all these assignments are atomic
.. set up x & y ..
shared_var1 = x;
shared_var2 = y;
// !! make sure the compiler doesn't reorder writing shared_var1 & 2 versus published !
shared_published = true; // !! (*1) var1 and var2 must be flushed before published flag is set
// !! can never touch var1 or var2 again

// Thread 2:

while( ! shared_published ) { } // !! read must be acquire (*2)
x = shared_var1;
y = shared_var2;
DoWork(x,y);

Okay, so this simple code already has lots of issues. The exact details of those issues depends on your system, and we don't want to make the client code full of lots of details about the exact system. We want to write client code that is *algorithms* and then have adaptors so that our code is portable and efficient on various systems.

The first little issue mentioned is that the variable loads and stores should be atomic, which just means done all in one go. On the vast majority of systems, 4 and 8 byte types that are aligned to 4 or 8 bytes are atomic. Smaller types are often NOT atomic because the actual load is done by loading a word then masking down to a byte, so beware of bool! Similarly larger types are often not atomic, and 8-byte types that are not 8-byte aligned are not atomic! (on x86, anything that is wholely contained in a cache line is atomic - x86 always reads & writes whole cache lines as one atomic op - more details on hardware later). Anyway, we don't want out client code to know about this. Clearly what we want is some kind of Atomic< > template that does all this for us.

Atomic< void * > shared_var1;
Atomic< void * > shared_var2;
Atomic< int > shared_published = false;

This Atomic< > just ensures the type is a valid size to be atomic on this platform, and is aligned to the necessary alignment for this platform, it also overrides operator = to make sure the assignment is a simple atomic op.

Next we have this to deal with :


// !! make sure the compiler doesn't reorder writing shared_var1 & 2 versus published !

Remember the compiler optimizer will go nuts reordering things. Instructions don't necessarily actually happen in the order that you write code. The key thing is that *we want that*. The optimizer reordering things is what makes our code fast. We don't want to just turn off the optimizer. For example, we could make all our variables "volatile". I'm not going to talk much about the new magic VC2005+ volatile, but in the old meaning of volatile, all volatile memory ops are executive in program order. So if we make everything volatile we would enforce
shared_var1 <- x
shared_var2 <- y
shared_published <- true;
But really we would be happy to get
shared_var2 <- y
shared_var1 <- x
shared_published <- true;
Or
shared_var2 <- y
shared_var1 <- x
anything_else <- z
shared_published <- true;
We just care about var1 and var2 being done before "published". To make fast code we want the constraints to be as *precise* and as *weak* as possible for correct code.

The key point is at (*1) :


shared_published = true; // !! (*1) var1 and var2 must be flushed before published flag is set

What we are doing here is "Release" memory semantics. When you do a Store-Release it is a "publishing" store - it ensures that no earlier writes can pass the Release. Obviously this Store-Release has to encapsulate constraints on various parts of the system : it must tell the optimizer not to reorder writes, it must tell the CPU that those writes must be done in order, and it must tell the cache controllers that those affected memory locations must be flushed in order.

Note : you could accomplish this same thing with a barrier :

shared_var1 <- x
shared_var2 <- y

# Write Flush Memory Barrier

shared_published <- true;
but you don't want to ! That's way way more expensive. BTW another way to write the "Release" is in the old style load-load blocking way :
shared_var1 <- x
shared_var2 <- y
#StoreStore
shared_published <- true;
Here #StoreStore means that stores above the # cannot move past stores below the #. In terms of how you specify the program memory constraints, you can either do it with these reorder blocks between the instructions, or with the modifiers *on* the loads and stores. C++0x has chosen to do the latter, so I'll go with that (and it maps more closley to certain hardware). Java uses the #StoreStore kind of method, as does the Linux Kernel with the smp_barrier macros.

From now on I am going to writing all the Lock Free code using Atomic< > types and special stores and loads. So our basically publication now looks like :


// shared variables :
char pad1[CACHE_LINE];
Atomic< void * > shared_var1;
Atomic< void * > shared_var2;
Atomic< int > shared_published = false;
char pad2[CACHE_LINE];

//Thread 1:

.. set up x & y ..
StoreRelaxed( &shared_var1 , x );
StoreRelaxed( &shared_var2 , y );

StoreRelease( &shared_published , true );
// now don't touch var1 and var2

// Thread 2:

while( ! LoadAcquire(&shared_published) ) { } // (*2)
x = LoadRelaxed(&shared_var1);
y = LoadRelaxed(&shared_var2);
DoWork(x,y);

We haven't talked about the Acquire at (*2), but it's basically just the complement to a StoreRelease. a Load-Acquire ensures that no later loads can move up before the Acquire. In terms of the Java-style barriers these translate to :


x = LoadAcquire(ptr) :
    x = *ptr; // atomically
    #LoadLoad

StoreRelease(ptr,x) :
    #StoreStore
    *ptr = x; // atomically

Note the way they put the barrier on opposite sides of the memory op. (Also recall this is NOT an actual memory "barrier" instruction - it is a blockage for the memory ops, which may or may not actually requires a memory barrier on your platform).

Now, obviously Acquire and Release are almost always what you want :


localVar = LoadAcquire( &shared_var );
// acquire prevents any loads inside this block from moving up

x = localVar;
y = z;
/// etc. ...

// Release prevents any stores in here from moving down
StoreRelease(&shared_var,localVar);

So for example, entering a Mutex should obviously be an Acquire, and releasing the Mutex should be Release - that way work you do inside a mutex is done while you own the mutex (or it looks like it was done while you own the mutex). Everyone who uses critsecs is secretly using Acquire/Release to wrap their critsec work.

I should note that there is another type of load barrier called "consume" or smp_barrier_depends() in Linux ; this is only necessary on processors that don't automatically protect dependent reads. The only known processor that doesn't do that is the Alpha, so I'm not going to deal with that issue. If you care about full correctness you should also go through and mark up barriers that do dependent reads. It's really a strange thing, it means :


object = g_shared; (*1)
# consume memory barrier
int x = object->m_var; (*2)

On Alpha, "object->m_var" can be read "before" g_shared is read. WTF !? Obviously it can't actually be read before object is read, because you need the address of object from (*1) before you can even execute (*2)- but it can read older memory at (*2) than it did at (*1). That is, on the Alpha cache lines are not temporally ordered. So say some other processor has changed the address of object and also the data inside object. You read at (*1) and for some reason you get a fresh new cache line there and get the new value of object. When you read at (*2) you use the new value of object of course, but the cache line it points to is old and it hasn't refreshed, so you read the old value. Thus (*2) acts as if it happened earlier in time than (*1).

Like I said, all existing significant chips do not allows this, so we're not going to worry about it, but you can see how all kinds of different parts of the hardware can cause operations to act like they are going out of order. The goal of our Acquire/Release type barriers is to make instructions act as if they are going in order whether they actually do or not. We are trying to specify the memory constrains semantically in our code, and we want the code generator to make the right stuff for the architecture. In C# or Java this is easy to do because they have memory model semantics, in C++ it's messy, you have to write your own. C++0x will fix this but sadly that's a ways out for most of us.

Now I'll give you a bit of a punchline. On x86 LoadAcquire and StoreRelease are both just "mov". That is, they are 100% free. All (atomic) movs on x86 are acquire and all stores are release (ignoring SSE). It is still important to mark up your code semantics with acquire and release for portability, and also so that the compiler actually generates a mov and keeps your other operations inside. For example on MSVC x86 you can implement them like this :

 
x = LoadAcquire(ptr) :
    mov x , *ptr;
    ReadBarrier(); // (*1)

StoreRelease(ptr,x) :
    WriteBarrier(); // (*2)
    mov *ptr , x;

Note that the "Barriers" at *1 and *2 are not actual memory barriers, they generate ZERO instructions - those are compiler barriers that stop the compiler from moving stuff around. That is an implementation detail and your client code shouldn't have to know about.

This should clarify the win from using memory model semantics. We asked for what we needed - Acquire and Release - and we got them *for free*. Consider instead if we had not asked for them and had used a true memory fence instead - we would be adding 100+ clocks of expensive for no reason.

On Itanium LoadAcquire needs to actually generate a "ld.acq" which is a bit more expensive than a relaxed load, but not nearly as expensive as a memory fence. On PowerPC an implementation might be :


Load Acquire     ld; cmp; bc; isync
Store Release     lwsync; st

While I'm on this topic I should mention that the x86 has this super magic where loads can actually be invoked out of order, but they are still retired *as if* they went in order (they may not actually be retired in order). Because of this a lot of people incorrectly say "the x86 can reorder loads". Well, yes, and no. From a multi-threaded programmer's point of view the loads are not semantically reordered, however yes in terms of how the processor actually executes them, yes they do go through the reorder buffer and everything. In order to make this work, the x86 receives notifications whenever a cache line changes, and it invalidates reads that were speculatively done out of order in the wrong order and causes them to be redone in the right order. The x86 reorder buffer can also reorder loads against stores when they do not overlap. Again this usually does not affect semantic program flow (note that LoadAcquire and StoreRelease don't say anything about the #StoreLoad barrier).

Okay this has gotten long enough so we'll continue in the next part.

4 comments:

Anonymous said...

The try-discard loop is only like a spinlock if you're using it in the same way.

The important point about a try-discard loop is that if you succeed, then you know that you've made some progress, and just as important, if you fail, you know that someone *someone else has made some progress* (otherwise the global ptr/whatever that you're upating wouldn't have changed -- unless your algo is broken of course). Therefore, whatever happens, *someone* makes progress, and therefore it's a lock free algo.

If you use a spin-lock to do that, then you've achieved pretty much the same thing. Of course there's another little problem though because if you use any kind of lock to do it (spinlock or not), then you run the risk of things like the OS pre-empting your thread just after it's acquired the lock, and ending up blocking *everyone* because now you're not doing anything and no one else can get the lock. If you're updating with a CAS then that obviously can't happen -- you can't be pre-empted *during* the CAS, only before or after it, and in either case everyone else gets to carry on and update stuff while you're blocked.

cbloom said...

Yeah, funny I wrote something very similar for part 4.0

BTW this makes me wonder if when you do use a traditional mutex/critsec - should you bump your thread priority to infinite when you hold it? It seems like in well designed code that holds a critsec for a very short period, getting thread-switched inside the critsec is a pretty huge disaster.

Anonymous said...

Interesting idea. I'd kind of assume changing thread priority will have bad overheads (at least bad for this type of use) due to the scheduler messing with its book-keeping data, but maybe not. Do you have any info/guesses about that?

cbloom said...

My tests indicate that SetThreadPriority is actually super cheap - it appears to just set the value in your thread info block. The scheduler doesn't actually see that you've changed until it gets invoked by something (a Wait or a timer interrupt).

I can't say for sure though. And it is by no means a cheap function call because it does jump into kernel mode, so it would add a lot of overhead to critsecs.

old rants