1/25/2009

01-25-09 - Low Level Threading Junk Part 1

Okay, this is definitely one of those posts where I'm no expert by a long shot so I'll probably write some things that are wrong and you should correct me. By "low level" I mean directly accessing shared variables and worrying about what's going on with the memory, as opposed to just using the language/OS constructs for safe sharing.

Let me say up front that writing lots of low-level thread-sharing code is a very very bad idea and should not be in 99% of the cases. Just use CriticalSection and once in a while use Interlocked and don't worry about the tiny inefficiency; if you do things right they won't be a speed hit. Trying to get things right in lots of low level threading code is a recipe for huge bugs.

I'm going to assume you know about race conditions and basic threading primitives and such. If you're hazy, this is a pretty good introduction : Concurrency What Every Dev Must Know About Multithreaded Apps . I'm also going to assume that you know how to do simple safe thread interaction stuff using Interlocked, but maybe you don't know exactly what's going on with that.

First of all, let me try to list the various issues we're dealing with :

1. Multiple threads running on one core (shared cache). This means you can get swapped in and out; this is actually the *easy* part but it's what most people think of as threading.

2. Multiple cores/processors running multiple threads, possibly with incoherent caches. We have #1 to deal with, but also now the memory views of various threads can be different. The memories sync up eventually, but it's almost like they're talking through a delated communication channel.

3. CPU OOP instruction reorder buffers and cache gather/reorder buffers. Instructions (even in ASM) may not execute in the order you wrote, and even if they do exectue in that order, memory reads & writes may not happen in the order of the instructions (because of cache line straddle issues, write gather buffers, etc.)

4. Single CISC CPU instructions being broken into pieces (such as unaligned accesses). Single ASM instructions may not be single operations; things like "inc" become "load, add, store" and there's an opportunity for a thread to interleave in there. Even apparently atomic ops like just a "load" can become multiple instructions if the load is unaligned, and that creates a chance for another thread to poke in the gap.

5. The compiler/optimizer reordering operations. Obviously things don't necessarily happen in the order that you write them in your C program.

6. The compiler/optimizer caching values or eliminating ops

I think that's the meat of it. One thing that sort of mucks this all up is that x86 and MSVC >= 2005 are sort of special cases which are much simpler than most other compilers & platforms. Unfortunately most devs and authors are working with x86 and MSVC 2005+ which means they do lazy/incorrect things that happen to work in that case. Also I'm going to be talking about C++ but there are actually much better memory model controls now in Java and C#. I'm going to try to describe things that are safe in general, not just safe on x86/Windows, but I will use the x86/Windows functions as an example.

Almost every single page I've read on this stuff get its wrong. Even by the experts. I always see stuff like this . Where they implement a lock-free fancy doohicky, and then come back later and admit that oh, it doesn't actually work. For example this Julian Bucknall guy has a lot of long articles about lock free stuff, and then every 3rd article he comes back and goes "oh BTW the last article was wrong, oops". BTW never try to use any of the lock free stuff from a place like "codeproject.com" or anybody's random blog.

I've read a lot of stuff like :

Unfortunately, Matt's answer features what's called double-checked locking which isn't supported by the C/C++ memory model.

To my mind, that's a really silly thing to say. Basically C/C++ doesn't *have* a memory model. Modern C# and Java *do* which means that the language natively supports the ideas of memory access ordering and such. With C/C++ you basically have zero gaurantees of anything. That means you have to do everything manually. But when you do things manually you can of course do whatever you want. For example "double checked locking" is just fine in C, but you have to manually control your memory model in the right way. (I really don't even like the term "memory model" that much; it's really an "execution model" because it includes things like the concept of what's atomic and what can be reordered).

Some things I'm going to talk about : how lock free is really like spin locking, how critical sections work, why you should spincount critical sections now, what kind of threading stuff you can do without critsecs, etc.

Something I am NOT going to talk about is the exact details of the memory model on x86/windows, because I don't think you should be writing code for a specific processor memory model. Try to write code that will always work. x86/windows has strong constraints (stores are not reordered past stores, etc. etc. but forget you know that and don't rely on it).


Let's look at a simple example and see if we can get it right.

Thread A is trying to pass some work to thread B. Thread B sits and waits for the work then does it. Our stupid example looks like :


// global variables :
MyStruct * g_work = NULL;


Thread A :

int main(int argc, char ** argv)
{
    g_work = new MyStruct( argc, argv );

    ....
}


void ThreadB()
{

    while( g_work == NULL )
    {
    }

    g_work->DoStuff();
}

Okay, now let's go through it and fix it.

First of all, this line should give you the heebee jeebees :

    g_work = new MyStruct( argc, argv );
It's doing a bunch of work and assigning to a shared variable. There are no gaurantees about what order that gets written to memory, so g_work could be assigned before the struct is set up, then ThreadB could start poking into it while I'm still constructing it. We want to release a full object to g_work that we're all done with. We can start trying to fix it by doing :
1.  MyStruct * temp = new MyStruct( argc, argv );
2.  g_work = temp;
that's good, but again you cannot assume anything about the memory model in C or the order of operations. In particular, we need to make sure that the writes to memory done by line 1 are actually finished before line 2 executes.

In Windows we do that by calling MemoryBarrier() :

    MyStruct * temp = new MyStruct( argc, argv );
    MemoryBarrier();
    g_work = temp;
MemoryBarrier is an intrinsic in MSVC ; it actually does two things. 1. It emits an instruction that causes the processor to force a sync point (this also actually does two things : 1A : flushes caches and write gather buffers and 1B. puts a fence in the reorder buffer so the processor can't speculate ahead). 2. the MemoryBarrier instrinsic also acts as a compiler optimizer barrier - so that the MSVC compiler won't move work before MemoryBarrier ahead of MemoryBarrier.

MemoryBarrier is a full memory fence, it creates an ordering point. In general if you just write memory operations :

    A
    B
    C
you can't say what order they actually happen in. If another thread is watching that spot it might see C,A,B or whatever. With MemoryBarrier :
    A
    B
    MemoryBarrier
    C
You get an order constraint : C is always after {AB} , so it might be ABC or BAC.

Another digression about the compiler optimizer fence : in windows you can also control just the compiler optimization with _ReadWriteBarrier (and _ReadBarrier and _WriteBarrier). This doesn't generate a memory fence to the processor, it's just a command to the optimizer to not move memory reads or writes across a specific line. I haven't seen a case where I would actually want to use this without also generating a memory barrier (??). Another thing I'm not sure about - it seems that if you manually output a fence instruction with __asm, the compiler automatically treats that as a ReadWriteBarrier (??).

Alright, so we're getting close, we've made a work struct and forced it to flush out before becoming visible to the other thread :

    MyStruct * temp = new MyStruct( argc, argv );
    MemoryBarrier();
    g_work = temp; <-
What about this last line? It looks inocuous, but it holds many traps. Assignment is atomic - but only if g_work is a 32 bit pointer on 32-bit x86, or a 64 bit pointer on 64-bit x86. Also since g_work is just a variable it could get optimized out or deferred or just stored in local cache and not flushed out to the bus, etc.

One thing we can do is use "volatile". I hesitate to even talk about volatile because it's not well defined by C and it means different things depending on platform and compiler. (In particular, MS has added lots of threading usefulness to volatile, but nobody else does what they do, so don't use it!). What we want "volatile" for here is to force the compiler to actually generate a memory store for g_work. To my mind I like to think that volatile means "don't put me in registers - always read or write memory". (again on x86 volatile means extra things, for example volatile memory accesses won't get reordered, but don't rely on that!). Note you might also have to make sure g_work is aligned unless you are sure the compiler is doing that.

One thing to be careful with about volatile is how you put it on a pointer. Remember to read pointer adjective from right to left in C:

    volatile char * var;

    // this is a non-volatile pointer to a volatile char !! probably not what you meant

    char * volatile var;

    // this is a volatile pointer to chars - probably what you meant
    //  (though of course you have to make sure the char memory accesses are synchronized right)

Note that volatile is a pretty big performance hit. I actually think most of the time you should just not use "volatile" at all, because it's too variable in its meaning, and instead you should manually specify the operations that need to be sync'ed :

On Windows there's a clean way to do this :


    MyStruct * temp = new MyStruct( argc, argv );
    InterlockedExchangePointer( &g_work, temp );

The Interlocked functions are guaranteed to be atomic. Now we don't have to just hope that the code we wrote actually translated into an atomic op. The Interlocked functions also automatically generate memory barriers and optimizer barriers (!! NOTE : only true on Windows, NOT true on Xenon !!). Thus the InterlockedExchangePointer forces MyStruct to get written to temp first.

Let me just briefly mention that this full MemoryBarrier is *very* expensive and you can get away with less. In particular, something you will see is "Acquire" and "Release". The heuristic rule of thumb is that you use "Acquire" to read shared conditions and "Release" to write them. More formally, "Acquire" is a starting memory barrier - it means any memory op done after the Acquire will not move before it (but ones done before can move after). "Release" is a finishing memory barrier - memory ops done before it will not move after (but ones done after can be done before).

So if you have :

    A
    B
    C - Release
    D
The actual order could be {A B C D} or {B A C D} or {A B D C} but never {A C B D}. Obviously Acquire and Release are a slightly weaker constraint than a full barrier so they give the processor more room to wiggle, which is good for it.

So lets do another example of this simple thread passing (stolen from comp.programming.threads) :


int a;
int b;
LONG valid = 0;
CRITICAL_SECTION mtx;

void function1 () { // called from many threads

  EnterCriticalSection(&mtx);
  if (! valid) {
    a = something;
    b = something;
    InterlockedExchangeAddRelease(&valid, 1);
    // writes to memory 'a' and 'b' will be done before valid is set to 1
  }
  do_other_stuff();
  LeaveCriticalSection(&mtx);

}

int function2 () { // called from many threads

  if (InterlockedExchangeAddAcquire(&valid, 0)) {
    return a + b;
  } else {
    return 0;
  }

} 

int function3 () { // broken

  if ( valid ) {
    return a + b;
  } else {
    return 0;
  }

} 

Now it's easy to fall into a trap of thinking that because we did the "Release" that function3 is okay. I mean, by the time function3 sees valid get set to 1, 'a' and 'b' will already be set, so function 3 is right, okay? Well, sort of. That would be true *if* function 3 was in assembly so the compiler couldn't reorder anything, and if the chip couldn't reorder memory ops (or if we rely on the x86 constraint of read ordering). You should know the actual execution of function 3 could go something like :


fetch a
fetch b
add b to a
test valid
set conditional a to 0
return a

which is now reading a and b before 'valid'. Acquire stops this.

Some good links on this basic memory barrier stuff : (read Kang Su in particular)

Kang Su's Blog volatile, acquirerelease, memory fences, and VC2005
Memory barrier - Wikipedia, the free encyclopedia
Acquire and Release Semantics
The Old New Thing Acquire and release sound like bass fishing terms, but they also apply to memory models
Memory ordering, barriers, acquire and release semantics niallryan.com
Memory barrier + volatile (C++) - comp.programming.threads Google Groups

BTW note that volatile does some magic goodness in VC >= 2005 , but *not* on Xenon even with that compiler version. In summary :


ReadBarrier/WriteBarrier - just prevents compiler reordering

MemoryBarrier() - CPU memory load/store barrier

Interlocked & volatile
    Interlocked functions on Windows automatically generate a compiler & a CPU barrier
    Interlocked on Xbox 360 does not generate a CPU barrier - you need to do it

special volatile thing in MSVC ?
    volatile automatically does the right thing on VC >= 2005
    not for XBox though

ADDENDUM : lol the new Dr. Dobbs has a good article by Herb called volatile vs volatile . It covers a lot of this same territory.

And some more links that are basically on the same topic : (the thing with the valid flag we've done here is called the "publication safety pattern")

Bartosz Milewski - Publication Safety on Multicore
Hans Boehm - Why atomics have integrated ordering constraints
Bartosz Milewski - C Atomics and Memory Ordering

5 comments:

Anonymous said...

again on x86 volatile means extra things, for example volatile memory accesses won't get reordered, but don't rely on that

I don't think this is true; when ANSI introduced volatile they also introduced this whole "sequence point" construction to allow them to specify control over ordering, and I can't remember it being used for anything except volatile. Also, volatile was (IIRC) introduced to allow communicating with hardware through memory, so obviously the order of operations is crucial.

So I thought the guarantee of volatile was that all volatile reads and writes have to happen in fixed order, and none can be suppressed or duplicated.

Of course, presumably this is only implemented in the compiler output stream, and it doesn't force the hardware to do anything -- it doesn't output barriers the hardware might need, etc (although I think technically one could argue that this is a bug, since then the implementation isn't actually doing what ANSI specified, and I don't think ANSI distinguishes the compiler from the final execution).

MaciejS said...

Cost of memory barrier varies from platform to platform, it actually is not that bad on X360 (quicker than most interlocked ops for example).

Tom Forsyth said...

I seem to recall that "volatile" is simply a compiler directive on ordering/enregistering. It does not cause memory barrier instructions to be emitted.

On x86, there are many strong guarantees on the ordering of memory accesses w.r.t. the ordering of the assembly you write (there's a big document somewhere detailing exactly what you can expect), so using volatile is 90% of what you need. On processors like Alpha or Itanium where the ordering of memory has very little to do with the order the assembly instructions are in, using volatile is not enough - you need explicit memory barriers.

cbloom said...

http://en.wikipedia.org/wiki/Memory_barrier

Wikipedia says "volatile" memory accesses cannot be reordered against each other by the compiler. But they can be reordered against nonvolatile memory access. And it gives no guarantees about the CPU or memory system reordering.

x86 is special because the x86 cpu will never reorder stores vs stores or reads vs reads. (note that this is only true for scalar code on Intel, SSE load/stores do not have these strict semantics)

Furthermore, MSVC volatile is special because after 2005 they automatically ensure volatile reads are Acquire and volatile writes are Release. That would be sweet if it was consistent, but even MSVC on Xenon doesn't do that.

But as I say I think the best thing is to explicitly indicate where barriers of various kinds are needed in the code rather than relying on the inconsistent magic of volatile.

C++0x will probably have some kind of "atomic" keyword which is what you want volatile to be.

Action Man (the greatest hero of them all) said...

Regarding MemoryBarrier vs _ReadWriteBarrier:

Apparently MemoryBarrier *may* prevent compile-time reordering, but is not guaranteed to.

If this is the case, and if _ReadWriteBarrier doesn't have a run-time cost, then I would recommend using both of these together when a full-fence is required.

There's lots of detail in "Lockless Programming Considerations for Xbox 360 and Microsoft Windows" (http://msdn.microsoft.com/en-us/library/ee418650%28VS.85%29.aspx)

old rants