1/26/2009

01-25-09 - Low Level Threading Junk Part 2

Okay, we're starting to learn some primitive threading junk. So let's try to use it do something useful.

Non-trivial statics in C++ are hideously non-thread-safe. That should be extremely obvious but if you don't see it, read here : The Old New Thing C++ scoped static initialization is not thread-safe, on purpose! . Obviously plain C static initialization of integral types is fine (BTW floats are NOT fine - beware using const floats in C++). Beware all things like this :


Object & Instance() { static Object s; return s; }

void func(int i)
{
    static Object once;
    once.DoStuff(i);
}

Not thread safe. For the most part, we should just not use them. It's very nice to do minimal work in CINIT. Much bruhaha has been made about the "thread safe singleton" (see all these) :

opbarnes.com Code Snippet Released - Thread-safe Singleton (for Windows platforms)
Dr. Dobb's Singleton Creation the Thread-safe Way October 1, 1999 - BROKEN
Dr. Dobb's C++ and The Perils of Double-Checked Locking Part I July 1, 2004
Double-checked locking - Wikipedia, the free encyclopedia

For my money this is a little silly because there's a really damn trivial solution to this. The main point of the old C++ initialize-on-demand Singleton pattern (like Instance() above) is to make it work during CINIT with complex initialization order dependencies. Now, CINIT we know is run entirely on one thread before any threads are made. (you are fucking nuts if you're making threads during CINIT). That means we know any Singletons that get made during CINIT don't have to worry about threading. That means you can use a trivial singleton pattern, and just make sure it gets used in CINIT :


static Object * s_instance = NULL; // okay cuz its a basic type

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        s_instance = new Object;
    }
    return s_instance;
}

static volatile Object * s_forceUse = GetSingleton();

The forceUse thing just makes sure our GetSingleton is called during cinit, so that we can be sure that we're all made by the time threads come around. The GetSingleton() that we have here is NOT thread safe, but it doesn't need to be ! (btw I assume that "CreateThread" acts as a MemoryBarrier (??))

Okay, that's nice and easy. What if you want a Singleton that doesn't usually get made at all, so you don't want to just make it during CINIT ? (you actually want it to get made on use). Okay, now we have to worry about thread-safing the construction.

The easiest way is you know your Singleton won't get used during CINIT. In that case you can just use Critical Section :


static CCriticalSection s_crit;     // CRITICAL_SECTION wrapper class will construct during CINIT
static Object * s_instance = NULL;

// broken version :

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        s_crit.Lock();

        s_instance = new Object;
    
        s_crit.Unlock();
    }
    return s_instance;
}

Okay, that was easy, but it's horribly broken. First of all, there's no gaurantee that Object is only made once. The thread can switch while the instruction pointer is between the first if() and the critsec lock. If that happens, some other thread can get in and make s_instance, and then when we come back to execute we run through and make it again. (If you like, you could say we put the critsec in the right place - we could fix it by moving the critsec out of the if). Even aside from that the line that assigns s_instance is all wrong because the pointer to s_instance is not necessarilly being written atomically, and it might be written before the stuff inside Object is written. What did we learn in Part 1?

// good version :

static CCriticalSection s_crit;     // CRITICAL_SECTION wrapper class will construct during CINIT
static Object * volatile s_instance = NULL;

Object * GetSingleton()
{
    if ( ! s_instance ) // must be memory_order_consume
    {
        s_crit.Lock();

        if ( ! s_instance )
        {
            Object * pNew = new Object;
        
            InterlockedExchangeRelease( &s_instance , pNew );
        }   
    
        s_crit.Unlock();
    }
    return s_instance;
}

This is a "double-checked lock" that works. The purpose of the double-check is to avoid taking the critsec when we can avoid doing so. If instance is NULL we take the critsec, then we have to check again to make sure its still null, then we rock on and make sure that the Object's memory is flushed before s_instance is set, by using a "Release" memory barrier. Also using Interlocked ensures the pointer is written atomically.

ADDENDUM : I need to add some more notes about this. See comments for now.

Okay, that's all good, but it doesn't work if you now ever try to use this Singleton from CINIT. The problem is that s_crit might not be constructed yet. There's a pretty easy solution to that - just check if s_crit has been initialized, and if it hasn't, then don't use it. That works because we know CINIT is single threaded, so you can do something like :


class Object { } ;
static CRITICAL_SECTION s_crit = { 0 };
AT_STARTUP( InitializeCriticalSection(&s_crit); );
static Object * s_instance = NULL;

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        if ( s_crit.DebugInfo == 0 ) // DebugInfo will always be nonzero for an initialized critsec
        {
            // must be in CINIT !
            s_instance = new Object;
        }
        else
        {
            EnterCriticalSection(&s_crit);

            if ( ! s_instance )
            {
                Object * pNew = new Object;
        
                InterlockedExchangeRelease( &s_instance , pNew );
            }   
    
            LeaveCriticalSection(&s_crit);
        }
    }
    return s_instance;
}

This actually works pretty dandy. Note that in CINIT you might take the upper or lower paths - you have no way of knowing if GetSingleton() will be called before or after the critsec is initialized. But that's fine, it works either way, by design. Note that we are crucially relying here on the fact that all non-trivial CINIT work is done after all the simple-type zeroing.

Okay, so that's all fine, but this last thing was pretty ugly. Wouldn't it be nice to have a critical section type of mutex object that can be statically initialized so that we don't have to worry about CINIT mumbo jumo ?

Well, yes it would, and it's pretty trivial to make a standard simple Spin Lock thing :


typedef LONG SpinMutex; // initialize me with = 0

void SpinLock(SpinMutex volatile * mut)
{
    // 0 is unlocked, 1 is locked
    COMPILER_ASSERT( sizeof(SpinMutex) == sizeof(LONG) );
    int spins = 0;
    while( InterlockedCompareExchange((LONG *)mut,1,0) == 1 ) // Acquire
    {
        ++spins;
        if ( spins > CB_SPIN_COUNT )
        {
            Sleep(0);
        }
    }
    // *mut should now be 1 
}
    
void SpinUnlock(SpinMutex volatile * mut)
{
    // *mut should now be 1
    InterlockedDecrement((LONG *)mut); // Release
    // *mut should now be 0
    // if there was a higher priority thread stalled on us, wake it up !    
}

Note that SpinLock will deadlock you if you try to recurse. Okay, now because this thing is static initializable we can use it to make a Singleton and be safe for CINIT :

static SpinMutex s_mutex = 0;
static Object * volatile s_instance = NULL;

Object * GetSingleton()
{
    if ( ! s_instance )
    {
        SpinLock(&s_mutex);

        if ( ! s_instance )
        {
            Object * pNew = new Object;
        
            InterlockedExchangeRelease( &s_instance , pNew );
        }   
    
        SpinUnlock(&s_mutex);
    }
    return s_instance;
}

And that works too.

Now, people get tempted to use these simple SpinLocks all over. *DON'T DO IT*. It looks like it should be faster than CRITICAL_SECTION but in fact it's way way slower in bad cases.

First of all, what is a CRITICAL_SECTION ? See : Matt Pietrek - Critical Sections Under Windows .

It starts out as a simple spin lock like the above. It does the same kind of thing to busy-spin the processor first. But then it doesn't just Sleep. This kind of spin-and-sleep is a really really bad thing to do in heavy threading scenarios. If you have lots of threads contending over one resource, especially with very different execution patterns the spin-and-sleep can essentially serialize them (or worse). They can get stuck in loops and fail to make any progress.

Once CRITICAL_SECTION sees contention, it creates a kernel Event to wait on, and puts the thread into an altertable wait. The Windows scheduler has lots of good mojo for dealing with events and threads in altertable waits - it's the best way to do threading sleeping and wake ups generally. For example, it has mojo to deal with the bad cases of heavy contention by doing things like randomly choosing one of the threads that is waiting on a critsec to wake up, rather than just waking up the next one (this prevents a few runaway threads from killing the system).

One important note : you should almost alwayse use "InitializeCriticalSectionAndSpinCount" these days to give your crit secs a spinner. That's because more about this in a moment.

About performance : (from one of the MSDN articles) :


    * MemoryBarrier was measured as taking 20-90 cycles.
    * InterlockedIncrement was measured as taking 36-90 cycles.
    * Acquiring or releasing a critical section was measured as taking 40-100 cycles.

which clearly shows that the idea that a critical section is "too slow" is nonsense. I've said this already but let me emphasize :

    Most of the time (no contention) a Crit Sec *is* just an InterlockedIncrement
        (so how could you beat it by using InterlockedIncrement instead?)

    When there is contention and the Crit Sec does something more serious (use an Event) it's slow
        but that is exactly what you want !!

In fact, the big win from "lock free" is not that InterlockedIncrement or whatever is so much faster - it's that you can sometimes do just one interlocked op, unlike crit sec which requires two (one for Enter and one for Leave).

An interesting alternative is this QLock thing by Vladimir Kliatchko : Dr. Dobb's Developing Lightweight, Statically Initializable C++ Mutexes May 1, 2007 (QLock) ; it's quite a bit like the MS critsec, in that he uses a simple interlocked op to check for contention, and if it's not free then he makes an Event and goes into a wait state and all that. The code is there on Dr. Dobbs but you have to do the thing where you hit the "Download" in the top bar and then navigate around to try to find it. it's here in MUTEX.LST .


Ok, now a bit more InitializeCriticalSectionAndSpinCount. The reason you want a Spin is because basically every new machine these days is "multiprocessor" (multicore). That's a big difference from threading on a single core.

If you're threading on a single core - memory will never change underneath you while you are executing. You can get swapped out and memory can change and you can get swapped in, so it looks like memory changed underneath you, but it doesn't happen in real time. With multiple cores memory can be touched by some other core that is running along at full speed.

Often this doesn't affect the way you write threading code, but it does affect performance issues. It means you can have contention on a way finer scale. In a normal OS single proc environment, you get large time slices so other threads aren't frequently randomly poking at you. With real multicore the other guy can be poking at you all the time. A good range is to spin something like 1000-5000 times. 1000 times is a microsecond

What the spin count does is let you stay in the same thread and avoid an OS thread switch if some other processor is holding the lock. Note that if it's a thread on the same processor - spinning is totally pointless (in fact it's harmful).

7 comments:

Anonymous said...

This is a "double-checked lock" that works.

Uh, no it doesn't. At least not according to the java guys, for the java memory model (since C++ doesn't have a memory model)...

http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html

"However, even with a full memory barrier being performed by the thread that initializes the helper object, it still doesn't work.

The problem is that on some systems, the thread which sees a non-null value for the helper field also needs to perform memory barriers."

In other words, you memory barrier to make sure the writes go out in the right order on thread A, but there's no guarantee that thread B still sees them in the same order unless it does a read memory barrier as well.

So say the java guys; I can't say that I have any actual clue.

MaciejS said...

When implementing own spinning locks it's probably better to introduce exponential backoff instead of just sleeping for predefined duration. Not that it'll matter much, as spinning locks should only be taken for _very_ quick operations.

cbloom said...

"In other words, you memory barrier to make sure the writes go out in the right order on thread A, but there's no guarantee that thread B still sees them in the same order unless it does a read memory barrier as well."

I'm pretty sure that MemoryBarrier() does exactly that - it gaurantees that other cores see the same barrier.

Anyway, I think you're right that the check of "s_instance" for NULL should probably be an Acquire.

cbloom said...

Okay Sean here's my new understanding :

1. It almost always does work as-is because "s_instance" is a pointer, and any access to it will be dependent, and every platform in existence guarantees ordering of dependent reads.

2. Even barring that fact it would still work on x86 because every read on x86 is read-acquire.

In terms of Linux-speak , it should be protected by smp_read_barrier_depends() , which is a nop on every platform but Alpha.

See :

http://www.linuxjournal.com/article/8211
http://www.linuxjournal.com/article/8212

I'm gonna try to make sure I have this right then I'll revise the original.

Brian said...

Note that Hans Boehm has a paper in PLDI that examines a possible memory model for C++. It might be interesting to take a look at that.

Anonymous said...

I agree it works on x86, I'm not sure the theory 'every platform in existence guarantees ordering of dependent reads' is actually true.

If I create an object by filling out its "m_x" field through a pointer (temp->m_x = 1), and then write a pointer to it in a global variable 'foo', I generate two instructions:

move mem[base_reg + offset(m_x)],1
move mem[foo],base_reg

and (in this scenario) there will be a memory write barrier between them.

On the reading side, it just does the same thing in the opposite direction:

move base_reg, mem[foo]
move xvalreg, mem[base_reg + offset(m_x)]

but without the memory barrier between them.

Now, I agree, you can't reorder those two reads as generated by the compiler--it needs the value from mem[foo] to even generate the address to fetch the xvalreg from, so even the CPU can't reorder it.

But when we're talking about cache coherency across multiple cores and multiple processors, the fact that it has to *generate* and "compute" the addresses needed in order--that it can't compute the 2nd address until the value at the first address has been loaded, isn't a guarantee (as I understand it).

Say the first address is 0x20000, and the second address (computed by loading the first address) is 0x30000.

We're guaranteed that the CPU is going to pull the value out of the cache for 0x20000 before it pulls the value out of the cache for 0x30000--it can't reorder those reads as they come into the cache.

But from the POV of the cache, here's what happens over time:

- read request of 0x20000
- read result for 0x20000 sent to CPU
- read request of 0x30000
- read result for 0x30000 sent to CPU

The question is, are those two read results required to be enforced against a global ordering if the CPU doesn't issue a read memory barrier. It doesn't matter that the results have to be returned in that order, and that one address was computed from the other; all that matters is, given that pair of requests, what can happen -- how coherent a view of the imaginary conistent global memory store the cache has to provide. If it's allowed to use stale data for 0x30000, then the double-checked locking can still break.

So, I don't know. My vague guess at this is that, in fact, without the read barrier--with only the write barrierr--there's no actual guarantee here. Obviously it works on x86, so I'm not sure how exactly you would have a system that this doesn't work on, but this is apparently e.g. all that the java memory model promises.

cbloom said...

Everything I've found says that "dependent reads" are gauranteed order on every modern CPU (see the many links provided).

Now, you are right that it's a little mysterious exactly how that would happen in the circuity, I've been thinking about the exact same issue that you outline.

I haven't yet found any real clear description of what exactly is going on, but I think the key thing is :

on every current architecture (eg. not Alpha) the ordering of cache transfers is made to appear to happen in the same order as the reads or writes are executed by the CPU.

If you look at the use of smp_read_barrier_depends in the Linux kernel it's for exactly this case. They do stuff like :

ptr = shared_var;
smp_read_barrier_depends();
i = ptr->val;

and as I said before smp_read_barrier_depends is a nop on everything but alpha.

If you find any info about what's actually going on here I'd be interested to hear it. Obviously on x86 there's just a total order which is kind of mad, but on Power there's not a total order and they must do some kind of wait from the cpu to the cache to ensure order here.

old rants