cbloom rants: 02-24-09 - Low Level Threading

Basic lock-free algorithms

We want to write some lock free algorithms to mutate possibly shared data. (Obviously the best possible lock free algorithm is to work on thread local data; hey it's lock free! the next best lock free method is to work on shared read-only data; hey you can read it without locking; so we assume you've done that as much as possible and what remains is the threads need to write something shared). The general concept that we will follow in all cases is the idea of having a constraint on the data structure to preserve its validity, and then doing operations that keep the data structure valid every step of the way.

Other threads may see the changes you are making at any point of your flow, so if you mutate the data in some way that messes up the data structure, and then fix it on the next line, they might get the bad stuff. So every single operation needs to atomically maintain the validity of the data structure.

So for example if data is something stupid like Data { int x; int must_be_negative_x; } , if you write code like :

Data.x = 7;
// (*)
Data.must_be_negative_x = -7;

At the (*) you can get swapped and Data is in an invalid state. (Some consumer is sitting around doing int zero = Data.x + Data.must_be_negative_x).

One way to do this that's most obvious is with large atomic operations that swap the data structure from one good state to another. If you pretend for a moment that we can work on arbitrarily large objects, we could just copy the whole data structure, change it to the new state, and then write it back :


Data localData = LoadAcquire( &g_sharedData );
Data newData = localData;
.. mutate newData ...
StoreRelease( &g_sharedData, newData );

Note that using Acquire/Release here ensures that anything we do in "mutate" is properly scoped. This correctly takes sharedData from old to new in one atomic step so that it stays consistent. However, something else bad can happen. While we were working on it, somebody else may have put their change in too. We don't want to just stomp on their change with our Store. The solution is to use CAS and retry :


do
{
Data localData = LoadAcquire( &g_sharedData );
Data newData = localData;
.. mutate newData ...
} while ( ! CAS_Release( &g_sharedData, localData, newData ) )

Here we take a local copy, make some changes to bring it to a new state, and then we try to publish it to sharedData. However, when we publish, we check to make sure that sharedData is what we saw when we started - it must be equal to localData or we will not write it. If someone else changed it while we were working, we go back to the start and try again. (publish means make visible to other threads; at the time that you publish you must ensure that what you have written is in a consistent valid visible state for other threads).

This is a "change-cas-retry" loop and it's a very common pattern in lock free programming. It ensures that we change data atomically so it is made visible as a unit, and also that all threads that are changing Data get to make their changes public - nobody blindly stomps on someone else's change without looking at it.

Now, as handy as the CAS-retry loop is, it's not really great lock free programming. If you look at it a second you can see that it's basically the same as a spin lock mutex. If there is no contention, then the CAS-retry steps right through just like linear code and the CAS succeeds and we're done. If there is contention, someone else wrote data, we loop and try again until there is no contention. Compare to this :


int mutex = 0;
Data g_sharedData;
    
    while( ! CAS_Acquire(mutex,0,1) )
    {
    }
    Data localData = LoadAcquire( &g_sharedData );
    .. mutate localData ...
    StoreRelease(&g_sharedData,localData);
    StoreRelease(mutex,0);

Again we do a CAS, if there's contention we loop, once there's no contention we run through the linear code and publish. The CAS-retry loop is maybe slightly faster than the simple mutex, but if each thread actually get a hardware core and they are both running, then the difference is very small. The big advantage of the CAS-retry method over a plain old mutex comes from the fact that applications and operating systems are very variable.

Consider two threads that are fighting over g_sharedData, both trying to write it. Thread A is in the middle of working on sharedData when suddenly it gets stopped. There are various reasons it might get stopped, maybe someone suspended its thread, or maybe it just got swapped out to run another thread on that core, or maybe it had some OS kernel problem that stopped it (like it crashed). (BTW I'm assuming that you know you should not be doing things like File IO inside a mutex - in fact you should avoid any OS calls insidea mutex or anything that might stall or swap your thread; always hold mutexes for very short simple operations only). In any case it is stopped right in the middle of working on sharedData. If we used the simple mutex, then thread A is now blocking all other threads from working on sharedData. Our app is halted until thread A can run again. In the CAS-retry method, thread B can go ahead and do its work - it will see no contention because thread A is no longer changing sharedData.

One of the tricky things we have to deal with is that we have threads that are running on real separate hardware cores so they are stepping along at the same time, and if your locked area is short they will not ever stall on each other for long. But we also have threads that are running on cores with other threads, and those threads do NOT run at the same time, so one of them can appear to be frozen for a very long time when it's swapped out while the other runs. (there are a lot of other ways that these scenarios are quite different; in general we have to handle both and wind up making compromises; for example with a real hardware thread per core there's no point in ever sleeping at all, with lots of threads on one hardware core you have the exact opposite issue - spinning at all is pointless, and lots of micro work can cause thrashing such that you spend almost all your time in thread switches, and things like backoff become crucial).

If you like, you can imagine that a CAS-retry is a lot like a spin-lock which is only held while the contended item is actually being worked on ! BTW in both cases here you don't need to sit and spin when there is contention. You can see there's contention and just abandon the change and do something else. It's somewhat rare that that is actually practical. (this is like TryEnterCriticalSection for those from that world). CAS-retry only spins as many times as the number of times that someone else updates the data.

Now of course in the real world we can't atomically change a large data object, usually we are limited to 32 or 64 bits of atomic data. So the solution of course is to use a pointer and to CAS in the new pointer. This is pretty trivial but you do have to keep in mind that you are consuming & publishing a pointer, and that once you release it, you no longer own it :


for(;;)
{
const Data * localData = LoadAcquire( &g_sharedData );
// localData is a pointer to *shared* read-only data !
// I do not own it, it is owned by g_sharedData

Data * newData = new Data;
.. mutate newData ...
// I make my own local data object to prepare the next revision
// I can do anything I want in here on newData 

newData->stuff = Compute( localData );

// now I am publishing newData -
//    if the CAS succeeds the newData instantly becomes shared
//  and anyone else can be reading it
// the CAS here must have Release memory order so that the memory ops
// I just did in mutate are made public before the change to g_sharedData

if ( CAS_Release( &g_sharedData, localData, newData ) )
    break;

// newData was NOT published so I am still the exclusive owner here
delete newData;

}

(obviously you wouldn't actually new/delete each time, this is just illustrating the flow of ownership).

Next time we'll use CAS-retry to make a simple LIFO stack.

Oh, let me also take a second to introduce the idea of the "Abort". I find it easier to talk about many Lock-free algorithms this way. A lot of the times in LF code we go to change something or check something, we hope to find something out, but we see there was contention so the values are changing and we can't get a good result, we just Abort and retry.

So for example the C++0x CAS would be :


LF_Result CAS(T * pVar, T & oldVal, T newVal)
{
    if ( ! trylock(pVar) )
        return Abort;
    
    bool match = (*pVar == oldVal);
    oldVal = *pVar;
    if ( match )
        *pVar = newVal;    

    unlock(pVar);

    return match ? Success : Failure;
}

We now return a "LF_Result" which is {Abort, Success, Failure}. Abort means ignore the results and just retry because there was contention. One way to think about a lot of Lock Free algorithms is to imagine that your old deterministic code which only had two boolean possibilities now has 3. (also "void" deterministic code that had no return value at all now has {Abort,Success} options).

As I mentioned before the usual recourse when you get an Abort is just to retry, and you are gauranteed to only get Aborts as long as there is active contention. But you could also take the Abort and do something else entirely.

cbloom rants

2/24/2009

02-24-09 - Low Level Threading - Part 4.0

No comments:

old rants