01-21-09 - Towards Lock-Free Threading

Currently Oodle threads communicate through plain old mutex locking, but I'd like to go towards lock-free communication eventually. Now, lock-free coding doesn't have the normal deadlock problems, but it has a whole new host of much scarier and harder to debug thread timing problems, because you don't have the simplicity of mutex blocking out concurrency during shared data access.

It occurs to me there's a pretty simple way to make the transition and sort of have a middle ground.

Start by writing your threading using plain old mutexes and a locked "communication region". The communication area can only be accessed while the mutex is held. This is just the standard easy old way :

{Thread A} <---> {Communication Region} <---> {Thread B}

Thread A - lock mutex on CR
    change values ; 
--- Thread B lock mutex
    read values

Now find yourself a lockfree stack (aka singly linked list LIFO). The good old "SList" in Win32 is one fine choice. Now basically pretend that Thread A and Thread B are like over a network, and send messages to each other via the lock-free stacks.

To keep the code the same, they both get copies of the Communication Region :

{Thread A | Communication Region} <---> {Communication Region | Thread B}

and they send messages via two stacks :

{Thread A | Communication Region} --- stack AtoB --> {Communication Region | Thread B}
{Thread A | Communication Region} <-- stack BtoA --- {Communication Region | Thread B}

The messages you send across the stacks apply the deltas to the communication regions to keep them in sync, just like networked game.

So for example, if Thread A is the "main thread" and Thread B is a "job doer thread" , a common case might be that the communication region is a list of pending jobs and a list of completed jobs.

Main Thread
    {Pending Jobs}
    {Completed Jobs}

Request Job :
    add to my {Pending Jobs}
    push message on stackAtoB

Worker Thread
    {Pending Jobs}
    {Completed Jobs}

Sit and pop jobs off stackAtoB
put in pending jobs list
work on pending jobs
put result on completed jobs list
push completion message on stackBtoA

The nice thing is that Main Thread can at any time poke around in his own Pending and Completed list to see if various jobs are still pending or done yet awaiting examination.

Obviously if you were architecting for lock-free from the start you wouldn't do things exactly like this, but I like the ability to start with a simple old mutex-based system and debug it and make sure everything is solid before I start fucking around with lock-free. This way 99% of the code is identical, but it still just talks to a "Communication Region".


I should note that this is really neither new nor interesting. This is basically what every SPU programmer does. SPU "threads" get a copy of a little piece of data to work on, they do work on their own copy, and then they send it back to the main thread. They don't page pieces back to main memory as they go.

While the SPU thread is working, the main thread can either not look at the "communication region", or it can look at it but know that it might be getting old data. For many applications that's fine. For example, if the SPU is running the animation system and you want the main thread to query some bone positions to detect if your punch hit somebody - you can just go ahead and grab the bone positions without locking, and you might get new ones or you might get last frame's and who cares. (a better example is purely visual things like particle systems)

Now I should also note that "lock free" is a bit of a false grail. The performance difference of locks vs. no locks is very small. That is, whether you use "CriticalSection" or "InterlockedExchange" is not a big difference. The big difference comes from the communication model. Having lots of threads contending over one resource is slow, whether that resource is "lock free" or locked. Obviously holding locks for a long time is bad, but you can implement a "lock free" model using locks and its plenty fast.

That is, this kind of model :

[ xxxxxxxxxx giant shared data xxxxxxxxxx ]

    |          |         |         |
    |          |         |         |
    |          |         |         |

[thread A] [thread B] [thread C] [thread D]

is slow regardless of whether you use locks or not. And this kind of model :

[data A  ] [data B  ] [data C  ] [data D  ] 
    |     /    |     /   |       / |
    |    /     |    /    |      /  |
    |   /      |   /     |     /   |

[thread A] [thread B] [thread C] [thread D]

is fast regardless of whether you use locks or not. Okay I've probably made this way more wrong and confusing now.


Let me try to express it another way. The "message passing" model that I described is basically a way of doing a large atomic memory write. The message that you pass can contain various fields, and it is processed synchronously by the receiver. That makes common unsafe lock-free methods safe. Let me try to make this clear with an example :

You want Thread B to do some work and set a flag when it's done. Thread A is waiting to see that flag get set and then will process the work. So you have a communication region like :

// globals :
    bool isDone;
    int workParams[32];
    int workResults[32];

Now a lot of people try to do lock-free work passing trivially by going :

Thread A :

    isDone = false;
    // set up workParams
    // tell Thread B to start

    ... my stuff ...

    if ( isDone )
        // use workResults

Thread B :

    // read workParams

    // fill out workResults

    isDone = true;

Now it is possible to make code like this work, but it's processor & compiler dependent and can be very tricky and causes bugs. (I think I'll write some about this in a new post, see later). (the problem is that the reads & writes of isDone and the params and results don't all happen together and in-order). Instead we can just pass the object :

struct ThreadMessage
    bool isDone;
    int workParams[32];
    int workResults[32];

Thread A :
    ThreadMessage myLocalCopy; // in TLS or stack
    // fill out myLocalCopy
    // push myLocalCopy to thread message queue to thread B

    ... my stuff ...

    if ( pop my message queue )
        myLocalCopy = copy from queue
        // use myLocalCopy.workResults
Thread B :
    ThreadMessage myLocalCopy; // in TLS or stack
    myLocalCopy = pop message queue

    // read myLocalCopy.workParams

    // fill out myLocalCopy.workResults
    push myLocalCopy to queue for thread S

Okay. Basically we have taken the separate variables and linked them together, so that as far as our thread is concerned they get written and read in one big chunk. That is, we move the shared data from one large consistent state to another.

1 comment:

Sean Barrett said...

(Did you change the settings on your blog comments? I had to use a google account to post this, I was using OpenID before...)

I don't really understand your approach... with a mutex'd communication region, you can do stuff like "lock; read value; depending on value do various if()s and write one or another value; unlock". I don't see anyway you can do that under a network delta model, so it seems like you're actually talking about some very specific narrow model, not just 'communication region with a mutex'.

I tend to model shared information as "owned". There's a given chunk of memory that is owned by one thread. One atomic unit of the memory indicates ownership; all threads can read it, but either zero or one can write it. Then the rest of the memory of the chunk can only be read by the owner.

For the "zero-or-one can write it", there's really just two cases: either one thread owns it and only it can write the atomic field (and read/write the rest), or else it's unowned, and any other thread can take ownership of it (but they need to mutex or use an interlocked operation to do so). If you take ownership of a chunk, you need to do a read barrier before reading the rest of the chunk; and vice versa for the owner of the chunk giving up ownership.

I don't know if this is a sufficiently general model, but in the threading stuff I've done it's always been good enough.

So e.g. you can make jobs by somebody creating a job data structure which they own, then releasing ownership of it (to be accessed by work threads) by e.g. setting a flag field to 'READY TO BE WORKED ON'.

Implenting a circular queue holding jobs can also be done under this model; you use two chunks that are just the head/tail of the queue. If there's only one writer, it's the sole owner of the head so no locking is needed to add to the queue; otherwise a thread adding to the queue needs to take ownership of the head field to update it (which means using locking). If there's only one reader, it's the sole owner of the tail so no locking is needed to remove jobs; otherwise you need to take ownership to remove.

Anyway, I don't see any way to implement this model in the approach you're talking about. Maybe that's because this model isn't adaptable to lock-free operation, but I'm not clear what model *does* scale from using mutexes on communication regions to a 'network messaging' model.

old rants