cbloom rants: 02-25-09 - Low Level Threading

LIFO MPMC Stack

Last time we learned about publishing with CAS-retry, so let's use it to make our first LockFree data structure. The simplest one is probably the linked list LIFO (aka a stack) so we'll do that.

There is one global pointer to the top of the stack. Nodes have a next pointer. Something like this :

struct Node
{
    Atomic< Node * > m_next;
};

Atomic< Node * > g_stack = NULL;

This is an MPMC structure (Multi-Producer Multi-Consumer). Basically that means anybody in the world is allowed to be jamming on g_head at any time and it's all good. Remember that Atomic< > in my writeups here just means that we ensure that data type can be addressed atomically, it doesn't imply anything about memory ordering or anything like MSVC 2005 volatile or anything like that.

Note that "Node" has no data in it. That's intentional - this is meant to be used intrusively. That is, you put Node at the top of your own data structure inline, and then you pass in your own structures. I always try to implement things intrusively at the low level because you can always build non-intrusive on top of intrusive, but not vice-versa. Intrusive is more fundamental and can be much more efficient.

Other than the Atomic this looks just like a normal stack you would implement. So what would you normally write to push a new node ?


void Push(Node * node)
{
    node->m_next = g_stack;
    // (*)
    g_stack = node;
}

That's pretty simple and it almost works lock free, but there's one problem that should be obvious. "g_stack" is a shared variable that anybody else can be touching , and at (*) it can change so that we are doing something inconsistent - stomping on someone else's change.

Obviously any time we might be stomping on someone else's change, a CAS-retry is the natural suggestion, so let's fix it :


LF_Result Push_NoLoop(Node * node)
{
    // load g_stack to a local var, no memory order needed :
    Node * localStack = LoadRelaxed(&g_stack);

    // "node" is wholy owned by me, no one else can see it, so just normal store :
    node->m_next = localStack;

    // try to change g_stack from "localStack" to "node"
    //    if it fails it means there was contention :

    return CAS_Release( &g_stack, localStack, node );

    // CAS has release order so that when g_stack is set to node, 
    //  previous writes inside node are published correctly
    // CAS returns {Success,Failure,Abort}
}

This should look very much like the above Push, we've just been a bit more careful about things and now it works lock free. Of course Push_NoLoop doesn't actually always do the push - to do that we need to loop :


void Push(Node * node)
{
    while( Push_NoLoop(node) != LF_Success )
    {
    }
}

Note that in these retry spins there's not much point of doing sleep-backoff type stuff (maybe there's a tiny bit of a point for super heavy cache contention), but I go ahead and use my SpinBackoff helper for these anyway just so that it can assert if I accidentally write a bug and code an infinite loop.

(also note that really the Push loop should be using CAS to reload localStack but I find it more educational to write it this way for now).

Let's go ahead and write the Pop the same way. Single threaded Pop is :


Node * Pop()
{
    Node * node = g_stack;
    if ( ! node ) return NULL;
    g_stack = node->m_next;
    return node;
}

There are obviously the same issues in here as we had with Push, so we try to fix it up :


LF_Result Pop_NoLoop(Node ** pNode)
{
    Node * localStack = LoadAcquire(&g_stack);  // (*1)
    // Acquire memory order here because we will read stuff dependent on localStack

    *pNode = localStack; // we hope to return this

    if ( localStack == NULL ) return LF_Success; // empty stack , *pNode is NULL
    
    Node * nextStack = localStack->m_next; // relaxed load is fine here //(*2)

    // try to step stack to g_stack->next if noone else has changed it
    return CAS_Release(&g_stack,localStack,nextStack);

    // if we don't return Success then don't touch *pNode !!
}

and again to make a Pop that always gets a Node you just spin on this. This is similar to what we did before and it mostly works, but there's a really big dirty problem here which will plague all our lock free work.

The issue is what do the consumers typically do with nodes after they pop them? They look inside at the data and poke around with it - and then they free them! (or recycle them or something, they don't just leave them alone forever). So a typical consumer might look like :


Node * node = Pop();
if ( node )
{
    int got = node->data;
    free(node);
}

But the node that we just popped used to be the top of the stack. That means it was in g_stack, and many other threads may have grabbed a pointer to it at (*1). We happend to be the first one to return successfully so we got the node back and freed it, but at (*1) they still have localStack pointing at our node.

Now they walk along, la di da, and at (*2) they try to look inside localStack. BOOM access to freed memory.

Note that this occurred because Pop() is allowed to be multi-consumer. If we restrict our stack and say only one thread is ever allowed to Pop (single consumer), then this problem does not occur. That is, by going to single-consumer we have made all the Pops run single-threaded against each other, so they only need to worry about multi-threading against Push, which is safe.

This is a common and important pattern : Multi-Producer is easy. Multi-Consumer is hard. The general reason is that Producer doesn't have any branches or invalid state. It just makes a new node and sticks it on. Note that in the single-threaded version of the stack push it always just unconditionally pushes a not. The Pop on the other hand has to deal with the empty case. It is shrinking the size of the data structure getting towards an invalid empty state, and it is discarding nodes. We'll see more about this again later.

How do we handle this possible access to freed memory? First of all, notice that it is benign. That is, if reading from freed memory just gave us back a random value, that would be okay, because we will always fail the CAS and the bad value that we read will be discarded. So, there are three fixes :

1. Garbage Collection / Safe Memory Reclamation / etc. : allocate Nodes with a special allocator that prevents them from actually being freed while there's any possibility of a thread touching it. SMR is an efficient way to do this in C++ that's quite elegant. SMR can keep around as many extra nodes as there are consumer threads. BTW the simplest form of GC actually works fine in most applications : just keep the nodes around forever. Allocate them from a Node Pool that only ever grows and recycles them, it never actually hands them back to the OS. You probably want some kind of Node Pool / GC allocator if you're passing around a bunch of Node shitlets.

2. Just free it, but don't return it to the OS. If you're using your own malloc you can literally just Free() the node and put it back in the malloc pool and let somebody else grab it and write in it. That's totally fine - you might read random junk out of it, but that doesn't hurt here. You just don't want it to get all the way back to the OS and get marked so you page fault.

3. Catch the page fault.

This last one is rather cute, and is in fact what Windows SList does. If we get a page fault at (*2) reading localStack->next it means somebody popped off head and freed it while we were working. That just means there was contention and we should retry, so we can change Pop easily :


LF_Result Pop_NoLoop(Node ** pNode)
{
    Node * localStack = LoadAcquire(&g_stack);
    *pNode = localStack; // we hope to return this

    if ( localStack == NULL ) return LF_Success;

    Node * nextStack;
    __try
    {
        nextStack = localStack->m_next; // possible access violation
    }
    __except()
    {
        return LF_Abort; // page fault - contention - retry
    }

    return CAS_Release(&g_stack,localStack,nextStack);
}

that works great and is plenty fast on Windows, but sadly having a fast hard exception catcher is not portable, so we can't rely on this method. It's also not really that important because we are writing code that we control, so solutions #1 or #2 are totally fine for us. You can see that this technique is very cool for Windows SList because they cannot control how you do your mallocs and frees, so they must support any memory allocation pattern, which includes the memory being totally freed.

BTW this whole issue is another reason why GC languages like Java or C# are much easier to write lock-free code in.

Okay, we have our Stack juju going, but there's still a huge problem in here which we will address in the next post.

cbloom rants

2/25/2009

02-25-09 - Low Level Threading - Part 4.1

No comments:

old rants