02-25-09 - Low Level Threading - Part 4.2


I'm sure many of you know the remaining problem with the Stack in the last post is ABA. Let me briefly sketch how it bites you :

stack starts empty

Thread1 pushes A
    stack = A ; A->next = NULL

Thread2 starts to pop and sees that head is A ,
    it sees A->next is NULL

Thread1 pops A
    stack = NULL;
Thread1 pushes B
    stack = B; B->next = NULL;
Thread1 pushes A
    stack = A -> B -> NULL;

Thread2 finishes popping
    head is still A , so CAS succeeds
    stack = NULL;
    return A;

!!! What happened to B !!!

Okay, this has been discussed a lot around the net. In the case of these kind of Node pointer based data structures, you could say that the problem is because we are recycling nodes. That is, Thread1 popped A and then recycled that memory and pushed A again. If we never recycled nodes we would not have ABA problems here.

In fact, one solution that some of the Lock Free heavyweights use to this problem is to use special variants of GC that prevent ABA. If you are using a GC like SMR (Safe Memory Reclamation), then the GC knows that Thread2 is holding a pointer to A, so when Thread1 tries to push a new node, it won't be given A (even though A has been freed).

Using a GC that prevents ABA is nice because it makes your LF code simpler and removes ABA all over. I'm not going to go into the details of how to implement SMR but there are some links on it in the big Annotated Links post that I made.

However, I want to stress that ABA is not just a problem with memory recycling. In fact, the ABA problem is because CAS doesn't really do what we asked it to do !!

If you recall back when I introduced CAS I said we were using it to prevent us from stomping changes made by other threads while we were preparing our change :

Data localData = LoadAcquire( &g_sharedData );
Data newData = localData;
.. mutate newData ...

// only apply newData if sharedData hasn't changed since we started : 
} while ( ! CAS_Release( &g_sharedData, localData, newData ) )

But that's not what CAS actually does !! CAS checks that the *value* is the same as when we started, not that it hasn't been touched !

What we really want to be checking is a time stamp or a revision count on the memory, or some kind of GUID for the object. We want to know if the object is the same object that we loaded in the beginning of our loop. Obviously if the g_shared thing encapsulates the entire state of the shared object by value, then it is fine to check it with CAS.

The problem with the LIFO stack is that we are using CAS just on the head pointer - which is NOT a valid unique ID for the whole stack. Obviously {A->NULL} and {A->B->NULL} have the same head pointer but are very different stacks. Using CAS just on the head is a bug.

The standard solution to this problem is to use a "uniqueifier" , a sequence id or revision count. This is very common practice any time you're recycling things that you need to be considered different even though their value is the same. We keep our Node the same, but the root of the stack changes :

struct Node_And_Seq
    Node *    p;
    int        s;

Atomic< Node_And_Seq > g_stack = Node_And_Seq(0,0);

Note that we are now making "Node_And_Seq" together the object that must be Atomic. If the pointer is 32 bits, this object is 64 bits and we now rely on having an atomic 64 bit CAS. More on this later.

Now we just change our Pop to also increment sequence and work atomically on Node_And_Seq :

LF_Result Pop_NoLoop(Node ** pNode)
    Node_And_Seq localStack = LoadAcquire(&g_stack);

    *pNode = localStack.p;

    if ( localStack.p == NULL ) return LF_Success; // empty stack , *pNode is NULL
    // get next pointer and bump sequence count :
    Node_And_Seq nextStack = { localStack.p->m_next, localStack.s + 1 };

    // try to step stack to g_stack->next if noone else has changed it
    return CAS_Release(&g_stack,localStack,nextStack);

now the CAS at the end is checking that localStack has the same pointer *and* the same sequence count because it is atomically comparing 64 bits. Also note that the Load at the beginning atomically loads 64 bits so that it always loads the stack head pointer and sequence count together !

If you had written your Pop like this :

LF_Result Pop_NoLoop(Node ** pNode)
    Node * localStack = LoadAcquire(&g_stack.p);
    int localSeq = LoadAcquire(&g_stack.s);



you would have written a race condition !! But of course you wouldn't do that, right? You know that Node_And_Seq must be treated as a single combined object to make it a valid GUID for the whole stack. Well, the classic Fober-Orlarey-Letz paper on LIFO has this very race in it.

(Aside : Lock Free programming is *hard* ; even many of the academic papers detailing lock free algorithms have races!)

Note that I've only talked about changing Pop. We don't need to rev the sequence number in Push, because Push can never cause ABA on its own - you're never pushing a node that's already in the stack unless you have a bug. To cause ABA you must have at least one Pop, and there we rev the sequence id.

Technically under C++0x Push should be changed to also work on Node_And_Seq so that both Push and Pop are doing atomic 64-bit ops on the g_stack shared data variable. On some processors if you address the same memory with different size loads & stores, it is not gauranteed to be atomic. On x86 it's a very small perf win to keep using just the 32-bit head pointer in Push, and it's safe, but I personally have just switched to doing 64-bit all the time because it's better to avoid problems (and it works in Relacy).

Note that if we're doing Node_And_Seq we could also easily make the sequence count 16 bits and adding a 16 bit stack depth counter. That way we could quickly support depth queries. (Windows SList does something like this). Note that those 16 bit accesses may not be atomic, but you should never try to do that anyway - you always do a full 64-bit atomic copy from g_stack to a local variable, then you poke around inside it.

Okay, now this is all good but what happens when we go to 64-bit pointers? We could make Node_And_Seq be 128 bit. New x64 chips have 128-bit CAS so we could use that. The problem is that the first generation of x64 did not have CAS-128 (cmpxchg16b). Fortunately, we don't actually need a 64-bit pointer. There are a few different solutions :

1. We know our memory is always in our application's virtual address range. Our app is never going to have more than maybe 50 bits of virtual address range. So rather than store a pointer in our Stack, we can store an offset from the base of our virtual address range. You could use 48 bits for offset and 16 bits for sequence Id. Note that the bottom 3 bits of a node address must always be zero because nodes must be 64-bit aligned (8 byte), so you actually get 51 bits of offset this way.

2. Our Nodes for Stack are coming from a Node Pool Allocator (GC) of some kind anyway. So again we can just have our Node Pool allocator reserve a virtual memory address range and then we know that our Nodes are always in that range. In fact, it could reserve two gigs of virtual address range for Nodes, and then we can just use 32 bit pointers and not worry about any of this !!

I posted a link earlier about how Windows SList deals with this problem, and it's rather intersting. But it's important to note that your implementation does not have to deal with the same issues as them - you control where your memory is coming from (it's never in the kernel address range for example), they have to be able to link any memory in the system.

No comments:

old rants