02-27-09 - Why is Intel selling me Software -

Intel used to do a pretty good thing where they would write nice like optimized widgets for their chips and give them out as source code to developers for free. Obviously that makes a lot of sense in their business, it sells more chips.

Now they want to charge me for the optimized math routines !? WTF Intel !? And even if I buy it, I don't get source code, I just get DLLs. Not just DLLs, a HUGE MESS of DLLs - 118 of them taking 283 Mega bytes !

Granted, the "IPP" does a whole lot of stuff. It's got jpeg, H264, MDCT, MP3, hell it's got CELP, BWT, it's got like every damn thing in the universe, but it's all useless because it's not cross-platform, it's not source code, and it's a huge complex system.

Give me little isolated code snippets that just do the one thing I want to steal. Don't force giant systems on me.

I want a good little nugget of *functionality* , not a framework. I have my own framework.

I've been thinking about this a lot recently because to some extent the way I'm writing Oodle right now it's becoming a framework. I think it's becoming pretty good, it's getting to be pretty easy to make really cool stuff happen, but nobody wants to buy a framework. Nobody wants to have to buy into a "system" that will make their life easier - they want to have their own framework and just call you for the little bit of hard stuff they don't want to bother with. I need to make sure Oodle does that.

The danger for me is that I really like writing framework code. The way I like to code, I build up a huge base of really strong fundamentals into a whole system that does everything I want, and then I can just easily tack bits together to make application functionality. So maybe 90% of the code is "library" or "framework" and the glue and application is minimized.

Like to do my new vert snapper for galaxy it was like - I already have my hash map, I already have my int geometry quantizer, boom tack it together, it's like 20 lines of code. That's awesome, but to share that code you have to buy into my framework - if you try to pull just the vert snapper, you find that it's got roots that go deep into the earth of the framework and cling onto the vector, the int vector, the int-float math, the hash map, the std::vector, my template util set, etc etc you probably wind up with 20 files.

I wish you could sell mesh algorithms. Just looking back at Galaxy to do the BMP triangle removal makes me miss it; the algorithms are really complex and interesting, the results are concrete and compelling, and there's so much more to do. I just don't see how to make it a product. Like for example a really good UV mapper I think would be a great product, it's hard to do, it doesn't really exist, and everyone needs it. But they don't know they need it.


02-26-09 - Low Level Threading - Table of Contents

Table of Contents since searching Blogger is so awkward :

(the older articles here were written before I totally understood this stuff but I believe they are still correct, if a bit more confused)

Part 1 : Introduction to the issues, volatile, memory barriers, and interlocked ops.

Part 2 : Thread Safe singleton and info about CRITICAL_SECTION.

Part 2.01 : A bunch of links.

Part 2.02 : Clarification of x86 memory model and dependent read race issues in earlier posts.

Part 2.5 : TLS

Annotated Links

Part 3.0 : Intro to lock free programming and memory models.

Part 3.1 : Introducing CAS and cache line issues.

Part 4.1 : LIFO MPMC stack

Part 4.2 : LIFO MPMC stack : ABA

Part 4.3 : LIFO MPMC stack : issues

Part 5.1 : FIFO SPSC queue : simple and fast

Part 6 : Wrap up , Relacy, and the code.

Added 06-12-2012 : See also the newer threading post index .

02-26-09 - Low Level Threading - Part 5.1

SPSC FIFO : the simple message queue

Phew, I've lost the spark that was fueling me to write these, but I want to get one more out before I quit : the SPSC FIFO. This is a simple and useful structure, and I think it's a nice example because it illustrates many different properties from the MPMC stack.

SPSC means single producer single consumer ; one thread makes nodes, another uses them. Immediately we don't need to worry about all the GC issues and node lifetime because we have just a single consumer. We can easily make it two ways just by having an SPSC going in each direction. We can easily make it MPMC by using many of them, eg. for 3 threads to all talk both ways to each other uses 6 SPSC FIFOs. In many cases this is actually much more efficient that using an MPMC structure.

It's a FIFO so there are two ends, and we know only the producer thread is touching the input end, and only the consumer thread is touching the output end. Thus, even though it is a multi-threaded structure, the calls to "enqueue" and "dequeue" are always single-threaded against themselves. This is crucial.

You can also make either end of the SPSC into multi by putting a critsec on that end and changing the convention for who the "owner" is : rather than the "producer" being a certain specific thread, you define it to be the locker of the producer critsec. Now you have an MPSC queue which requires a lock to produce but none to consume. Or you could do SPMC or MPMC this way. This is okay in some cases, not in others. Herb Sutter designed a queue like this in Dr. Dobbs recently (see link in annotated links area). It's certainly better than having a single lock for the whole queue, because if only one thread is producing or consuming they will get the lock with no contention, but obviously it's not lock-free.

You can do a FIFO in an array, but it's actually much harder that way. If you had an infinite amount of memory for your FIFO it would be very simple indeed, but you don't, so you either have to make it circular or dynamically resize the array. Both are possible lock-free (or near-lock-free anyway), but they both break the magic property that the consumer and producer don't fight over the structure. If it's circular you have to worry about the head running into the tail and suddenly you have the producer and consumer contending. So we'll do with nodes and a linked list.

The FIFO is a linked list from the head to the tail :

m_head -> node -> node -> node -> m_tail

We push new nodes on the tail and pop them from the head. We want to do both of these as simply and unconditionally as possible, so we don't allow a totally empty queue. We always have a dummy node in the Queue so that when the queue is empty, head == tail.

In particular it should be clear that the Push operation is very simple and unconditional :

newNode->next = NULL;
m_tail->next = newNode;
m_tail = newNode;

When the Queue starts, m_tail points at the dummy, we just tack something in his next. Then at all times we know we can just jam into the next pointer at the end to add something at the list. The key is partial ownership of the structure by the different threads.

The consumer owns the "m_head" pointer - but not necessarilly the node that m_head points to (if the queue is not empty then it also owns that node). The producer owns the "m_tail" pointer and also the "->next" pointer on the last node (which is also the first node if the queue is empty). The funniest case occurs when the queue has one element in it - the dummy plus one more - in that case the producer owns the "->next" pointer in the last node, but the consumer owns the data in the last node ! More on this later.

Okay let's do the Lock Free Push in detail :

struct SPSC_FIFO_Node
    Atomic< SPSC_FIFO_Node * > m_next;

    void * m_data;

    // optional cache line padding here

struct SPSC_FIFO
    Atomic< SPSC_FIFO_Node * > m_head;

    // cache line padding here

    Atomic< SPSC_FIFO_Node * > m_tail;

    // optional cache line padding here

void Push(SPSC_FIFO * fifo, LF_SPSC_FIFO_Node * node)
    // at this point I own "node" completely
    // I wrote node->m_data before coming in here
    StoreRelaxed(&(node->m_next), NULL); // relaxed store because I own it

    // get the tail pointer - again I own it so relaxed load is fine
    SPSC_FIFO_Node * back = LoadRelaxed(&(fifo->m_tail));

    // as soon as I set back->next here , node is now visible to the consumer
    // I no longer entirely own node (I do still own node->next)
    // Release here to ensure node->next and node->data are written first
    // (*1)

    // now advance tail -
    // this is just for me, so relaxed is fine

No atomic RMW's at all! No consistent checks - just careful memory ordering. These are slightly subtle :

(*1) is the point where the consumer might be able to get at node. Obviously the stuff inside node must be out to him before he pops node. You do not need any kind of store fence here when you use the Release memory constraint. We're writing data out in a stream :

node->data = A;     // 1 - Relaxed
node->next = NULL;  // 2 - Relaxed
back->next = node;  // 3 - Release

Release means this can go out  1,2,3 or 2,1,3 , but never 1,3,2 or something invalid.

Similarly when the consumer Pops next, he will use an Acquire :

popped = head->next; // 1 - Acquire
head = popped->next; // 2 - Relaxed
data = popped->data; // 3 - Relaxed

Again this can go 1,2,3 or 1,3,2 but never a bad order that would see whack data.

BTW on most architectures (*1) doesn't actually even need to be Release because the loads we care about are dependent, and dependent data automatically is acquire/release on every major architecture, but pretend you heard that.

Also note why we used the dummy - the tail can never be popped. If we didn't have that you could have sequences like :

    SPSC_FIFO_Node * back = LoadRelaxed(&(fifo->m_tail));

    .. now consumer thread runs and pops everything ...
    .. !! back gets freed !!


However, the dummy is a bit weird in pop. After we push a bunch of nodes, it's sitting there at the head with nothing in it. If we pop a node, we pop the dummy. We can't throw it away, because we need a dummy to always be in the queue so that we never pop the last node. A common pattern used in other cases is to repush the dummy. (this is used in the MPMC FIFO and the MPSC FIFO - if you pop the dummy, you just push it back on the other end). But we can't do that here because we are the consumer, not the producer, and this is SPSC, so we don't have access to push. Thus we must use the method of sliding the dummy down the list.

It looks like this :

    // m_head is totally owned by me so I can load relaxed :
    SPSC_FIFO_Node * front = LoadRelaxed(&(fifo->m_head));
    // front is the current dummy, it is never null
    // front->m_data is garbage

    // (*1) 
    // front->next is the sync point with the producer !
    // if front == tail we are sharing this variable
    SPSC_FIFO_Node * next = LoadAcquire(&(front->m_next));
    if( next == NULL )
        return NULL; // empty

    // next exists ; now the node "next" is mostly owned by me
    //  (the producer owns next->m_next)
    // front will be returned, it is wholly owned by me

    // shift data from next to front :

    void * data = LoadRelaxed(&(next->m_data));

    // update m_head - relaxed, we own it
    // copy data into front

    // next is now the dummy
    //  next->m_data is garbage (it's a dupe of what we return here)

    return front;

Again it's pretty simple, in fact it's harder to understand that it works than to just write it and use it ;)

It's interesting to look at a slightly different possible way to write Pop. This alternative way is often suggested around the net, but I believe it's worse. Instead of checking front->next to check for empty , it checks head == tail :

    SPSC_FIFO_Node * front = LoadRelaxed(&(fifo->m_head));

    SPSC_FIFO_Node * tail = LoadAcquire(&(fifo->m_tail));
    if ( front == tail )
        return NULL; // empty

    // this is all the same :
    SPSC_FIFO_Node * next = LoadAcquire(&(front->m_next));


    front->m_data = next->m_data;

    return front;

If we use this Pop_Alt with our Push there's a race condition. The problem is in these two lines of Push :




What's the problem?

At (*2) one possibility is that the ->next pointer is set, but m_tail is not updated yet. Thus when Pop_Alt is called, it will see front == m_tail and return empty even though the node is there. That's not the problem. In fact there's nothing wrong with that. We never said that Pop would *immediately* be able to see a new node if the Push wasn't done yet. In fact with Lock Free stuff you can never gaurantee the timing of memory visibility.

The problem at (*2) is that the stores don't have to go in order because we used a relaxed store for m_tail ! That was okay before because Pop never touched tail, but now if they run out of order :

    fifo->m_tail = node

    back->m_next = node

tail gets updated first so Pop_Alt thinks the Queue is not empty. It tries to pop, but ->next is not set yet so it gets NULL ! Remember that "Release" does not prevent stores below it from moving up, only stores above it from moving down.

Obviously the fix is just to ensure that the stores go in the right order :

void Push_Alt(SPSC_FIFO * fifo, LF_SPSC_FIFO_Node * node)
    StoreRelaxed(&(node->m_next), NULL);

    SPSC_FIFO_Node * back = LoadRelaxed(&(fifo->m_tail));


    // Release so these don't reorder !

and now Push_Alt and Pop_Alt work okay. But I believe they are not as good as our original version because the consumer and producer are now both touching the "m_tail" variable which means they cache line fight. There's no contention in the sense that they never do a CAS on m_tail or have to loop or block on each other, but just by having Pop look at m_tail is it causes the line to have to sync. (note that this kind of sharing is way way way better than a CAS - this is just a normal cache line propagation and it's read-only on the consumer which is way better than if it was being written from both threads).

And now for the payoff.

If we use our macros and such to convert these actions into the right native operations that respect the memory order constraints, "Push" compiles to this on x86 :

mov [eax],0        // eax is node , [eax] is node->next
mov [eax+4],edi    // edi is the data element , [eax+4] is node->data
mov ecx,[esi+40h]  // [esi+40h] is the fifo head
mov [ecx],eax      // [ecx] is head->next
mov [esi+40h],eax  // head = node

Five moves for a lock-free FIFO push ! Note there are absolutely zero "lock" prefix ops or memory barriers, and yet the memory is gauranteed to be passed to the consumer safely. Also note that you really did need to be careful with the Release semantics because if these movs are not done in the correctly constrained order, this is not thread safe !!

This is the real payoff for why we learned about the memory models and using load/store with constraints. Carefully designed algorithms can be extremely efficient. Obviously the fact that this is so efficient on x86 relies on the strong x86 memory model which says all stores are Release, but more generally if we write the code with the absolute minimum of constraints, that lets the system translate it into the most efficient method possible for the given platform.

One annoying thing about this FIFO is it can't be intrusive. Push must allocate a none and Pop must free one. So you need a custom node allocator recycling thing. However there is no ABA issue because there is no CAS so you don't have to worry about SMR or anything.

One issue is that standard thread-caching mallocs are very very bad for FIFO queues. The problem is you are by design always allocating the node on one thread and freeing it on a different thread. Standard thread-caching mallocs usually optimize freeing on the *same* thread that did the allocation, and they make cross-thread frees slow. In fact, one option is to use an SPSC queue to pass the nodes back to the producer without any data just so that he can get the memory back ! Another option is to use one of Dmitriy Vjukov's custom allocators that works well with lockfree fifos.

Because this SPSC FIFO is so fast, it's totally reasonable to make N-thread multiway communication channels by multiplexing SPSC FIFOs. In fact that's what the experts in the field do. There is a problem, in that you need O(N^2) channels, which becomes bad for the large N of the future, but for N <= 8 or so it's a totally reasonable way to go.

The art of good lock free system design is in combining various simple structures to create a more elaborate information flow pattern. And don't forget to make good use of plain old single threaded structures too! For example, if you are often passing many data items at once, don't push them all one by one. Intead, link them together with a ->next pointer to make a plain old single-threaded linked list, and push the whole linked list with just one lock-free op.

Another cute trick I've seen is to transfer SPSC FIFOs using a more expensive structure like our MPMC LIFO :

Say for example you have N producers and M consumers, N and M both large. The producer wants to jam a bunch of nodes to some consumer and have them worked on, but it doesn't care who. Once some consumer is eating his nodes, he may continue to push them intermittently over time. So I can't just push them all in one big packet. What we do is use one MPMC LIFO or FIFO for all the producers and consumers. Our producer pushes a pointer to an SPSC FIFO. The first consumer that has free time pops the big shared MPMC structure and gets the FIFO. Now he owns it and is the single consumer - now the producer and consumer can talk directly to each other over that link efficiently without going through the slow MPMC structure.

I think I will try to write just one more of these with an introduction to Relacy Race Detector, and maybe a brief sketch of the standard thread pool / work stealing thing that people do now, and I will post my code for all this.


02-25-09 - Low Level Threading - Part 4.3

LIFO MPMC Stack : Performance Issues

Okay, we have our LIFO MPMC stack finally working, let's talk about what it does.

It sucks. If you actually try to use it with a lot of producers and consumers, it will grind to a halt. The problem is very heavy cache line contention.

A stack only has a single entry point - the "top". Anybody who wants to push or pop from it must go through g_stack and they all fight over each other. To do a Push or a Pop each thread has to do a CAS which we've seen before is pretty expensive. But the real problem is the cache lines. Let's look at Push in terms of what it does to cache lines :

LF_Result Push_NoLoop(Node * node)
    // request the cache line with g_stack for my processor
    //  if someone else is doing a CAS we stall for them to release exclusive lock on the line !
    //    then we stall to fetch it from them
    Node * localStack = LoadRelaxed(&g_stack);

    node->m_next = localStack;

    // CAS first requests a reservation of the stack line of g_stack and must stall on all other CAS's
    //    that also evicts the cache line of g_stack from all other processors
    return CAS_Release( &g_stack, localStack, node );
    // CAS is done, g_stack cache line gets flushed out
    //    other processors can now fetch it to their cache

As I mentioned before this isn't a huge disaster on shared-cache multi-core chips, but it's death on real SMP systems without shared caches.

If you even just have one thread that is pushing lots of items and another thread popping lots of items - they will be constantly trading exclusive access to that cache line, forcing the other one to evict it, back and forth over and over. (some day we will talk about a much better way to do an SPSC FIFO).

Because we know this cache line problem is so bad, we should at least put padding around g_stack. Say we have it at file scope with some other variables :

int g_x;
LFStack g_stack;
int g_y;

We know g_stack is thrashing cache - but now anybody touching g_x and g_y is also feeling the pain and causing us more problems. So we should isolate it :

int g_x;

int _pad1[CACHE_LINE_SIZE];
LFStack g_stack;
int _pad2[CACHE_LINE_SIZE];

int g_y;

This way at least our problems are only our problems. It sucks that we're introducing dependence on a hard-coded cache line size. (on x86 CACHE_LINE_SIZE seems to be 64 now).

Note that the Nodes have a similar though smaller issue. If your Nodes are say 16 bytes or so, and they come from a Node Pool Allocator, that allocator typically will put all the nodes in a row with each other. So now you have two different threads that pop off two nodes from a Stack - those nodes will often be directly adjacent in memory. Boom cache line contention. The big issue is always cache line contention.

One solution is to pad up your nodes to CACHE_LINE_SIZE. If you don't want to suffer that penalty of wasted memory use from padding your nodes, you could use a strided allocator, rather than return node[0], node[1], node[2] , instead you return node[0] , node[4] , node[8] , then node[1], node[5], node[9], etc...

Also note that all this CACHE_LINE_SIZE aligning of things is actually really bad for cache use because of the way caches are keyed. It's a well known cache use improvement optimization to make sure you *don't* make objects exactly CACHE_LINE_SIZE offset from each other. So you might want to align your nodes to (CACHE_LINE_SIZE+8). Sigh.

One obvious way to reduce contention on something like this MPMC stack is striping or binning. Rather than just have one stack you have 4. To push a node you use round robin or rand and push on one of them. To Pop you try them all. This helps in the case that your stacks are usually not empty so Pop usually succeeds on it's first try. There are better solutions to this problem though.

One big optimization for the MPMC stack that you can use in some cases is to Pop all. Rather than popping the stack, the consumer just grabs the whole thing using atomic exchange :

Node * Pop_All()
    Node_And_Seq nullStack = { 0, 0 };
    Node_And_Seq localStack = Exchange(&g_stack, nullStack);

    // localStack.p is now exclusively owned by me

    return localStack.p;

This is an unconditional exchange, not a CAS, we don't need to loop or anything. We make the stack empty and take ownership of the whole contents. We can now work on a bunch of nodes in a batch which gets rid of the pain of the cache contention.

ADDENDUM : I should say that there are use cases the MPMC stack are really good for. One is the case that you are mostly just using it from one thread, that thread is doing 90% of the pushing and popping to the stack, so it gets to own the cache line, but once in a while other people need to push or pop things to the stack. It's fine for that, especially if the "once in a while" tends to happen when I'm not heavily hitting it.

Okay, I think we have now talked about the majority of the issues of lock free coding. In part 5 we will talk about the SPSC FIFO which is a very elegant illustration of the advantages of memory models and partial exclusive ownership.

02-25-09 - Low Level Threading - Part 4.2


I'm sure many of you know the remaining problem with the Stack in the last post is ABA. Let me briefly sketch how it bites you :

stack starts empty

Thread1 pushes A
    stack = A ; A->next = NULL

Thread2 starts to pop and sees that head is A ,
    it sees A->next is NULL

Thread1 pops A
    stack = NULL;
Thread1 pushes B
    stack = B; B->next = NULL;
Thread1 pushes A
    stack = A -> B -> NULL;

Thread2 finishes popping
    head is still A , so CAS succeeds
    stack = NULL;
    return A;

!!! What happened to B !!!

Okay, this has been discussed a lot around the net. In the case of these kind of Node pointer based data structures, you could say that the problem is because we are recycling nodes. That is, Thread1 popped A and then recycled that memory and pushed A again. If we never recycled nodes we would not have ABA problems here.

In fact, one solution that some of the Lock Free heavyweights use to this problem is to use special variants of GC that prevent ABA. If you are using a GC like SMR (Safe Memory Reclamation), then the GC knows that Thread2 is holding a pointer to A, so when Thread1 tries to push a new node, it won't be given A (even though A has been freed).

Using a GC that prevents ABA is nice because it makes your LF code simpler and removes ABA all over. I'm not going to go into the details of how to implement SMR but there are some links on it in the big Annotated Links post that I made.

However, I want to stress that ABA is not just a problem with memory recycling. In fact, the ABA problem is because CAS doesn't really do what we asked it to do !!

If you recall back when I introduced CAS I said we were using it to prevent us from stomping changes made by other threads while we were preparing our change :

Data localData = LoadAcquire( &g_sharedData );
Data newData = localData;
.. mutate newData ...

// only apply newData if sharedData hasn't changed since we started : 
} while ( ! CAS_Release( &g_sharedData, localData, newData ) )

But that's not what CAS actually does !! CAS checks that the *value* is the same as when we started, not that it hasn't been touched !

What we really want to be checking is a time stamp or a revision count on the memory, or some kind of GUID for the object. We want to know if the object is the same object that we loaded in the beginning of our loop. Obviously if the g_shared thing encapsulates the entire state of the shared object by value, then it is fine to check it with CAS.

The problem with the LIFO stack is that we are using CAS just on the head pointer - which is NOT a valid unique ID for the whole stack. Obviously {A->NULL} and {A->B->NULL} have the same head pointer but are very different stacks. Using CAS just on the head is a bug.

The standard solution to this problem is to use a "uniqueifier" , a sequence id or revision count. This is very common practice any time you're recycling things that you need to be considered different even though their value is the same. We keep our Node the same, but the root of the stack changes :

struct Node_And_Seq
    Node *    p;
    int        s;

Atomic< Node_And_Seq > g_stack = Node_And_Seq(0,0);

Note that we are now making "Node_And_Seq" together the object that must be Atomic. If the pointer is 32 bits, this object is 64 bits and we now rely on having an atomic 64 bit CAS. More on this later.

Now we just change our Pop to also increment sequence and work atomically on Node_And_Seq :

LF_Result Pop_NoLoop(Node ** pNode)
    Node_And_Seq localStack = LoadAcquire(&g_stack);

    *pNode = localStack.p;

    if ( localStack.p == NULL ) return LF_Success; // empty stack , *pNode is NULL
    // get next pointer and bump sequence count :
    Node_And_Seq nextStack = { localStack.p->m_next, localStack.s + 1 };

    // try to step stack to g_stack->next if noone else has changed it
    return CAS_Release(&g_stack,localStack,nextStack);

now the CAS at the end is checking that localStack has the same pointer *and* the same sequence count because it is atomically comparing 64 bits. Also note that the Load at the beginning atomically loads 64 bits so that it always loads the stack head pointer and sequence count together !

If you had written your Pop like this :

LF_Result Pop_NoLoop(Node ** pNode)
    Node * localStack = LoadAcquire(&g_stack.p);
    int localSeq = LoadAcquire(&g_stack.s);



you would have written a race condition !! But of course you wouldn't do that, right? You know that Node_And_Seq must be treated as a single combined object to make it a valid GUID for the whole stack. Well, the classic Fober-Orlarey-Letz paper on LIFO has this very race in it.

(Aside : Lock Free programming is *hard* ; even many of the academic papers detailing lock free algorithms have races!)

Note that I've only talked about changing Pop. We don't need to rev the sequence number in Push, because Push can never cause ABA on its own - you're never pushing a node that's already in the stack unless you have a bug. To cause ABA you must have at least one Pop, and there we rev the sequence id.

Technically under C++0x Push should be changed to also work on Node_And_Seq so that both Push and Pop are doing atomic 64-bit ops on the g_stack shared data variable. On some processors if you address the same memory with different size loads & stores, it is not gauranteed to be atomic. On x86 it's a very small perf win to keep using just the 32-bit head pointer in Push, and it's safe, but I personally have just switched to doing 64-bit all the time because it's better to avoid problems (and it works in Relacy).

Note that if we're doing Node_And_Seq we could also easily make the sequence count 16 bits and adding a 16 bit stack depth counter. That way we could quickly support depth queries. (Windows SList does something like this). Note that those 16 bit accesses may not be atomic, but you should never try to do that anyway - you always do a full 64-bit atomic copy from g_stack to a local variable, then you poke around inside it.

Okay, now this is all good but what happens when we go to 64-bit pointers? We could make Node_And_Seq be 128 bit. New x64 chips have 128-bit CAS so we could use that. The problem is that the first generation of x64 did not have CAS-128 (cmpxchg16b). Fortunately, we don't actually need a 64-bit pointer. There are a few different solutions :

1. We know our memory is always in our application's virtual address range. Our app is never going to have more than maybe 50 bits of virtual address range. So rather than store a pointer in our Stack, we can store an offset from the base of our virtual address range. You could use 48 bits for offset and 16 bits for sequence Id. Note that the bottom 3 bits of a node address must always be zero because nodes must be 64-bit aligned (8 byte), so you actually get 51 bits of offset this way.

2. Our Nodes for Stack are coming from a Node Pool Allocator (GC) of some kind anyway. So again we can just have our Node Pool allocator reserve a virtual memory address range and then we know that our Nodes are always in that range. In fact, it could reserve two gigs of virtual address range for Nodes, and then we can just use 32 bit pointers and not worry about any of this !!

I posted a link earlier about how Windows SList deals with this problem, and it's rather intersting. But it's important to note that your implementation does not have to deal with the same issues as them - you control where your memory is coming from (it's never in the kernel address range for example), they have to be able to link any memory in the system.

02-25-09 - Low Level Threading - Part 4.1


Last time we learned about publishing with CAS-retry, so let's use it to make our first LockFree data structure. The simplest one is probably the linked list LIFO (aka a stack) so we'll do that.

There is one global pointer to the top of the stack. Nodes have a next pointer. Something like this :

struct Node
    Atomic< Node * > m_next;

Atomic< Node * > g_stack = NULL;

This is an MPMC structure (Multi-Producer Multi-Consumer). Basically that means anybody in the world is allowed to be jamming on g_head at any time and it's all good. Remember that Atomic< > in my writeups here just means that we ensure that data type can be addressed atomically, it doesn't imply anything about memory ordering or anything like MSVC 2005 volatile or anything like that.

Note that "Node" has no data in it. That's intentional - this is meant to be used intrusively. That is, you put Node at the top of your own data structure inline, and then you pass in your own structures. I always try to implement things intrusively at the low level because you can always build non-intrusive on top of intrusive, but not vice-versa. Intrusive is more fundamental and can be much more efficient.

Other than the Atomic this looks just like a normal stack you would implement. So what would you normally write to push a new node ?

void Push(Node * node)
    node->m_next = g_stack;
    // (*)
    g_stack = node;

That's pretty simple and it almost works lock free, but there's one problem that should be obvious. "g_stack" is a shared variable that anybody else can be touching , and at (*) it can change so that we are doing something inconsistent - stomping on someone else's change.

Obviously any time we might be stomping on someone else's change, a CAS-retry is the natural suggestion, so let's fix it :

LF_Result Push_NoLoop(Node * node)
    // load g_stack to a local var, no memory order needed :
    Node * localStack = LoadRelaxed(&g_stack);

    // "node" is wholy owned by me, no one else can see it, so just normal store :
    node->m_next = localStack;

    // try to change g_stack from "localStack" to "node"
    //    if it fails it means there was contention :

    return CAS_Release( &g_stack, localStack, node );

    // CAS has release order so that when g_stack is set to node, 
    //  previous writes inside node are published correctly
    // CAS returns {Success,Failure,Abort}

This should look very much like the above Push, we've just been a bit more careful about things and now it works lock free. Of course Push_NoLoop doesn't actually always do the push - to do that we need to loop :

void Push(Node * node)
    while( Push_NoLoop(node) != LF_Success )

Note that in these retry spins there's not much point of doing sleep-backoff type stuff (maybe there's a tiny bit of a point for super heavy cache contention), but I go ahead and use my SpinBackoff helper for these anyway just so that it can assert if I accidentally write a bug and code an infinite loop.

(also note that really the Push loop should be using CAS to reload localStack but I find it more educational to write it this way for now).

Let's go ahead and write the Pop the same way. Single threaded Pop is :

Node * Pop()
    Node * node = g_stack;
    if ( ! node ) return NULL;
    g_stack = node->m_next;
    return node;

There are obviously the same issues in here as we had with Push, so we try to fix it up :

LF_Result Pop_NoLoop(Node ** pNode)
    Node * localStack = LoadAcquire(&g_stack);  // (*1)
    // Acquire memory order here because we will read stuff dependent on localStack

    *pNode = localStack; // we hope to return this

    if ( localStack == NULL ) return LF_Success; // empty stack , *pNode is NULL
    Node * nextStack = localStack->m_next; // relaxed load is fine here //(*2)

    // try to step stack to g_stack->next if noone else has changed it
    return CAS_Release(&g_stack,localStack,nextStack);

    // if we don't return Success then don't touch *pNode !!

and again to make a Pop that always gets a Node you just spin on this. This is similar to what we did before and it mostly works, but there's a really big dirty problem here which will plague all our lock free work.

The issue is what do the consumers typically do with nodes after they pop them? They look inside at the data and poke around with it - and then they free them! (or recycle them or something, they don't just leave them alone forever). So a typical consumer might look like :

Node * node = Pop();
if ( node )
    int got = node->data;

But the node that we just popped used to be the top of the stack. That means it was in g_stack, and many other threads may have grabbed a pointer to it at (*1). We happend to be the first one to return successfully so we got the node back and freed it, but at (*1) they still have localStack pointing at our node.

Now they walk along, la di da, and at (*2) they try to look inside localStack. BOOM access to freed memory.

Note that this occurred because Pop() is allowed to be multi-consumer. If we restrict our stack and say only one thread is ever allowed to Pop (single consumer), then this problem does not occur. That is, by going to single-consumer we have made all the Pops run single-threaded against each other, so they only need to worry about multi-threading against Push, which is safe.

This is a common and important pattern : Multi-Producer is easy. Multi-Consumer is hard. The general reason is that Producer doesn't have any branches or invalid state. It just makes a new node and sticks it on. Note that in the single-threaded version of the stack push it always just unconditionally pushes a not. The Pop on the other hand has to deal with the empty case. It is shrinking the size of the data structure getting towards an invalid empty state, and it is discarding nodes. We'll see more about this again later.

How do we handle this possible access to freed memory? First of all, notice that it is benign. That is, if reading from freed memory just gave us back a random value, that would be okay, because we will always fail the CAS and the bad value that we read will be discarded. So, there are three fixes :

1. Garbage Collection / Safe Memory Reclamation / etc. : allocate Nodes with a special allocator that prevents them from actually being freed while there's any possibility of a thread touching it. SMR is an efficient way to do this in C++ that's quite elegant. SMR can keep around as many extra nodes as there are consumer threads. BTW the simplest form of GC actually works fine in most applications : just keep the nodes around forever. Allocate them from a Node Pool that only ever grows and recycles them, it never actually hands them back to the OS. You probably want some kind of Node Pool / GC allocator if you're passing around a bunch of Node shitlets.

2. Just free it, but don't return it to the OS. If you're using your own malloc you can literally just Free() the node and put it back in the malloc pool and let somebody else grab it and write in it. That's totally fine - you might read random junk out of it, but that doesn't hurt here. You just don't want it to get all the way back to the OS and get marked so you page fault.

3. Catch the page fault.

This last one is rather cute, and is in fact what Windows SList does. If we get a page fault at (*2) reading localStack->next it means somebody popped off head and freed it while we were working. That just means there was contention and we should retry, so we can change Pop easily :

LF_Result Pop_NoLoop(Node ** pNode)
    Node * localStack = LoadAcquire(&g_stack);
    *pNode = localStack; // we hope to return this

    if ( localStack == NULL ) return LF_Success;

    Node * nextStack;
        nextStack = localStack->m_next; // possible access violation
        return LF_Abort; // page fault - contention - retry

    return CAS_Release(&g_stack,localStack,nextStack);

that works great and is plenty fast on Windows, but sadly having a fast hard exception catcher is not portable, so we can't rely on this method. It's also not really that important because we are writing code that we control, so solutions #1 or #2 are totally fine for us. You can see that this technique is very cool for Windows SList because they cannot control how you do your mallocs and frees, so they must support any memory allocation pattern, which includes the memory being totally freed.

BTW this whole issue is another reason why GC languages like Java or C# are much easier to write lock-free code in.

Okay, we have our Stack juju going, but there's still a huge problem in here which we will address in the next post.

02-25-09 - Cast with union

Sometimes I need to cast an object to another as a "bit cast" , not a "value cast" (I don't know the good terms for this). The classic way is :

template <typename t_to, typename t_fm>
t_to & same_size_bit_cast( t_fm & from )
    COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    return *( (t_to *) &from );    

Which works fine except that I guess it freaks out GCC and might cause aliasing problems and such. (maybe I can just mark that up right and not have trouble?).

The union cast :

template <typename t_to, typename t_fm>
t_to same_size_bit_cast( t_fm from )
    RR_COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
    union _bit_cast_union
        t_fm fm;
        t_to to;
    _bit_cast_union converter = { from };
    return converter.to;

*almost* works. It works some times, like on basic types it works no problem. If t_to or t_fm has constructors or assignment operators though, it refuses to compile which is damn annoying (it would be handle in C++ if there was a way to say "I want an instance of this object here but don't run any constructors or destructors, just give me the member layout without the associated functions"). Often the whole reason I'm doing this is because I want to convert a more complex object into something I can just treat as a bag of bits. (for example if you want to use the windows Interlocked64 ops to work an LFSList type, you need to cast from a struct to a LONGLONG).

What's worse is that if t_fm or t_to has certain type qualifiers that get attached in the template argument deduction, it fails, and it becomes a big mess of casting back on the user side to figure it out and get that stuff off. (qualifiers like const or volatile or will make the union illegal).

Furthermore there's a nasty difference in that the latter gives you a *copy* while the first gives you a reinterpret pointer to the same memory that you can fiddle with, which is what I'd rather have, but I guess that's verboten these days.


02-24-09 - Low Level Threading - Part 3.1

Atomic RMW and Hardware

The next building block we need is some kind of atomic RMW. We've talking about getting atomic loads & stores and memory ordering, but we need a way to change a value and know that we are changing the original value.

The classic example is two threads trying to increment a shared counter :

Atomic< int > g_counter = 0;

// thread1 :
g_counter ++;

// thread2 :
g_counter ++;
After this runs, g_counter might be 1 or 2 !! The problem is what ++ is actually doing :
g_counter ++ :

temp = load g_counter
temp ++;
store g_counter , temp
Now you should be able to see that both threads can load g_counter at 0, both can increment to 1 in their temp vars, then both store a 1. To fix this we could take a lock :
temp = load g_counter
temp ++;
store g_counter , temp
And that is exactly the Atomic read-modify-write (RMW) that we need. If we use this for g_counter++ then it will always be = 2 after our program runs.

Fortunately almost all hardware provides some kind of cheap atomic micro-lock mechanism to do this. On x86 you have the "lock prefix" instructions (the Interlocked intrinsics directly correspond to lock instructions for the most part). On Power you have a load-linked/store-conditional with something like lwarx-stwcx. On Power you are almost literally writing a micro-lock and unlock on the cache line. On x86 it's more implicit that a lock is taken for a period of time, it pretends to be a simple instruction. A bit more on this later.

The most powerful and useful atomic RMW is probably the CAS (Compare and Swap). I'm going to write CAS using the C++0x style since that is becoming more and more standard around the web and in the literature :

// T must be Atomic
bool CAS(T * pVar, T & oldVal, T newVal)
    lock(pVar);  // "micro-lock" the variable
    bool match = (*pVar == oldVal);
    oldVal = *pVar;
    if ( match )
        *pVar = newVal;    


    return match;

CAS checks if *pVar is = oldVal and if it is, then it stuffs in newVal. Note that oldVal is a *reference* and it also reads the current value of pVar into there. (I really despise that style of coding but it's the way the C++0x standard CAS is defined and you will see people rely on that behavior, so be aware).

Actually C++0x defines CAS a little differently; it can have spurious failures. Basically that means that the micro-lock on the variable can fail and CAS can return false even though *pVar was == oldVal :

bool CAS(T * pVar, T & oldVal, T newVal)
    if ( ! trylock(pVar) )
        return false;
    bool match = (*pVar == oldVal);
    oldVal = *pVar;
    if ( match )
        *pVar = newVal;    


    return match;

On x86 that doesn't really make any sense, but on other architectures where the RMW is implemented using a lock-page loop it gives them more flexibility to be able to fail to lock.

Note that the arguments are in the opposite order of InterlockedCompareExchange which can be a bit confusing.

CAS is super useful and can be used to implement almost anything. For example, CAS could be use to implement increment :

void AtomicIncrement( Atomic * pVar )
        Atomic old = Load( pVar );
        Atomic new = old + 1;
    } while ( ! CAS(pVar,old,new) );
Note that I failed to take advantage of the fact that CAS is reading oldval for me in the second operand, really this should be :
void AtomicIncrement( Atomic * pVar )
    Atomic old = Load( pVar );
        Atomic new = old + 1;
    } while ( ! CAS(pVar,old,new) );
When CAS fails, old is reloaded to the new value, we add 1 to that and try again. Obviously this is a crazy way to implement increment, the hardware probably has a direct atomic increment, this just shows the power of CAS, and also why C++0x made it reload old val in the second arg, it makes stuff like this very handy.

Note that I haven't been talking about the memory order semantics here. The reason is that the client algorithm needs to specify the memory order semantics that are needed for that operation. CAS itself does not have a memory order requirment, you might want to CAS_Acquire or CAS_Release or CAS_Acq_Rel or CAS_Sequential. They all have their uses depending on what the actual operation needs, and they enforce ordering of memory ops outside the CAS against the CAS (which is treated as one op).

The other really basic operation that we will use a lot is "Exchange". Exchange is almost always a lot faster than CAS. It just reads the old value and stuffs the new value atomically.

We are now ready to implement our own simple mutex :

// 0 is unlocked, 1 is locked
typedef Atomic< int > SimpleMutex; 

void SimpleLock(SimpleMutex * mut)
    SpinBackOff backoff;
    while( ! CAS_Sequential(mut,0,1) )
        // mut was not 0 , spin/yield/backoff try again
    // mut should now be 1    
void SimpleUnlock(SimpleMutex volatile * mut)
    // *mut should now be 1

(there are a lot of possible memory semantics you might want for your mutex, this is only one possibility; there are subtle reasons that you want mutex locks to be globally sequential and not just acquire which I won't go into)
(obvious this isn't reentrant or any of that stuff and I'm not suggesting you use this instead of critsec - critsec is better in almost every way, this is just an example)
(backoff is a little helper struct I use that spins a while then sleeps with increasing intervals).

Sequential memory semantics means that memory operations cannot move past it in either direction, and ALSO that there is a global total ordering. That is, if we have three threads doing this :

// thread 1

// thread 2

// thread 3
And each thread runs on its own processor, then processor 1 might see g_global be 2 then 3 then 1. Processor 2 might see g_global be 3 then 2 then 1. Obviously that makes no sense in a total order, but the memory ops are allowed to be seen by the processors in different orders as long as they satisfy the contraints placed on them.

If we make all those stores into StoreSequential, then all processors will see the ops change in the same way. They might them at different times but they see them in the same order. Think about it like a room full of people. Everybody needs to tell everyone else their name. If you just let everyone go and tell their name to other people, then each individual will hear all the others' names in very different order. eg. Mike might hear Alex then Andy then Dave then Chris. Dave might hear Alex then Mike then Chris then Andy. The total order basically means first one person raises their hand, then they say their name to everyone, then the next person raises their hand. Thus everybody hears each name in the same order.

Obviously Sequential ops are a lot more expensive because they require all the processors in the system to agree on a sync point that they will not reorder other sequential operations across.

Another common memory ordering we haven't really talked about is "Acq_Rel" or Acquire+Release. This means that loads can't move up before it, and stores can't move down past it, but loads could move down past it and stores could move up before it. Acq_Rel also doesn't imply a total order. You often want to use Acq_Rel when you are creating a sync point that does both a read and a publish, but doesn't need full Sequential order. For example :

Load(a); // can move down
Store(b); // cannot move down

CAS_Acq_Rel( stuff );

Load(c); // cannot move up
Store(d); // can move up

// so it could change to this :
// (assuming a,b,c,d don't overlap)


CAS_Acq_Rel( stuff );


Acq_Rel vs. Sequential is sort of like lwsync vs hwsync on Power (but not exactly I don't think). Acq_Rel is useful when I am trying to use CAS to both publish something and to get some values.

Okay, I think that's the majority of the tools we need. I'm going to talk a bit about what actually happens on x86, but you can just ignore this part if you want.

We already talked last time about how x86 is already automatically acquire & release, that is, there are #LoadLoad and #StoreStore barriers everywhere - but acquire and release say nothing about how loads move against stores (that would be #LoadStore and #StoreLoad barriers - the notation is that #XY means X must occur before Y). Well it turns out x86 also automatically has #LoadStore ordering. The x86 multiprocessor model was designed to make single-proc code "just work" and they did a fuck of a good job of it. The crucial thing is that Stores coming out of any core come out in the order of the program code. Obviously the #StoreStore barriers everywhere enforce that. What it all means is that Stores must be done at the very end of the x86 execution pipe and retired in order. Thus there's no way for a load to move after a store - in fact we've seen that loads can get executed speculatively, but there's no way to execute a store speculatively because that would invalidate the order coming out of the chip.

The remaining barrier that we don't have is #StoreLoad , to keep loads after stores. Note that #StoreLoad is automatic for access to the same variable (and doesn't cause a Load-Hit-Store stall on x86 because we have the awesome thing where the load can peek into the write buffer for the new value), however it its not automatic for *different* variables. eg.

g_var1 = a;
b = g_var2
The read of var2 can move before the write to var1. As I've said before, this is *good* you want your processor to be able to reorder things. You just need to be able to tell it when not to. On x86 the only way to make a #StoreLoad barrier is with an mfence or any interlocked op.

Interlocked ops on x86 are always sequentially consistent. That makes coding easier, but means they are very expensive. (Itanium has some Acquire & Release interlocked ops ; presumably future Intel stuff will have some weaker memory model instructions). Basically to execute a "lock" instruction the x86 CPU does the pseudo code that I wrote above for CAS. First it goes out to the bus and reserves a cache line (x86 interlocks actually hold whole cache lines which makes any op inside a cache line atomic) then it runs the cmpxchg instruction or whatever you wanted, then it releases the cache line.

When it reserves the cache line, all other processors who try to acquire that line are blocked and might stall. When it releases the cache line, all processors that had work based on that line in flight must dump it. That is, any speculative reads and all computations based on them must be invalidated from the deep execution pipe of modern x86 chips.

On modern multi-core chips with shared cache this is bad, but it's not awful. On SMP machines where the cache line reservation has to go out to bus, it's very bad indeed. And what's really awful is that a CAS on one chip doesn't just slow down that chip - it syncs the bus, and it *really* slows things down if anybody else is touching that cache line.

Even though Interlocked CAS is "lock free" it is by no means cheap, and the people who write really efficient lock free stuff try to do so even without CAS.

Note that the crucial issue here was *contention* over a *cache line*. That is a common theme and is the main thing that you should actually worry about. If you have some shared data item which is heavily contended (lots of threads need to hit it often), then changing it from a CritSec to an Interlocked CAS will help some, but the CPUs are still going to be stalling on each other like crazy. You need to reduce the contention over a single data items. You also need to worry about so-called "false sharing". Any data items on the same cache line count for contention, so even though you were good boy and made your threads work on separate variables, if they are on the same cache line you still pay much of the price because they will sit and fight over that cache line over and over.

Okay, that's enough for basics. Next time we'll look at some techniques for lock free algorithms.

02-24-09 - Low Level Threading - Part 3.0

Intro to Lock Free Programming

Disclaimer : don't do it. Okay now let's dive in.

"Lock Free" is what this general group of techniques is usually called, but it's a bit of a misnomer. "Lock Free" just means not using any blocking thread synthronization techniques like mutexes, however the classic lock free technique of try-discard-loop is basically a spin lock. Hence there is another group of techniques called "Wait Free" to indicate when not even that kind of wait is necessary. Another class recently coined is "Obstruction Free". And in fact we don't care if we sometimes take mutexes to resolve sticky spots (a lot of people now are wisely doing this - for example queues where you take a mutex when near empty).

A lot of people take "Lock Free" to mean avoiding the OS mutex because it's too slow. That's not really a great goal and you won't see a huge speedup if you do that. Good Lock Free programming is about redesigning your code flow to minimize contention for shared resources and to spend a minimum amount of time in contention. Obviously one form of contention is fighting over an OS mutex, but fighting over a cache line is contention too and hurts badly.

Our goal is just to do fast & correct thread synchronization. We don't want to cut any corners and introduce possible races. Any tiny speed win from cheating is just not worth it.

In lock free programming we are basically trying to share some data between threads in a way that we can understand sequentially so that we can code it right. In this post I'm going to talk about some simple "lock free" coding patterns and what's going on. In the next post we'll go through some of the simple lock free data structures.

First off let's look at the simple publication pattern and what's really going on. Thread 1 sets up some work and then "publishes" it. Thread 2 waits for the work then does it.

// shared variables :
void * shared_var1;
void * shared_var2;
bool shared_published = false;

//Thread 1:

.. set up x & y ..
shared_var1 = x;
shared_var2 = y;
shared_published = true;

// Thread 2:

while( ! shared_published ) { }
x = shared_var1;
y = shared_var2;

Already your spidey senses should be tingling. Let's go through and look at the problems :

// shared variables :
// !! should pad a cache line here to keep away from other pages
// !! need to align to required atomic alignment
void * shared_var1;
void * shared_var2;
bool shared_published = false; // !! <- bool is not safe because it might be a char which is not atomic on many systems
// !! should pad a cache line here to keep away from other pages

//Thread 1:

// !! need to ensure all these assignments are atomic
.. set up x & y ..
shared_var1 = x;
shared_var2 = y;
// !! make sure the compiler doesn't reorder writing shared_var1 & 2 versus published !
shared_published = true; // !! (*1) var1 and var2 must be flushed before published flag is set
// !! can never touch var1 or var2 again

// Thread 2:

while( ! shared_published ) { } // !! read must be acquire (*2)
x = shared_var1;
y = shared_var2;

Okay, so this simple code already has lots of issues. The exact details of those issues depends on your system, and we don't want to make the client code full of lots of details about the exact system. We want to write client code that is *algorithms* and then have adaptors so that our code is portable and efficient on various systems.

The first little issue mentioned is that the variable loads and stores should be atomic, which just means done all in one go. On the vast majority of systems, 4 and 8 byte types that are aligned to 4 or 8 bytes are atomic. Smaller types are often NOT atomic because the actual load is done by loading a word then masking down to a byte, so beware of bool! Similarly larger types are often not atomic, and 8-byte types that are not 8-byte aligned are not atomic! (on x86, anything that is wholely contained in a cache line is atomic - x86 always reads & writes whole cache lines as one atomic op - more details on hardware later). Anyway, we don't want out client code to know about this. Clearly what we want is some kind of Atomic< > template that does all this for us.

Atomic< void * > shared_var1;
Atomic< void * > shared_var2;
Atomic< int > shared_published = false;

This Atomic< > just ensures the type is a valid size to be atomic on this platform, and is aligned to the necessary alignment for this platform, it also overrides operator = to make sure the assignment is a simple atomic op.

Next we have this to deal with :

// !! make sure the compiler doesn't reorder writing shared_var1 & 2 versus published !

Remember the compiler optimizer will go nuts reordering things. Instructions don't necessarily actually happen in the order that you write code. The key thing is that *we want that*. The optimizer reordering things is what makes our code fast. We don't want to just turn off the optimizer. For example, we could make all our variables "volatile". I'm not going to talk much about the new magic VC2005+ volatile, but in the old meaning of volatile, all volatile memory ops are executive in program order. So if we make everything volatile we would enforce
shared_var1 <- x
shared_var2 <- y
shared_published <- true;
But really we would be happy to get
shared_var2 <- y
shared_var1 <- x
shared_published <- true;
shared_var2 <- y
shared_var1 <- x
anything_else <- z
shared_published <- true;
We just care about var1 and var2 being done before "published". To make fast code we want the constraints to be as *precise* and as *weak* as possible for correct code.

The key point is at (*1) :

shared_published = true; // !! (*1) var1 and var2 must be flushed before published flag is set

What we are doing here is "Release" memory semantics. When you do a Store-Release it is a "publishing" store - it ensures that no earlier writes can pass the Release. Obviously this Store-Release has to encapsulate constraints on various parts of the system : it must tell the optimizer not to reorder writes, it must tell the CPU that those writes must be done in order, and it must tell the cache controllers that those affected memory locations must be flushed in order.

Note : you could accomplish this same thing with a barrier :

shared_var1 <- x
shared_var2 <- y

# Write Flush Memory Barrier

shared_published <- true;
but you don't want to ! That's way way more expensive. BTW another way to write the "Release" is in the old style load-load blocking way :
shared_var1 <- x
shared_var2 <- y
shared_published <- true;
Here #StoreStore means that stores above the # cannot move past stores below the #. In terms of how you specify the program memory constraints, you can either do it with these reorder blocks between the instructions, or with the modifiers *on* the loads and stores. C++0x has chosen to do the latter, so I'll go with that (and it maps more closley to certain hardware). Java uses the #StoreStore kind of method, as does the Linux Kernel with the smp_barrier macros.

From now on I am going to writing all the Lock Free code using Atomic< > types and special stores and loads. So our basically publication now looks like :

// shared variables :
char pad1[CACHE_LINE];
Atomic< void * > shared_var1;
Atomic< void * > shared_var2;
Atomic< int > shared_published = false;
char pad2[CACHE_LINE];

//Thread 1:

.. set up x & y ..
StoreRelaxed( &shared_var1 , x );
StoreRelaxed( &shared_var2 , y );

StoreRelease( &shared_published , true );
// now don't touch var1 and var2

// Thread 2:

while( ! LoadAcquire(&shared_published) ) { } // (*2)
x = LoadRelaxed(&shared_var1);
y = LoadRelaxed(&shared_var2);

We haven't talked about the Acquire at (*2), but it's basically just the complement to a StoreRelease. a Load-Acquire ensures that no later loads can move up before the Acquire. In terms of the Java-style barriers these translate to :

x = LoadAcquire(ptr) :
    x = *ptr; // atomically

StoreRelease(ptr,x) :
    *ptr = x; // atomically

Note the way they put the barrier on opposite sides of the memory op. (Also recall this is NOT an actual memory "barrier" instruction - it is a blockage for the memory ops, which may or may not actually requires a memory barrier on your platform).

Now, obviously Acquire and Release are almost always what you want :

localVar = LoadAcquire( &shared_var );
// acquire prevents any loads inside this block from moving up

x = localVar;
y = z;
/// etc. ...

// Release prevents any stores in here from moving down

So for example, entering a Mutex should obviously be an Acquire, and releasing the Mutex should be Release - that way work you do inside a mutex is done while you own the mutex (or it looks like it was done while you own the mutex). Everyone who uses critsecs is secretly using Acquire/Release to wrap their critsec work.

I should note that there is another type of load barrier called "consume" or smp_barrier_depends() in Linux ; this is only necessary on processors that don't automatically protect dependent reads. The only known processor that doesn't do that is the Alpha, so I'm not going to deal with that issue. If you care about full correctness you should also go through and mark up barriers that do dependent reads. It's really a strange thing, it means :

object = g_shared; (*1)
# consume memory barrier
int x = object->m_var; (*2)

On Alpha, "object->m_var" can be read "before" g_shared is read. WTF !? Obviously it can't actually be read before object is read, because you need the address of object from (*1) before you can even execute (*2)- but it can read older memory at (*2) than it did at (*1). That is, on the Alpha cache lines are not temporally ordered. So say some other processor has changed the address of object and also the data inside object. You read at (*1) and for some reason you get a fresh new cache line there and get the new value of object. When you read at (*2) you use the new value of object of course, but the cache line it points to is old and it hasn't refreshed, so you read the old value. Thus (*2) acts as if it happened earlier in time than (*1).

Like I said, all existing significant chips do not allows this, so we're not going to worry about it, but you can see how all kinds of different parts of the hardware can cause operations to act like they are going out of order. The goal of our Acquire/Release type barriers is to make instructions act as if they are going in order whether they actually do or not. We are trying to specify the memory constrains semantically in our code, and we want the code generator to make the right stuff for the architecture. In C# or Java this is easy to do because they have memory model semantics, in C++ it's messy, you have to write your own. C++0x will fix this but sadly that's a ways out for most of us.

Now I'll give you a bit of a punchline. On x86 LoadAcquire and StoreRelease are both just "mov". That is, they are 100% free. All (atomic) movs on x86 are acquire and all stores are release (ignoring SSE). It is still important to mark up your code semantics with acquire and release for portability, and also so that the compiler actually generates a mov and keeps your other operations inside. For example on MSVC x86 you can implement them like this :

x = LoadAcquire(ptr) :
    mov x , *ptr;
    ReadBarrier(); // (*1)

StoreRelease(ptr,x) :
    WriteBarrier(); // (*2)
    mov *ptr , x;

Note that the "Barriers" at *1 and *2 are not actual memory barriers, they generate ZERO instructions - those are compiler barriers that stop the compiler from moving stuff around. That is an implementation detail and your client code shouldn't have to know about.

This should clarify the win from using memory model semantics. We asked for what we needed - Acquire and Release - and we got them *for free*. Consider instead if we had not asked for them and had used a true memory fence instead - we would be adding 100+ clocks of expensive for no reason.

On Itanium LoadAcquire needs to actually generate a "ld.acq" which is a bit more expensive than a relaxed load, but not nearly as expensive as a memory fence. On PowerPC an implementation might be :

Load Acquire     ld; cmp; bc; isync
Store Release     lwsync; st

While I'm on this topic I should mention that the x86 has this super magic where loads can actually be invoked out of order, but they are still retired *as if* they went in order (they may not actually be retired in order). Because of this a lot of people incorrectly say "the x86 can reorder loads". Well, yes, and no. From a multi-threaded programmer's point of view the loads are not semantically reordered, however yes in terms of how the processor actually executes them, yes they do go through the reorder buffer and everything. In order to make this work, the x86 receives notifications whenever a cache line changes, and it invalidates reads that were speculatively done out of order in the wrong order and causes them to be redone in the right order. The x86 reorder buffer can also reorder loads against stores when they do not overlap. Again this usually does not affect semantic program flow (note that LoadAcquire and StoreRelease don't say anything about the #StoreLoad barrier).

Okay this has gotten long enough so we'll continue in the next part.

02-24-09 - Low Level Threading - Part 4.0

Basic lock-free algorithms

We want to write some lock free algorithms to mutate possibly shared data. (Obviously the best possible lock free algorithm is to work on thread local data; hey it's lock free! the next best lock free method is to work on shared read-only data; hey you can read it without locking; so we assume you've done that as much as possible and what remains is the threads need to write something shared). The general concept that we will follow in all cases is the idea of having a constraint on the data structure to preserve its validity, and then doing operations that keep the data structure valid every step of the way.

Other threads may see the changes you are making at any point of your flow, so if you mutate the data in some way that messes up the data structure, and then fix it on the next line, they might get the bad stuff. So every single operation needs to atomically maintain the validity of the data structure.

So for example if data is something stupid like Data { int x; int must_be_negative_x; } , if you write code like :

Data.x = 7;
// (*)
Data.must_be_negative_x = -7;
At the (*) you can get swapped and Data is in an invalid state. (Some consumer is sitting around doing int zero = Data.x + Data.must_be_negative_x).

One way to do this that's most obvious is with large atomic operations that swap the data structure from one good state to another. If you pretend for a moment that we can work on arbitrarily large objects, we could just copy the whole data structure, change it to the new state, and then write it back :

Data localData = LoadAcquire( &g_sharedData );
Data newData = localData;
.. mutate newData ...
StoreRelease( &g_sharedData, newData );

Note that using Acquire/Release here ensures that anything we do in "mutate" is properly scoped. This correctly takes sharedData from old to new in one atomic step so that it stays consistent. However, something else bad can happen. While we were working on it, somebody else may have put their change in too. We don't want to just stomp on their change with our Store. The solution is to use CAS and retry :

Data localData = LoadAcquire( &g_sharedData );
Data newData = localData;
.. mutate newData ...
} while ( ! CAS_Release( &g_sharedData, localData, newData ) )

Here we take a local copy, make some changes to bring it to a new state, and then we try to publish it to sharedData. However, when we publish, we check to make sure that sharedData is what we saw when we started - it must be equal to localData or we will not write it. If someone else changed it while we were working, we go back to the start and try again. (publish means make visible to other threads; at the time that you publish you must ensure that what you have written is in a consistent valid visible state for other threads).

This is a "change-cas-retry" loop and it's a very common pattern in lock free programming. It ensures that we change data atomically so it is made visible as a unit, and also that all threads that are changing Data get to make their changes public - nobody blindly stomps on someone else's change without looking at it.

Now, as handy as the CAS-retry loop is, it's not really great lock free programming. If you look at it a second you can see that it's basically the same as a spin lock mutex. If there is no contention, then the CAS-retry steps right through just like linear code and the CAS succeeds and we're done. If there is contention, someone else wrote data, we loop and try again until there is no contention. Compare to this :

int mutex = 0;
Data g_sharedData;
    while( ! CAS_Acquire(mutex,0,1) )
    Data localData = LoadAcquire( &g_sharedData );
    .. mutate localData ...

Again we do a CAS, if there's contention we loop, once there's no contention we run through the linear code and publish. The CAS-retry loop is maybe slightly faster than the simple mutex, but if each thread actually get a hardware core and they are both running, then the difference is very small. The big advantage of the CAS-retry method over a plain old mutex comes from the fact that applications and operating systems are very variable.

Consider two threads that are fighting over g_sharedData, both trying to write it. Thread A is in the middle of working on sharedData when suddenly it gets stopped. There are various reasons it might get stopped, maybe someone suspended its thread, or maybe it just got swapped out to run another thread on that core, or maybe it had some OS kernel problem that stopped it (like it crashed). (BTW I'm assuming that you know you should not be doing things like File IO inside a mutex - in fact you should avoid any OS calls insidea mutex or anything that might stall or swap your thread; always hold mutexes for very short simple operations only). In any case it is stopped right in the middle of working on sharedData. If we used the simple mutex, then thread A is now blocking all other threads from working on sharedData. Our app is halted until thread A can run again. In the CAS-retry method, thread B can go ahead and do its work - it will see no contention because thread A is no longer changing sharedData.

One of the tricky things we have to deal with is that we have threads that are running on real separate hardware cores so they are stepping along at the same time, and if your locked area is short they will not ever stall on each other for long. But we also have threads that are running on cores with other threads, and those threads do NOT run at the same time, so one of them can appear to be frozen for a very long time when it's swapped out while the other runs. (there are a lot of other ways that these scenarios are quite different; in general we have to handle both and wind up making compromises; for example with a real hardware thread per core there's no point in ever sleeping at all, with lots of threads on one hardware core you have the exact opposite issue - spinning at all is pointless, and lots of micro work can cause thrashing such that you spend almost all your time in thread switches, and things like backoff become crucial).

If you like, you can imagine that a CAS-retry is a lot like a spin-lock which is only held while the contended item is actually being worked on ! BTW in both cases here you don't need to sit and spin when there is contention. You can see there's contention and just abandon the change and do something else. It's somewhat rare that that is actually practical. (this is like TryEnterCriticalSection for those from that world). CAS-retry only spins as many times as the number of times that someone else updates the data.

Now of course in the real world we can't atomically change a large data object, usually we are limited to 32 or 64 bits of atomic data. So the solution of course is to use a pointer and to CAS in the new pointer. This is pretty trivial but you do have to keep in mind that you are consuming & publishing a pointer, and that once you release it, you no longer own it :

const Data * localData = LoadAcquire( &g_sharedData );
// localData is a pointer to *shared* read-only data !
// I do not own it, it is owned by g_sharedData

Data * newData = new Data;
.. mutate newData ...
// I make my own local data object to prepare the next revision
// I can do anything I want in here on newData 

newData->stuff = Compute( localData );

// now I am publishing newData -
//    if the CAS succeeds the newData instantly becomes shared
//  and anyone else can be reading it
// the CAS here must have Release memory order so that the memory ops
// I just did in mutate are made public before the change to g_sharedData

if ( CAS_Release( &g_sharedData, localData, newData ) )

// newData was NOT published so I am still the exclusive owner here
delete newData;


(obviously you wouldn't actually new/delete each time, this is just illustrating the flow of ownership).

Next time we'll use CAS-retry to make a simple LIFO stack.

Oh, let me also take a second to introduce the idea of the "Abort". I find it easier to talk about many Lock-free algorithms this way. A lot of the times in LF code we go to change something or check something, we hope to find something out, but we see there was contention so the values are changing and we can't get a good result, we just Abort and retry.

So for example the C++0x CAS would be :

LF_Result CAS(T * pVar, T & oldVal, T newVal)
    if ( ! trylock(pVar) )
        return Abort;
    bool match = (*pVar == oldVal);
    oldVal = *pVar;
    if ( match )
        *pVar = newVal;    


    return match ? Success : Failure;

We now return a "LF_Result" which is {Abort, Success, Failure}. Abort means ignore the results and just retry because there was contention. One way to think about a lot of Lock Free algorithms is to imagine that your old deterministic code which only had two boolean possibilities now has 3. (also "void" deterministic code that had no return value at all now has {Abort,Success} options).

As I mentioned before the usual recourse when you get an Abort is just to retry, and you are gauranteed to only get Aborts as long as there is active contention. But you could also take the Abort and do something else entirely.


02-23-09 - Low Level Threading - Annotated Linkies

Part 1 : Background

Stuff about C++0x. I now believe that C++0x is the right way to talk about multi-threading. You require the basic type to be atomic (eg. fit in a machine word), and then you add specific memory order constraints to each operation - load, store, or exchange - that constrain the program to be correct. The goal is writing good lock-free code is to find the minimum necessary and sufficient memory constraints to make the program correct. The lighter the constraints that you need to apply, the more efficient it can be.

Multicores and Publication Safety � ��Bartosz Milewski�s Programming Cafe
C++ atomics and memory ordering � ��Bartosz Milewski�s Programming Cafe
Hans Boehm : Why atomics have integrated ordering constraints

Stuff on the basics of "volatile" and memory ordering and how to make the compiler do what you want. The summary is that "volatile" has changed meaning so much that you should probably just not use it at all. The main issues to worry about are : compiler reordering and what the compiler does with your code, memory ordering constraints (cache/bus locking and flushing constraints), CPU reordering of instructions and memory timing vs cpu instruction timing.

It Goes To Eleven The Joys of Compiler and Processor Reordering Why You Technically Need Read-Side Barriers
Kang Su's Blog volatile, acquirerelease, memory fences, and VC2005
Intel - Volatile : Almost Useless for Multi-Threaded Programming Memory ordering, barriers, acquire and release semantics niallryan.com
The Old New Thing Acquire and release sound like bass fishing terms, but they also apply to memory models
Dr. Dobb's volatile vs. volatile January 8, 2009
Acquire and Release Semantics
Memory barrier - Wikipedia, the free encyclopedia
Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8 Visual C++ Language Visual C++ MSDN Forums

The MSDN stuff is all good :

Understanding Low-Lock Techniques in Multithreaded Apps
Rules and Limitations for TLS
Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Synchronization and Multiprocessor Issues (Windows)
Concurrency What Every Dev Must Know About Multithreaded Apps
Break Free of Code Deadlocks in Critical Sections Under Windows
Under the Hood Reduce EXE and DLL Size with LIBCTINY.LIB
Multiple Approaches to Multithreaded Applications - Intel� Software Network

I think that "cmpxchg" on x86 with no lock prefix is equivalent to a C++0x compare_exchange(memory_order_acq_rel) , as opposed to "lock cmpxchg" which is memory_order_seq_cst :

Boehm, Hans - locks on X86
x86 synchronization primitives

This next link I believe holds the secrets to the mysteries of the x86 load reordering. All loads on the x86 are "acquire", which means they appear to go in order. (an "acquire" load means that later loads cannot move before the acquire, but earlier ones could move after it - when all loads are acquire it means they all go in order). However, we all know the x86 can speculatively execute loads out of order. WTF !? Something is seriously wrong with those two facts. In fact, loads are executed out of order, but if writes cause any changes during that time, they are invalidated and the loads are reissued :

details of x86 load reorder

This post gives you some idea of what the processors have to do to enforce the appearance of in-order loads. Also note : "Any memory access that falls within a single cache line is guaranteed to be atomic". (x86 specific)

Gustavo Duarte talks about Cache details

General note : I find it very valuable to understand what is going on under the hood. HOWEVER you should not write code that relies on the details of one processor or OS or compiler implementation. The code should be descriptive about the constraints that it needs, then you can have library routines which enforce that model for the specific OS/compiler/CPU.

Some general articles about how lock-free and CAS are not a panacea. It's important to remember that a CAS (any "lock" instruction) is a bus lock which is very slow - not just for your processor, but for all other processors looking at that cache line! In even moderately contended algorithms, the cost of CAS for N processors goes up like O(N). The means if you tried to implement a huge multiprocessor machine where people used Interlocked ops to synchronize, as you added more and more processors eventually the whole thing would be running in lockstep with every processor waiting on the bus all the time. In fact, a CAS-spin-retry is really fundamentally not much different than a short Critical Section.

Joe Duffy's Weblog - CAS Performance kills you
Dr. Dobb's Maximize Locality, Minimize Contention May 23, 2008
Dr. Dobb's Lock-Free Code A False Sense of Security September 8, 2008
What is the real cost of LOCK on x86 ?

Good article about how Windows SLIST deals with 64-bit pointers and the requirement to do CAS with only a 64-bit atomic in early x64 chips. With 32-bit pointers we often make a {pointer,counter} object in 64 bits and use cmpxchg8b. New chips have cmpxchg16b. Sadly there was a period when Intel made 64-bit chips with no cmpxchg16b.

Alex Ionescu�s Blog � Blog Archive � Behind Windows x64�s 44-bit Virtual Memory Addressing Limit

A bunch of Herb's junk : Herb develops a lock free queue (FIFO) which is basically just a SPSC queue, and then makes an MPMC variant by putting a lock on each end. This is a general technique - any SP or SC structure can be changed to MP or MC by adding a "producer lock" or "consumer lock". SP or SC work by knowing that only one owner is touching that end at a time. That "owner" can either be a specific thread, or it can be designated as the holder of the lock.

Dr. Dobb's Developing Lightweight, Statically Initializable C++ Mutexes May 1, 2007
Dr. Dobb's Lock-Free Queues July 1, 2008
Dr. Dobb's Writing Lock-Free Code A Corrected Queue September 29, 2008
Dr. Dobb's Measuring Parallel Performance Optimizing a Concurrent Queue December 1, 2008

Big collections of articles :

Dr. Dobb's Go Parallel Features
Computer Laboratory -
AMD Developer Central - Articles & Whitepapers
Sun Scalable Synchronization Research Group
Sun Labs Scalable Synchronization Research Group Publications

Worker pools and queues :

Multithreaded Producer-Consumer The Easy Way
Larry Osterman's WebLog So you need a worker thread pool...
Re Non-blocking queue for the 2 threads
Joe Duffy's WSQ 1
Joe Duffy's WSQ 2
Joe Duffy's WSQ 3
Creating a thread safe producer consumer queue in C++ without using locks - Cluebat-man to the rescue

Some shit about singletons and such :

Dr. Dobb's C++ and The Perils of Double-Checked Locking Part I July 1, 2004
Dr. Dobb's Singleton Creation the Thread-safe Way October 1, 1999 - BROKEN
Double-checked locking - Wikipedia, the free encyclopedia
The Old New Thing C++ scoped static initialization is not thread-safe, on purpose!
The Old New Thing An auto-reset event is just a stupid semaphore

Major lock-free code bases :

Paul E. McKenney Read-Copy Update (RCU)
Atomic Ptr Plus Project
TBB Home
Download details Parallel Extensions to .NET Framework 3.5 June 2008 CTP
Getting Started with the .NET Task Parallel Library Multi-Core Case Studies
Intel� Performance Libraries - Intel� Software Network
Intel� Thread Checker 3.1 for Windows - Intel� Software Network
Intel� Thread Profiler 3.1 for Windows - Intel� Software Network
Synchronization Algorithm Verificator for C++0x - Intel� Software Network
transforming single-threaded allocators... - Intel� Software Network

Some specific code samples (WARNING : OFTEN WRONG !! I have yet to find any reliable source of lock-free code on the net that you can just grab and use)

opbarnes.com Code Snippet Released - Thread-safe Singleton (for Windows platforms)
Kernel space Ticket spinlocks - LinuxWorld
CodeProject A simple Win32 readerswriters lock with reentrance. Free source code and programming help
CodeProject Single Producer Single Consumer, with no critical sections. Free source code and programming help
Lock-Free Linked List
SourceForge.net Spin-X Engine
Lock free double ended queue .mischief.mayhem.soap.
Single producersingle consumer bounded queue .mischief.mayhem.soap.
Lock-free, multi-threaded Mandelbrot .mischief.mayhem.soap.

ADDENDUM : Found some more goodies : Java Memory Model guide for compiler writers by Doug Lea. This is a very good overall discussion of memory models and how to implement them on various architectures.

Example implementation of C++0x on Power architecture by Paul McKenney.

fences in C++0x by Peter Dimov.

Joe Duffy on why critsecs need fences

Bartosz on the rare need for fences on x86

Part 2 : V'jukov and Thomasson

Now we get into the part that's actually useful. (note the are places where they "sketch" algorithms - obviously don't try to grab a sketch code snippet and stick it in your code and assume it works)

Any lock-free code that has not been tested with Relacy is basically worthless. Anybody can write lock-free code and it "seems to work" - it's shit, it's garbage, it's no help at all. The hard part is in the testing and absolutely being sure it's correct, just sketching up something that "looks right" can be done by any monkey.

Relacy Race Detector Google Groups

Dmitriy V'jukov who wrote Relacy also has lots of great posts around the net. The most valuable information we will find is by him and one other guy :

Chris Thomasson, author of AppCore : the code in AppCore is mostly ASM and x86 specific. Chris has lots of great writing on lock-free stuff around the net, some more links to him will follow. He believes that doing a lot of this in ASM is easier because you can see the exact ordering of ops and don't have to worry about what kind of madness the compiler is doing. I think he's right to some extent, I wind up spending half my time trying to ensure the compiler generates the right single instruction. For example his SPSC queue is incredibly simple and efficient.

AppCore A Portable High-Performance Thread Synchronization Library
vzoom refcount - mostly lock-free word-based atomic reference counting

Our two heroes often reside here :

Scalable Synchronization Algorithms Google Groups

Discussion with Thomasson where he describes how the Fober LIFO is broken and how to fix it : (also contains his x86 ASM for a LF LIFO and small mention of the fact that stack can touch freed memory, but in a benign way, so that just catching the access violation and making it a retry is a good fix) :

Solving the memory issue of lock-free stack

Thomasson sketch of work-requesting system : (unlike work-stealing , work-requesting lets you used a non-thread-safe stack for your own work items) :

VERY crude work-requesting environment... - comp.programming.threads Google Groups

Thomasson and V'jukov discuss the SPMC WSDQ of Hendler & Lev :

ABP stealing deque algorithm - comp.programming.threads Google Groups

Thomasson sketches a LF SPMC WSDQ (ignore all the junk posts at the top and scroll down to Thomasson and V'jukov) :

Single producer, one-shot work queue implementation, no lock on queue. - comp.programming.threads Google Groups

In this post Thomasson sketches the method of using an MPMC stack and just exchaning the head with NULL to grab the whole thing, then reversing it to make it FIFO instead of LIFO. This is a very efficient and simple way to do an MPSC FIFO. Also more interesting discussion by V'jukov at the bottom about garbage collection and such :

Wait free queue - comp.programming.threads Google Groups

A related thread - this starts off with the MPSC method using stack-swap, but then V'jukov drops a gem of a new MPSC method :

A Correct, Lock-free FIFO with a single CAS per complete putget cycle - Scalable Synchronization Algorithms Google Groups
V'jukov intrusive MPSC node FIFO here

Another WSDQ where Thomasson and V'jukov try to straighten things out :

Double ended queue - review - comp.programming.threads Google Groups

Again discussion starts off with mystification about the x86 details, and leads to Thomasson posting his SPSC and straightening out the clowns :

Memory barrier + volatile (C++) - comp.programming.threads Google Groups

V'jukov's very cool node allocator for FIFO queues that does no atomic ops at all ! (amortized). FIFO's are a very bad case for standard multi-threaded allocators like tcmalloc, because every single block is being allocated on one thread and freed on another, which requires them to do a cross-thread pass-back (or causes heavy slab contention).

Lock-free node allocator for fifo-queues - Scalable Synchronization Algorithms Google Groups

V'jukov's simple example of an SPSC queue :

Single-ProducerSingle-Consumer Queue - Intel� Software Network

V'jukov talks about CAS performance and issues of MPMC and other structures, mentions SList exception catching technique : (go to page 2)

Concurrent stack - Intel� Software Network

Finally : Thomasson's slab allocator

Dmitriy sometimes uses the opposite meaning of "head" and "tail" from how I think and it tilts the hell out of me. I always thing of "head" as where you push, and you pop from the tail. So like stacks only have a head, you push & pop the head, FIFOs you enqueue on the head and dequeue on the tail, he usually does the opposite. That is, in a singly-list linked I think of links as pointing from the head towards the tail, so if you walk links you get from the head to the tail.

ADDENDUM : Found some more goodies :

Thomasson describes vZOOM architecture

V'jukov Memory Passing Queue good for thread caching allocators passing blocks back to the main thread for example, but also other things.

old rants