cbloom rants: 02-26-09 - Low Level Threading

SPSC FIFO : the simple message queue

Phew, I've lost the spark that was fueling me to write these, but I want to get one more out before I quit : the SPSC FIFO. This is a simple and useful structure, and I think it's a nice example because it illustrates many different properties from the MPMC stack.

SPSC means single producer single consumer ; one thread makes nodes, another uses them. Immediately we don't need to worry about all the GC issues and node lifetime because we have just a single consumer. We can easily make it two ways just by having an SPSC going in each direction. We can easily make it MPMC by using many of them, eg. for 3 threads to all talk both ways to each other uses 6 SPSC FIFOs. In many cases this is actually much more efficient that using an MPMC structure.

It's a FIFO so there are two ends, and we know only the producer thread is touching the input end, and only the consumer thread is touching the output end. Thus, even though it is a multi-threaded structure, the calls to "enqueue" and "dequeue" are always single-threaded against themselves. This is crucial.

You can also make either end of the SPSC into multi by putting a critsec on that end and changing the convention for who the "owner" is : rather than the "producer" being a certain specific thread, you define it to be the locker of the producer critsec. Now you have an MPSC queue which requires a lock to produce but none to consume. Or you could do SPMC or MPMC this way. This is okay in some cases, not in others. Herb Sutter designed a queue like this in Dr. Dobbs recently (see link in annotated links area). It's certainly better than having a single lock for the whole queue, because if only one thread is producing or consuming they will get the lock with no contention, but obviously it's not lock-free.

You can do a FIFO in an array, but it's actually much harder that way. If you had an infinite amount of memory for your FIFO it would be very simple indeed, but you don't, so you either have to make it circular or dynamically resize the array. Both are possible lock-free (or near-lock-free anyway), but they both break the magic property that the consumer and producer don't fight over the structure. If it's circular you have to worry about the head running into the tail and suddenly you have the producer and consumer contending. So we'll do with nodes and a linked list.

The FIFO is a linked list from the head to the tail :


m_head -> node -> node -> node -> m_tail

We push new nodes on the tail and pop them from the head. We want to do both of these as simply and unconditionally as possible, so we don't allow a totally empty queue. We always have a dummy node in the Queue so that when the queue is empty, head == tail.

In particular it should be clear that the Push operation is very simple and unconditional :


newNode->next = NULL;
m_tail->next = newNode;
m_tail = newNode;

When the Queue starts, m_tail points at the dummy, we just tack something in his next. Then at all times we know we can just jam into the next pointer at the end to add something at the list. The key is partial ownership of the structure by the different threads.

The consumer owns the "m_head" pointer - but not necessarilly the node that m_head points to (if the queue is not empty then it also owns that node). The producer owns the "m_tail" pointer and also the "->next" pointer on the last node (which is also the first node if the queue is empty). The funniest case occurs when the queue has one element in it - the dummy plus one more - in that case the producer owns the "->next" pointer in the last node, but the consumer owns the data in the last node ! More on this later.

Okay let's do the Lock Free Push in detail :


struct SPSC_FIFO_Node
{
    Atomic< SPSC_FIFO_Node * > m_next;

    void * m_data;

    // optional cache line padding here
};

struct SPSC_FIFO
{
    Atomic< SPSC_FIFO_Node * > m_head;

    // cache line padding here

    Atomic< SPSC_FIFO_Node * > m_tail;

    // optional cache line padding here
};

void Push(SPSC_FIFO * fifo, LF_SPSC_FIFO_Node * node)
{
    // at this point I own "node" completely
    // I wrote node->m_data before coming in here
    StoreRelaxed(&(node->m_next), NULL); // relaxed store because I own it

    // get the tail pointer - again I own it so relaxed load is fine
    SPSC_FIFO_Node * back = LoadRelaxed(&(fifo->m_tail));

    // as soon as I set back->next here , node is now visible to the consumer
    // I no longer entirely own node (I do still own node->next)
    // Release here to ensure node->next and node->data are written first
    // (*1)
    StoreRelease(&(back->m_next),node);

    // now advance tail -
    // this is just for me, so relaxed is fine
    StoreRelaxed(&(fifo->m_tail),node);
}

No atomic RMW's at all! No consistent checks - just careful memory ordering. These are slightly subtle :

(*1) is the point where the consumer might be able to get at node. Obviously the stuff inside node must be out to him before he pops node. You do not need any kind of store fence here when you use the Release memory constraint. We're writing data out in a stream :


node->data = A;     // 1 - Relaxed
node->next = NULL;  // 2 - Relaxed
back->next = node;  // 3 - Release

Release means this can go out  1,2,3 or 2,1,3 , but never 1,3,2 or something invalid.

Similarly when the consumer Pops next, he will use an Acquire :


popped = head->next; // 1 - Acquire
head = popped->next; // 2 - Relaxed
data = popped->data; // 3 - Relaxed

Again this can go 1,2,3 or 1,3,2 but never a bad order that would see whack data.

BTW on most architectures (*1) doesn't actually even need to be Release because the loads we care about are dependent, and dependent data automatically is acquire/release on every major architecture, but pretend you heard that.

Also note why we used the dummy - the tail can never be popped. If we didn't have that you could have sequences like :

    SPSC_FIFO_Node * back = LoadRelaxed(&(fifo->m_tail));

    .. now consumer thread runs and pops everything ...
    .. !! back gets freed !!

    StoreRelease(&(back->m_next),node);

However, the dummy is a bit weird in pop. After we push a bunch of nodes, it's sitting there at the head with nothing in it. If we pop a node, we pop the dummy. We can't throw it away, because we need a dummy to always be in the queue so that we never pop the last node. A common pattern used in other cases is to repush the dummy. (this is used in the MPMC FIFO and the MPSC FIFO - if you pop the dummy, you just push it back on the other end). But we can't do that here because we are the consumer, not the producer, and this is SPSC, so we don't have access to push. Thus we must use the method of sliding the dummy down the list.

It looks like this :


SPSC_FIFO_Node * SPSC_FIFO_Pop(SPSC_FIFO * fifo)
{
    // m_head is totally owned by me so I can load relaxed :
    SPSC_FIFO_Node * front = LoadRelaxed(&(fifo->m_head));
    // front is the current dummy, it is never null
    // front->m_data is garbage

    // (*1) 
    // front->next is the sync point with the producer !
    // if front == tail we are sharing this variable
    SPSC_FIFO_Node * next = LoadAcquire(&(front->m_next));
    if( next == NULL )
        return NULL; // empty

    // next exists ; now the node "next" is mostly owned by me
    //  (the producer owns next->m_next)
    // front will be returned, it is wholly owned by me

    // shift data from next to front :

    void * data = LoadRelaxed(&(next->m_data));

    // update m_head - relaxed, we own it
    StoreRelaxed(&(fifo->m_head),next);
    
    // copy data into front
    StoreRelaxed(&(front->m_data),data);

    // next is now the dummy
    //  next->m_data is garbage (it's a dupe of what we return here)

    return front;
}

Again it's pretty simple, in fact it's harder to understand that it works than to just write it and use it ;)

It's interesting to look at a slightly different possible way to write Pop. This alternative way is often suggested around the net, but I believe it's worse. Instead of checking front->next to check for empty , it checks head == tail :


SPSC_FIFO_Node * SPSC_FIFO_Pop_Alt(SPSC_FIFO * fifo)
{
    SPSC_FIFO_Node * front = LoadRelaxed(&(fifo->m_head));

    SPSC_FIFO_Node * tail = LoadAcquire(&(fifo->m_tail));
    if ( front == tail )
        return NULL; // empty

    // this is all the same :
    SPSC_FIFO_Node * next = LoadAcquire(&(front->m_next));

    StoreRelaxed(&(fifo->m_head),next);

    front->m_data = next->m_data;

    return front;
}

If we use this Pop_Alt with our Push there's a race condition. The problem is in these two lines of Push :


    StoreRelease(&(back->m_next),node);

    (*2)

    StoreRelaxed(&(fifo->m_tail),node);

What's the problem?

At (*2) one possibility is that the ->next pointer is set, but m_tail is not updated yet. Thus when Pop_Alt is called, it will see front == m_tail and return empty even though the node is there. That's not the problem. In fact there's nothing wrong with that. We never said that Pop would *immediately* be able to see a new node if the Push wasn't done yet. In fact with Lock Free stuff you can never gaurantee the timing of memory visibility.

The problem at (*2) is that the stores don't have to go in order because we used a relaxed store for m_tail ! That was okay before because Pop never touched tail, but now if they run out of order :


    fifo->m_tail = node

    back->m_next = node

tail gets updated first so Pop_Alt thinks the Queue is not empty. It tries to pop, but ->next is not set yet so it gets NULL ! Remember that "Release" does not prevent stores below it from moving up, only stores above it from moving down.

Obviously the fix is just to ensure that the stores go in the right order :


void Push_Alt(SPSC_FIFO * fifo, LF_SPSC_FIFO_Node * node)
{
    StoreRelaxed(&(node->m_next), NULL);

    SPSC_FIFO_Node * back = LoadRelaxed(&(fifo->m_tail));

    StoreRelease(&(back->m_next),node);

    // Release so these don't reorder !
    StoreRelease(&(fifo->m_tail),node);
}

and now Push_Alt and Pop_Alt work okay. But I believe they are not as good as our original version because the consumer and producer are now both touching the "m_tail" variable which means they cache line fight. There's no contention in the sense that they never do a CAS on m_tail or have to loop or block on each other, but just by having Pop look at m_tail is it causes the line to have to sync. (note that this kind of sharing is way way way better than a CAS - this is just a normal cache line propagation and it's read-only on the consumer which is way better than if it was being written from both threads).

And now for the payoff.

If we use our macros and such to convert these actions into the right native operations that respect the memory order constraints, "Push" compiles to this on x86 :


mov [eax],0        // eax is node , [eax] is node->next
mov [eax+4],edi    // edi is the data element , [eax+4] is node->data
mov ecx,[esi+40h]  // [esi+40h] is the fifo head
mov [ecx],eax      // [ecx] is head->next
mov [esi+40h],eax  // head = node

Five moves for a lock-free FIFO push ! Note there are absolutely zero "lock" prefix ops or memory barriers, and yet the memory is gauranteed to be passed to the consumer safely. Also note that you really did need to be careful with the Release semantics because if these movs are not done in the correctly constrained order, this is not thread safe !!

This is the real payoff for why we learned about the memory models and using load/store with constraints. Carefully designed algorithms can be extremely efficient. Obviously the fact that this is so efficient on x86 relies on the strong x86 memory model which says all stores are Release, but more generally if we write the code with the absolute minimum of constraints, that lets the system translate it into the most efficient method possible for the given platform.

One annoying thing about this FIFO is it can't be intrusive. Push must allocate a none and Pop must free one. So you need a custom node allocator recycling thing. However there is no ABA issue because there is no CAS so you don't have to worry about SMR or anything.

One issue is that standard thread-caching mallocs are very very bad for FIFO queues. The problem is you are by design always allocating the node on one thread and freeing it on a different thread. Standard thread-caching mallocs usually optimize freeing on the *same* thread that did the allocation, and they make cross-thread frees slow. In fact, one option is to use an SPSC queue to pass the nodes back to the producer without any data just so that he can get the memory back ! Another option is to use one of Dmitriy Vjukov's custom allocators that works well with lockfree fifos.

Because this SPSC FIFO is so fast, it's totally reasonable to make N-thread multiway communication channels by multiplexing SPSC FIFOs. In fact that's what the experts in the field do. There is a problem, in that you need O(N^2) channels, which becomes bad for the large N of the future, but for N <= 8 or so it's a totally reasonable way to go.

The art of good lock free system design is in combining various simple structures to create a more elaborate information flow pattern. And don't forget to make good use of plain old single threaded structures too! For example, if you are often passing many data items at once, don't push them all one by one. Intead, link them together with a ->next pointer to make a plain old single-threaded linked list, and push the whole linked list with just one lock-free op.

Another cute trick I've seen is to transfer SPSC FIFOs using a more expensive structure like our MPMC LIFO :

Say for example you have N producers and M consumers, N and M both large. The producer wants to jam a bunch of nodes to some consumer and have them worked on, but it doesn't care who. Once some consumer is eating his nodes, he may continue to push them intermittently over time. So I can't just push them all in one big packet. What we do is use one MPMC LIFO or FIFO for all the producers and consumers. Our producer pushes a pointer to an SPSC FIFO. The first consumer that has free time pops the big shared MPMC structure and gets the FIFO. Now he owns it and is the single consumer - now the producer and consumer can talk directly to each other over that link efficiently without going through the slow MPMC structure.

I think I will try to write just one more of these with an introduction to Relacy Race Detector, and maybe a brief sketch of the standard thread pool / work stealing thing that people do now, and I will post my code for all this.

cbloom rants

2/26/2009

02-26-09 - Low Level Threading - Part 5.1

No comments:

old rants