02-25-09 - Low Level Threading - Part 4.3

LIFO MPMC Stack : Performance Issues

Okay, we have our LIFO MPMC stack finally working, let's talk about what it does.

It sucks. If you actually try to use it with a lot of producers and consumers, it will grind to a halt. The problem is very heavy cache line contention.

A stack only has a single entry point - the "top". Anybody who wants to push or pop from it must go through g_stack and they all fight over each other. To do a Push or a Pop each thread has to do a CAS which we've seen before is pretty expensive. But the real problem is the cache lines. Let's look at Push in terms of what it does to cache lines :

LF_Result Push_NoLoop(Node * node)
    // request the cache line with g_stack for my processor
    //  if someone else is doing a CAS we stall for them to release exclusive lock on the line !
    //    then we stall to fetch it from them
    Node * localStack = LoadRelaxed(&g_stack);

    node->m_next = localStack;

    // CAS first requests a reservation of the stack line of g_stack and must stall on all other CAS's
    //    that also evicts the cache line of g_stack from all other processors
    return CAS_Release( &g_stack, localStack, node );
    // CAS is done, g_stack cache line gets flushed out
    //    other processors can now fetch it to their cache

As I mentioned before this isn't a huge disaster on shared-cache multi-core chips, but it's death on real SMP systems without shared caches.

If you even just have one thread that is pushing lots of items and another thread popping lots of items - they will be constantly trading exclusive access to that cache line, forcing the other one to evict it, back and forth over and over. (some day we will talk about a much better way to do an SPSC FIFO).

Because we know this cache line problem is so bad, we should at least put padding around g_stack. Say we have it at file scope with some other variables :

int g_x;
LFStack g_stack;
int g_y;

We know g_stack is thrashing cache - but now anybody touching g_x and g_y is also feeling the pain and causing us more problems. So we should isolate it :

int g_x;

int _pad1[CACHE_LINE_SIZE];
LFStack g_stack;
int _pad2[CACHE_LINE_SIZE];

int g_y;

This way at least our problems are only our problems. It sucks that we're introducing dependence on a hard-coded cache line size. (on x86 CACHE_LINE_SIZE seems to be 64 now).

Note that the Nodes have a similar though smaller issue. If your Nodes are say 16 bytes or so, and they come from a Node Pool Allocator, that allocator typically will put all the nodes in a row with each other. So now you have two different threads that pop off two nodes from a Stack - those nodes will often be directly adjacent in memory. Boom cache line contention. The big issue is always cache line contention.

One solution is to pad up your nodes to CACHE_LINE_SIZE. If you don't want to suffer that penalty of wasted memory use from padding your nodes, you could use a strided allocator, rather than return node[0], node[1], node[2] , instead you return node[0] , node[4] , node[8] , then node[1], node[5], node[9], etc...

Also note that all this CACHE_LINE_SIZE aligning of things is actually really bad for cache use because of the way caches are keyed. It's a well known cache use improvement optimization to make sure you *don't* make objects exactly CACHE_LINE_SIZE offset from each other. So you might want to align your nodes to (CACHE_LINE_SIZE+8). Sigh.

One obvious way to reduce contention on something like this MPMC stack is striping or binning. Rather than just have one stack you have 4. To push a node you use round robin or rand and push on one of them. To Pop you try them all. This helps in the case that your stacks are usually not empty so Pop usually succeeds on it's first try. There are better solutions to this problem though.

One big optimization for the MPMC stack that you can use in some cases is to Pop all. Rather than popping the stack, the consumer just grabs the whole thing using atomic exchange :

Node * Pop_All()
    Node_And_Seq nullStack = { 0, 0 };
    Node_And_Seq localStack = Exchange(&g_stack, nullStack);

    // localStack.p is now exclusively owned by me

    return localStack.p;

This is an unconditional exchange, not a CAS, we don't need to loop or anything. We make the stack empty and take ownership of the whole contents. We can now work on a bunch of nodes in a batch which gets rid of the pain of the cache contention.

ADDENDUM : I should say that there are use cases the MPMC stack are really good for. One is the case that you are mostly just using it from one thread, that thread is doing 90% of the pushing and popping to the stack, so it gets to own the cache line, but once in a while other people need to push or pop things to the stack. It's fine for that, especially if the "once in a while" tends to happen when I'm not heavily hitting it.

Okay, I think we have now talked about the majority of the issues of lock free coding. In part 5 we will talk about the SPSC FIFO which is a very elegant illustration of the advantages of memory models and partial exclusive ownership.


Anonymous said...

> int _pad1[CACHE_LINE_SIZE];
> LFStack g_stack;
> int _pad2[CACHE_LINE_SIZE];

There is no guarantee that pad1, g_stack and pad2 are contiguous, right? You probably want to wrap that in a struct...

cbloom said...

"There is no guarantee that pad1, g_stack and pad2 are contiguous, right? You probably want to wrap that in a struct..."

I don't know if C actually says they are contiguous but in practice I believe they are 100% of the time.

But yeah I agree I wouldn't want to rely on that. (like for example I dunno if LTCG or some other weird linkages might screw this up)

won3d said...

On recent versions of gcc/ld, you can try adding -fdata-sections to gcc and --gc-sections to ld and see if it strips the unreferenced data like it is supposed to.

cbloom said...

Yeah I don't actually suggest that you have a "g_stack" global at file scope. I just thought it would be simpler to pretend you were talking through a global variable rather than setting up a ThreadSystem Singleton or whatever.

won3d said...

Well, I suppose for static/auto allocated stuff, you could just use the "align" declspec/attribute?

cbloom said...

Not exactly. I've been trying stuff using align and it's a bit of a mess.

align(64) on the *typedef* would cause the type to get padded up to a multiple of 64 bytes. If that's acceptable to you then things are pretty easy.

align(64) on the instance will align its address to 64 but won't prevent other objects from being directly after you.

So, yes if you align both the type and the instance it will get its own cache line.

But there doesn't seem to be any way to mark at the instance that it should be aligned and that the *next* value should also be aligned so that I get the whole line.

old rants