5/02/2013

05-02-13 - Simple C++0x style LF structs and adapter for x86-Windows

I've seen a lot of lockfree libraries out there that are total crap. Really weird non-standard ways of doing things, or overly huge and complex.

I thought I'd make a super simple one in the correct modern style. Download : cbliblf.zip

(If you want a big fully functional much-more-complete library, Intel TBB is the best I've seen. The problem with TBB is that it's huge and entangled, and the license is not clearly free for all use).

There are two pieces here :

"cblibCpp0x.h" provides atomic and such in C++0x style for MSVC/Windows/x86 compilers that don't have real C++0x yet. I have made zero attempt to make this header syntatically identical to C++0x, there are various intentional and unintentional differences.

"cblibLF.h" provides some simple lockfree utilities (mostly queues) built on C++0x atomics.

"cblibCpp0x.h" is kind of by design not at all portable. "cblibLF.h" should be portable to any C++0x platform.

WARNING : this stuff is not super well tested because it's not what I use in Oodle. I've mostly copy-pasted this from my Relacy test code, so it should be pretty strong but there may have been some copy-paste errors.

ADDENDUM : In case it's not clear, you do not *want* to use "cblibCpp0x.h". You want to use real Cpp0x atomics provided by your compiler. This is a temporary band-aid so that people like me who use old compilers can get a cpp0x stand-in, so that they can do work using the modern syntax. If you're on a gcc platform that has the __atomic extensions but not C1X, use that.

You should be able to take any of the C++0x-style lockfree code I've posted over the years and use it with "cblibCpp0x.h" , perhaps with some minor syntactic fixes. eg. you could take the fastsemaphore wrapper and put the "semaphore" from "cblibCpp0x.h" in there as the base semaphore.

Here's an example of what the objects in "cblibLF.h" look like :


//=================================================================         
// spsc fifo
//  lock free for single producer, single consumer
//  requires an allocator
//  and a dummy node so the fifo is never empty
template <typename t_data>
struct lf_spsc_fifo_t
{
public:

    lf_spsc_fifo_t()
    {
        // initialize with one dummy node :
        node * dummy = new node;
        m_head = dummy;
        m_tail = dummy;
    }

    ~lf_spsc_fifo_t()
    {
        // should be one node left :
        LF_OS_ASSERT( m_head == m_tail );
        delete m_head;
    }

    void push(const t_data & data)
    {
        node * n = new node(data);
        // n->next == NULL from constructor
        m_head->next.store(n, memory_order_release); 
        m_head = n;
    }

    // returns true if a node was popped
    //  fills *pdata only if the return value is true
    bool pop(t_data * pdata)
    {
        // we're going to take the data from m_tail->next
        //  and free m_tail
        node* t = m_tail;
        node* n = t->next.load(memory_order_acquire);
        if ( n == NULL )
            return false;
        *pdata = n->data; // could be a swap
        m_tail = n;
        delete t;
        return true;
    }

private:

    struct node
    {
        atomic<node *>      next;
        nonatomic<t_data>   data;
        
        node() : next(NULL) { }
        node(const t_data & d) : next(NULL), data(d) { }
    };

    // head and tail are owned by separate threads,
    //  make sure there's no false sharing :
    nonatomic<node *>   m_head;
    char                m_pad[LF_OS_CACHE_LINE_SIZE];
    nonatomic<node *>   m_tail;
};

Download : cbliblf.zip

12 comments:

Brian said...

What is the state of C++11 compilers on Windows? Or rather why are you rolling your own Atomics library? This seems to kill all the benefits of using relaxed atomics as far as compiler optimizations...

Matt said...

Is it possible to avoid heap (free store) allocation?

cbloom said...

"What is the state of C++11 compilers on Windows?"

VC 2012 seems okay.

"Or rather why are you rolling your own Atomics library?"

I use about 20 different compilers and right now 1 of them supports C++11.

The gcc "__atomic" extensions are very similar to C++11 and are slightly more widespread currently. You can very easily build your own atomic template from the gcc __atomic extensions. (not the old __sync extensions)

"This seems to kill all the benefits of using relaxed atomics as far as compiler optimizations..."

True but I cannot imagine a situation where anyone would care about that. You're talking about the scheduler taking a lockfree queue push from 40 clocks to 35 clocks or something like that. The big advantage of relaxed atomics is that they don't require memory barriers on chips where acq/rel is not automatic (eg. PPC), not that the compiler can reorder the instructions.

I've found several compiler bugs where compilers are too aggressive about optimizing around atomics, so I would probably be using compiler barriers around my lockfree stuff no matter what.

(actually I've found that gcc in general is helped a lot by putting scheduler barriers (asm volatile empty statements) around logical chunks of code; it seems to get confused if you give the optimizer too much freedom)

cbloom said...

"Is it possible to avoid heap (free store) allocation?"

Well of course you can replace new/delete with your own allocator that comes from a fixed pool or something.

I really don't like bounded lockfree queues, they generally have very bad wait cases. (I've written several posts about this in the past (for example:

http://cbloomrants.blogspot.com/2011/07/07-09-11-lockfree-thomasson-simple-mpmc.html

http://cbloomrants.blogspot.com/2011/07/07-29-11-spinning.html

)

In general I don't understand this myopic obsession that some people have with avoiding using the heap. The Windows LFH is particularly good.

There is an alternative SPSC I posted a while ago where nodes are recycled from the consumer back to the producer; maybe I'll put that in here.

cbloom said...

Meh, I put an spsc boundq in.

Anonymous said...

"The Windows LFH is particularly good."
Except if you use the Windows Heap, you get the debug heap when running in the debugger, which is not only slow but also completely serializing and leads to absolutely abysmal performance (speedups of >10x just from setting _NO_DEBUG_HEAP=1 aren't uncommon, and that's single-threaded).

Case in point: The Intel Occlusion Culling Sample I looked at a while ago, final load time ~1.4s (warm cache) with regular heap, ~15s with debug heap (but still release build!) on the same machine.

cbloom said...

True enough.

BTW it's been really annoying me that I have a similar problem with my allocator in Oodle. I've got a fast lock-free allocator that's less than 100 clocks on average, but I have to get to it through a pluggable function pointer, and there's optional debug layers and leak tracking and argument validation, and blah blah, so that by the time you do all that, allocations take 1000 clocks or so. It's not important for me but it just annoys me aesthetically that this neat little fast simple kernel is buried in a mountain of crud.

Brian said...

BTW. If anyone is looking for an alternative to Relacy, we have a C11/C++11 model checker at http://demsky.eecs.uci.edu/c11modelchecker.html. It pretty much covers the behaviors allowed by the memory model other than satisfaction cycles.

Matt said...

Fair enough about the heap allocation, was mostly interested whether it's an option (can come in handy).

"I've got a fast lock-free allocator that's less than 100 clocks on average, but I have to get to it through a pluggable function pointer, and there's optional debug layers and leak tracking and argument validation, and blah blah, so that by the time you do all that, allocations take 1000 clocks or so. It's not important for me but it just annoys me aesthetically that this neat little fast simple kernel is buried in a mountain of crud."

Is it C++ (from the other comments sounds like it is).
How many from the "and there's optional debug layers and leak tracking and argument validation" features do you need all at once?

I'm asking, since if you don't need them all at once, you can use templates for code dispatching with _zero_ icache-pollution / _zero_ extra code bloat overhead (this will actually beat macros, etc., since uninstantiated (read: unused) templates will never be compiled; that's not even dead code elimination for release builds, that's code never being there in the first place even for debug builds).

While your pet peeve seems to be folks preferring to avoid heap allocation ( ;] ), mine is folks not using templates where they may fit just fine (this, as any other programming construct, needs some qualifiers, see above). There are surprisingly many cases where you can avoid consequences "textbooks are warning about" (like the icache pollution).

Of course, chances are you have already considered it, and you may be in an unlucky situation where you really do
need to have all of the above combinations of options compiled in, all in one build--then it really seems there's only a choice between pointer chasing (with flushes and other crud killing performance) or code bloat (whether you use macros or templates, if you somehow managed to instantiate and use every single combination, you're going to kill your icache, no doubt about it).

cbloom said...

@Matt - Yeah, I'm a huge fan of template dispatching -

http://cbloomrants.blogspot.com/2010/05/05-27-10-loop-branch-inversion.html

though sadly on many compilers templates are still actually slower than macros, because they get confused about passing variables that can be modified to inline functions. (sigh)

Anyway, if it was my own game, I agree with you that the options could be on compile-time toggles and it would all be fine, but shipping a library it has to be on run-time branches.

I could certainly change "a func ptr call + several branch checks" into just one func ptr call; that is, make a dispatcher func for each combo of branch possibilities. But like I said it's actually all irrelevant because no customer ever wants to use the Oodle allocator directly or indirectly.

cbloom said...

@Brian - CDSChecker looks totally awesome and I'd love to use it instead of Relacy going forward. (it's a more complete test of possible memory reorderings)

If someone wants to port it to Windows, that would be a worthy and appreciated effort.

Brian said...

The problem with porting it to windows is that we actually copy on write snapshot (and rollback) the entire heap and globals and we don't know windows system programming well enough to do that.

old rants