7/20/2011

07-20-11 - Some notes on condition vars

Let's talk about condition vars a bit. To be clear I won't be talking about the specific POSIX semantics for "condition_var" , but rather the more general concept of what a CV is.

As mentioned before, a CV provides a mechanism to wait on a mutex-controlled variable. I've always been a bit confused about why you would want CV because you can certainly do the same thing with the much simpler concept of "Event" - but with event there are some tricky races (see waiting on thread events for example) ; you can certainly use Event and build up a system with a kind of "register waiter then wait" mechanism ; but basically what you're doing is build a type of eventcount or CV there.

Anyhoo, the most basic test of CV is something like this :


before : x = 0;

thread 1:
    {
        cv.lock();
        x += 1;
        cv.signal_unlock();
    }

thread 2:
    {
        cv.lock();
        while ( x <= 0 )
        {
            cv.unlock_wait_lock();
        }
        x = 7;
        cv.unlock();
    }

after : x == 7;

in words, thread 2 waits for thread 1 to do the inc, then sets x = 7. Now, what if our CV was not legit. For example what if our unlock_wait_lock was just :
cv.unlock_wait_lock() :
    cv.unlock();
    cv.wait();
    cv.lock();
this code would not work. Why? What could happen is thread 2 runs first and sees x = 0, so it does the unlock. Then thread 1 runs and does x+= 1 but there's no one to signal so it just exits. Then thread 2 does the wait and deadlocks. ruh roh.

In this case, it could be fixed simply by using a Semaphore for the signal and making sure you always Post (inc) even if there is no waiter (that's a way of counting the signals). But that doesn't actually fix it - if you had another thread running that could consume wakeups for this CV, it could consume all those wakeups and you would still go into the wait and sleep.

So, it is not required that "unlock_wait_lock" actually be atomic (and in fact, it is not in most CV implementations), but you must not go into the wait if another thread could get in between the unlock and the wait. There are a few ways to accomplish this. One is to use some kind of prepare_wait , something like :

cv.unlock_wait_lock() :
    cv.prepare_wait();
    cv.unlock();
    cv.retire_wait();
    cv.lock();
prepare_wait is inside the mutex so it can run race-free (assuming signal takes the same mutex) ; it could record the signal epoch (the GUID for the signal), and then retire_wait will only actually wait if the epoch is the same. Calls to signal() must always inc the epoch even if there is no waiter. So, with this method if someone sneaks in after your unlock but before your wait, you will not go into the wait and just loop around again. This is one form of "spurious wakeup" and is why unlock_wait_lock should generally be used in a loop. (note that this is not a bad "thread thrashing" type of spurious wakeup - you just don't go to sleep).

This seems rather trivial, but it is really the crucial aspect of CV - it's why CV works and why we're interested in it. unlock_wait_lock does not need to be atomic, but it sort of acts like it is, in the sense that a lost signal is not allowed to pass between the unlock and the wait.


The next problem that you will see discussed is the issue of "wait generations". Basically if you have N waiters on a CV, and you go into broadcast() (aka notify_all) , it needs to signal exactly those N waiters. What can happen is a new waiter could come in and steal the wakeup that should go to one of the earlier waiters.

As usual I was a bit confused by this, because it's specific to only certain types of CV implementation. For example, if broadcast blocks new waiters from waiting on the CV, there is no need for generations. If your CV signal and wait track their "wait epoch", there is no problem with generations (the epoch defines the generation, if you like). If your CV is FIFO, there is no problem with generations - you can let new waiters come in during the broadcast, and the correct ones will get the wakeup.

So the generation issue mainly arises if you are trying to use a counting semaphore to implement your CV. In that case your broadcast might do something like count the number of waiters (its N), then Post (inc) the semaphore N times. Then another waiter comes in, and he gets the dec on the semaphore instead of the guy you wanted.

The reason that people get themselves into this trouble is that they want to make a CV that respects the OS scheduler; obviously a generic CV implementation should. When you broadcast() to N waiters, the highest priority waiter should get the first wakeup. A FIFO CV with an event per thread is very easy to implement (and doesn't have to worry about generations), but is impossible to make respect OS scheduling. The only way to get truly correct OS scheduling is to make all the threads wait on the same single OS waitable handle (either a Semaphore or Event typically). So, now that all your waiters are waiting on the same handle, you have the generation problem.

Now also note that I said before that using epoch or blocking new waiters from entering during broadcast would work. Those are conceptually simple but in practice quite messy. The issue is that the broadcast might actually take a long time - it's under the control of the OS and it's not over until all waiters are awake. You would have to inc a counter by N in broadcast and then dec it each time a thread wakes up, and only when it gets to zero can you allow new waiters in. Not what you want to do.

I tried to think of a specific example where not having wait generations would break the code ; this is the simplest I could come up with :


before : x = 0

thread 0 :

    cv.lock();
    x ++;
    cv.unlock();
    cv.broadcast();

thread 1 :
    cv.lock();
    while ( x == 0 )
        cv.unlock_wait_lock();
    x = 7;
    cv.signal();
    cv.unlock();

thread 2 :
    cv.lock();
    while ( x != 7 )
        cv.unlock_wait_lock();
    x = 37;
    cv.unlock();

after : x == 37

So the threads are supposed to run in sequence, 0,1,2 , and the CV should control that. If thread 2 gets into its wait first, there can't be any problems. The problem arises if thread 1 gets into its wait first but the wakeup goes to thread 2. The bad execution sequence is like this :


thread 1 : cv.lock()
thread 1 : x == 0
thread 1 : cv.unlock_wait ...

thread 0 : cv.lock()
thread 0 : x++;
thread 0 : cv.unlock()
thread 0 : cv.broadcast .. starts .. sees 1 waiter ..

thread 2 : cv.lock
thread 2 : x != 7
thread 2 : cv.unlock_wait - I go into wait

thread 0 : .. cv.broadcast .. wakes thread 2

deadlock

You can look in boost::thread for a very explicit implementation of wait generations. I think it's rather over-complex but it does show you how generations get made (granted some of that is due to trying to match some of the more arcane requirements of the standard interface). I'll present a few simple implementations that I believe are much easier to understand.


A few more general notes on CV's before we get into implementations :

In some places I will use "signal_unlock" as one call instead of signal() and unlock(). One reason is that I want to convey the fact that signal_unlock is trying to be as close to atomic as possible. (we'll show one implementation where it actually is atomic). The other reason is that the normal CV API that allows signal() outside the mutex can create additional overhead. With the separate API you have either :

lock
..blah..
signal
unlock
which can cause "thread thrashing" if the signal wakes the waiter and then immediately blocks it on the lock. Or you can have :
lock
..blah..
unlock
signal
the problem with this for the implementor is that now signal is outside of the mutex and thus the CV is not protected; so you either have to take the mutex inside signal() or protect your guts in some other way; this adds overhead that could be avoided if you knew signal was always called with the mutex held. So both variants of the separate API are bad and the merged one is preferred.

The other issue that you may have noticed that I combined my mutex and CV. This is slightly more elegant for the user, and is easier for the implementor, because some CV implementations could use knowledge of the guts of the mutex they are attached to. But it's not actually desirable.

The reason it's not great is because you do want to be able to use multiple CV's per mutex. (I've heard it said that you might want to use different mutexes with the same CV, but I can't think of an example where you would actually want to do that).

eg. in our broadcast generation example above, it would actually be cleaner to use a single mutex to protect "x" but then use two different CV's - one that you signal when x = 1, and another that you signal when x = 7. The advantage of this is that the CV signalled state corresponds directly to the condition that the waiter is waiting on.

In general, you should prefer to use the CV to mean "this condition is set" , not "hey wakeup and check your condition" , because it reduces unnecessary wakeups. The same could be said for Events or Semaphores or any wakeup mechanism - using wakeups to kick threads to check themselves is not desirable.

We're going to go off on a tangent a bit now.

What would be ideal in general is to actually be able to sleep a thread on a predicate. Instead of doing this :


    cv.lock();
    while ( x != 7 )
        cv.unlock_wait_lock();

you would so something like :

    cv.wait( { x == 7 } );

where it's implied that the mutex is locked when the predicate is checked. This is actually not hard to implement. In your signal() implementation, you can walk the list of waiters in the waitset (you are holding the mutex during signal) and just call predicate() on each waiter and see if he should wake up.

The big advantage is that only the threads that actually want to wake up get woken. Dmitry has done an implementation of something like this and calls it fine grained condvar/eventcount (also submitted to TBB as fine grained concurrent monitor ).

As noted before, using multiple CV's can be a primitive way of getting more selective signalling, if you can associate each CV with a certain condition and wait on the right one.

A nice version of this that some operating systems provide is an event with a bit set. They use something like a 32-bit mask so you can set 32 conditions. Then when you Wait() you wait on a bit mask which lets you choose various conditions to wake for. Then signal passes a bit mask of which to match. This is really a very simplified version of the arbitrary predicate case - in this case the predicate is always (signal_bits & waiting_bits) != 0 (or sometimes you also get the option of (signal_bits & waiting_bits) == waiting_bits so you can do a WAIT_ANY or WAIT_ALL for multiple conditions).

The bit-mask events/signals would be awesome to have in general, but they are not available on most OS'es so oh well. (of course you can mimic it on windows by making 32 Events and using WFMO on them all).

Next up, some condvar implementations...

07-20-11 - Some condition var implementations

Okay, time for lots of ugly code posting.

Perhaps the simplest condvar implementation is FIFO and SCHED_OTHER using an explicit waiter list. The waiter list is protected by an internal mutex. It looks like this :

(BTW I'm putting the external mutex in the condvar for purposes of this code, but you may not actually want to do that ; see previous notes )


struct cond_var_mutex_list
{
    std::mutex  external_m;
    std::mutex  internal_m;
    // (*1)
    std::list<HANDLE>   waitset;

    cond_var_mutex_list() { }
    ~cond_var_mutex_list() { }
    
    void lock() { external_m.lock($); }
    void unlock() { external_m.unlock($); }
    
    // (*1) should be the event from TLS :
    HANDLE get_event()
    {
        return CreateEvent(NULL,0,0,NULL);
    }
    void free_event(HANDLE h)
    {
        CloseHandle(h);
    }
    
    void unlock_wait_lock() 
    {
        HANDLE h = get_event();
        
        // taking internal_m lock prevents us from racing with signal
        {
        internal_m.lock($);

        waitset.push_back(h);
        
        internal_m.unlock($);
        }
        
        // (*2)
        external_m.unlock($);
        
        WaitForSingleObject(h,INFINITE);
        
        free_event(h);

        // I will often wake from the signal and immediately go to sleep here :
        external_m.lock($);
    }
    
    // could return if one was signalled
    void signal()
    {
        HANDLE h = 0;
        
        // pop a waiter off the front, if any :
        {
        internal_m.lock($);
        
        if ( ! waitset.empty() )
        {
            h = waitset.front();
            waitset.pop_front();
        }
        
        internal_m.unlock($);
        }
        
        if ( h == 0 )
            return;
    
        SetEvent(h);        
    }
    
    // could return # signalled
    void broadcast()
    {
        std::list<HANDLE> local_waitset;
        
        // grab local copy of the waitset
        // this enforces wait generations correctly
        {
        internal_m.lock($);
        
        local_waitset.swap(waitset);

        internal_m.unlock($);
        }

        // (*3)     

        // set events one by one;
        // this is a bit ugly, SCHED_OTHER and thread thrashing
        while( ! local_waitset.empty() )
        {
            HANDLE h = local_waitset.front();
            local_waitset.pop_front();
            SetEvent(h);
        }   
    }

};

I think it's pretty trivial and self-explanatory. A few important notes :

*1 : We use std::list here for simplicity, but in practice a better way would be to have a per-thread struct which contains the per-thread event and a forward & back pointer for linking. Then you don't have any dynamic allocations at all. One per-thread event here is all you need because a thread can only be in one wait at a time. Also there's no event lifetime issue because each thread only waits on its own event (we'll see issues with this in later implementations). (see for example Thomasson's sketch of such but it's pretty self-explanatory )

*2 : This is the crucial line of code for cond-var correctness. The external mutex is unlocked *after* the current thread is put in the waitset. This means that after we unlock the external mutex, even though we don't atomically go into the wait, we won't miss signal that happens between the unlock and the wait.

*3 : This where the "wait generation" is incremented. We swap the waiter set to a local copy and will signal the local copy. At this point new waiters can come in, and they will get added to the member variable waitset, but they don't affect our generation.

The nice thing about this style of implementation is that it only needs mutex and auto-reset events, which are probably the most portable of all synchronization primitives. So you can use this on absolutely any platform.

The disadvantage is that it's SCHED_OTHER (doesn't respect OS priorities) and it can have rather more thread switches than necessary.


The next version we'll look at is Thomassons two-event cond_var. There are a lot of broken versions of this idea around the net, so it's instructive to compare to (what I believe is) a correct one.

The basic idea is that you use two events. One is auto-reset (an auto-reset event is just like a semaphore with a max count of 1); the other is manual reset. signal() sets the auto-reset event to release one thread. broadcast() sets the manual-reset event to release all the threads (opens the gate and leaves it open). Sounds simple enough. The problem is that manual reset events are fraught with peril. Any time you see anyone say "manual reset event" you should think "ruh roh, race likely". However, handling it in this case is not that hard.

The easy way is to use the same trick we used above to handle broadcast with generations - we just swap out the "waitset" (in this case, the broadcast event) when we broadcast(). That way it is associated only with previous waiters, and new waiters can immediately come in and wait on the next generation's manual reset event.

The only ugly bit is handling the lifetime of the broadcast event. We want it to be killed when the last member of its generation is woken, and to get this right we need a little ref-counting mechanism.

So, here it is , based on the latest version from Thomasson of an idea that he posted many times in slightly different forms :


class thomasson_win_condvar
{
    enum { event_broadcast=0, event_signal = 1 };

    struct waitset
    {
        HANDLE m_events[2];
        std::atomic<int> m_refs;

        waitset(HANDLE signalEvent) : m_refs(1)
        {
            // signalEvent is always the same :
            m_events[event_signal] = signalEvent;

            // broadcast is manual reset : (that's the TRUE)
            m_events[event_broadcast] = CreateEvent(NULL, TRUE, FALSE, NULL);
        }
        
        ~waitset()
        {
            RL_ASSERT( m_refs($) == 0 );
    
            //if ( m_events[event_broadcast] )
            CloseHandle(m_events[event_broadcast]);
        }
    };


private:
    VAR_T(waitset*) m_waitset;
    CRITICAL_SECTION m_internal_mutex;
    CRITICAL_SECTION m_external_mutex;
    HANDLE m_signal_event;


public:
    thomasson_win_condvar()
    :   m_waitset(NULL)
    {
        m_signal_event = CreateEvent(NULL,0,0,NULL);
        InitializeCriticalSection(&m_internal_mutex);
        InitializeCriticalSection(&m_external_mutex);
    }

    ~thomasson_win_condvar()
    {
        RL_ASSERT( VAR(m_waitset) == NULL );
        CloseHandle(m_signal_event);
        DeleteCriticalSection(&m_internal_mutex);
        DeleteCriticalSection(&m_external_mutex);
    }


    void dec_ref_count(waitset * w)
    {
        EnterCriticalSection(&m_internal_mutex);
        // if I took waitsets refs to zero, free it

        if (w->m_refs($).fetch_add(-1, std::mo_relaxed) == 1)
        {
            std::atomic_thread_fence(std::mo_acquire,$);
            delete w;
            if ( w == VAR(m_waitset) )
                VAR(m_waitset) = NULL;
        }

        LeaveCriticalSection(&m_internal_mutex);
    }

    void inc_ref_count(waitset * w)
    {
        if ( ! w ) return;

        w->m_refs($).fetch_add(1,std::mo_relaxed);

        LeaveCriticalSection(&m_internal_mutex);
    }
        
public:
    void lock ()
    {
        EnterCriticalSection(&m_external_mutex);
    }
    void unlock ()
    {
        LeaveCriticalSection(&m_external_mutex);
    }

    void unlock_wait_lock()
    {
        waitset* w;
        
        {
        EnterCriticalSection(&m_internal_mutex);

        // make waitset on demand :
        w = VAR(m_waitset);

        if (! w)
        {
            w = new waitset(m_signal_event);
            VAR(m_waitset) = w;
        }
        else
        {
            inc_ref_count(w);
        }
        
        LeaveCriticalSection(&m_internal_mutex);
        }

        // note unlock of external after waitset update :
        LeaveCriticalSection(&m_external_mutex);

        // wait for *either* event :
        WaitForMultipleObjects(2, w->m_events, false, INFINITE);

        EnterCriticalSection(&m_external_mutex);
        
        dec_ref_count(w);
    }


    void broadcast()
    {
        EnterCriticalSection(&m_internal_mutex);

        // swap waitset to local state :
        waitset* w = VAR(m_waitset);

        VAR(m_waitset) = NULL;

        inc_ref_count(w);
        
        LeaveCriticalSection(&m_internal_mutex);

        // at this point a new generation of waiters can come in,
        //  but they will be on a new waitset

        if (w)
        {
            SetEvent(w->m_events[event_broadcast]);

            // note : broadcast event is actually never cleared (that would be a tricky race)
            // instead the waitset it used is deleted and not used again
            // a new waitset will be made with an un-set broadcast event

            dec_ref_count(w);
        }
    }


    void signal()
    {        
        EnterCriticalSection(&m_internal_mutex);

        waitset* w = VAR(m_waitset);

        inc_ref_count(w);
        
        LeaveCriticalSection(&m_internal_mutex);

        if (w)
        {
            SetEvent(w->m_events[event_signal]);

            dec_ref_count(w);
        }
    }

};

I don't think there's anything too interesting to say about this, the interesting bits are all commented in the code.

Basically the trick for avoiding the evilness of a manual reset event is just to make a new one after you set it to "open" and never try to set it to "closed" again. (of course you could set it to closed and recycle it through a pool instead of allocating a new one each time).

This code can be simplified/optimized in various ways, for example when you signal() you don't actually need to make or delete a waitset at all.

I believe you could also get rid of m_internal_mutex completely with a bit of care. Actually it doesn't take any care; if you require that signal() and broadcast() are always called from within the external lock, then the internal mutex isn't needed at all (the external lock serves to protect the things that it protects, namely the waitset).


The Terekhov condvar in pthreads-win32 (reportedly) uses a barrier to block entry to "wait" for new waiters after you "broadcast" but before all the waiters have woken up. It's a gate that's closed when you broadcast, the waiter count is remembered, and it's opened after they all wake up. This works but does cause thread thrashing; waiters who were blocked will go to sleep on the barrier, then wake up and rush in and immediately go to sleep in the wait on the condvar. (caveat : I haven't actually looked at the pthreads-win32 code other than to see that it's huge and complex and I didn't want to read it)

Doug Schmidt wrote the nice page on Strategies for Implementing POSIX cond vars on Win32 (which describes a lot of bad or broken ways (such as using PulseEvent or using SetEvent and trying to count down to reset it)). The way he implemented it in his ACE package is sort of similar to Terekhov's blocking mechanism. this extraction for qemu is a lot easier to follow than the ACE code and uses the same technique. At first I didn't think it worked at all, but the secret is that it blocks wake-stealers using the external mutex. The key is that in this implementation, "broadcast" has to be called inside the external mutex held. So what happens is broadcast wakes a bunch of guys, then waits on their wakes being done - it's still holding the external mutex. The waiters wake up, and dec the count and then try to lock the mutex and block on that. Eventually they all wake up and set an event so the broadcaster is allowed to resume. Now he leaves broadcast and unlocks the mutex and the guys who were woken up can now run. Stolen wakeups are prevented because the external mutex is held the whole time, so nobody can get into the cv to even try to wait.

I'm really not a fan of this style of condvar implementation. It causes lots of thread thrashing. It requires every single one of the threads broadcasted-to to wake up and go back to sleep before any one can run. Particularly in the Windows environment where individual threads can lose time for a very long time on multi-core machines, this is very bad.

Thomasson's earlier waitset didn't swap out the waitset in notify_all so it didn't get generations right. (he did later correct versions such as the one above)

Derevyago has posted a lot of condvars that are broken (manual reset event with ResetEvent() being used is immediately smelly and in this case is in fact broken). He also posted one that works which is similar to my first one here (FIFO, SCHED_OTHER, manual wait list).

Anthony Williams posted a reasonably simple sketch of a condvar ; it uses a manual reset event per generation which is swapped out on broadcast ; it's functionally identical to the Thomasson condvar, except that Anthony maintains a linked list of waiters instead of just inc/dec'ing a refcount. Anthony didn't provide signal() but it's trivial to do so by adding another manual-reset event.

Dmitry's fine grained eventcount looks like a nice way to build condvar, but I'm scared of it. If somebody ever makes a C++0x/Relacy version of that, let me know.

7/18/2011

07-18-11 - cblib Relacy

Announce : cblib now has its own version of "Relacy".

This is no replacement for the original - go get Dmitry's Relacy Race Detector if you are serious about low level threading, you must have this. It's great. (I'm using 2.3.0).

Now in cblib is "LF/cblibRelacy.h" which is a look-alike for Relacy.

What you do, is you write code for Relacy just like you normally would. That is, you write C++0x but you add $ on all the shared variable accesses (and use rl::backoff and a few others things). You set up a test class and run rl::simulate.

Now you can change your #include "relacy.h" to #include "cblib/lf/cblibRelacy.h" and it should just transparently switch over to my version.

What does my version do differently?

0. First of all, it is no replacement for the real Relacy, which is a simulator that tries many possible races; cblibRelacy uses the real OS threads, not fibers, so the tests are not nearly as comprehensive. You need to still do your tests in the real Relacy first to check that your algorithm is correct.

1. It gives you actually usable compiled code. eg. if you take the clh_mutex or anything I've posted recently and combine it with cblibRelacy.h , you have a class you can go and use. (but don't literally do that)

2. It runs its tests on the actual compiled code. That means you aren't testing in a simulator which might hide problems with the code in the real world (eg. if your implementation of atomics has a bug, or if there's a priority inversion caused by the scheduling of real OS threads).

3. Obviously you can test OS primitives that Relacy doesn't support, like Nt keyed events, or threads waiting on IO, etc.

4. It can run a lot more iterations than the real Relacy because it's using real optimized code; for example I found bugs with my ticket locks when the ticket counters overflowed 16 bits, which is way more iterations that you can do in real Relacy.

How does it work :

I create a real Win32 thread for each of the test threads. The atomic ops translate to real atomic ops on the machine. I then run the test many times, and try to make the threads interleave a bit to make problems happen. The threads get stopped and started at different times and in different orders to try to get them to interleave a bit differently on each test iteration.

Optionally (set by cblibRelacy_do_scheduler), I can also use the Relacy $ points to juggle my scheduling, just the same way Relacy does with fibers. Wherever a $ occurs (access to a shared variable), I randomize an operation and might spin a bit or sleep the thread or some other things. This gives you way more interleaving than you would get just from letting the OS do thread switching.

Now as I said this is no substitute for the real Relacy, you'll never get as many fine switches back and forth as he does (well, you could of course, if you made the $ juggle your thread almost always, but that would actually make it much slower than Relacy because he uses fibers to make all the switching less expensive).

One important note - cblibRelacy will not stress your test very well unless you run it with more threads than you have cores. The reason is that if your system is not oversubscribed, then the SwitchToThread() that I use in $ will do nothing.

Also, don't be a daft/difficult commenter. This code is intended as a learning tool, it's obviously not ready to be used directly in a production environment (eg. I don't check any OS return codes, and I intentionally make failures be asserts instead of handling them gracefully). If you want some of these primitives, I suggest you learn from them then write your own versions of things. Or, you know, buy Oodle, which will have production-ready multi-platform versions of lots of stuff.

ADDENDUM : I should also note to be clear : Relacy detects races and other failure types and will tell you about them. cblibRelacy just runs your code. That means to actually make it a test, you need it to do some asserting. For example, with real Relacy you can test your LF Stack by just pushing some nodes and popping some nodes. With cblibRelacy that wouldn't tell you much (unless you crash). You need to write a test that pushes some values, then pops some value and asserts that it got the same things out.

07-18-11 - MCS list-based lock

We did CLH before, but MCS really is better, so let's go through it.

MCS is another list/node based lock from the ancient days. The nodes are now linked forwards from head to tail. The "head" currently owns the mutex and the tail is the next slot for someone trying to enter the lock.

Unlike CLH, as soon as you leave the mutex in MCS, your node is no longer in use in the chain, so you can free it right away. This also means you can just keep your node on the stack, and no allocations or any goofy gymnastics are needed.

Also like CLH, MCS has the nice performance advantage of only spinning on a local variable (do cache line padding to get the full benefit of this).

The way it works is quite simple. Each node has a "gate" flag which starts "blocked". You spin on your own gate being open. The previous lock-holder will point at you via his "next" pointer. When he unlocks he sets "next->gate" to unlocked, and that allows you to run.

The code is :


// mcs on stack

struct mcs_node
{
    std::atomic<mcs_node *> next;
    std::atomic<int> gate;
    
    mcs_node()
    {
        next($).store(0);
        gate($).store(0);
    }
};

struct mcs_mutex
{
public:
    // tail is null when lock is not held
    std::atomic<mcs_node *> m_tail;

    mcs_mutex()
    {
        m_tail($).store( NULL );
    }
    ~mcs_mutex()
    {
        ASSERT( m_tail($).load() == NULL );
    }
    
    class guard
    {
    public:
        mcs_mutex * m_t;
        mcs_node    m_node; // node held on the stack

        guard(mcs_mutex * t) : m_t(t) { t->lock(this); }
        ~guard() { m_t->unlock(this); }
    };
    
    void lock(guard * I)
    {
        mcs_node * me = &(I->m_node);
        
        // set up my node :
        // not published yet so relaxed :
        me->next($).store(NULL, std::mo_relaxed );
        me->gate($).store(1, std::mo_relaxed );
    
        // publish my node as the new tail :
        mcs_node * pred = m_tail($).exchange(me, std::mo_acq_rel);
        if ( pred != NULL )
        {
            // (*1) race here
            // unlock of pred can see me in the tail before I fill next
            
            // publish me to previous lock-holder :
            pred->next($).store(me, std::mo_release );

            // (*2) pred not touched any more       

            // now this is the spin -
            // wait on predecessor setting my flag -
            rl::linear_backoff bo;
            while ( me->gate($).load(std::mo_acquire) )
            {
                bo.yield($);
            }
        }
    }
    
    void unlock(guard * I)
    {
        mcs_node * me = &(I->m_node);
        
        mcs_node * next = me->next($).load(std::mo_acquire);
        if ( next == NULL )
        {
            mcs_node * tail_was_me = me;
            if ( m_tail($).compare_exchange_strong( tail_was_me,NULL,std::mo_acq_rel) )
            {
                // got null in tail, mutex is unlocked
                return;
            }
            
            // (*1) catch the race :
            rl::linear_backoff bo;
            for(;;)
            {
                next = me->next($).load(std::mo_acquire);
                if ( next != NULL )
                    break;
                bo.yield($);
            }
        }

        // (*2) - store to next must be done,
        //  so no locker can be viewing my node any more        

        // let next guy in :
        next->gate($).store( 0, std::mo_release );
    }
};

there are two subtle "moments" (in Dmitry's terminology) which have been marked in the code.

The main one is (*1) : this is actually exactly analogous to the "unlocking in progress" state that we talked about with the list-event mutex that I designed earlier. In this case it's a "locking in progress". The state is indicated when "m_tail" has been changed, but "->next" has not yet been filled. Because m_tail is exchanged first (it is the atomic sync point for the lock), your node is published before pred->next is set. So the linked list can be broken into pieces and invalid during this phase. But the unlocker can easily detect it and spin to wait for this phase to pass.

The other important point is (*2) - this is what allows you to keep the node on the stack. Your node can be held by another thread, but by the time you get to *2 you know it must be done with you.

So, it's FIFO, SCHED_OTHER, etc. complaints like previous mutexes. But it has a lot of advantages. The mutex itself is tiny, just one pointer. The fast path is one exchange and one CAS ; that's not the best, but it's okay. But the real advantage is its flexibility.

You can change the spin to a Wait() quite trivially. (just store an Event in the node instead of the "gate" flag)

You can spin and try to CAS m_tail to acquire the lock a bit before waiting if you like.

You can combine spinners and waiters! That's unusual. You can use the exact same mutex for a spinlock or a Wait lock. I'm not sure that's ever a good idea, but it's interesting. (one use for this is if you fail to get a handle to wait on, you can just spin, and your app will degrade somewhat gracefully).

Cool!

7/17/2011

07-17-11 - Per-thread event mutexes

Another class of mutex implementations that we haven't talked about yet are those based on a list of waiting threads. Again this is a general pattern which is useful in lots of threading primitives, so it's useful to talk about.

The basic common idea is that you have some kind of node per thread. This can be in the TLS, it can be on the stack (the stack is a form of TLS, BTW), or if you're the OS it is in the OS thread structure. (btw most implementations of mutexes and waitlists inside the kernel take this form, using a node inside the OS thread structure, but of course they are much simpler because they are in control of the thread switching).

A common advantage of this kind of scheme is that you only need as many waitable handles as threads, and you can have many many mutexes.

So a pseudo-code sketch is something like :


per-thread node for linked list

lock(mutex,node) :
  try to acquire mutex
  if failed, add node to waiting list

unlock(mutex) :
  pop node off waiting list
  if not null, set owner to him
  else set owner to none

If I'm building my own mutex in user mode and don't want to use a previously existing mutex, I need to make the linked list of waiters lock-free. Now, note that the linked list I need here is actually "MPSC" (multi-producer single-consumer), because the consumer is always the thread that currently holds the mutex. (for something to be SC doesn't mean the same thread has to consume always, it means there must only be one at a time, using some mechanism of ensuring exclusion (such as a mutex)).

If I'm going to go to the trouble of managing this list, then I want my unlock to provide "direct handoff" - that is, gaurantee a waiter gets the lock, no new locker can sneak in and grab it. I also want my threads to really be able to go to sleep and wake up, since we saw with the ticket lock that if you don't control thread waking with this kind of mutex, you can get the problem that the thread you are handing to is swapped out and that blocks all the other threads from proceeding.

Now I haven't yet specified that the list has to be ordered, but sure let's make it a FIFO so we have a "fair" mutex, and I'll go ahead and use an MPSC_FIFO queue because I have one off the shelf that's tested and ready to go.

So, we obviously need some kind of prepare_wait/doublecheck/wait mechanism in this scenario , so we try something like :


lock(mutex,node) :
{
 try acquire mutex ; if succeeded , return

 prepare_wait : push node to MPSC_FIFO :

 double check :
 try acquire mutex
 if succeeded :
   cancel_wait
   return

 wait;
}

unlock() :
{
 node = pop MPSC_FIFO
 if no node
   set mutex unlocked , return

 set owner to node
 wake node->thread
}

(this doesn't work). But we're on the right track. There's one race that we handle okay :

locker : tries to acquire mutex, can't

unlocker : pop FIFO , gets nada
unlocker : set mutex unlocked

locker : push to FIFO
locker : double check acquire
locker : gets it now , cancel wait

that's okay. But there's another race that we don't handle :

locker : tries to acquire mutex, can't

unlocker : pop FIFO , gets nada

locker : push to FIFO
locker : double check acquire , doesn't get it

unlocker : set mutex unlocked

locker : wait

ruh-roh ! deadlock.

So, there's one ugly way to fix this. In the unlock, before the pop, you change the mutex status from "locked" to "unlocking in progress". Then when the locker does the double-check , he will fail to get the mutex but he will see the "unlocking in progress" flag, and he can handle it (one way to handle it is by spinning until the state is no longer "in progress").

But this is quite ugly. And we would like our mutex to not have any spins at all, and in particular not spins that can be blocked for the duration of a thread switch. (though to some extent this is inherent in mutexes, so it's not a disaster here, it is something to beware of generally in lockfree design - you don't want to design with exclusive states like "unlock in progress" that block other people if you get swapped out).

So, another way to handle that second bad race is for the double-check to be mutating. If double-check is a fetch_add , then when unlocker sets the mutex to unlocked, there will be a count in there indicating that there was a push after our pop.

Thus we change the "gate" to the mutex to be a counter - we use a high bit to indicate locked state, and the double check does an inc to count waiters :


struct list_mutex3
{
    enum
    {
        UNLOCKED = 0,
        LOCKED = (1<<30),
        WAITER = 1
    };
    
    std::atomic<int> m_state;
    MPSC_FIFO   m_list;

    list_mutex3()
    {
        m_state($).store( UNLOCKED );
        MPSC_FIFO_Open(&m_list);
    }
    ~list_mutex3()
    {
        MPSC_FIFO_Close(&m_list);
    }

    void lock(ThreadNode * tn)
    {
        // optionally do a few spins before going to sleep :
        const int spin_count = 10;
        for(int spins=0;spins<spin_count;spins++)
        {
            int zero = UNLOCKED;
            if ( m_state($).compare_exchange_strong(zero,LOCKED,std::mo_acq_rel) )
                return;
                
            HyperYieldProcessor();
        }
                  
        // register waiter :
        MPSC_FIFO_Push(&m_list,tn);
        
        // double check :
        int prev = m_state($).fetch_add(WAITER,std::mo_acq_rel);
        if ( prev == UNLOCKED )
        {
            // I got the lock , but set the wrong bit, fix it :
            m_state($).fetch_add(LOCKED-WAITER,std::mo_release);
            // remove self from wait list :
            cancel_wait(tn);
            return;
        }

        // wait :
        WaitForSingleObject(tn->m_event, INFINITE);
        
        // ownership has been passed to me
    }

    void unlock()
    {
        int prev = m_state($).fetch_add(-LOCKED,std::mo_release);
        ASSERT( prev >= LOCKED );
        if ( prev == LOCKED )
        {
            // no waiters
            return;
        }
        
        // try to signal a waiter :
        LFSNode * pNode = MPSC_FIFO_Pop(&m_list);
        // there must be one because the WAITER inc is after the push
        ASSERT( pNode != NULL );

        // okay, hand off the lock directly to tn :         
        ThreadNode * tn = (ThreadNode *) pNode;

        // we turned off locked, turn it back on, and subtract the waiter we popped :
        prev = m_state($).fetch_add(LOCKED-WAITER,std::mo_release);
        ASSERT( prev < LOCKED && prev >= WAITER );
        SetEvent(tn->m_event);
    }
    
    void cancel_wait(ThreadNode * tn)
    {
        MPSC_FIFO_Fetch( &m_list);
        MPSC_FIFO_Remove(&m_list,tn);
    }
};

I think it's reasonably self-explanatory what's happening. Normally when the mutex is locked, the LOCKED bit is on, and there can be some number of waiters that have inc'ed the low bits. The unlock is reasonably fast because it checks the waiter count and doesn't have to bother with queue pops if there are no waiters.

In the funny race case, what happens is the LOCKED bit turns off, but WAITER gets inc'ed at the same time, so the mutex is still blocked from entry (because the initial CAS to enter is checking against zero, it's not checking the LOCKED bit). During this funny phase that was previously a race, now the unlocker will see that the double-check has happened (and failed) and will proceed into the pop-signal branch.

Remember that fetch_add returns the value *before* the add. (this sometimes confuses me because the Win32 InterlockedIncrement returns the value *after* the increment).

cancel_wait is possible because at that point we own the mutex, thus we are the SC for the MPSC and we can do whatever we want to it. In particular my implementation of MPSC uses Thomasson's trick of building it from MPMC stack and then using exchange to grab all the nodes and reverse them to FIFO order. So Fetch does the exchange and reverse, and then I have a helper that can remove a node. (obviously those should be combined for efficiency). You should be able to do something similar with most MPSC implementations (*).

(* = Dmitry has a very clever MPSC implementation ( here : low-overhead mpsc queue - Scalable Synchronization Algorithms ) which you cannot use here. The problem is that Dmitry's MPSC can be temporarily made smaller by a Push. During Push, it goes through a phase where previously pushed nodes are inaccessible to the popper. This is fine if your popper is something like a worker thread that just spins in a loop popping nodes, because it will eventually see them, but in a case like this I need the gaurantee that anything previously pushed is definitely visible to the popper).

In the fast path (no contention), lock is one CAS and unlock is one fetch_add (basically the same as a CAS). That's certainly not the cheapest mutex in the world but it's not terrible.

Now, clever readers may have already noticed that we don't actually need the LOCKED bit at all. I left it in because it's a nice illustration of the funny state changes that happen in the race case, but in fact we can set LOCKED=1 , and then all our adds of (LOCKED-WAITER) go away, which gives us the simpler code :


    void lock(ThreadNode * tn)
    {   
        int zero = 0;
        if ( m_state($).compare_exchange_strong(zero,1,std::mo_acq_rel) )
            return;
                
        // register waiter :
        MPSC_FIFO_Push(&m_list,tn);
                    
        // inc waiter count :
        int prev = m_state($).fetch_add(1,std::mo_acq_rel);
        if ( prev == 0 )
        {
            // remove self from wait list :
            cancel_wait(tn);
            return;
        }
                
        // wait :
        WaitForSingleObject(tn->m_event, INFINITE);
        
        // ownership has been passed to me
    }

    void unlock()
    {
        int prev = m_state($).fetch_add(-1,std::mo_release);
        if ( prev == 1 )
        {
            // no waiters
            return;
        }
        
        // try to signal a waiter :
        LFSNode * pNode = MPSC_FIFO_Pop(&m_list);
        // there must be one because the WAITER inc is after the push
        ASSERT( pNode != NULL );

        // okay, hand off the lock directly to tn :         
        ThreadNode * tn = (ThreadNode *) pNode;
        SetEvent(tn->m_event);
    }

and it's obvious that m_state is just an entry count.

In fact you can do an even simpler version that doesn't require cancel_wait :


    void lock(ThreadNode * tn)
    {                       
        // inc waiter count :
        int prev = m_state($).fetch_add(1,std::mo_acq_rel);
        if ( prev == 0 )
        {
            // got the lock
            return;
        }
                
        // register waiter :
        MPSC_FIFO_Push(&m_list,tn);
                
        // wait :
        WaitForSingleObject(tn->m_event, INFINITE);
        
        // ownership has been passed to me
    }

    void unlock()
    {
        int prev = m_state($).fetch_add(-1,std::mo_release);
        if ( prev == 1 )
        {
            // no waiters
            return;
        }
        
        // try to signal a waiter :
        LFSNode * pNode = NULL;
        rl::backoff bo;
        for(;;)
        {
            pNode = MPSC_FIFO_Pop(&m_list);
            if ( pNode ) break;
            bo.yield($);
        }

        // okay, hand off the lock directly to tn :         
        ThreadNode * tn = (ThreadNode *) pNode;
        SetEvent(tn->m_event);
    }

where you loop in the unlock to catch the race. This last version is not recommended, because it doesn't allow spinning before going to sleep, and requires a loop in unlock.

One more note : all of these suffer from what Thomasson calls "SCHED_OTHER". SCHED_OTHER is a Linux term for one of the schedulers in that OS. What it means in this context is that we are not respecting thread priorities or any more exotic scheduling that OS wants, because each thread here is waiting on its own event (and by "event" I mean "generic OS waitable handle"). If what you really want is a FIFO mutex then that's fine, you got it, but usually you would rather respect the OS scheduler, and to do that you need all your waiting threads to wait on the same handle.

07-17-11 - CLH list-based lock

The multi-way ticket lock we just did is very similar to some classic spin locks. I found this nice page : scalable synchronization pseudocode ( and parent page at cs.rochester ) ( and similar material covered here, but with nice drawings : Mutual Exclusion: Classical Algorithms for Locks (PDF) ).

I wanted to see how the classic MCS queue lock compares to my per-thread-mpsc lock ; the answer is not much. The classic queue locks are really closer kin to the multi-way ticket lock. I'll try to show that now. The MCS lock is probably more well known, but the CLH lock is simpler, so I'll deal with that.

The idea of these locks is to avoid the heavy cache contention inherent to the basic single-variable gate locks. To solve that, the idea is to use a distributed gate; basically one gate variable per waiter, and it's the responsibility of the unlocker to open the gate for the next waiter. So there has to be some kind of linked list so that the unlocker can find the next waiter. And these locks will be inherently FIFO and SCHED_OTHER and all that. (these are really only appropriate for kernels or kernel-like environments)

The CLH algorithm is usually described as a linked list, with the "head" of the list being the node that currently has access to the mutex, and the "tail" being the variable held in the lock struct. When new waiters come in, they tack onto the tail, thus it's FIFO.

There's a node for each waiter, and each node contains the gate for the guy after me :


struct clh_node
{
    // list is linked from tail backwards :
    std::atomic<clh_node *> prev;
    // should the guy after me wait ?
    std::atomic<int> succ_must_wait;
    
    clh_node()
    {
        prev($).store(0);
        succ_must_wait($).store(0);
    }
};

we also need a way of providing a node per-thread ! *per-lock* ! ; this is different than my event-queue-mutex that just needs a node *per-thread* ; the reason is that the nodes in CLH keep getting used even after you unlock, so you can't just reuse them. However, you can free some node when you unlock - just not necessarily the one you passed in. So anyhoo, we need some struct to pass in this node for us, here it is :

struct ThreadNode
{
    std::atomic<clh_node *> pNode;
    
    ThreadNode()
    {
        pNode($).store( new clh_node );
    }
    ~ThreadNode()
    {
        // note that the pNode I delete might not be the one I created
        //  so don't try to hold it by value
        clh_node * n = pNode($).exchange(0, std::mo_relaxed);
        delete n;
    }
};

this could be in the TLS, or it could be in the mutex::guard , or whatever.

Okay, now that we have our helpers we can write the code. When the mutex is held, the tail node will have succ_must_wait = 1 , when you take the lock you stick yourself on the tail and then wait on your predecessor. To unlock the mutex you just set succ_must_wait = 0 on yourself, and that allows the guy after you to go :


struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    std::atomic<clh_node *> m_tail;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_tail($).store( new clh_node );
    }
    ~clh_mutex()
    {
        clh_node * n = m_tail($).exchange(0);
        delete n;
    }
    
    void lock(ThreadNode * I)
    {
        clh_node * me = I->pNode($).load(std::mo_relaxed);
    
        me->succ_must_wait($).store( 1, std::mo_relaxed );
        //me->prev($).store(0, std::mo_relaxed );
        clh_node * pred = m_tail($).exchange(me, std::mo_acq_rel);
        me->prev($).store(pred, std::mo_relaxed );
        
        // wait on predecessor's flag -
        //  this is why pred can't free himself
        rl::linear_backoff bo;
        while ( pred->succ_must_wait($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
    }
    
    void unlock(ThreadNode * I)
    {
        clh_node * me = I->pNode($).load(std::mo_relaxed);
        
        clh_node * pred = me->prev($).load(std::mo_relaxed);
        me->succ_must_wait($).store( 0, std::mo_release );
        // take pred's node :
        //  this leaves my node allocated, since succ is still looking at it
        I->pNode($).store( pred, std::mo_relaxed );
    }

};

okay, I think this is reasonably self-explanatory. BTW the reason why the classical locks are the way they are is often to avoid test-and-set ops, which they didn't have or were very expensive; here we use only one exchange, the rest is just loads and stores.

That matches the classical algorithm description, but it's a lot more expensive that necessary. The first thing you might notice is that we don't actually need to store the linked list at all. All we need to do is get "pred" from lock to unlock. So you can either store it in the mutex struct, or put it in the "guard" (ThreadNode in this case) ; I think putting it in the guard is better, but I'm going to put it in the mutex right now because it's more analogous to our next step :


struct clh_node
{
    // should the guy after me wait ?
    std::atomic<int> succ_must_wait;
    
    clh_node() { succ_must_wait($).store(0); }
};

struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    std::atomic<clh_node *> m_lock;
    std::atomic<clh_node *> m_lock_pred;
    std::atomic<clh_node *> m_lock_holder;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_lock($).store( new clh_node );
        m_lock_pred($).store( 0 );
        m_lock_holder($).store( 0 );
    }
    ~clh_mutex()
    {
        clh_node * n = m_lock($).exchange(0);
        delete n;
    }

    clh_node * alloc_slot()
    {
        return new clh_node;
    }
    void free_slot(clh_node * p)
    {
        delete p;
    }
    
    void lock()
    {
        clh_node * me = alloc_slot();
    
        me->succ_must_wait($).store( 1, std::mo_relaxed );
        clh_node * pred = m_lock($).exchange(me, std::mo_acq_rel);
        
        rl::linear_backoff bo;
        while ( pred->succ_must_wait($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
        
        m_lock_holder($).store(me, std::mo_relaxed );
        m_lock_pred($).store(pred, std::mo_relaxed );
    }
    
    void unlock()
    {
        clh_node * me = m_lock_holder($).load(std::mo_relaxed);
        
        clh_node * pred = m_lock_pred($).load(std::mo_relaxed);
        
        me->succ_must_wait($).store( 0, std::mo_release );

        free_slot( pred );
    }

};

and rather than pass in the nodes I just bit the bullet and allocated them. But now the obvious thing to do is make alloc_slot and free_slot just take & return nodes from an array. But then "me" is just stepping a pointer through an array. So our "linked list" should just be a sequence of adjacent elements in an array :

struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    #define NUM_WAYS    16
    // should be cache line sized objects :
    std::atomic<int> succ_must_wait[NUM_WAYS];
    std::atomic<int> m_lock;
    VAR_T(int) m_lock_pred;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_lock($).store(0);
        succ_must_wait[0]($).store(0);
        for(int i=1;i<NUM_WAYS;i++)
        {
            succ_must_wait[i]($).store(1);
        }
        m_lock_pred($) = 0;
    }
    ~clh_mutex()
    {
    }

    void lock()
    {   
        int pred = m_lock($).fetch_add(1, std::mo_acq_rel);
        pred &= (NUM_WAYS-1);
        
        rl::linear_backoff bo;
        while ( succ_must_wait[pred]($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
        
        // m_lock_pred just remembers my index until unlock
        //  could be a local
        m_lock_pred($) = pred;
    }
    
    void unlock()
    {
        int pred = m_lock_pred($);
        int me = (pred+1)&(NUM_WAYS-1);
        
        // recycle this slot :
        succ_must_wait[pred]($).store(1, std::mo_relaxed);
        
        // free my lock :
        succ_must_wait[me]($).store( 0, std::mo_release );
    }

};

(as usual, m_lock_pred doesn't really belong as a member variable in the lock).

But this is exactly "Anderson's array-based queue lock" that we mentioned at the end of the ticket-lock post, and it's also just a CLH lock with the nodes stuck in an array. This suffers from the big problem that you must have enough array entries for the threads that will touch the lock or it doesn't work (what happens is multiple threads can get into the mutex at the same time, eg. it doesn't actually provide mutual exclusion).

I don't think this is actually useful for anything, but there you go.

07-17-11 - Atman's Multi-way Ticket Lock

But first - Atman sent me a note about the ticket lock which is related and worth sharing.

Imagine we're in a "high performance computing" type environment, we have a mess of threads running, locked on the processor, which is the appropriate time to use something like a ticket lock. Now, they all try to get the lock. What actually happens?

N threads run into the mutex and 1 gets the lock. So the remaining (N-1) go into their spin loops :


while ( m_serving($).load(std::mo_acquire) != my_ticket ) ;

but what is this actually doing? When you try to load that shared variable, what it does is something like :
is this cache line current?
okay read the variable from my copy of the cache line
now the first guy holding the mutex unlocks it. This marks the cache line dirty and that is propagated to all the other cores. Now the remaining (N-1) spinners try to read m_serving again. But this time it does :
is this cache line current?
no it's not, get the new copy of the cache line
okay read the variable from my copy of the cache line
the cache line had to be copied around (N-1) times. You can see the pattern, and to do N unlocks with N waiters you wind up doing N^2 cache line copies. Obviously this is not okay for N large.

(note that this is why putting a backoff pause in your spin loop can actually be a performance advantage even on non-hyperthreaded cores - it reduces cache line traffic ; also in the special case of the ticket lock, the waiters actually know how far they are from the front of the list, so they can do "proportional backoff" by subracting "my_ticket" from "now serving" and pausing for that amount of time).

Okay, but we can do better. Leave m_ticket as a single gate variable, but split m_serving into many "ways" (ways in the cache sense). So depending on your ticket # you look at a different serving number. This is just like a very large bakery - rather than have a single "now serving" sign, we have one for odd tickets and one for even tickets ; you stand on the side of the room with the appropriate sign for you and just read that sign.

The code is :


struct ticket_mutex_ways
{
    enum { NUM_WAYS = 16 };
    std::atomic<unsigned int> m_ticket;
    // note : m_serving objects should be cache-line-size padded :
    std::atomic<unsigned int> m_serving[NUM_WAYS];
    VAR_T(int)  m_lock_holder;
    
    ticket_mutex_ways()
    {
        m_ticket($).store( 0 );
        for(int i=0;i<NUM_WAYS;i++)
            m_serving[i]($).store( 0 );
    }
    ~ticket_mutex_ways()
    {
    }

    void lock()
    {
        unsigned int me = m_ticket($).fetch_add(1,std::mo_acq_rel); // mo_acquire ; *1
    
        int way = me % NUM_WAYS;
    
        rl::linear_backoff bo;
        for(;;)
        {
            unsigned int cur = m_serving[way]($).load(std::mo_acquire);
            if ( me == cur )
                break;
        
            bo.yield($);
        }
        
        m_lock_holder($) = me;
    }

    void unlock()
    {
        int next = m_lock_holder($) + 1;
        int way = next % NUM_WAYS;
        
        m_serving[way]($).store(next,std::mo_release);
    }
};

the key thing is that in the spin loop you are only touching the serving variable in your way, and there is no cache contention with up to NUM_WAYS lockers. (as noted - you need cache line size padding between the variables)

(*1 standard caveat here - this only needs to be acquire but they you need other mechanisms to protect your mutex, so beware)

Note that "m_lock_holder" really doesn't belong in the mutex structure; it's an annoyance that I have to put it there; it should just be held on the stack until the unlock. If you use some kind of "guard" class to wrap your mutex lock/unlock it would be more appropriate to store this in the guard. (in fact it is probably good class design to make lock & unlock take the "guard" as a parameter, because that allows more flexibility).

This is pretty cool, I think it's about as fast as you can get if you don't care about your mutex being rather large. One nice thing about it is that you don't need to know your number of CPUs. There are a lot of similar algorithms that break unless NUM_WAYS is >= number of threads. (for example, you can do basically the same thing but just use a bool in each way to indicate locked or not, and that works fine as long as num threads is < num ways; that would be "Anderson's array-based queue lock" BTW). With Atman's algorithm, you can even choose to make WAYS less, and you will be fast as long as the number of contenders is less than the # of ways.

7/16/2011

07-16-11 - Ticket FIFO Mutex

The Linux kernel internally uses a FIFO spinlock that they call "ticket lock". A ticket or "bakery" algorithm is quite a common pattern so we'll have a glance.

The analogy is the easiest way to understand it. There's an atomic ticket machine, when you walk into the shop you grab a ticket (and the machine increments itself). On the wall is a "now serving" sign that counts up as people turn in their tickets.

This can be implemented most obviously using two ints :


struct ticket_mutex2
{
    // (*0)
    std::atomic<unsigned int> m_ticket;
    std::atomic<unsigned int> m_serving;

    ticket_mutex2()
    {
        m_ticket($).store( 0 );
        m_serving($).store( 0 );
    }
    ~ticket_mutex2()
    {
    }

    void lock()
    {
        unsigned int me = m_ticket($).fetch_add(1,std::mo_acq_rel);
    
        rl::linear_backoff bo;
        for(;;)
        {
            unsigned int cur = m_serving($).load(std::mo_acquire);
            if ( me == cur )
                return;
        
            bo.yield($);
            
            // (*1)
        }
    }

    void unlock()
    {
        // (*2)
        // my ticket must match m_serving
        // (*3)
        //m_serving($).fetch_add(1,std::mo_release);
        unsigned int cur = m_serving($).load(std::mo_relaxed);
        m_serving($).store(cur+1,std::mo_release);
    }
};

*0 : obviously you could put the two counters into words and mush them in one int (Linux on x86 used to put them into bytes and mush them into one word), but it's actually a better demonstration of the algorithm to have them separate, because it's a weaker constraint. Lockfree algorithms always continue to work if you mush together variables into larger atomic pieces, but rarely continue to work if you separate them into smaller independent atomic pieces. So when you're trying to show the fundamental requirements of an algorithm you should use the minimum mushing-together required.

(BTW I don't remotely claim that any of the things I've posted have the minimum synchronization constraints required by the algorithm, but that is always the goal).

*1 : you might be tempted to put a Wait here using eventcount or something, but you can't. The problem is if multiple threads go to sleep there, only the one thread that has the next ticket will be able to take the lock. So if you use a generic waitset, you might wake the wrong thread, it won't be able to get in, and you will deadlock. More on this in a moment.

*2 : m_serving is actually protected by the mutex, it is only ever modified by the mutex holder. m_ticket is actually the barrier variable for acquiring the lock. When you get the lock you could store your ticket id as a member in the lock struct and at unlock it will be equal to m_serving.

*3 : you can of course use an atomic increment on serving but because of *2 it's not necessary, and a simple load & inc is cheaper on some architectures (and as per *1, it's a weaker constraint so we prefer to demonstrate its correctness here).

Okay, this is a very cheap lock in terms of the number of expensive atomics required, and it's FIFO (fair) which is nice in some cases, but it simply cannot be used outside of a kernel environment. The reason is that if the thread who is next in line get swapped out, then no currently running threads can get the lock, and we don't have any wakeup mechanism to get that sleeping thread to take the lock so we can make progress. This is okay in the kernel because the kernel is controlling which threads are awake or asleep, so obviously it won't put a thread to sleep that is currently spinning trying to get the lock.

So if we want to turn this into a FIFO lock that works in user space, we have to have a sleep/wakeup mechanism.

I don't think this is actually an awesome way to write your own FIFO lock, but it's a nice demonstration of the usefulness of NT's Keyed Events, so I'm going to do that.

You need to get the secret functions :


template <typename ret,typename T1,typename T2,typename T3,typename T4>
ret CallNtInternal(const char * funcName,T1 arg1,T2 arg2,T3 arg3,T4 arg4)
{
    typedef ret NTAPI t_func(T1,T2,T3,T4);

    t_func * pFunc = (t_func*) GetProcAddress( LoadLibrary( TEXT("ntdll.dll") ), funcName );
    ASSERT_RELEASE( pFunc != NULL );

    return (*pFunc) (arg1,arg2,arg3,arg4);
}

#define MAKE_NTCALL_4(ret,func,type1,type2,type3,type4) ret func(type1 arg1,type2 arg2,type3 arg3,type4 arg4) { return CallNtInternal<ret>(#func,arg1,arg2,arg3,arg4); }

MAKE_NTCALL_4( LONG,NtCreateKeyedEvent,OUT PHANDLE, IN ACCESS_MASK, IN PVOID, IN ULONG );
MAKE_NTCALL_4( LONG,NtReleaseKeyedEvent,IN HANDLE, IN PVOID, IN BOOLEAN, IN PLARGE_INTEGER ); 
MAKE_NTCALL_4( LONG,NtWaitForKeyedEvent,IN HANDLE, IN PVOID, IN BOOLEAN, IN PLARGE_INTEGER );

and then you can make the lock :


struct ticket_mutex2_keyed
{
    std::atomic<unsigned int> m_state;
    // ticket is bottom word
    // now serving is top word

    HANDLE  m_keyedEvent;

    // keyed event must have bottom bit off :
    enum { WAITKEY_SHIFT = 1 };

    ticket_mutex2_keyed()
    {
        m_state($).store( 0 );
        NtCreateKeyedEvent(&m_keyedEvent,EVENT_ALL_ACCESS,NULL,0);
    }
    ~ticket_mutex2_keyed()
    {
        CloseHandle(m_keyedEvent);
    }

    void lock()
    {
        // grab a ticket and inc :
        unsigned int prev = fetch_add_low_word(m_state($),1);
    
        // if ticket matches now serving I have the lock :
        if ( top_word_matches_bottom(prev) )
            return;
    
        // wait on my ticket :
        unsigned int ticket = prev&0xFFFF;
        intptr_t waitKey = (ticket<<WAITKEY_SHIFT);
        NtWaitForKeyedEvent(m_keyedEvent,(PVOID)(waitKey),FALSE,NULL);
    }

    void unlock()
    {
        // inc now serving :
        unsigned int prev = m_state($).fetch_add((1<<16),std::mo_release);

        // get a local copy of the "now serving" that I published :
        prev += (1<<16);

        // if lock was not made open to new entries :       
        if ( ! top_word_matches_bottom(prev) )
        {
            // wake up the one after me in the sequence :
            unsigned int next = (prev>>16);
            intptr_t waitKey = (next<<WAITKEY_SHIFT);
            NtReleaseKeyedEvent(m_keyedEvent,(PVOID)(waitKey),FALSE,NULL);
        }
    }
};

Note that we have had to push together our two state variables now, because previous unlock only touched the "now serving" counter, but now it has to also check against the ticket counter to see if there are any people waiting.

Also note that we are taking advantage of the fact that ReleaseKeyedEvent is blocking. If the Release happens before the Wait, the signal is not lost - the unlocking thread blocks until the Wait is entered.

Exercise for the reader : make it possible for lock to spin a while before going into the wait.

I made use of these self-explanatory helpers :


bool top_word_matches_bottom( unsigned int x )
{
    unsigned int t = _lrotl(x,16);
    return t == x;
}

unsigned int fetch_add_low_word(std::atomic<unsigned int> & state,int inc)
{
    unsigned int old = state($).load(std::mo_relaxed);
    while ( ! state($).compare_exchange_weak(old,((old+inc)&0xFFFF) | (old & 0xFFFF0000),std::mo_acq_rel) ) { }
    return old;
}

which do what they do.

Obviously on Linux you could use futex, but there are too many platforms that have neither KeyedEvent nor futex, which make using them not very attractive.

Some links :

Time-Published Queue-Based Spin Locks
Ticket spinlocks [LWN.net]
spinlocks XXXKSE What to do
Linux x86 ticket spinlock
git.kernel.org - linuxkernelgittorvaldslinux-2.6.gitcommit
futex(2) - Linux manual page

7/15/2011

07-15-11 - Review of many Mutex implementations

This is gonna be a long one. The point of this is not that you should go off and implement your own mutex (don't). The point is that it's educational to understand this simple case, because the issues will be the same in other domain-specific threading problems. A lot of the times people think they are being "safe" by using the OS mutex, but then they still do some atomic CAS on a bool and think "it's no big deal, it's just a CAS and loop" and basically are creating all the same issues of races and livelocks and thrashing without being careful about it.

So, I'm going to present a (hopefully) working implementation of a mutex/lock and then talk about the issues with that type of implementation.

classic single-variable CAS spinlock :


class spinmutex
{
public:

    spinmutex() : m_lock(0)
    {
    }
    
    void lock()
    {
        rl::linear_backoff b;
        unsigned int prev = 0;
        // (*1)
        while ( ! m_lock($).compare_exchange_weak(prev,1,std::mo_acquire) )
        //while ( ! m_lock($).compare_exchange_weak(prev,1,std::mo_acq_rel) )
        {
            b.yield($);
            prev = 0;
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock($).load(std::mo_relaxed) == 1 );
        m_lock($).store(0,std::mo_release);
    }

private:
    std::atomic<unsigned int> m_lock;
};

*1 : I believe the CAS only needs to be acquire, but then you do need some other mechanism to keep stores from moving out the top of the mutex (such as a #loadstore which C++0x doesn't provide), and some way to prevent mutexes from moving to overlap each other (which could lead to deadlock). So it's probably easiest to just make the CAS be acq_rel even though it doesn't need to be. (see previous post on the barriers that we think a mutex needs to provide). Most of the mutexes here have this issue and we won't mention it again.

For some reason people love to implement the basic spinlock with CAS, but in fact you can do it just with exchange :


class spinmutex2
{
public:

    spinmutex2() : m_lock(0)
    {
    }
    
    void lock()
    {
        rl::linear_backoff b;
        while ( m_lock($).exchange(1,std::mo_acquire) )
        {
            b.yield($);
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock($).load(std::mo_relaxed) == 1 );
        m_lock($).store(0,std::mo_release);
    }

private:
    std::atomic<unsigned int> m_lock;
};

which is cheaper on some platforms.

So, there are a few problems with spinmutex. The most obvious is that you have to just spin, the threads which can't get in don't go to sleep. The other problem is that it doesn't respect OS scheduling directives (thread priorities) and it's quite un-"fair", in that it doesn't order access at all, and in fact greatly favors the last thread in, since it's most likely to be getting CPU time.

So we want to make it sleep. The pattern for making lock-free primitive sleep is to change :


while ( ! trylock() ) { spin }

to :

if ( ! trylock() )
{
  register desire to wait

  if ( trylock() ) return (cancel wait)

  wait;
}

that is, a double-checked wait. (and then perhaps loop, depending on whether waking from wait implies the condition). The reason you need to do this is that before the "register waiter" has finished putting you in the wait list, the lock may become open. If you didn't try to acquire the lock again, you would miss the wake signal.

So the easiest way to transform our spin mutex into one that sleeps is with eventcount :

eventcount sleeping exchange lock :


class ecmutex1
{
public:

    ecmutex1() : m_lock(0)
    {
    }
    
    void lock()
    {
        while ( m_lock.exchange(1,std::memory_order_acquire) )
        {
            unsigned ec_key = m_ec.get();
            // double check :
            if ( m_lock.exchange(1,std::memory_order_acquire) == 0 )
                return; // got the lock
            
            // wait for signal :
            m_ec.wait(ec_key);
            // now retry
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock.load(std::memory_order_relaxed) == 1 );
        m_lock.store(0,std::memory_order_release);
        m_ec.signal();
    }

private:
    //std::atomic<unsigned int> m_lock;
    rl::atomic<unsigned int> m_lock;
    eventcount m_ec;
};

now, this is okay, but there are a few problems.

One is that the signal for the eventcount is just a "hey wake up and see if you can get the lock" , it's not a "hey wake up you have the lock". That means it can suffer from what I call "thrashing" or spurious wakeup (this is not technically "spurious wakeup" , a true "spurious wakeup" would be a wakeup that didn't come from calling signal()). You might wake a thread, it fails to get the lock, and goes right back to sleep. So that sort of sucks.

This issue is closely related to a fairness problem; we might be able to ensure some level of fairness through eventcount, but that is ruined by the fact that spinning threads can jump in and grab the lock before the one we signalled.

Another issue is that we are calling "signal" every time even when there is no waiter. This is a minor issue because your eventcount probably checks for waiters and does nothing (if it's a good implementation - a bad implementation might implement signal by immediately taking a mutex, in which case you really want to avoid calling it if you have no waiters).

Anyway, Thomasson showed how to improve this last little bit of inefficiency. You use one bit to flag locking and one bit to flag waiting, and you only need to signal the eventcount if the waiting bit is on :


class ecmutex2
{
public:

    enum { UNLOCKED = 0, LOCKED = 1, LOCKED_WAITING = (LOCKED|2) };

    ecmutex2() : m_lock(UNLOCKED)
    {
    }
    
    void lock()
    {
        unsigned int prev = 0;
        // this CAS could be a bit-test-and-set :
        while ( ! m_lock.compare_exchange_strong(prev,LOCKED,std::memory_order_acquire) )
        {
            unsigned ec_key = m_ec.get();
            // double check :
            // change LOCKED->LOCKED_WAITING (and then we will wait)
            // or change UNLOCKED->LOCKED_WAITING (and we take the lock)
            prev = m_lock.exchange(LOCKED_WAITING,std::memory_order_acquire);
            if ( prev == UNLOCKED )
                return;
                
            m_ec.wait(ec_key);
            
            // now retry
            prev = 0;
        }
    }
    
    void unlock()
    {
        unsigned int local = m_lock.load(std::memory_order_relaxed);
        RL_ASSERT( local & LOCKED );
        m_lock.store(UNLOCKED,std::memory_order_release);
        // could always signal :
        //m_ec.signal();
        // faster because it avoids an atomic :
        unsigned int check = m_lock.load(std::memory_order_relaxed);
        if ( (local|check) & LOCKED_WAITING )
        {
            m_ec.signal();
        }
    }

private:
    rl::atomic<unsigned int> m_lock;
    eventcount m_ec;
    
};

you have to use a CAS (not an exchange) to take the lock initially, because you can't turn off the WAITING bit. The entire advantage of this method is the fact that in the uncontended case (no waiters), unlock only does a load_relaxed instead of the atomic op needed in eventcount to check if signal is necessary.

Note : in some cases it may be an improvement to spin a bit before going to sleep in the lock() side. It also can be an optimization to spin a bit before signalling in the unlock (to see if the WAITING flag turns off) - however, both of these hurt fairness, they make the mutex more LIFO than FIFO, which can indeed be an optimization in many cases, but is also dangerous (more notes on this issue elsewhere). If a thread was already asleep on the mutex, it will tend to stay asleep forever if there are other awake threads that keep trading the mutex around.

Anyhoo, you can implement the exact same thing using windows Event instead of eventcount :

Three-state mutex using Event :


// Thomasson's simple mutex based on windows event :
struct win_event_mutex
{
    std::atomic<int> m_state; // = 0
    HANDLE m_waitset; // auto reset event; set to false

    win_event_mutex()
    {
        m_state($) = 0;
        m_waitset = CreateEvent(NULL,0,0,NULL);
    }
    ~win_event_mutex()
    {
        CloseHandle(m_waitset);
    }

    void lock()
    {
        if ( m_state($).exchange(1,rl::mo_acquire) )
        {
            while ( m_state($).exchange(2,rl::mo_acquire) )
            {
                WaitForSingleObject(m_waitset, INFINITE);
            }
        }
    }

    void unlock()
    {
        if ( m_state($).exchange(0,rl::mo_release) == 2 )
        {
            SetEvent(m_waitset);
        }
    }
};

the three states are again "0 = unlocked", "1 = locked (exclusive)" , "2 = contended (locked with waiter)".

(I got this from Thomasson but I believe it's actually an old algorithm; I've seen it discussed in many blogs. there is a slightly subtle state transition where m_state can be 2 (contended) and then someone comes in to lock() and exchanges it to 1 (locked, uncontended); that seems to be bad, because there is a Waiter which might now miss a signal (because we turned off the contended flag), but in fact it's okay because if that happens we will then step in and take the lock in the conteded state (by exchanging in 2) and when we unlock we will signal the waiter. So this is another way of doing "unfair" acquisition (the later-entering thread gets the lock even though there was already a waiter) but it is not a lost wakeup).

The unlock is slightly more expensive because it's an exchange instead of just a store. This mutex is "fair" (as fair as Win32 native primitives ever are) for waiters that actually get into the waitset, because all the threads wait on the same Event and thus will get the OS priorities and boosts and so on. But it still doesn't hand off the lock in the wakeup and so on. (I guess I'll call respecting the OS scheduler "pseudo-fair" ; if the mutex implementation is at least as fair as the OS mutex)

BTW this mutex is very similar to the futex-based mutex that Bartosz described here ; in general anywhere you see Win32 event you could use futex on Linux, since futex is a superset of event.

We're going to take a diversion now and look at some other topics in mutex design - in particular, avoiding allocation of OS events unless/until they're actually needed (Win32 CRITICAL_SECTION does this, for example).

Event mutex that makes event on demand :


// Thomasson's simple mutex based on windows event :
// version that does event creation on demand
struct win_event_mutex_ondemand
{
    std::atomic<int> m_state; // = 0
    std::atomic<HANDLE> m_waitset; // auto reset event; set to false

    win_event_mutex_ondemand()
    {
        m_state($) = 0;
        m_waitset($) = 0;
    }
    ~win_event_mutex_ondemand()
    {
        if ( m_waitset($) != 0 )
            CloseHandle(m_waitset($));
    }

    void lock()
    {
        if ( m_state($).exchange(1,std::mo_acq_rel) )
        {
            HANDLE h = m_waitset($).load(std::mo_acquire);
            if ( h == 0 )
            {
                HANDLE newH = CreateEvent(NULL,0,0,NULL);
                if ( m_waitset($).compare_exchange_strong(h,newH,std::mo_acq_rel) )
                {
                     h = newH;
                }
                else
                {
                    // loaded h
                    RL_ASSERT( h != 0 );
                    CloseHandle(newH);
                }
            }
            RL_ASSERT( h != 0 );
            while ( m_state($).exchange(2,std::mo_acq_rel) )
            {
                WaitForSingleObject(h, INFINITE);
            }
        }
    }

    void unlock()
    {
        if ( m_state($).exchange(0,std::mo_acq_rel) == 2 )
        {
            HANDLE h = m_waitset($).load(std::mo_relaxed);
            RL_ASSERT(h != 0 );
            SetEvent(h);
        }
    }
};

This is just the same as the previous win_event_mutex , except that it makes the event on first use, using the modern speculative-creation singleton method.

This works fine, unfortunately it is difficult to ever free the event, so once we make it we have it forever. If your goal is to do something like have 4k mutexes that only use 32 OS events, you can't do it this way. (in general you only need as many waitable handles as you have threads, and you might want to have many more lockable objects than that).

I implemented one way of making a mutex that releases its event when not needed, but it's a bit ugly :

Event mutex that only holds event during contention :


struct win_event_mutex2
{
    struct state // 64 bit double-word 
    {
        // two 32 bit words :
        int lock; HANDLE event; 

        state(int l,HANDLE h) : lock(l), event(h) { } 
        state() : lock(0), event(0) { } 
        bool operator == (const state & rhs) const { return lock == rhs.lock && event == rhs.event; }
    };
    
    std::atomic<state> m_state; // = 0

    win_event_mutex2()
    {
    }
    ~win_event_mutex2()
    {
        state local = m_state($);
        if ( local.event != 0 )
            CloseHandle(local.event);
    }

    void lock()
    {
        HANDLE newH = 0;
            
        state oldState = m_state($).load(std::mo_acquire);
        state newState;
        
        for(;;)
        {
            // increment the lock count :
            newState = oldState;
            newState.lock = oldState.lock+1;
            
            // if there is contention, make sure there is an event to wait on :
            if ( newState.lock > 1 && newState.event == 0 )
            {
                if ( newH == 0 )
                {
                    newH = CreateEvent(NULL,0,0,NULL);
                }
                
                newState.event = newH;              
            }
            else if ( newState.lock == 1 && newState.event == newH )
            {
                newState.event = 0;
            }

            // try to swap in the lock count and event handle at the same time :
            if ( m_state($).compare_exchange_weak(oldState,newState) )
                break;

        }
        
        if ( newH && newH != newState.event )
        {
            // I made an event but didn't use it
            CloseHandle(newH);
        }
                
        if ( oldState.lock == 0 )
        {
            // I got the lock
            RL_ASSERT( newState.lock == 1 );
            return;
        }
        
        // lock is contended :
        RL_ASSERT( newState.lock > 1 );
        RL_ASSERT( newState.event != 0 );

        WaitForSingleObject(newState.event, INFINITE);
        
        // I own the mutex now!
    }

    void unlock()
    {
        state oldState = m_state($).load(std::mo_acquire);
        RL_ASSERT( oldState.lock >= 1 );
        state newState(0,0);
        
        // at this moment I own the mutex
        
        for(;;)
        {
            // release the lock, and if we're no longer contended remove the event
            RL_ASSERT( oldState.lock >= 1 );
            newState = oldState;
            newState.lock--;
        
            if ( newState.lock == 0 && newState.event != 0 )
            {
                newState.event = 0;
            }
        
            if ( m_state($).compare_exchange_weak(oldState,newState) )
                break;
        }

        if ( oldState.event )
        {
            if ( newState.event == 0 )
            {
                RL_ASSERT( newState.lock == 0 );

                CloseHandle(oldState.event);
            }
            else
            {
                RL_ASSERT( newState.lock > 0 );
                SetEvent(oldState.event);
            }
        }
    }
};

This is always the cop-out method for implementing lock free algorithms - take the two variables that you need to stay in sync (in this case the lock count and the presence of a waitable event) - and just mush them together into a bigger word and CAS it atomically. That way you don't have to think carefully about all the funny state transition possibilities. I'm sure someone could do a better version of this that's not so expensive in atomic ops on the fast path (no contention).

(btw it's safer / more portable to make "state" just be a uint64 and do the packing manually with shifts and ors, don't use a struct inside std::atomic, it causes too many headaches and is too risky)

(to be clear, the problem with this algorithm is the no-contention fast path is way more expensive than any of our previous mutexes).

Also note : you shouldn't actually use CreateEvent/CloseHandle with something like this, you should have a recycling pool of OS Events that you alloc/free from an event cache of some kind. As I said before, you only need one per thread. If you do use a pool like this, you have to be a bit careful about whether they can be recycled in signalled state, and whether you want to try to Reset them at some point (beware someone could still be in the process of waking up from it), or just make your algorithm work okay with a pre-signalled event, or something.

There is one way this mutex is better than any previous one - when a thread receives a wakeup, it is never bogus; it automatically has the lock when it is woken up. Also under contention there can be no "stolen locks" from threads that jump the queue and quickly grab the atomic - one of the sleeping threads will always be woken.

(btw "pseudo-fairness" like this is not always better; in the case that all your threads are equivalent workers, you actually want a LIFO mutex, because LIFO keeps caches hotter and makes thread switches less likely. However, some code can be stalled indefinately by LIFO mutexes, so they are very dangerous in the general case.)

Okay, that's enough mutexes for one post, we'll do some followups in a later one.

7/14/2011

07-14-11 - Some obscure threading APIs

Futex :

Futex from a Win32 programmer's perspective is an enhanced version of the windows Event. It has absolutely nothing to do with a "fast user space mutex". It's a much lower level primitive - it's a "waitset" if you like (more about "waitset" in a future post). Basically it lets you put a thread to sleep with a Wait(), or wake up one or more threads. It has several advantages over Win32 Event which make it very nice.

Futex is associated with the address of an int. This means you don't have to actually create a Futex object, any int in your system can be used as the handle for a waitable event. This is nice.

Futex Wait atomically checks the int vs. a certain value. The basic futex Wait op is :

atomically {
  if ( *address == check_value ) Wait();
  else return;
}
this atomic check before waiting is exactly what you want for implementing lots of threading primitives (mutex, conditionvar, etc.) so that's very handy. The normal thing you would write is something like :

thread 1:

if ( queue empty / mutex locked / whatever )
{
    Wait();
}

thread 2:

push queue / unlock mutex / whatever
Signal();

but that contains a race. You can fix it very easily with futex by passing in the condition to check to the Wait, like :

thread 1 :
if ( num_items == 0 )
    futex_wait( &num_items, 0 );

thread 2 :
num_items++;
Signal();

(note that the alternative to this is a prepare_wait/wait pair with a double-check of the condition after the prepare_wait, and signal applies to anyone who has prepared, not anyone who has waited)

Futex Wake can wake up N threads (sadly Win32 event only provides "wake 1" in the robust auto-reset mode). That's nice.

Some reference :
Futexes are tricky (PDF) (no they're not, BTW)
Thin Lock vs. Futex � ��Bartosz Milewski's Programming Cafe
Mutexes and Condition Variables using Futexes
futex(2) - Linux manual page

Now some Win32 :

SignalObjectAndWait :

SignalObjectAndWait (Win2k+).

This seems pretty hot, because you would think it was atomic (signal and wait in one kernel op), but it is *NOT* :

Note that the "signal" and "wait" are not guaranteed to be performed as an atomic operation. Threads executing on other processors can observe the signaled state of the first object before the thread calling SignalObjectAndWait begins its wait on the second object.

which means it's useless. It's just the same as calling Signal() then Wait(). Maybe it's less likely that you get swapped out between the two calls, which would reduce thread thrashing and also reduce the number of system calls, but it does not help you with correctness issues (an actually atomic SignalAndWait would let you implement things differently and solve some lost-wakeup problems).


thread1 :
  Signal(event1);
  /* (!1) */
  Wait(event2);

thread 2:
  Wait(event1);
  Signal(event2); // (!2)
  Wait(event2); // ! stolen wakeup

So first of all, the separate Signal & Wait is a performance bug because if thread1 loses the CPU at (!1) (or immediately upon signalling), then in thread 2 before (!2) you transfer execution back to thread1, it immediately goes to sleep. That's lame, and it's why you prefer to do these things atomically. But, it's even worse in this case, because at (!2) we intended that Signal to go to thread1 and wake him up, but he isn't in his sleep yet. Because SignalAndWait is not atomic we have a race. If it was atomic, code like this could actually be correct.

Windows NT Keyed Events (NtWaitForKeyedEvent/NtReleaseKeyedEvent) :

This is an NT internal API (it's in NTDLL since Win XP, so despite being internal-only it's actually more usable than the Win2k+ or Vista+ stuff (revision : duh, brain fart, not true, 2k+ is fine, only Vista+ stuff is problematic)). It's a very nice extension to the Win32 Event. Basically it lets you associate a value with the Wait and Signal.

NtWaitForKeyedEvent : analogous to Wait() on an Event, but also takes a value (the value is PVOID but it is just a value, it doesn't deref). Execution is stopped until a signal is received with that particular value.

NtReleaseKeyedEvent : analogous to SetEvent() (on an auto-reset event) - eg. it wakes exactly 1 one thread - except that it actually blocks if no thread is waiting - that is, this always wakes exactly 1 thread (never zero, which SetEvent can do). Release also takes a value and only wakes threads waiting on that value.

So for example if you want you can use this to make a primitive implementation of futex. You have one global event (futex_event), then FutexWait(address) does WaitForKeyedEvent(futex_event,address) , using address as the value to wait for, and FutexWake(address) does ReleaseKeyedEvent(futex_event,address). (though obviously this is not a proper futex because it can't broadcast and can't check a value and so on).

More usefully, you can create a mutex which applies to an array! Something like :


template<typename T>
lockable_vector
{
  Event  vector_event;
  vector<T>   items;
  vector<int> contention;

  T * lock(int i)
  {
    if ( contention[i] ) NtWaitForKeyedEvent(vector_event, i);
    contention[i]++;
    return &items[i];
  }

  void unlock(int i)
  {
    contention[i]--;
    if ( contention[i] ) NtReleaseKeyedEvent(vector_event, i);
  }
}

(obviously this just is a rough sketch; you have to update "contention" properly atomically so that you only do the Wait and Release in the right cases) (I'll post a working version of this sometime soon).

(small issue with this : the lowest bit of the key must be unset ; apparently they use it as a flag bit, so you need a 2* or something to use it this way, or just give it aligned pointer addresses as the key)

The point is - you only have one kernel event, and you can mutex on any item in your vector; you can resize the vector and don't have to create/destroy mutexes. Cool! In the typical scenario that you have maybe 2-4 threads accessing an array of 4k items, this is a big win. (in fact this is something that's important in Oodle so I have a few implementations of how to do this without WaitForKeyedEvent).

KeyedEvent is implemented in the obvious way - when a thread in Win32 waits on a handle it gets stuck in that handle's data structure on a linked list. They just added a "wait_value" field to that linked list. So now when you do ReleaseKeyedEvent instead of just popping off the first thread in the list, it walks through the list in the kernel and tries to find a thread whose "wait_value" matches your signal value.

Some reference :
Slim Reader Writer Locks - Under The Hood - Matt Pietrek - Site Home - MSDN Blogs
NtCreateKeyedEvent
Keyed Events (lockless inc)
Concurrent programming on Windows - Google Books

old rants