7/15/2011

07-15-11 - Review of many Mutex implementations

This is gonna be a long one. The point of this is not that you should go off and implement your own mutex (don't). The point is that it's educational to understand this simple case, because the issues will be the same in other domain-specific threading problems. A lot of the times people think they are being "safe" by using the OS mutex, but then they still do some atomic CAS on a bool and think "it's no big deal, it's just a CAS and loop" and basically are creating all the same issues of races and livelocks and thrashing without being careful about it.

So, I'm going to present a (hopefully) working implementation of a mutex/lock and then talk about the issues with that type of implementation.

classic single-variable CAS spinlock :


class spinmutex
{
public:

    spinmutex() : m_lock(0)
    {
    }
    
    void lock()
    {
        rl::linear_backoff b;
        unsigned int prev = 0;
        // (*1)
        while ( ! m_lock($).compare_exchange_weak(prev,1,std::mo_acquire) )
        //while ( ! m_lock($).compare_exchange_weak(prev,1,std::mo_acq_rel) )
        {
            b.yield($);
            prev = 0;
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock($).load(std::mo_relaxed) == 1 );
        m_lock($).store(0,std::mo_release);
    }

private:
    std::atomic<unsigned int> m_lock;
};

*1 : I believe the CAS only needs to be acquire, but then you do need some other mechanism to keep stores from moving out the top of the mutex (such as a #loadstore which C++0x doesn't provide), and some way to prevent mutexes from moving to overlap each other (which could lead to deadlock). So it's probably easiest to just make the CAS be acq_rel even though it doesn't need to be. (see previous post on the barriers that we think a mutex needs to provide). Most of the mutexes here have this issue and we won't mention it again.

For some reason people love to implement the basic spinlock with CAS, but in fact you can do it just with exchange :


class spinmutex2
{
public:

    spinmutex2() : m_lock(0)
    {
    }
    
    void lock()
    {
        rl::linear_backoff b;
        while ( m_lock($).exchange(1,std::mo_acquire) )
        {
            b.yield($);
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock($).load(std::mo_relaxed) == 1 );
        m_lock($).store(0,std::mo_release);
    }

private:
    std::atomic<unsigned int> m_lock;
};

which is cheaper on some platforms.

So, there are a few problems with spinmutex. The most obvious is that you have to just spin, the threads which can't get in don't go to sleep. The other problem is that it doesn't respect OS scheduling directives (thread priorities) and it's quite un-"fair", in that it doesn't order access at all, and in fact greatly favors the last thread in, since it's most likely to be getting CPU time.

So we want to make it sleep. The pattern for making lock-free primitive sleep is to change :


while ( ! trylock() ) { spin }

to :

if ( ! trylock() )
{
  register desire to wait

  if ( trylock() ) return (cancel wait)

  wait;
}

that is, a double-checked wait. (and then perhaps loop, depending on whether waking from wait implies the condition). The reason you need to do this is that before the "register waiter" has finished putting you in the wait list, the lock may become open. If you didn't try to acquire the lock again, you would miss the wake signal.

So the easiest way to transform our spin mutex into one that sleeps is with eventcount :

eventcount sleeping exchange lock :


class ecmutex1
{
public:

    ecmutex1() : m_lock(0)
    {
    }
    
    void lock()
    {
        while ( m_lock.exchange(1,std::memory_order_acquire) )
        {
            unsigned ec_key = m_ec.get();
            // double check :
            if ( m_lock.exchange(1,std::memory_order_acquire) == 0 )
                return; // got the lock
            
            // wait for signal :
            m_ec.wait(ec_key);
            // now retry
        }
    }
    
    void unlock()
    {
        RL_ASSERT( m_lock.load(std::memory_order_relaxed) == 1 );
        m_lock.store(0,std::memory_order_release);
        m_ec.signal();
    }

private:
    //std::atomic<unsigned int> m_lock;
    rl::atomic<unsigned int> m_lock;
    eventcount m_ec;
};

now, this is okay, but there are a few problems.

One is that the signal for the eventcount is just a "hey wake up and see if you can get the lock" , it's not a "hey wake up you have the lock". That means it can suffer from what I call "thrashing" or spurious wakeup (this is not technically "spurious wakeup" , a true "spurious wakeup" would be a wakeup that didn't come from calling signal()). You might wake a thread, it fails to get the lock, and goes right back to sleep. So that sort of sucks.

This issue is closely related to a fairness problem; we might be able to ensure some level of fairness through eventcount, but that is ruined by the fact that spinning threads can jump in and grab the lock before the one we signalled.

Another issue is that we are calling "signal" every time even when there is no waiter. This is a minor issue because your eventcount probably checks for waiters and does nothing (if it's a good implementation - a bad implementation might implement signal by immediately taking a mutex, in which case you really want to avoid calling it if you have no waiters).

Anyway, Thomasson showed how to improve this last little bit of inefficiency. You use one bit to flag locking and one bit to flag waiting, and you only need to signal the eventcount if the waiting bit is on :


class ecmutex2
{
public:

    enum { UNLOCKED = 0, LOCKED = 1, LOCKED_WAITING = (LOCKED|2) };

    ecmutex2() : m_lock(UNLOCKED)
    {
    }
    
    void lock()
    {
        unsigned int prev = 0;
        // this CAS could be a bit-test-and-set :
        while ( ! m_lock.compare_exchange_strong(prev,LOCKED,std::memory_order_acquire) )
        {
            unsigned ec_key = m_ec.get();
            // double check :
            // change LOCKED->LOCKED_WAITING (and then we will wait)
            // or change UNLOCKED->LOCKED_WAITING (and we take the lock)
            prev = m_lock.exchange(LOCKED_WAITING,std::memory_order_acquire);
            if ( prev == UNLOCKED )
                return;
                
            m_ec.wait(ec_key);
            
            // now retry
            prev = 0;
        }
    }
    
    void unlock()
    {
        unsigned int local = m_lock.load(std::memory_order_relaxed);
        RL_ASSERT( local & LOCKED );
        m_lock.store(UNLOCKED,std::memory_order_release);
        // could always signal :
        //m_ec.signal();
        // faster because it avoids an atomic :
        unsigned int check = m_lock.load(std::memory_order_relaxed);
        if ( (local|check) & LOCKED_WAITING )
        {
            m_ec.signal();
        }
    }

private:
    rl::atomic<unsigned int> m_lock;
    eventcount m_ec;
    
};

you have to use a CAS (not an exchange) to take the lock initially, because you can't turn off the WAITING bit. The entire advantage of this method is the fact that in the uncontended case (no waiters), unlock only does a load_relaxed instead of the atomic op needed in eventcount to check if signal is necessary.

Note : in some cases it may be an improvement to spin a bit before going to sleep in the lock() side. It also can be an optimization to spin a bit before signalling in the unlock (to see if the WAITING flag turns off) - however, both of these hurt fairness, they make the mutex more LIFO than FIFO, which can indeed be an optimization in many cases, but is also dangerous (more notes on this issue elsewhere). If a thread was already asleep on the mutex, it will tend to stay asleep forever if there are other awake threads that keep trading the mutex around.

Anyhoo, you can implement the exact same thing using windows Event instead of eventcount :

Three-state mutex using Event :


// Thomasson's simple mutex based on windows event :
struct win_event_mutex
{
    std::atomic<int> m_state; // = 0
    HANDLE m_waitset; // auto reset event; set to false

    win_event_mutex()
    {
        m_state($) = 0;
        m_waitset = CreateEvent(NULL,0,0,NULL);
    }
    ~win_event_mutex()
    {
        CloseHandle(m_waitset);
    }

    void lock()
    {
        if ( m_state($).exchange(1,rl::mo_acquire) )
        {
            while ( m_state($).exchange(2,rl::mo_acquire) )
            {
                WaitForSingleObject(m_waitset, INFINITE);
            }
        }
    }

    void unlock()
    {
        if ( m_state($).exchange(0,rl::mo_release) == 2 )
        {
            SetEvent(m_waitset);
        }
    }
};

the three states are again "0 = unlocked", "1 = locked (exclusive)" , "2 = contended (locked with waiter)".

(I got this from Thomasson but I believe it's actually an old algorithm; I've seen it discussed in many blogs. there is a slightly subtle state transition where m_state can be 2 (contended) and then someone comes in to lock() and exchanges it to 1 (locked, uncontended); that seems to be bad, because there is a Waiter which might now miss a signal (because we turned off the contended flag), but in fact it's okay because if that happens we will then step in and take the lock in the conteded state (by exchanging in 2) and when we unlock we will signal the waiter. So this is another way of doing "unfair" acquisition (the later-entering thread gets the lock even though there was already a waiter) but it is not a lost wakeup).

The unlock is slightly more expensive because it's an exchange instead of just a store. This mutex is "fair" (as fair as Win32 native primitives ever are) for waiters that actually get into the waitset, because all the threads wait on the same Event and thus will get the OS priorities and boosts and so on. But it still doesn't hand off the lock in the wakeup and so on. (I guess I'll call respecting the OS scheduler "pseudo-fair" ; if the mutex implementation is at least as fair as the OS mutex)

BTW this mutex is very similar to the futex-based mutex that Bartosz described here ; in general anywhere you see Win32 event you could use futex on Linux, since futex is a superset of event.

We're going to take a diversion now and look at some other topics in mutex design - in particular, avoiding allocation of OS events unless/until they're actually needed (Win32 CRITICAL_SECTION does this, for example).

Event mutex that makes event on demand :


// Thomasson's simple mutex based on windows event :
// version that does event creation on demand
struct win_event_mutex_ondemand
{
    std::atomic<int> m_state; // = 0
    std::atomic<HANDLE> m_waitset; // auto reset event; set to false

    win_event_mutex_ondemand()
    {
        m_state($) = 0;
        m_waitset($) = 0;
    }
    ~win_event_mutex_ondemand()
    {
        if ( m_waitset($) != 0 )
            CloseHandle(m_waitset($));
    }

    void lock()
    {
        if ( m_state($).exchange(1,std::mo_acq_rel) )
        {
            HANDLE h = m_waitset($).load(std::mo_acquire);
            if ( h == 0 )
            {
                HANDLE newH = CreateEvent(NULL,0,0,NULL);
                if ( m_waitset($).compare_exchange_strong(h,newH,std::mo_acq_rel) )
                {
                     h = newH;
                }
                else
                {
                    // loaded h
                    RL_ASSERT( h != 0 );
                    CloseHandle(newH);
                }
            }
            RL_ASSERT( h != 0 );
            while ( m_state($).exchange(2,std::mo_acq_rel) )
            {
                WaitForSingleObject(h, INFINITE);
            }
        }
    }

    void unlock()
    {
        if ( m_state($).exchange(0,std::mo_acq_rel) == 2 )
        {
            HANDLE h = m_waitset($).load(std::mo_relaxed);
            RL_ASSERT(h != 0 );
            SetEvent(h);
        }
    }
};

This is just the same as the previous win_event_mutex , except that it makes the event on first use, using the modern speculative-creation singleton method.

This works fine, unfortunately it is difficult to ever free the event, so once we make it we have it forever. If your goal is to do something like have 4k mutexes that only use 32 OS events, you can't do it this way. (in general you only need as many waitable handles as you have threads, and you might want to have many more lockable objects than that).

I implemented one way of making a mutex that releases its event when not needed, but it's a bit ugly :

Event mutex that only holds event during contention :


struct win_event_mutex2
{
    struct state // 64 bit double-word 
    {
        // two 32 bit words :
        int lock; HANDLE event; 

        state(int l,HANDLE h) : lock(l), event(h) { } 
        state() : lock(0), event(0) { } 
        bool operator == (const state & rhs) const { return lock == rhs.lock && event == rhs.event; }
    };
    
    std::atomic<state> m_state; // = 0

    win_event_mutex2()
    {
    }
    ~win_event_mutex2()
    {
        state local = m_state($);
        if ( local.event != 0 )
            CloseHandle(local.event);
    }

    void lock()
    {
        HANDLE newH = 0;
            
        state oldState = m_state($).load(std::mo_acquire);
        state newState;
        
        for(;;)
        {
            // increment the lock count :
            newState = oldState;
            newState.lock = oldState.lock+1;
            
            // if there is contention, make sure there is an event to wait on :
            if ( newState.lock > 1 && newState.event == 0 )
            {
                if ( newH == 0 )
                {
                    newH = CreateEvent(NULL,0,0,NULL);
                }
                
                newState.event = newH;              
            }
            else if ( newState.lock == 1 && newState.event == newH )
            {
                newState.event = 0;
            }

            // try to swap in the lock count and event handle at the same time :
            if ( m_state($).compare_exchange_weak(oldState,newState) )
                break;

        }
        
        if ( newH && newH != newState.event )
        {
            // I made an event but didn't use it
            CloseHandle(newH);
        }
                
        if ( oldState.lock == 0 )
        {
            // I got the lock
            RL_ASSERT( newState.lock == 1 );
            return;
        }
        
        // lock is contended :
        RL_ASSERT( newState.lock > 1 );
        RL_ASSERT( newState.event != 0 );

        WaitForSingleObject(newState.event, INFINITE);
        
        // I own the mutex now!
    }

    void unlock()
    {
        state oldState = m_state($).load(std::mo_acquire);
        RL_ASSERT( oldState.lock >= 1 );
        state newState(0,0);
        
        // at this moment I own the mutex
        
        for(;;)
        {
            // release the lock, and if we're no longer contended remove the event
            RL_ASSERT( oldState.lock >= 1 );
            newState = oldState;
            newState.lock--;
        
            if ( newState.lock == 0 && newState.event != 0 )
            {
                newState.event = 0;
            }
        
            if ( m_state($).compare_exchange_weak(oldState,newState) )
                break;
        }

        if ( oldState.event )
        {
            if ( newState.event == 0 )
            {
                RL_ASSERT( newState.lock == 0 );

                CloseHandle(oldState.event);
            }
            else
            {
                RL_ASSERT( newState.lock > 0 );
                SetEvent(oldState.event);
            }
        }
    }
};

This is always the cop-out method for implementing lock free algorithms - take the two variables that you need to stay in sync (in this case the lock count and the presence of a waitable event) - and just mush them together into a bigger word and CAS it atomically. That way you don't have to think carefully about all the funny state transition possibilities. I'm sure someone could do a better version of this that's not so expensive in atomic ops on the fast path (no contention).

(btw it's safer / more portable to make "state" just be a uint64 and do the packing manually with shifts and ors, don't use a struct inside std::atomic, it causes too many headaches and is too risky)

(to be clear, the problem with this algorithm is the no-contention fast path is way more expensive than any of our previous mutexes).

Also note : you shouldn't actually use CreateEvent/CloseHandle with something like this, you should have a recycling pool of OS Events that you alloc/free from an event cache of some kind. As I said before, you only need one per thread. If you do use a pool like this, you have to be a bit careful about whether they can be recycled in signalled state, and whether you want to try to Reset them at some point (beware someone could still be in the process of waking up from it), or just make your algorithm work okay with a pre-signalled event, or something.

There is one way this mutex is better than any previous one - when a thread receives a wakeup, it is never bogus; it automatically has the lock when it is woken up. Also under contention there can be no "stolen locks" from threads that jump the queue and quickly grab the atomic - one of the sleeping threads will always be woken.

(btw "pseudo-fairness" like this is not always better; in the case that all your threads are equivalent workers, you actually want a LIFO mutex, because LIFO keeps caches hotter and makes thread switches less likely. However, some code can be stalled indefinately by LIFO mutexes, so they are very dangerous in the general case.)

Okay, that's enough mutexes for one post, we'll do some followups in a later one.

5 comments:

Miroslav said...

I think its better to replace comments about sizes with STATIC_ASSERT's. So it will not compile if sizes are different. For example in Win x64 HANDLE==void*

cbloom said...

Nah, I never use compile-time asserts. I'm really not a fan of self-checking code.

cbloom said...

Ok, I apologize for the snarkiness, but I am like the *inventor* of using way too many compile-time asserts on everything (look at cblib/Threading.h for what my real code looks like).

Telling me to use more compile-time asserts is like telling Mister Rogers he needs to be nicer to people. Or telling the pope he should get a taller hat. Or telling Charlie Sheen he should really try this thing called cocaine. Or telling Brian Blessed he needs to speak up. Or...

Miroslav said...

sorry, I was hoping that last part about HANDLE not being 32 bits on one of the popular OSes will make it less rude.
Quick look at Threading.h shows a lot of usual compiler-specific preprocessor plumbing. A dozen or so compile assertions are ok and dont stand out.

cbloom said...

Yes, obviously if you actually wanted to use that type of lock on Win64, you would make "state" be a U64 and you would put the lock count in the top 8 bits or so, because handle values are always small on windows. (this is the same kind of thing that MS's primitives do internally).

old rants