cbloom rants: 07-17-11 - Per-thread event mutexes

Another class of mutex implementations that we haven't talked about yet are those based on a list of waiting threads. Again this is a general pattern which is useful in lots of threading primitives, so it's useful to talk about.

The basic common idea is that you have some kind of node per thread. This can be in the TLS, it can be on the stack (the stack is a form of TLS, BTW), or if you're the OS it is in the OS thread structure. (btw most implementations of mutexes and waitlists inside the kernel take this form, using a node inside the OS thread structure, but of course they are much simpler because they are in control of the thread switching).

A common advantage of this kind of scheme is that you only need as many waitable handles as threads, and you can have many many mutexes.

So a pseudo-code sketch is something like :


per-thread node for linked list

lock(mutex,node) :
  try to acquire mutex
  if failed, add node to waiting list

unlock(mutex) :
  pop node off waiting list
  if not null, set owner to him
  else set owner to none

If I'm building my own mutex in user mode and don't want to use a previously existing mutex, I need to make the linked list of waiters lock-free. Now, note that the linked list I need here is actually "MPSC" (multi-producer single-consumer), because the consumer is always the thread that currently holds the mutex. (for something to be SC doesn't mean the same thread has to consume always, it means there must only be one at a time, using some mechanism of ensuring exclusion (such as a mutex)).

If I'm going to go to the trouble of managing this list, then I want my unlock to provide "direct handoff" - that is, gaurantee a waiter gets the lock, no new locker can sneak in and grab it. I also want my threads to really be able to go to sleep and wake up, since we saw with the ticket lock that if you don't control thread waking with this kind of mutex, you can get the problem that the thread you are handing to is swapped out and that blocks all the other threads from proceeding.

Now I haven't yet specified that the list has to be ordered, but sure let's make it a FIFO so we have a "fair" mutex, and I'll go ahead and use an MPSC_FIFO queue because I have one off the shelf that's tested and ready to go.

So, we obviously need some kind of prepare_wait/doublecheck/wait mechanism in this scenario , so we try something like :


lock(mutex,node) :
{
 try acquire mutex ; if succeeded , return

 prepare_wait : push node to MPSC_FIFO :

 double check :
 try acquire mutex
 if succeeded :
   cancel_wait
   return

 wait;
}

unlock() :
{
 node = pop MPSC_FIFO
 if no node
   set mutex unlocked , return

 set owner to node
 wake node->thread
}

(this doesn't work). But we're on the right track. There's one race that we handle okay :


locker : tries to acquire mutex, can't

unlocker : pop FIFO , gets nada
unlocker : set mutex unlocked

locker : push to FIFO
locker : double check acquire
locker : gets it now , cancel wait

that's okay. But there's another race that we don't handle :


locker : tries to acquire mutex, can't

unlocker : pop FIFO , gets nada

locker : push to FIFO
locker : double check acquire , doesn't get it

unlocker : set mutex unlocked

locker : wait

ruh-roh ! deadlock.

So, there's one ugly way to fix this. In the unlock, before the pop, you change the mutex status from "locked" to "unlocking in progress". Then when the locker does the double-check , he will fail to get the mutex but he will see the "unlocking in progress" flag, and he can handle it (one way to handle it is by spinning until the state is no longer "in progress").

But this is quite ugly. And we would like our mutex to not have any spins at all, and in particular not spins that can be blocked for the duration of a thread switch. (though to some extent this is inherent in mutexes, so it's not a disaster here, it is something to beware of generally in lockfree design - you don't want to design with exclusive states like "unlock in progress" that block other people if you get swapped out).

So, another way to handle that second bad race is for the double-check to be mutating. If double-check is a fetch_add , then when unlocker sets the mutex to unlocked, there will be a count in there indicating that there was a push after our pop.

Thus we change the "gate" to the mutex to be a counter - we use a high bit to indicate locked state, and the double check does an inc to count waiters :


struct list_mutex3
{
    enum
    {
        UNLOCKED = 0,
        LOCKED = (1<<30),
        WAITER = 1
    };
    
    std::atomic<int> m_state;
    MPSC_FIFO   m_list;

    list_mutex3()
    {
        m_state($).store( UNLOCKED );
        MPSC_FIFO_Open(&m_list);
    }
    ~list_mutex3()
    {
        MPSC_FIFO_Close(&m_list);
    }

    void lock(ThreadNode * tn)
    {
        // optionally do a few spins before going to sleep :
        const int spin_count = 10;
        for(int spins=0;spins<spin_count;spins++)
        {
            int zero = UNLOCKED;
            if ( m_state($).compare_exchange_strong(zero,LOCKED,std::mo_acq_rel) )
                return;
                
            HyperYieldProcessor();
        }
                  
        // register waiter :
        MPSC_FIFO_Push(&m_list,tn);
        
        // double check :
        int prev = m_state($).fetch_add(WAITER,std::mo_acq_rel);
        if ( prev == UNLOCKED )
        {
            // I got the lock , but set the wrong bit, fix it :
            m_state($).fetch_add(LOCKED-WAITER,std::mo_release);
            // remove self from wait list :
            cancel_wait(tn);
            return;
        }

        // wait :
        WaitForSingleObject(tn->m_event, INFINITE);
        
        // ownership has been passed to me
    }

    void unlock()
    {
        int prev = m_state($).fetch_add(-LOCKED,std::mo_release);
        ASSERT( prev >= LOCKED );
        if ( prev == LOCKED )
        {
            // no waiters
            return;
        }
        
        // try to signal a waiter :
        LFSNode * pNode = MPSC_FIFO_Pop(&m_list);
        // there must be one because the WAITER inc is after the push
        ASSERT( pNode != NULL );

        // okay, hand off the lock directly to tn :         
        ThreadNode * tn = (ThreadNode *) pNode;

        // we turned off locked, turn it back on, and subtract the waiter we popped :
        prev = m_state($).fetch_add(LOCKED-WAITER,std::mo_release);
        ASSERT( prev < LOCKED && prev >= WAITER );
        SetEvent(tn->m_event);
    }
    
    void cancel_wait(ThreadNode * tn)
    {
        MPSC_FIFO_Fetch( &m_list);
        MPSC_FIFO_Remove(&m_list,tn);
    }
};

I think it's reasonably self-explanatory what's happening. Normally when the mutex is locked, the LOCKED bit is on, and there can be some number of waiters that have inc'ed the low bits. The unlock is reasonably fast because it checks the waiter count and doesn't have to bother with queue pops if there are no waiters.

In the funny race case, what happens is the LOCKED bit turns off, but WAITER gets inc'ed at the same time, so the mutex is still blocked from entry (because the initial CAS to enter is checking against zero, it's not checking the LOCKED bit). During this funny phase that was previously a race, now the unlocker will see that the double-check has happened (and failed) and will proceed into the pop-signal branch.

Remember that fetch_add returns the value *before* the add. (this sometimes confuses me because the Win32 InterlockedIncrement returns the value *after* the increment).

cancel_wait is possible because at that point we own the mutex, thus we are the SC for the MPSC and we can do whatever we want to it. In particular my implementation of MPSC uses Thomasson's trick of building it from MPMC stack and then using exchange to grab all the nodes and reverse them to FIFO order. So Fetch does the exchange and reverse, and then I have a helper that can remove a node. (obviously those should be combined for efficiency). You should be able to do something similar with most MPSC implementations (*).

(* = Dmitry has a very clever MPSC implementation ( here : low-overhead mpsc queue - Scalable Synchronization Algorithms ) which you cannot use here. The problem is that Dmitry's MPSC can be temporarily made smaller by a Push. During Push, it goes through a phase where previously pushed nodes are inaccessible to the popper. This is fine if your popper is something like a worker thread that just spins in a loop popping nodes, because it will eventually see them, but in a case like this I need the gaurantee that anything previously pushed is definitely visible to the popper).

In the fast path (no contention), lock is one CAS and unlock is one fetch_add (basically the same as a CAS). That's certainly not the cheapest mutex in the world but it's not terrible.

Now, clever readers may have already noticed that we don't actually need the LOCKED bit at all. I left it in because it's a nice illustration of the funny state changes that happen in the race case, but in fact we can set LOCKED=1 , and then all our adds of (LOCKED-WAITER) go away, which gives us the simpler code :


    void lock(ThreadNode * tn)
    {   
        int zero = 0;
        if ( m_state($).compare_exchange_strong(zero,1,std::mo_acq_rel) )
            return;
                
        // register waiter :
        MPSC_FIFO_Push(&m_list,tn);
                    
        // inc waiter count :
        int prev = m_state($).fetch_add(1,std::mo_acq_rel);
        if ( prev == 0 )
        {
            // remove self from wait list :
            cancel_wait(tn);
            return;
        }
                
        // wait :
        WaitForSingleObject(tn->m_event, INFINITE);
        
        // ownership has been passed to me
    }

    void unlock()
    {
        int prev = m_state($).fetch_add(-1,std::mo_release);
        if ( prev == 1 )
        {
            // no waiters
            return;
        }
        
        // try to signal a waiter :
        LFSNode * pNode = MPSC_FIFO_Pop(&m_list);
        // there must be one because the WAITER inc is after the push
        ASSERT( pNode != NULL );

        // okay, hand off the lock directly to tn :         
        ThreadNode * tn = (ThreadNode *) pNode;
        SetEvent(tn->m_event);
    }

and it's obvious that m_state is just an entry count.

In fact you can do an even simpler version that doesn't require cancel_wait :


    void lock(ThreadNode * tn)
    {                       
        // inc waiter count :
        int prev = m_state($).fetch_add(1,std::mo_acq_rel);
        if ( prev == 0 )
        {
            // got the lock
            return;
        }
                
        // register waiter :
        MPSC_FIFO_Push(&m_list,tn);
                
        // wait :
        WaitForSingleObject(tn->m_event, INFINITE);
        
        // ownership has been passed to me
    }

    void unlock()
    {
        int prev = m_state($).fetch_add(-1,std::mo_release);
        if ( prev == 1 )
        {
            // no waiters
            return;
        }
        
        // try to signal a waiter :
        LFSNode * pNode = NULL;
        rl::backoff bo;
        for(;;)
        {
            pNode = MPSC_FIFO_Pop(&m_list);
            if ( pNode ) break;
            bo.yield($);
        }

        // okay, hand off the lock directly to tn :         
        ThreadNode * tn = (ThreadNode *) pNode;
        SetEvent(tn->m_event);
    }

where you loop in the unlock to catch the race. This last version is not recommended, because it doesn't allow spinning before going to sleep, and requires a loop in unlock.

One more note : all of these suffer from what Thomasson calls "SCHED_OTHER". SCHED_OTHER is a Linux term for one of the schedulers in that OS. What it means in this context is that we are not respecting thread priorities or any more exotic scheduling that OS wants, because each thread here is waiting on its own event (and by "event" I mean "generic OS waitable handle"). If what you really want is a FIFO mutex then that's fine, you got it, but usually you would rather respect the OS scheduler, and to do that you need all your waiting threads to wait on the same handle.

cbloom rants

7/17/2011

07-17-11 - Per-thread event mutexes

No comments:

old rants